INTERNET-DRAFT                                         David Noveck
Expires: April 2006                         Network Appliance, Inc.
                                                  Rodney C. Burnett
                                                          IBM, Inc.

                                                       October 2005


               Next Steps for NFSv4 Migration/Replication
                    draft-noveck-nfsv4-migrep-00.txt


Status of this Memo

     By submitting this Internet-Draft, each author represents
     that any applicable patent or other IPR claims of which he
     or she is aware have been or will be disclosed, and any of
     which he or she becomes aware will be disclosed, in
     accordance with Section 6 of BCP 79.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
         http://www.ietf.org/ietf/1id-abstracts.txt The list of
     Internet-Draft Shadow Directories can be accessed at
         http://www.ietf.org/shadow.html.

Copyright Notice

     Copyright (C) The Internet Society (2005).  All Rights Reserved.


Noveck, Burnett                April 2006                       [Page 1]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


Abstract

     The fs_locations attribute in NFSv4 provides support for fs
     migration, replication and referral.  Given the current work on
     supporting these features, and the new needs such as support for
     global namespace, it is time to look at this area and see what
     further development of this protocol area may be required.  This
     document makes suggestions for the further development of these
     features in NFSv4.1 and also presents ideas for work that might be
     done as part of future minor versions.

Table Of Contents

     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.   History  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.   Areas to be Addressed  . . . . . . . . . . . . . . . . .   3
     2.   Clarifications/Corrections to V4.0 Functionality . . . . .   4
     2.1.   Attributes Returned by GETATTR and READDIR . . . . . . .   4
     2.1.1.   fsid . . . . . . . . . . . . . . . . . . . . . . . . .   5
     2.1.2.   mounted_on_fileid  . . . . . . . . . . . . . . . . . .   5
     2.1.3.   fileid . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.1.4.   filehandle . . . . . . . . . . . . . . . . . . . . . .   6
     2.2.   Issues with the Error NFS4ERR_MOVED  . . . . . . . . . .   6
     2.2.1.   Issue of when to check current filehandle  . . . . . .   7
     2.2.2.   Issue of GETFH . . . . . . . . . . . . . . . . . . . .   7
     2.2.3.   Handling of PUTFH  . . . . . . . . . . . . . . . . . .   7
     2.2.4.   Inconsistent handling of GETATTR . . . . . . . . . . .   8
     2.2.5.   Ops not allowed to return NFS4ERR_MOVED  . . . . . . .   8
     2.2.6.   Summary of NFS4ERR_MOVED . . . . . . . . . . . . . . .   9
     2.3.   Issues of Incomplete Attribute Sets  . . . . . . . . . .   9
     2.3.1.   Handling of attributes for READDIR . . . . . . . . . .  10
     2.4.   Referral Issues  . . . . . . . . . . . . . . . . . . . .  11
     2.4.1.   Editorial Changes Related to Referrals . . . . . . . .  12
     3.   Feature Extensions . . . . . . . . . . . . . . . . . . . .  13
     3.1.   Attribute Continuity . . . . . . . . . . . . . . . . . .  13
     3.1.1.   filehandle . . . . . . . . . . . . . . . . . . . . . .  14
     3.1.2.   fileid . . . . . . . . . . . . . . . . . . . . . . . .  14
     3.1.3.   change attribute . . . . . . . . . . . . . . . . . . .  15
     3.1.4.   fsid . . . . . . . . . . . . . . . . . . . . . . . . .  15
     3.2.   Additional Attributes  . . . . . . . . . . . . . . . . .  16
     3.2.1.   fs_absent  . . . . . . . . . . . . . . . . . . . . . .  16
     3.2.2.   fs_location_info . . . . . . . . . . . . . . . . . . .  16
     3.2.3.   fh_replacement . . . . . . . . . . . . . . . . . . . .  28
     3.2.4.   fs_status  . . . . . . . . . . . . . . . . . . . . . .  31
     4.   Migration Protocol . . . . . . . . . . . . . . . . . . . .  33
     4.1.   NFSv4.x as a Migration Protocol  . . . . . . . . . . . .  34
          Acknowledgements . . . . . . . . . . . . . . . . . . . . .  36
          Normative References . . . . . . . . . . . . . . . . . . .  36


Noveck, Burnett                April 2006                       [Page 2]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          Informative References . . . . . . . . . . . . . . . . . .  37
          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  37
          Full Copyright Statement . . . . . . . . . . . . . . . . .  37


1.  Introduction


1.1.  History

     When the fs_locations attribute was introduced, it was done with
     the expectation that a server-to-server migration protocol was in
     the offing.  Including the fs_locations-related features provided
     client support which would be used to allow clients to use
     migration when that protocol was developed and also could provide
     support for vendor-specific homogeneous server migration, until
     that point.

     As things happened, development of a server-to-server migration
     protocol stalled.  In part, this was due to the demands of NFSv4
     implementation itself.  Also, until V4 clients which supported
     these features were widely deployed, it was hard to justify the
     long-term effort for a new server-to-server protocol.

     Now that serious implementation work has begun, a number of issues
     have been discovered with the treatment of these features in
     RFC3530.  There are no significant protocol bugs, but there are
     numerous cases in which the text is not clear or contradictory on
     significant points.  Also, a number of suggestions have been made
     regarding small things left undone in the original specification,
     leading to the question of whether it is now an appropriate time to
     rectify those inadequacies.

     Another important development has been the idea of referrals.
     Referrals, a limiting case of migration, were not recognized when
     the spec was written, even though the protocol defined therein does
     support them.  See [referrals] for an explanation of referrals
     implementation.  Also, it has turned out that referrals are an
     important building-block for the development of a global namespace
     for NFSv4.

1.2.  Areas to be Addressed

     This document is motivated in large part by the opportunity
     represented by NFSv4.1.  First, this will provide a way to revise
     the treatment of these features in the spec, to make it clearer, to
     avoid ambiguities and contradictions, and to incorporate explicit
     discussion of referrals into the text.


Noveck, Burnett                April 2006                       [Page 3]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     NFSv4.1 also affords the opportunity to provide small extensions to
     these facilities, to make them more generally useful, in particular
     in environments in which migration between servers of different
     types is to be performed.  Use of these features in a global-
     namespace environment will also motivate certain extensions.

     The remaining issue in this area is the development of a vendor-
     independent migration mechanism.  This is definitely not something
     that can be done immediately (in v4.1) but the working group needs
     to figure out when this effort can be revived.  This document will
     examine a somewhat lower-overhead alternative to development of a
     separate server-to-server migration protocol.

     The alternative that will be explored is the use of NFSv4 itself,
     with a small set of additions, by a server operating as an NFSv4
     client to either pull or push file system state to or from another
     server.  It seems that this sort of incremental development can
     provide a more efficient way of getting a migration mechanism than
     development of a new protocol that will inevitably duplicate a lot
     of NFSv4.  Since NFSv4 must have the general ability to represent
     fs state that is accessible via NFSv4, using the core protocol as
     the base and adding only the extensions needed to do data transfer
     efficiently and transfer locking state should be more efficient in
     terms of design time.  The needed extensions could be introduced
     within a minor version.  It is not proposed or expected that these
     extensions would be in NFSv4.1.

2.  Clarifications/Corrections to V4.0 Functionality

     All of the sub-sections below deal with the basic functionality
     described, explicitly or implicitly, in RFC3530.  While the
     majority of the material is simply corrections, clarifications, and
     the resolution of ambiguities, in some cases there is cleanup to
     make things more consistent in v4.1, without adding any new
     functionality.  Functional changes are addressed in separate
     sections.

2.1.  Attributes Returned by GETATTR and READDIR

     While the RFC3530 allows the server to return attributes in
     addition to fs_locations, when GETATTR is used with a current
     filehandle within an absent filesystem, not much guidance is given
     to help clarify what is appropriate.  Such vagueness can result in
     serious interoperability issues.

     Instead of simply allowing an undefined set of attributes to
     returned, the NFSv4.1 spec should clearly define the circumstances
     under which attributes for absent filesystems are to be returned.


Noveck, Burnett                April 2006                       [Page 4]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     While some leeway may be necessary to accommodate different NFSv4.1
     servers, unnecessary leeway should be avoided.

     In particular, there are a number of attributes which most server
     implementations should find relatively easy to supply which are of
     critical importance to clients, particularly in those cases in
     which NFS4ERR_MOVED is returned when first crossing into an absent
     file system that the client has not previously referenced, i.e. a
     referral.

     NFSv4.1 should require servers to return fsid for an absent file
     system as well as fs_locations.  In order for the client to
     properly determine the boundaries of the absent filesystems, it
     needs access to fsid.  In addition when at the root of absent
     filesystem, mounted_on_fileid needs to be returned.

     On the other hand, a number of attributes pose difficulties when
     returned for an absent filesystem.  While not prohibiting the
     server from returning these, the NFSv4.1 spec should explain the
     issues which may result in problems, since these are not always
     obvious.  Handling of some specific attributes is discussed below.

2.1.1.  fsid

     The fsid attribute allows clients to recognize when fs boundaries
     have been crossed.  This applies also when one crosses into an
     absent filesystem.  While it might seem that returning fsid is not
     absolutely required, since fs boundaries are also reflected, in
     this case, by means of the fs_root field of the fs_locations
     attribute, there are renaming issues that make this unreliable.
     Returning fsid is necessary for clients and servers should have no
     difficulty in providing it.

     To avoid misunderstanding, the NFSv4.1 spec should note that the
     fsid provided in this case is solely so that the fs boundaries can
     be properly noted and that the fsid returned will not necessarily
     be valid after resolution of the migration event.  The logic of
     fsid handling for NFSv4 is that fsid's are only unique within a
     per-server context.  This would seem to be a strong indication that
     they need not be persistent when file systems are moved from server
     to server, although RFC 3530 does not specifically address the
     matter.

2.1.2.  mounted_on_fileid

     The mounted_on_fileid attribute is of particular importance to many
     clients, in that they need this information to form a proper
     response to a readdir() call.  When a readdir() call is done within


Noveck, Burnett                April 2006                       [Page 5]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     UNIX, the d_ino field of each of the entries needs to have a unique
     value normally derived from the NFSv4 fileid attribute.  It is in
     the case in which a file system boundary is crossed that using the
     fileid attribute for this purpose, particularly when crossing into
     an absent fs, will pose problems.  Note first that the fileid
     attribute, since it is within a new fs and thus a new fileid space,
     will not be unique within the directory.  Also, since the fs, at
     its new location, may arrange things differently, the fileid
     decided on at the directing server may be overridden at the target
     server, making it of little value.  Neither of these problems
     arises in the case of mounted_on_fileid since that fileid is in the
     context of the mounted-on fs and unique within it.

2.1.3.  fileid

     For reasons explained above under mounted_on_fileid, it would be
     difficult for the referring server to provide a fileid value that
     is of any use to the client.  Given this, it seems much better for
     the server never to return fileid values for files on an absent fs.

2.1.4.  filehandle

     Returning file handles for files in the absent fs, whether by use
     of GETFH (discussed below) or by using the filehandle attribute
     with GETATTR or READDIR poses problems for the client as the server
     to which it is referred is likely not to assign the same filehandle
     value to the object in question.  Even though it is possible that
     volatile filehandles may allow a change, the referring server
     should not prejudge the issue of filehandle volatility for the
     server which actually has the fs.  By not providing the file
     handle, the referring server allows the target server freedom to
     choose the file handle value without constraint.

2.2.  Issues with the Error NFS4ERR_MOVED

     RFC3530, in addition to being somewhat unclear about the situations
     in which NFS4ERR_MOVED is to be returned, is self-contradictory.
     In particular in section 6.2, it is stated, "The NFS4ERR_MOVED
     error is returned for all operations except PUTFH and GETATTR.",
     which is contradicted by the error lists in the detailed operation
     descriptions.  Specifically,

     o    NFS4ERR_MOVED is listed as an error code for PUTFH (section
          14.2.20), despite the statement noted above.

     o    NFS4ERR_MOVED is listed as an error code for GETATTR (section
          14.2.7), despite the statement noted above.


Noveck, Burnett                April 2006                       [Page 6]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     o    Despite the "all operations except" in the statement above,
          six operations (PUTROOTFH, PUTPUBFH, RENEW, SETCLIENTID,
          SETCLIENTID_CONFIRM, RELEASE_OWNER) are not allowed to return
          NFS4ERR_MOVED.

2.2.1.  Issue of when to check current filehandle

     In providing the definition of NFS4ERR_MOVED, RFC 3530 refers to
     the "filesystem which contains the current filehandle object" being
     moved to another server.  This has led to some confusion when
     considering the case of operations which change the current
     filehandle and potentially the current file system.  For example, a
     LOOKUP which causes a transition to an absent file system might be
     supposed to result in this error.  This should be clarified to make
     it explicit that only the current filehandle at the start of the
     operation can result in NFS4ERR_MOVED.

2.2.2.  Issue of GETFH

     While RFC 3530 does not make any exception for GETFH when the
     current filehandle is within an absent filesystem, the fact that
     GETFH is such a passive, purely interrogative operation, may lead
     readers to wrongly suppose that an NFSERR_MOVED error will not
     arise in this situation.  Any new NFSv4 RFC should explicitly state
     that GETFH will return this error if the current filehandle is
     within an absent filesystem.

     This fact has a particular importance in the case of referrals as
     it means that filehandles within absent filesystems will never be
     seen by clients.  Filehandles not seen by clients can pose no
     expiration or consistency issues on the target server.

2.2.3.  Handling of PUTFH

     As noted above, the handling of PUTFH regarding NFS4ERR_MOVED is
     not clear in RFC3530.  Part of the problem is that there is felt to
     be a need for an exception for PUTFH, to enable the sequence PUTFH-
     GETATTR(fs_locations).  However, if one clearly establishes, as
     should be established, that the check for an absent filesystem is
     only to be made at the start of each operation, then no such
     exception is required.  The sequence PUTFH-GETATTR(fs_locations)
     requires an exception for the GETATTR but not the PUTFH.

     PUTFH can return NFS4ERR_MOVED but only if the current filehandle,
     as established by a previous operation, is within an absent
     filesystem.  Whether the filehandle established by the PUTFH, is
     within an absent filesystem is of no consequence in determining


Noveck, Burnett                April 2006                       [Page 7]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     whether such an error is returned, since the check is to be done at
     the start of the operation.

2.2.4.  Inconsistent handling of GETATTR

     While, as noted above, RFC 3530 indicates that NFS4ERR_MOVED is not
     returned for a GETATTR operation, NFS4ERR_MOVED is listed as an
     error that can be returned by GETATTR.  The best resolution for
     this is to limit the exception for GETATTR to specific cases in
     which it is required.


     o    If all of the attributes requested can be provided (e.g. fsid,
          fs_locations, mounted_on_fileid in the case of the root of an
          absent filesystem), then NFS4ERR_MOVED is not returned.

     o    If an attribute which indicates that the client is aware of
          the likelihood of migration having happened (such as
          fs_locations) then NFS4ERR_MOVED is not returned, irrespective
          of what additional attributes are requested.  The newly-
          proposed attributes fs_absent and fs_location_info (see
          sections 3.2.1 and 3.2.2) would, like fs_locations, also cause
          NFS4ERR_MOVED not to be returned.  For the rest this document,
          the phrase "fs_locations-like attributes" is to be understood
          as including fs_locations, and the new attributes fs_absent
          and fs_location_info, if added to the protocol.

     In all other cases, if the current filesystem is absent,
     NFS4ERR_MOVED is to be returned.

2.2.5.  Ops not allowed to return NFS4ERR_MOVED

     As noted above, RFC 3530 does not allow the following ops to return
     NFS4ERR_MOVED:

     o    PUTROOTFH

     o    PUTPUBFH

     o    RENEW

     o    SETCLIENTID

     o    SETCLIENTID_CONFIRM

     o    RELEASE_OWNER


Noveck, Burnett                April 2006                       [Page 8]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     All of these are ops which do not require a current file handle,
     although two other ops that also do not require a current file
     handle, DELEGPURGE and PUTFH are allowed to return NFS4ERR_MOVED.

     There is no good reason to continue these as exceptions.  In future
     NFSv4 versions it should be the case that if there is a current
     filehandle and the associated filesystem is not present an
     NFS4ERR_MOVED error should result, as it does for other ops.

2.2.6.  Summary of NFS4ERR_MOVED

     To summarize, NFSv4.1 should:

     o    Make clear that the check for an absent filesystem is to occur
          at the start (and only at the start) of each operation.

     o    Allow NFS4ERR_MOVED to be returned by all ops including those
          not allowed to return it in RFC3530.

     o    Be clear about the circumstances in which GETATTR will or will
          not return NFS4ERR_MOVED.

     o    Delete the confusing text regarding an exception for PUTFH.

     o    Make it clear that GETFH will return NFS4ERR_MOVED rather than
          a filehandle within an absent filesystem.

2.3.  Issues of Incomplete Attribute Sets

     Migration or referral events naturally create situations in which
     all of the attributes normally supported on a server are not
     obtainable.  RFC3530 is in places ambivalent and/or apparently
     self-contradictory on such issues.  Any new NFSv4 RFC should take a
     clear position on these issues (and it should not impose undue
     difficulties on support for migration).

     The first problem concerns the statement in the third paragraph of
     section 6.2: "If the client requests more attributes than just
     fs_locations, the server may return fs_locations only.  This is to
     be expected since the server has migrated the filesystem and may
     not have a method of obtaining additional attribute data."

     While the above seems quite reasonable, it is seemingly
     contradicted by the following text from section 14.2.7 the second
     paragraph of the DESCRIPTION for GETATTR: "The server must return a
     value for each attribute that the client requests if the attribute
     is supported by the server.  If the server does not support an
     attribute or cannot approximate a useful value then it must not


Noveck, Burnett                April 2006                       [Page 9]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     return the attribute value and must not set the attribute bit in
     the result bitmap.  The server must return an error if it supports
     an attribute but cannot obtain its value.  In that case no
     attribute values will be returned."

     While the above is a useful restriction in that it allows clients
     to simplify their attribute interpretation code since it allows
     them to assume that all of the attributes they request are present
     often making it possible to get successive attributes at fixed
     offsets within the data stream, it seems to contradict what is said
     in section 6.2, where it is clearly anticipated, at least when
     fs_locations is requested, that fewer (often many fewer) attributes
     will be available than are requested.  It could be argued that you
     could harmonize these two by being creative with the interpretation
     of the phrase "if the attribute is supported by the server".  One
     could argue that many attributes are not supported by the server
     for an absent fs even though the text by talking about attributes
     "supported by a server" seems to indicate that this is not allowed
     to be different for different fs's (which is troublesome in itself
     as one server might have filesystems that do support and don't
     support acl's for example).

     Note however that the following paragraph in the description says,
     "All servers must support the mandatory attributes as specified in
     the section 'File Attributes'".  That's reasonable enough in
     general, but for an absent fs it is not reasonable and so section
     14.2.7 and section 6.2 are contradictory.  NFSv4.1 should remove
     the contradiction, by making an explicit exception for the case of
     an absent filesystem.

2.3.1.  Handling of attributes for READDIR


     A related issue concerns attributes in a READDIR.  There has been
     discussion, without any resolution yet, regarding the server's
     obligation (or not) to return the attributes requested with
     READDIR.  There has been discussion of cases in which this is
     inconvenient for the server, and an argument has been made that the
     attributes request should be treated as a hint, since the client
     can do a GETATTR to get requested attributes that are not supplied
     by the server.

     Regardless of how this issue is resolved, it needs to be made clear
     that at least in the case of a directory that contains the roots of
     absent filesystems, the server must not be required to return
     attributes that it is simply unable to return, just it cannot with
     GETATTR.


Noveck, Burnett                April 2006                      [Page 10]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     The following rules, derived from section 3.1 of [referrals],
     modified for suggested attribute changes in NFSv4.1 represent a
     good base for handling this issue, although the resolution of the
     general issue regarding the attribute mask for READDIR will affect
     the ultimate choices for NFSv4.1.

     o    When any of the fs_locations-like attributes is among the
          attributes requested, the server may provide a subset of the
          other requested attributes together with the request
          fs_locations-like attributes for roots of absent fs's, without
          causing any error for the READDIR as a whole.  If rdattr_error
          is also requested and there are attributes which are not
          available, then rdattr_error will receive the value
          NFS4ERR_MOVED.

     o    When no fs_locations-like attributes are requested, but all of
          the attributes requested can be provided, then they will be
          provided and no NFS4ERR_MOVED will be generated. An example
          would be READDIR's that request mounted_on_fileid either with
          or without fsid.

     o    When none of the fs_locations-like attributes are requested,
          but rdattr_error is and some attributes requested are not
          available because of the absence of the filesystem, the server
          will return NFS4ERR_MOVED for the rdattr_error attribute and,
          in addition, the requested attributes that are valid for the
          root of an absent filesystem.

     o    When none of fs_locations-like attributes are requested and
          there is a directory within an absent fs within the directory
          being read, if some unavailable attributes are requested, the
          handling will depend on the overall decision about READDIR
          referred to above.  If the attribute mask is to be treated as
          a hint, only available attributes will be returned.
          Otherwise, no data will be returned and the READDIR will get
          an NFS4ERR_MOVED error.

2.4.  Referral Issues

     RFC 3530 defines a migration feature which allows the server to
     direct clients to another server for the purpose of accessing a
     given file system.  While that document explains the feature in
     terms of a client accessing a given file system and then finding
     that it has moved, an important limiting case is that in which the
     clients are redirected as part of their first attempt to access a
     given file system.


Noveck, Burnett                April 2006                      [Page 11]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


2.4.1.  Editorial Changes Related to Referrals

     Given the above framework for implementing referrals, within the
     basic migration framework described in RFC 3530, we need to
     consider how future NFSv4 RFC's should be modified, relative to RFC
     3530, to address referrals.

     The most important change is to include an explanation of how
     referrals fit into the v4 migration model.  Since the existing
     discussion does not specifically call out the case in which the
     absence of a filesystem is noted while attempting to cross into the
     absent file system, it makes it hard to understand how referrals
     work and how they relate to other sorts of migration events.

     It makes sense to present a description of referrals in a new sub-
     section following the "Migration" section, and would be section
     6.2.1, given the current numbering scheme of RFC 3530.  The
     material in [referrals], suitably modified for the changes proposed
     for v4.1, would be very helpful in providing the basis for this
     sub-section.

     There are also a number of cases in which the existing wording of
     RFC 3530 seems to ignore the referral case of the migration
     feature.  In the following specific cases, some suggestions are
     made for edits to tidy this up.

     o    In section 1.4.3.3, in the third sentence of the first
          paragraph, the phrase "In the event of a migration of a
          filesystem" is unnecessarily restrictive and having the
          sentence read "In the event of the absence of a filesystem,
          the client will receive an error when operating on the
          filesystem and it can then query the server as to the current
          location of the file system" would be better.

     o    In section 6.2, the following should be added as a new second
          paragraph: "Migration may be signaled when a file system is
          absent on a given server, when the file system in question has
          never actually been located on the server in question.  In
          such a case, the server acts to refer the client to the proper
          fs location, using fs_locations to indicate the server
          location, with the existence of the server as a migration
          source being purely conventional."

     o    In the existing second paragraph of section 6.2, the first
          sentence should be modified to read as follows: "Once a
          filesystem has been successfully established at a new server
          location, the error NFS4ERR_MOVED will be returned for
          subsequent requests received by the server whose role is as


Noveck, Burnett                April 2006                      [Page 12]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          the source of the filesystem, whether the filesystem actually
          resided on that server, or whether its original location was
          purely nominal (i.e. the pure referral case)."

     o    The following should be added as an additional paragraph to
          the end of section 6.4, the: "Note that in the case of a
          referral, there is no issue of filehandle recovery since no
          filehandles for the absent filesystem are communicated to the
          client (and neither is the fh_expire_type)".

     o    The following should be added as an additional paragraph to
          the end of section 8.14.1: "Note that in the case of referral,
          there is no issue of state recovery since no state can have
          been generated for the absent filesystem."

     o    In section 12, in the description of NFS4ERR_MOVED, the first
          sentence should read, "The filesystem which contains the
          current filehandle object is now located on another server."

3.  Feature Extensions

     A number of small extensions can be made within NFSv4's minor
     versioning framework to enhance the ability to provide multi-vendor
     implementations of migration and replication where the transition
     from server instance to server instance is transparent to client
     users.  This includes transitions due to migration or transitions
     among replicas due to server or network problems.  These same
     extensions would enhance the ability of the server to present
     clients with multiple replicas in referral situation, so that the
     most appropriate one might be selected.  These extensions would all
     be in the form of additional recommended attributes.

3.1.  Attribute Continuity

     There are a number of issues with the existing protocol that
     revolve around the continuity (or lack thereof) of attribute values
     across a migration event.  In some cases, the spec is not clear
     about whether such continuity is required and different readers may
     make different assumptions.  In other cases, continuity is not
     required but there are significant cases in which there would be a
     benefit and there is no way for the client to take advantage of
     attribute continuity when it exists.  A third situation is that
     attribute continuity is generally assumed (although not specified
     in the spec), but allowing change at a migration event would add
     greatly to flexibility in handling a global namespace.


Noveck, Burnett                April 2006                      [Page 13]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


3.1.1.  filehandle

     The issue of filehandle continuity is not fully addressed in
     RFC3530.  In many cases of vendor-specific migration or replication
     (where an entire fs image is copied, for instance), it is
     relatively easy to provide that the same persistent filehandles
     used on the source server be recognized on the destination server.

     On the other hand, for many forms of migration, filehandle
     continuity across a migration event cannot be provided, requiring
     that filehandles be re-established.  Within RFC3530, volatile
     filehandles (FH4_VOL_MIGRATION) is the only mechanism to satisfy
     this need and in many environments they will work fine.

     Unfortunately, in the case in which an open file is renamed by a
     another client, the re-establishment of the filehandle on the
     destination target will give the wrong result and the client will
     attempt to re-open an incorrect file on the target.

     There needs to be a way to address this difficulty in order to
     provide transparent switching among file system instance, both in
     the event of migration or when transitioning among replicas.

3.1.2.  fileid

     RFC3530 gives no real guidance on the issue of continuity of
     fileid's in the event of migration or a transition between two
     replicas.  The general expectation has been that in situations in
     which the two filesystem instances are created by a single vendor
     using some sort of filesystem image copy, fileid's will be
     consistent across the transition while in the analogous multi-
     vendor transitions they will not.  This latter can pose some
     difficulties.

     It is important to note that while clients themselves may have no
     trouble with a fileid changing as a result of a filesystem
     transition event, applications do typically have access to the
     fileid (e.g. via stat), and the result of this is that an
     application may work perfectly well if there is no filesystem
     instance transition or if any such transition is among instances
     created by a single vendor, yet be unable to deal with the
     situation in which a multi-vendor transition occurs, at the wrong
     time.

     Providing the same fileid's in a multi-vendor (multiple server
     vendors) environment has generally been held to be quite difficult.
     While there is work to be done, it needs to be pointed out that
     this difficulty is partly self-imposed.  Servers have typically


Noveck, Burnett                April 2006                      [Page 14]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     identified fileid with inode number, i.e. with a quantity used to
     find the file in question.  This identification poses special
     difficulties for migration of an fs between vendors where assigning
     the same index to a given file may not be possible.  Note here that
     a fileid does not require that it be useful to find the file in
     question, only that it is unique within the given fs.  Servers
     prepared to accept a fileid as a single piece of metadata and store
     it apart from the value used to index the file information can
     relatively easily maintain a fileid value across a migration event,
     allowing a truly transparent migration event.

     In any case, where servers can provide continuity of fileids, they
     should and the client should be able to find out that such
     continuity is available, and take appropriate action.

3.1.3.  change attribute

     Currently the change attribute is defined as strictly the province
     of the server, making it necessary for the client to re-establish
     the change attribute value on the new server.  This has the further
     consequence that the lack of continuity between change values on
     the source and destination servers creates a window during which we
     have no reliable way of determining whether caches are still valid.
     Where there is a transition among writable filesystem instances,
     even if most of the access is for reading (in fact particularly if
     it is), the can be a big performance issue.

     Where the co-operating servers can provide continuity of change
     number across the migration event, the client should be able to
     determine this fact and use this knowledge to avoid unneeded
     attribute fetches and client cache flushes.

3.1.4.  fsid

     Although RFC3530 does not say so explicitly, it has been the
     general expectation that although the fsid is expected to change as
     part of migration (since the fsid space is per-server), the
     boundaries of a server when migrated will be the same as they were
     on the source.

     The possibility of splitting an existing filesystem into two or
     more as part of migration can provide important additional
     functionality in a global namespace environment.  When one divides
     up pieces of a global namespace into convenient-sized fs's (to
     allow their independent assignment to individual servers),
     difficulties will arise over time.  As the sizes of directories
     grow, what was once a convenient set of files, embodied as a
     separate fs, may become inconveniently large.  This requires a


Noveck, Burnett                April 2006                      [Page 15]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     means to divide it into a new set of pieces which are of a
     convenient size.  The important point is that while there are many
     ways to do that currently, they are all disruptive.  A method is
     needed which allows this division to occur without disrupting
     access.

3.2.  Additional Attributes

     A small number of additional attributes in V4.1 can provide
     significant additional functionality, by addressing the attribute
     continuity issues discussed above and allowing more complete
     information about the possible replicas, post-migration locations,
     or referral targets for a given filesystem that allows the client
     to choose the one most suited to its needs, and to more effectively
     handle the transition to a new target server.

     All of the proposed attributes would be defined as validly
     requested when the current filehandle is within an absent
     filesystem, i.e. an attempt to obtain these attributes would not
     result in NFS4ERR_MOVED.  In some cases, it may be optional to
     actually provide the requested attribute information based on the
     presence or absence of the filesystem.  The specifics will be
     discussed under each of the individual attributes.

3.2.1.  fs_absent

     In NFSv4.0, fs_locations is the only attribute which, when fetched,
     indicates that the client is aware of the possibility that the
     current filesystem may be absent.  Since fs_locations is a
     complicated attribute and the client may simply want an indication
     of whether the filesystem is present, we propose the addition of a
     boolean attribute named "fs_absent" to provide this information
     simply.

     As noted above, this attribute, when supported, may be requested of
     absent filesystems without causing NFS4ERR_MOVED to be returned and
     it should always be available.  Servers are strongly urged to
     support this attribute on all filesystems if they support it on any
     filesystem.

3.2.2.  fs_location_info

     The fs_location_info attribute is intended as a more functional
     replacement for fs_locations which will continue to exist and be
     supported.  Clients which need the additional information provided
     by this attribute will interrogate it and get the information from
     servers that support it.  When the server does not support
     fs_location_info, fs_locations can be used to get a subset of the


Noveck, Burnett                April 2006                      [Page 16]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     information.  A server which supports fs_location_info MUST support
     fs_locations as well.

     There are several sorts of additional information present in
     fs_location_info, that aren't available in fs_locations:

     o    Attribute continuity information to allow a client to select a
          location which meets the transparency requirements of the
          applications accessing the data and to take advantage of
          optimizations that server guarantees as to attribute
          continuity may provide (e.g. change attribute).

     o    Filesystem identity information which indicates when multiple
          replicas, from the clients point of view, correspond to the
          same target filesystem, allowing them to be used
          interchangeably, without disruption, as multiple paths to the
          same thing.

     o    Information which will bear on the suitability of various
          replicas, depending on the use that the client intends.  For
          example, many applications need an absolutely up-to-date copy
          (e.g. those that write), while others may only need access to
          the most up-to-date copy reasonably available.

     o    Server-derived preference information for replicas, which can
          be used to implement load-balancing while giving the client
          the entire fs list to be used in case the primary fails.

     Attribute continuity and filesystem identity information define a
     number of identity relations among the various filesystem replicas.
     Most often, the relevant question for the client will be whether a
     given replica is identical-with/continuous-to the current one in a
     given respect but the information should be available also as to
     whether two other replicas match in that respect as well.

     The way in which such pairwise filesystem comparisons are
     relatively compactly encoded is to associate with each replica a
     32-bit integer, the location id.  The fs_location_info attribute
     then contains for each of the identity relations among replicas a
     32-bit mask.  If that mask, when anded with the location ids of the
     two replicas, result in fields which are identical,  then the two
     replicas are defined as belonging to the corresponding  identity
     relation.  This scheme allows the server to accommodate relatively
     large sets of replicas distinct according to a given criteria
     without requiring large amounts of data to be sent for each
     replica.


Noveck, Burnett                April 2006                      [Page 17]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     Server-specified preference information is also provided in a
     fashion that allows a number of different relations (in this case
     order relations) in a compact way.  In this case each
     location4_server structure contains a 32-bit priority word which
     can be broken into fields devoted to these relations in any way the
     server wishes.  The location4_info structure contains a set of
     32-bit masks, one for each relation.  Two replicas can be compared
     via that relation by anding the corresponding mask with the
     priority word for each replica and comparing the results.

     The fs_location_info attribute consists of a root pathname (just
     like fs_locations), together with an array of location4_item
     structures.


Noveck, Burnett                April 2006                      [Page 18]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          struct location4_server {
                  uint32_t        priority;
                  uint32_t        flags;
                  uint32_t        location_id;
                  int32_t         currency;
                  utf8str_cis     server;
          };

          const LIF_FHR_OPEN  = 0x00000001;
          const LIF_FHR_ALL   = 0x00000002;
          const LIF_MULTI_FS  = 0x00000004;
          const LIF_WRITABLE  = 0x00000008;
          const LIF_CUR_REQ   = 0x00000010;
          const LIF_ABSENT    = 0x00000020;
          const LIF_GOING     = 0x00000040;

          struct location4_item {
                  location4_server entries<>;
                  pathname4       rootpath;
          };

          struct location4_info {
                  pathname4       fs_root;
                  location4_item  items<>;
                  uint32_t        fileid_keep_mask;
                  uint32_t        change_cont_mask;
                  uint32_t        same_fh_mask;
                  uint32_t        same_state_mask;
                  uint32_t        same_fs_mask;
                  uint32_t        valid_for;
                  uint32_t        read_rank_mask;
                  uint32_t        read_order_mask;
                  uint32_t        write_rank_mask;
                  uint32_t        write_order_mask;

          };


     The fs_location_info attribute is structured similarly to the
     fs_locations attribute.  A top-level structure (fs_locations4 or
     location4_info) contains the entire attribute including the root
     pathname of the fs and an array of lower-level structures that
     define replicas that share a common root path on their respective
     servers.  Those lower-level structures in turn (fs_locations4 or
     location4_item) contain a specific pathname and information on one
     or more individual server replicas.  For that last lowest-level
     information, fs_locations has a server name in the form of
     utf8str_cis, while fs_location_info has a location4_server


Noveck, Burnett                April 2006                      [Page 19]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     structure that contains per-server-replica information in addition
     to the server name.

     The location4_server structure consists of the following items:

     o    The priority word is used to implement server-specified
          ordering relations among replicas.  These relations are
          intended to be used to select a replica when migration
          (including a referral) occurs, when a server appears to be
          down, when the server directs the client to find a new replica
          (see LIF_GOING) and, optionally, when a new filesystem is
          first entered.  See the location4_info fields read_order,
          read_rank, write_order, and write_rank for details of the
          ordering relations.

     o    A word of flags providing information about this
          replica/target.  These flags are defined below.

     o    An indication of file system up-to-date-ness (currency) in
          terms of approximate seconds before the present.  A negative
          value indicates that the server is unable to give any
          reasonably useful value here.  A zero indicates that
          filesystem is the actual writable data or a reliably coherent
          and fully up-to-date copy.  Positive values indicate how out-
          of-date this copy can normally be before it is considered for
          update.  Such a value is not a guarantee that such updates
          will always be performed on the required schedule but instead
          serve as a hint about how far behind the most up-to-date copy
          of the data, this copy would normally be expected to be.

     o    A location id for the replica, to be used together with masks
          in the location4_info structure to determine whether that
          replica matches other in various respects, as described above.
          See below (after the mask definitions) for an example of how
          the location_id can be used to communicate filesystem
          information.

          When two location id's are identical, then access to the
          corresponding replicas are defined as identical in all
          respects.  They access the same filesystem with the same
          filehandles and share v4 file state.  Further, multiple
          connections to the two replicas may be done as part of the
          same session.  Two such replicas will share a common root path
          and are best presented within two location4_server entries in
          a common location4_item.  These replicas should have identical
          values for the currency field although the flags and priority
          fields may be different.


Noveck, Burnett                April 2006                      [Page 20]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          Clients may find it helpful to associate all of the
          location4_server structures that share a location_id value and
          treat this set as representing a single fs target.  When they
          do so, they should take proper care to note that priority
          fields for these may be different and the selection of
          location4_server needs to reflect rank and order
          considerations (see below) for the individual entries.

     o    The server string.  For the case of the replica currently
          being accessed (via GETATTR), a null string may be used to
          indicate the current address is using for the RPC call.

     The flags field has the following bits defined:

     o    LIF_FHR_OPEN indicates that the server will normally make a
          replacement filehandle available for files that are open at
          the time of a filesystem image transition.  When this flag is
          associated with an alternative filesystem instance, the client
          may get the replacement filehandle to be used on the new
          filesystem instance from the current server.  When this flag
          is associated with the current filesystem instance, a
          replacement for filehandles from a previous instance may be
          obtained on this one.  See section 3.2.2, fh_replacement, for
          details.  Because of the possibility of hardware and software
          failures, this is not a guarantee, but when this bit returned,
          the server should make all reasonable efforts to provide the
          replacement filehandle.

     o    LIF_FHR_ALL indicates that a replacement filehandle will be
          made available for all files when there is a migration event
          or a replica switch.  Like LIF_FHR_OPEN, it may indicate
          replacement availability on the source or the destination, and
          the details are described in section 3.2.3.

     o    LIF_MULTI_FS indicates that when a transition occurs from the
          current filesystem instance to this one, the replacement may
          consist of multiple filesystems.  In this case, the client has
          to be prepared for the possibility that objects on the same fs
          before migration will be on different ones after.  Note that
          LIF_MULTI_FS is not incompatible with the two filesystems
          agreeing with respect to the fileid-keep mask since, if one
          has a set of fileid's that are unique within an fs, each
          subset assigned to a smaller fs after migration would not have
          any conflicts internal to that fs.

          A client, in the case of split filesystem will interrogate
          existing files with which it has continuing connection (it is
          free simply forget cached filehandles).  If the client


Noveck, Burnett                April 2006                      [Page 21]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          remembers the directory filehandle associated with each open
          file, it may proceed upward using LOOKUPP to find the new fs
          boundaries.

          Once the client recognizes that one filesystem has been split
          into two, it could maintain applications running without
          disruption by presenting the two filesystems as a single one
          until a convenient point to recognize the transition, such as
          a reboot.  This would require a mapping of fsid's from the
          server's fsids to fsids as seen by the client but this already
          necessary for other reasons anyway.  As noted above, existing
          fileids within the two descendant fs's will not conflict.
          Creation of new files in the two descendent fs's may require
          some amount of fileid mapping which can be performed very
          simply in many important cases.

     o    LIF_WRITABLE indicates that this fs target is writable,
          allowing it to be selected by clients which may need to write
          the on this filesystem.  When the current filesystem instance
          in writable, then any other filesystem to which the client
          might switch must incorporate within its data any committed
          write made on the current filesystem instance.  See below, in
          the section on the same-fs mask, for issues related to
          uncommitted writes.  While there is no harm in not setting
          this flag for a filesystem that turns out to be writable,
          turning the flag on for read-only filesystem can cause
          problems for clients who select a migration or replication
          target based on it and then find themselves unable to write.

     o    LIF_VLCACHE indicates that the server is a cached copy where
          the measured latency of operation may differ very
          significantly depending on the particular data requested, in
          that already cached data may be provided with very low latency
          while other data may require transfer from a distant source.

     o    LIF_CUR_REQ indicates that this replica is the one on which
          the request is being made.  Only a single server entry may
          have this flag set and in the case of a referral, no entry
          will have it.

     o    LIF_ABSENT indicates that this entry corresponds an absent
          filesystem replica.  It can only be set if LIF_CUR_REQ is set.
          When both such bits are set it indicates that a filesystem
          instance is not usable but that the information in the entry
          can be used to determine the sorts of continuity available
          when switching from this replica to other possible replicas.
          Since this bit can only be true if LIF_CUR_REQ is true, the
          value could be determined using the fs_absent attribute but


Noveck, Burnett                April 2006                      [Page 22]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          the information is also made available here for the
          convenience of the client.  An entry with this bit, since it
          represents a true filesystem (albeit absent) does not appear
          in the event of a referral, but only where a filesystem has
          been accessed at this location and subsequently been migrated.

     o    LIF_GOING indicates that a replica, while still available,
          should not be used further.  The client, if using it, should
          make an orderly transfer to another filesystem instance as
          expeditiously as possible.  It is expected that filesystems
          going out of service will be announced as LIF_GOING some time
          before the actual loss of service and that the valid_for value
          will be sufficiently small to allow servers to detect and act
          on scheduled events while large enough that the cost of the
          requests to fetch the fs_location_info values will not be
          excessive.  Values on the order of ten minutes seem
          reasonable.

     The location4_item structure, analogous to an fs_locations4
     structure, specifies the root pathname all used by an array of
     server replica entries.

     The location4_info structure, encoding the fs_location_info
     attribute contains the following:

     o    The fs_root field which contains the pathname of the root of
          the current filesystem on the current server, just as it does
          the fs_locations4 structure.

     o    An array of location4_item structures, which contain
          information about replicas of the current filesystem.  Where
          the current filesystem is actually present, or has been
          present, i.e. this is not a referral situation, one of the
          location4_item structure will contain a location4_server for
          the current server.  This structure will have LIF_ABSENT set
          if the current filesystem is absent, i.e.  normal access to it
          will return NFS4ERR_MOVED.

     o    The fileid-keep mask indicates, in combination with the
          appropriate location ids, that fileids will not change (i.e.
          they will be reliably maintained with no lack of continuity)
          across a transition between the two filesystem instances,
          whether by migration or a replica transition.  This allows
          transition to safely occur without any chance that
          applications that depend on fileids will be impacted.

     o    The change-cont mask indicates, in combination with the
          appropriate location ids, that the change attribute is


Noveck, Burnett                April 2006                      [Page 23]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          continuous across a migration event between the server within
          any pair of replicas.  In other words if the change attribute
          has a given value before the migration event, then it will
          have that same value after, unless there has been an
          intervening change to the file.  This information is useful
          after a migration event, in avoiding any need to refetch
          change information or any requirement to needlessly flush
          cached data because of a lack of reliable change information.
          Although change attribute continuity allows the client to
          dispense with any migration-specific refetching of change
          attributes, it still must fetch the attribute in all cases in
          which would normally do so if there had been no migration.  In
          particular, when an open-reclaim is not available and the file
          is re-opened, a check for an unexpected change in the change
          attribute must be done.

     o    The same-fh mask indicates, in combination with the
          appropriate location ids, whether two replicas will have the
          same fh's for corresponding objects.  When this is true, both
          filesystems must have the same filehandle expiration type.
          When this is true and that type is persistent, those
          filehandles may be used across a migration event, without
          disruption.

     o    The same-state mask indicates, in combination with the
          appropriate location ids, whether two replicas will have the
          same state environment.  This does not necessarily mean that
          when performing migration, the client will not have to reclaim
          state.  However it does mean that the client may proceed using
          his current clientid just as if there were no migration event
          and only reclaim state when an NFS4ERR_STALE_CLIENTID or
          NFS4ERR_STALE_STATEID error is received.

          Filesystems marked as having the same state should also have
          same filehandles.  In other words the same-fh mask should be a
          subset (not necessarily proper) of the same-state mask.

     o    The same-fs mask indicates, in combination with the
          appropriate location ids, whether two replicas in fact
          designate the same filesystem in all respects.  If so, any
          action taken on one is immediately on the other and the client
          can consider them as effectively the same thing.

          The same-fs mask must include all bits in the same-fh mask,
          the change-cont mask, and same-state mask.  Thus, filesystem
          instances marked as same-fs must also share state, have the
          same filehandles, and be change continuous.  These
          considerations imply that a transition can occur with no


Noveck, Burnett                April 2006                      [Page 24]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          application disruption and no significant client work to
          update state related to the filesystem.

          When the same-fs mask indicates two filesystems are the same
          the clients are entitled to assume that there will also be no
          significant delay for the server to re-establish its state to
          effectively support the client.  Where same-fs is not true and
          the other constituent continuity indication are true (fileid-
          keep, change-cont, same-fh), there may be significant delay
          under some circumstances, in line with the fact that the
          filesystems are being represented as being carefully kept in
          complete synchronization yet they are not the same.

          When two filesystems on separate servers have location ids
          which match on all the bits within the same-fs mask, clients
          should present the same nfs_client_id to both with the
          expectation the servers may be able to generate a shared
          clientid to be used when communicating with either.  Such
          servers are expected to co-ordinate at least to the degree
          that they will not provide the same clientid to a client while
          not actually sharing the underlying state data.

          In handling of uncommitted writes, two servers with any pair
          of filesystems having the same-fs relation, write verifiers
          must be sufficiently unique that a client switching between
          the servers can determine whether previous async writes need
          to be reissued.  This is unlike the general case of
          filesystems not bearing this relation, in which it must be
          assumed that asynchronous writes will be lost across a
          filesystem transition.

          When two replicas' location ids, match on all the bits within
          the same-fs mask, but are not identical, the client using
          sessions will establish separate sessions to each which
          together share any such common clientid.

     o    The valid_for field specifies a time for which it is
          reasonable for a client to use the fs_location_info attribute
          without refetch.  The valid_for value does not provide a
          guarantee of validity since servers can unexpectedly go out of
          service or become inaccessible for any number of reasons.
          Clients are well-advised to refetch this information for
          actively accessed filesystem at every valid_for seconds.  This
          is particularly important when filesystem replicas may go out
          of service in a controlled way using the LIF_GOING flag to
          communicate an ongoing change.  The server should set
          valid_for to a value which allows well-behaved clients to
          notice the LIF_GOING flag and make an orderly switch before


Noveck, Burnett                April 2006                      [Page 25]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          the loss of service becomes effective.  If this value is zero,
          then no refetch interval is appropriate and the client need
          not refetch this data on any particular schedule.

          In the event of a transition to a new filesystem instance, a
          new value of the fs_location_info attribute will be fetched at
          the destination and it is to be expected that this may have a
          different valid_for value, which the client should then use,
          in the same fashion as the previous value.

     o    The read-rank, read-order, write-rank, and write-order masks
          are used, together with the priority words of various replicas
          to order the replicas according to the server's preference.
          See the discussion below for the interaction of rank, order,
          and the client's own preferences and needs.  Read-rank and
          read-order are used to direct clients which only need read
          access while write-rank and write-order are used to direct
          clients that require some degree of write access to the
          filesystem.

     Depending on the potential need for write access by a given client,
     one of the pairs of rank and order masks is used, together with
     priority words, to determine a rank and an order for each instance
     under consideration.  The read rank and order should only be used
     if the client knows that only reading will ever be done or if it is
     prepared to switch to a different replica in the event that any
     write access capability is required in the future.  The rank is
     obtained by anding the selected rank mask with the priority and the
     order is obtained similarly by anding the selected order mask with
     the priority.  The resulting rank and order are compared as
     described below with lower always being better (more preferred).

     Rank is used to express a strict server-imposed ordering on
     clients, with lower values indicating "more preferred."  Clients
     should attempt to use all replicas with a given rank before they
     use one with a higher rank.  Only if all of those servers are
     unavailable should the client proceed to servers of a higher rank.

     Within a rank, the order value is used to specify the server's
     preference to guide the client's selection when the client's own
     preferences are not controlling, with lower values of order
     indicating "more preferred."  If replicas are approximately equal
     in all respects, clients should defer to the order specified by the
     server.  When clients look at server latency as part of their
     selection, they are free to use this criterion but it is suggested
     that when latency differences are not significant, the server-
     specified order should guide selection.


Noveck, Burnett                April 2006                      [Page 26]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     The server may configure the rank and order masks to considerably
     simplify the decisions if it so chooses.  For example, if read vs.
     write is not to be important in the selection process, then the
     location4_info should be one in which the read-rank and write-rank
     mask, and the read-order and write-order mask are equal.  If the
     server wishes to totally direct the process via rank, leaving no
     room for client choice, it may simply set the write-order mask and
     the read-order mask to zero.  Conversely, if it wishes to give
     general preferences with more scope for client choice, it may set
     the read-rank mask and the write-rank mask to zero.  A server may
     even set all the masks to zero and allow the client to make its own
     choices.  The protocol allows multiple policies to be used as found
     appropriate.

     The use of location id together with the masks in location4_info
     structure can be illustrated by an example.

     Suppose one has the following sets of servers:

     o    Server A with four IP addresses A1 through A4.

     o    Servers B, C, D sharing a cluster filesystem with A and each
          having four IP addresses, B1, B2, ... D3, D4.

     o    A point-in-time copy of the filesystem created using image
          copy which shares filehandles and is change-attribute
          continuous with the filesystem on A-D and has two IP address
          X1 and X2.

     o    A point-in-time-copy of the filesystem which was created at a
          higher level but shares fileid's with the one on A-D but is
          accessed (via a clustered filesystem) by servers Ya and Yb.

     o    A copy of the of the filesystem made by simple user-level copy
          tools and which is served from server Z.

     Given the above, one way of presenting these relationships is to
     assign the following location id's:

     o    A1-4 would get 0x1111

     o    B1-4 would get 0x1112

     o    C1-4 would get 0x1113

     o    D1-4 would get 0x1114


Noveck, Burnett                April 2006                      [Page 27]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     o    X1-2 would get 0x1125

     o    Ya would get 0x1236

     o    Yb would get 0x1237

     o    Z would get 0x2348

     And then the following mask values would be used:

     o    The same-fs and same-state masks would all be 0xfff0.

     o    The same-fh and change-cont mask would be 0xff00.

     o    The keep-fileid mask would be 0xf00

     This scheme allows the number of bits devoted to various kinds of
     similarity classes to be adjusted as needed with no change to the
     protocol.  The total of thirty-two bits is expected to suffice
     indefinitely.

     As noted above, the fs_location_info attribute, when supported, may
     be requested of absent filesystems without causing NFS4ERR_MOVED to
     be returned and it is generally expected that will be available for
     both present and absent filesystems even if only a single
     location_server entry is present, designating the current (present)
     filesystem, or two location_server entries designating the current
     (and now previous) location of an absent filesystem and its
     successor location.  Servers are strongly urged to support this
     attribute on all filesystems if they support it on any filesystem.

3.2.3.  fh_replacement

     The fh_replacement attribute provides a way of providing a
     substitute filehandle to be used on a target server when a
     migration event or other fs instance switching event occurs.  This
     provides an alternative to maintaining access via the existing
     persistent filehandle (which may be difficult) or using volatile
     filehandles (which will not give the correct result in all cases).

     When a migration event occurs, information on the new location (or
     location choices) will be available via the fs_location_info
     attribute applied to any filehandle within the source filesystem.
     When LIF_FHR_OPEN or LIF_FHR_ALL is present, the fh_replacement
     attribute may be used to get the corresponding filehandle for
     filehandles that the client has accessed.


Noveck, Burnett                April 2006                      [Page 28]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     Similarly, after such an event, when the fs_location_info attribute
     is fetched on the new server, LIF_FHR_OPEN or LIF_FHR_ALL may be
     present in the server entry corresponding to the current filesystem
     instance.  In this case, the fh_replacement attribute can be used
     to get the new filehandles corresponding to each of the now
     outdated filehandles on the previous instance.  In either of these
     ways, the client may be assured of a consistent mapping from old to
     new filehandles without relying on a purely name-based mapping,
     which in some cases will not be correct.

     The choice of providing replacement on the source filesystem
     instance or the target will normally be based on which server has
     the proper mapping.  Generally when the image is created by a push
     from the source, the source server naturally has the appropriate
     filehandles corresponding to its files and can provide them to the
     client.  When the image transfer is done via pull, the target
     server will be aware of the source filehandles and can provide the
     appropriate mapping when the client requests it.  Note that the
     target server can only provide replacement filehandles if it can
     assure filehandle uniqueness, i.e.  that filehandles from the
     source do not conflict with valid filehandles on the destination
     server.  In the case where such uniqueness can be assured, source
     filehandles can be accepted for the purpose of providing
     replacements with NFS4ERR_FHEXPIRED returned for any use other than
     interrogation of the fh_replacement attribute via GETATTR.

     Multiple fh replacement on different migration targets may be
     provided via multiple fhrep4 entries.  Each fhrep4_entry provides a
     replacement filehandle applying to all targets whose location id,
     when anded with the fh-same mask (from the fs_location_info
     attribute) matches the location_set value in the fhrep4_entry.
     This set of replicas share the same filehandle and thus can a
     single entry can provide replacement filehandles for all of the
     members.  Note that the location_set value will only match that of
     the current filesystem instance, when the client presents a
     filehandle from the previous filesystem instance and the target
     filesystem provides its own replacement filehandles.


          union fhrep4_entry switch (bool present) {
                  uint32_t        location_set;
                  nfs_fh4         replacement;
          };

          struct fh4_replacement {
                  fhrep4_entry     entries<>;
          };


Noveck, Burnett                April 2006                      [Page 29]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     When a filesystem becomes absent, the server in responding to
     requests for the fh_replacement attribute is not required to
     validate all fields of the filehandle if it does not maintain per-
     file information.  This matches current handle of fs_locations (and
     applies as well to fs_location_info).  For example, if a server has
     an fsid field within its filehandle implementation, it may simply
     recognize that value and return filehandles with the corresponding
     new fsid without validating other information within the handle.
     This can result in filesystem accepting a filehandle, which under
     other circumstances might result in NFS4ERR_STALE, just as it can
     when interrogating the fs_locations or fs_location_info attributes.
     Note that when it does so, it will return a replacement which, when
     presented to the new filesystem, will get an NFS4ERR_STALE there.

     Use of the fh_replacement attribute can allow wholesale change of
     filehandles to implement storage re-organization even within the
     context of a single server.  If NFS4ERR_MOVED is returned, the
     client will fetch fs_location_info which may refer to a location on
     the original server.  Use of fh_replacement in this context allows
     a new set of filehandles to be established as part of storage
     reconfiguration (including possibly a split into multiple fs's)
     without requiring the client to maintain name information against
     the possibility of such a reconfiguration (for volatile
     filehandles).

     Servers are not required to maintain the availability of
     replacement filehandles for any particular length of time, but in
     order to maintain continuity of access in the face of network
     disruptions, servers should generally maintain the mapping from the
     pre-replacement file handles persistently across server reboots,
     and for a considerable time.  It should be the case that even under
     severe network disruption, any client that received pre-replacement
     filehandles is given an opportunity to obtain the replacements.
     When this mapping no longer made available, the pre-replacement
     filehandles should not be re-used, just as is the case for any
     other superseded file handle.

     As noted above, this attribute, when supported, may be requested of
     absent filesystems without causing NFS4ERR_MOVED to be returned,
     and it should always be available.  When it is requested and the
     attribute is supported, if no replacement file handle information
     is present, either because the filesystem is still present and
     there is no migration event or because there are currently no
     replacement filehandles available, a zero-length array of
     fhrep4_entry structures should be returned.


Noveck, Burnett                April 2006                      [Page 30]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


3.2.4.  fs_status

     In an environment in which multiple copies of the same basic set of
     data are available, information regarding the particular source of
     such data and the relationships among different copies, can be very
     helpful in providing consistent data to applications.


          enum status4_type {
                  STATUS4_FIXED = 1,
                  STATUS4_UPDATED = 2,
                  STATUS4_INTERLOCKED = 3,
                  STATUS4_WRITABLE = 4,
                  STATUS4_ABSENT = 5
          };

          struct fs4_status {
                  status4_type    type;
                  utf8str_cs      source;
                  utf8str_cs      current;
                  nfstime4        version;
          };


     The type value indicates the kind of filesystem image represented.
     This is of particular importance when using the version values to
     determine appropriate succession of filesystem images.  Five types
     are distinguished:

     o    STATUS4_FIXED which indicates a read-only image in the sense
          that it will never change.  The possibility is allowed that as
          a result of migration or switch to a different image, changed
          data can be accessed but within the confines of this instance,
          no change is allowed.  The client can use this fact to
          aggressively cache.

     o    STATUS4_UPDATED which indicates an image that cannot be
          updated by the user writing to it but may be changed
          exogenously, typically because it is a periodically updated
          copy of another writable filesystem somewhere else.

     o    STATUS4_VERSIONED which indicates that the image, like the
          STATUS4_UPDATED case, is updated exogenously, but it provides
          a guarantee that the server will carefully update the
          associated version value so that the client, may if it
          chooses, protect itself from a situation in which it reads
          data from one version of the filesystem, and then later reads
          data from an earlier version of the same filesystem.  See


Noveck, Burnett                April 2006                      [Page 31]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


          below for a discussion of how this can be done.

     o    STATUS4_WRITABLE which indicates that the filesystem is an
          actual writable one.  The client need not of course actually
          write to the filesystem, but once it does, it should not
          accept a transition to anything other than a writable instance
          of that same filesystem.

     o    STATUS4_ABSENT which indicates that the information is the
          last valid for a filesystem which is no longer present.

     The opaque strings source and current provide a way of presenting
     information about the source of the filesystem image being present.
     It is not intended that client do anything with this information
     other than make it available to administrative tools.  It is
     intended that this information be helpful when researching possible
     problems with a filesystem image that might arise when it is
     unclear if the correct image is being accessed and if not, how that
     image came to be made.  This kind of debugging information will be
     helpful, if, as seems likely, copies of filesystems are made in
     many different ways (e.g. simple user-level copies, filesystem-
     level point-in-time copies, cloning of the underlying storage),
     under a variety of administrative arrangements.  In such
     environments, determining how a given set of data was constructed
     can be very helpful in resolving problems.

     The opaque string 'source' is used to indicate the source of a
     given filesystem with the expectation that tools capable of
     creating a filesystem image propagate this information, when that
     is possible.  It is understood that this may not always be possible
     since a user-level copy may be thought of as creating a new data
     set and the tools used may have no mechanism to propagate this
     data.  When a filesystem is initially created associating with it
     data regarding how the filesystem was created, where it was
     created, by whom, etc. can be put in this attribute in a human-
     readable string form so that it will be available when propagated
     to subsequent copies of this data.

     The opaque string 'current' should provide whatever information is
     available about the source of the current copy.  Such information
     as the tool creating it, any relevant parameters to that tool, the
     time at which the copy was done, the user making the change, the
     server on which the change was made etc.  All information should be
     in a human-readable string form.

     The version field provides a version identification, in the form of
     a time value, such that successive versions always have later time
     values.  When the filesystem type is anything other than


Noveck, Burnett                April 2006                      [Page 32]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     STATUS4_VERSIONED, the server may provide such a value but there is
     no guarantee as to its validity and clients will not use it except
     to provide additional information to add to 'source' and 'current'.

     When the type is STATUS4_VERSIONED, servers should provide a value
     of version which progresses monotonically whenever any new version
     of the data is established.  This allows the client, if reliable
     image progression is important to it, to fetch this attribute as
     part of each COMPOUND where data or metadata from the filesystem is
     used.

     When it is important to the client to make sure that only valid
     successor images are accepted, it must make sure that it does not
     read data or metadata from the filesystem without updating its
     sense of the current state of the image, to avoid the possibility
     that the fs_status which the client holds will be one for an
     earlier image, and so accept a new filesystem instance which is
     later than that but still earlier than updated data read by the
     client.

     In order to do this reliably, it must do a GETATTR of fs_status
     that follows any interrogation of data or metadata within the
     filesystem in question.  Often this is most conveniently done by
     appending such a GETATTR after all other operations that reference
     a given filesystem.  When errors occur between reading filesystem
     data and performing such a GETATTR, care must be exercised to make
     sure that the data in question is not used before obtaining the
     proper fs_status value.  In this connection, when an OPEN is done
     within such a versioned filesystem and the associated GETATTR of
     fs_status is not successfully completed, the open file in question
     must not be accessed until that fs_status is fetched.

     The procedure above will ensure that before using any data from the
     filesystem the client has in hand a newly-fetched current version
     of the filesystem image.  Multiple values for multiple requests in
     flight can be resolved by assembling them into the required partial
     order (and the elements should form a total order within it) and
     using the last.  The client may then, when switching among
     filesystem instances, decline to use an instance which is not of
     type STATUS4_VERSIONED or whose version field is earlier than the
     last one obtained from the predecessor filesystem instance.

4.  Migration Protocol

     As discussed above, it has always been anticipated that a migration
     protocol would be developed, to address the issue of migration of a
     filesystem between different filesystem implementations.  This need
     remains, and it can be expected that as client implementations of


Noveck, Burnett                April 2006                      [Page 33]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     migration become more common, it will become more pressing and the
     working group needs to seriously consider how that need may be best
     addressed.

     We are going to suggest that the working group should seriously
     consider what may be a significantly lighter-weight alternative,
     the addition of features to support server-to-server migration
     within NFSv4 itself, but taking advantage of existing NFSv4
     facilities and only adding the features needed to support efficient
     migration, as items within a minor version.

     One thing that needs to be made clear is that a common migration
     protocol does not mean a common migration approach or common
     migration functionality.  Thus the need for the kinds of
     information provided by fs_location_info.  For example, the fact
     that the migration protocol will make available on the target the
     file id, file handle, and change attribute from the source, does
     not means that the receiving can store these values natively, or
     that it will choose to implement translation support to accommodate
     the values exported by the source.  This will remain an
     implementation choice.  Clients will need information about those
     various choices, such as would be provided by fs_location_info, in
     order to deal with the various implementations.

4.1.  NFSv4.x as a Migration Protocol

     Whether the following approach or any other is adopted,
     considerable work will still be required to flesh out the details,
     requiring a number of drafts for a problem statement, initial
     protocol spec, etc.  But to give an idea of what would be involved
     in this kind of approach, a rough sketch is given below.

     First, let us fix for the moment on a pull model, in which the
     target server, selected by a management application pulls data from
     the source using NFSv4.x.  The server acts as a client, albeit a
     specially privileged one, to copy the existing data.

     The first point to be made is that using NFSv4 means that we have a
     representation for all data that is representable within NFSv4 and
     that that is maintained automatically as minor versioning proceeds.
     That is, when attributes are added to a minor version of NFSv4,
     they are "automatically" added to the migration copy protocol,
     because the two are the same.

     The presence of COMPOUND is a further help in that implementations
     will be able to maintain high throughput when copying without
     creating a special protocol devoted to that purpose.  For example,
     when copying a large set of small files, these files can all be


Noveck, Burnett                April 2006                      [Page 34]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     read with a single COMPOUND.  This means that the benefit of
     creating a stream format for the entire fs is much reduced and
     allows existing servers (with small modifications) to simply
     support the kinds of access they have to support anyway.  The
     servers acting as clients would probably use a non-standard
     implementation but they would share lots of infrastructure with
     more standard clients, so this would probably be a win on the
     implementation side as well as on the specification side.

     One other point is that if the migration protocol were in fact an
     NFSv4.x, NFSv4 developments such as pNFS would be available for
     high-performance migration, with no special effort.

     Clearly, there is still considerable work to do this, even if it is
     not of the same order as a new protocol.  The working group needs
     to discuss this and see if there is agreement that a means of
     cross-server migration is worthwhile and whether this is the best
     way to get there.

     Here is a basic list of things that would have to be dealt with to
     effect a transfer:

     o    Reads without changing access times.  This is probably best
          done as a per-session attribute (it is best to assume sessions
          here).

     o    Reads that ignore share reservations and mandatory locks.  It
          may be that the existing all-ones special stateid is adequate.

     o    A way to obtain the locking state information for the source
          fs: the locks (byte-range and share reservations) for that fs
          including associated stateids and owner opaque strings,
          clientid's and the other identifying client information for
          all clients with locks on that fs.  This is all protocol-
          defined, rather than implementation-specific data.

     o    A way to lock out changes on a filesystem.  This would be
          similar to a read delegation on the entire filesystem, but
          would have a greater degree of privilege, in that the holder
          would be allowed to keep it as long as his lease was renewed.

     o    A way to permanently terminate existing access to the
          filesystem (by everyone except the calling session) and report
          it MOVED to the users.

     Conventions as far as appropriate security for such operations
     would have to be developed to assure interoperability, but it is a


Noveck, Burnett                April 2006                      [Page 35]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


     question of establishing conventions rather than defining new
     mechanisms.

     Given the facilities above, you could get an initial image of a
     filesystem, and then rescan and update the destination until the
     amount of change to be propagated stabilized.  At this point,
     changes could be locked out and a final set up updates propagated
     while read-only access to the filesystem continued.  At that point
     further access would be locked out, and the locking state and any
     final changes to access time would be propagated.  The access time
     scan would be manageable since the client could issue long
     COMPOUND's with many PUTFH-GETATTR pairs and many such requests
     could be in flight at a time.

     If it was required that the disruption to access be smaller, some
     small additions to the functionality might be quite effective:

     o    Notifications for a filesystem, perhaps building on the
          notifications proposed in the directory delegations document
          would limit the rescanning for changes, and so would make the
          window in which additional changes could happen much smaller.
          This would greatly reduce the window in which write access
          would have to be locked out.

     o    A facility for global scans for attribute changes could help
          reduce lockout periods.  Something that gave a list of object
          filehandles that met a given attribute search criterion (e.g.
          attribute x greater than, less than, equal to, some value)
          could reduce rescan update times and also rescan times for
          accesstime updates.

     These lists assume that the server initiating the transfer is doing
     its own writing to disk.  Extending this to writing the new fs via
     NFSv4 would require further protocol support.  The basic message
     for the working group is that the set of things to do is of
     moderate size and builds in large part on existing or already
     proposed facilities.

Acknowledgements

     The authors wish to thank Ted Anderson and Jon Haswell for their
     contributions to the ideas within this document.


Noveck, Burnett                April 2006                      [Page 36]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


Normative References

     [RFC3530]
          S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track
          RFC

Informative References

     [referrals]
          D. Noveck, R. Burnett, "Implementation Guide for Referrals in
          NFSv4", Internet Draft draft-ietf-nfsv4-referrals-00.txt, Work
          in progress

Authors' Addresses

     David Noveck
     Network Appliance, Inc.
     375 Totten Pond Road
     Waltham, MA 02451 USA

     Phone: +1 781 768 5347
     EMail: dnoveck@netapp.com

     Rodney C. Burnett
     IBM, Inc.
     13001 Trailwood Rd
     Austin, TX 78727 USA

     Phone: +1 512 838 8498
     EMail: cburnett@us.ibm.com


Full Copyright Statement

     Copyright (C) The Internet Society (2005).  This document is
     subject to the rights, licenses and restrictions contained in BCP
     78, and except as set forth therein, the authors retain all their
     rights.

     This document and the information contained herein are provided on
     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
     PARTICULAR PURPOSE.


Noveck, Burnett                April 2006                      [Page 37]

Internet-Draft Next Steps for NFSv4 Migration/Replication   October 2005


Intellectual Property

     The IETF takes no position regarding the validity or scope of any
     Intellectual Property Rights or other rights that might be claimed
     to pertain to the implementation or use of the technology described
     in this document or the extent to which any license under such
     rights might or might not be available; nor does it represent that
     it has made any independent effort to identify any such rights.
     Information on the procedures with respect to rights in RFC
     documents can be found in BCP 78 and BCP 79.

     Copies of IPR disclosures made to the IETF Secretariat and any
     assurances of licenses to be made available, or the result of an
     attempt made to obtain a general license or permission for the use
     of such proprietary rights by implementers or users of this
     specification can be obtained from the IETF on-line IPR repository
     at http://www.ietf.org/ipr.

     The IETF invites any interested party to bring to its attention any
     copyrights, patents or patent applications, or other proprietary
     rights that may cover technology that may be required to implement
     this standard.  Please address the information to the IETF at ietf-
     ipr@ietf.org.


Acknowledgement

     Funding for the RFC Editor function is currently provided by the
     Internet Society.


Noveck, Burnett                April 2006                      [Page 38]