INTERNET-DRAFT David Noveck Expires: April 2006 Network Appliance, Inc. Rodney C. Burnett IBM, Inc. October 2005 Next Steps for NFSv4 Migration/Replication draft-noveck-nfsv4-migrep-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2005). All Rights Reserved. Noveck, Burnett April 2006 [Page 1] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Abstract The fs_locations attribute in NFSv4 provides support for fs migration, replication and referral. Given the current work on supporting these features, and the new needs such as support for global namespace, it is time to look at this area and see what further development of this protocol area may be required. This document makes suggestions for the further development of these features in NFSv4.1 and also presents ideas for work that might be done as part of future minor versions. Table Of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. History . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Areas to be Addressed . . . . . . . . . . . . . . . . . 3 2. Clarifications/Corrections to V4.0 Functionality . . . . . 4 2.1. Attributes Returned by GETATTR and READDIR . . . . . . . 4 2.1.1. fsid . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2. mounted_on_fileid . . . . . . . . . . . . . . . . . . 5 2.1.3. fileid . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4. filehandle . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Issues with the Error NFS4ERR_MOVED . . . . . . . . . . 6 2.2.1. Issue of when to check current filehandle . . . . . . 7 2.2.2. Issue of GETFH . . . . . . . . . . . . . . . . . . . . 7 2.2.3. Handling of PUTFH . . . . . . . . . . . . . . . . . . 7 2.2.4. Inconsistent handling of GETATTR . . . . . . . . . . . 8 2.2.5. Ops not allowed to return NFS4ERR_MOVED . . . . . . . 8 2.2.6. Summary of NFS4ERR_MOVED . . . . . . . . . . . . . . . 9 2.3. Issues of Incomplete Attribute Sets . . . . . . . . . . 9 2.3.1. Handling of attributes for READDIR . . . . . . . . . . 10 2.4. Referral Issues . . . . . . . . . . . . . . . . . . . . 11 2.4.1. Editorial Changes Related to Referrals . . . . . . . . 12 3. Feature Extensions . . . . . . . . . . . . . . . . . . . . 13 3.1. Attribute Continuity . . . . . . . . . . . . . . . . . . 13 3.1.1. filehandle . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2. fileid . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3. change attribute . . . . . . . . . . . . . . . . . . . 15 3.1.4. fsid . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2. Additional Attributes . . . . . . . . . . . . . . . . . 16 3.2.1. fs_absent . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2. fs_location_info . . . . . . . . . . . . . . . . . . . 16 3.2.3. fh_replacement . . . . . . . . . . . . . . . . . . . . 28 3.2.4. fs_status . . . . . . . . . . . . . . . . . . . . . . 31 4. Migration Protocol . . . . . . . . . . . . . . . . . . . . 33 4.1. NFSv4.x as a Migration Protocol . . . . . . . . . . . . 34 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 36 Normative References . . . . . . . . . . . . . . . . . . . 36 Noveck, Burnett April 2006 [Page 2] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Informative References . . . . . . . . . . . . . . . . . . 37 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 37 Full Copyright Statement . . . . . . . . . . . . . . . . . 37 1. Introduction 1.1. History When the fs_locations attribute was introduced, it was done with the expectation that a server-to-server migration protocol was in the offing. Including the fs_locations-related features provided client support which would be used to allow clients to use migration when that protocol was developed and also could provide support for vendor-specific homogeneous server migration, until that point. As things happened, development of a server-to-server migration protocol stalled. In part, this was due to the demands of NFSv4 implementation itself. Also, until V4 clients which supported these features were widely deployed, it was hard to justify the long-term effort for a new server-to-server protocol. Now that serious implementation work has begun, a number of issues have been discovered with the treatment of these features in RFC3530. There are no significant protocol bugs, but there are numerous cases in which the text is not clear or contradictory on significant points. Also, a number of suggestions have been made regarding small things left undone in the original specification, leading to the question of whether it is now an appropriate time to rectify those inadequacies. Another important development has been the idea of referrals. Referrals, a limiting case of migration, were not recognized when the spec was written, even though the protocol defined therein does support them. See [referrals] for an explanation of referrals implementation. Also, it has turned out that referrals are an important building-block for the development of a global namespace for NFSv4. 1.2. Areas to be Addressed This document is motivated in large part by the opportunity represented by NFSv4.1. First, this will provide a way to revise the treatment of these features in the spec, to make it clearer, to avoid ambiguities and contradictions, and to incorporate explicit discussion of referrals into the text. Noveck, Burnett April 2006 [Page 3] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 NFSv4.1 also affords the opportunity to provide small extensions to these facilities, to make them more generally useful, in particular in environments in which migration between servers of different types is to be performed. Use of these features in a global- namespace environment will also motivate certain extensions. The remaining issue in this area is the development of a vendor- independent migration mechanism. This is definitely not something that can be done immediately (in v4.1) but the working group needs to figure out when this effort can be revived. This document will examine a somewhat lower-overhead alternative to development of a separate server-to-server migration protocol. The alternative that will be explored is the use of NFSv4 itself, with a small set of additions, by a server operating as an NFSv4 client to either pull or push file system state to or from another server. It seems that this sort of incremental development can provide a more efficient way of getting a migration mechanism than development of a new protocol that will inevitably duplicate a lot of NFSv4. Since NFSv4 must have the general ability to represent fs state that is accessible via NFSv4, using the core protocol as the base and adding only the extensions needed to do data transfer efficiently and transfer locking state should be more efficient in terms of design time. The needed extensions could be introduced within a minor version. It is not proposed or expected that these extensions would be in NFSv4.1. 2. Clarifications/Corrections to V4.0 Functionality All of the sub-sections below deal with the basic functionality described, explicitly or implicitly, in RFC3530. While the majority of the material is simply corrections, clarifications, and the resolution of ambiguities, in some cases there is cleanup to make things more consistent in v4.1, without adding any new functionality. Functional changes are addressed in separate sections. 2.1. Attributes Returned by GETATTR and READDIR While the RFC3530 allows the server to return attributes in addition to fs_locations, when GETATTR is used with a current filehandle within an absent filesystem, not much guidance is given to help clarify what is appropriate. Such vagueness can result in serious interoperability issues. Instead of simply allowing an undefined set of attributes to returned, the NFSv4.1 spec should clearly define the circumstances under which attributes for absent filesystems are to be returned. Noveck, Burnett April 2006 [Page 4] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 While some leeway may be necessary to accommodate different NFSv4.1 servers, unnecessary leeway should be avoided. In particular, there are a number of attributes which most server implementations should find relatively easy to supply which are of critical importance to clients, particularly in those cases in which NFS4ERR_MOVED is returned when first crossing into an absent file system that the client has not previously referenced, i.e. a referral. NFSv4.1 should require servers to return fsid for an absent file system as well as fs_locations. In order for the client to properly determine the boundaries of the absent filesystems, it needs access to fsid. In addition when at the root of absent filesystem, mounted_on_fileid needs to be returned. On the other hand, a number of attributes pose difficulties when returned for an absent filesystem. While not prohibiting the server from returning these, the NFSv4.1 spec should explain the issues which may result in problems, since these are not always obvious. Handling of some specific attributes is discussed below. 2.1.1. fsid The fsid attribute allows clients to recognize when fs boundaries have been crossed. This applies also when one crosses into an absent filesystem. While it might seem that returning fsid is not absolutely required, since fs boundaries are also reflected, in this case, by means of the fs_root field of the fs_locations attribute, there are renaming issues that make this unreliable. Returning fsid is necessary for clients and servers should have no difficulty in providing it. To avoid misunderstanding, the NFSv4.1 spec should note that the fsid provided in this case is solely so that the fs boundaries can be properly noted and that the fsid returned will not necessarily be valid after resolution of the migration event. The logic of fsid handling for NFSv4 is that fsid's are only unique within a per-server context. This would seem to be a strong indication that they need not be persistent when file systems are moved from server to server, although RFC 3530 does not specifically address the matter. 2.1.2. mounted_on_fileid The mounted_on_fileid attribute is of particular importance to many clients, in that they need this information to form a proper response to a readdir() call. When a readdir() call is done within Noveck, Burnett April 2006 [Page 5] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 UNIX, the d_ino field of each of the entries needs to have a unique value normally derived from the NFSv4 fileid attribute. It is in the case in which a file system boundary is crossed that using the fileid attribute for this purpose, particularly when crossing into an absent fs, will pose problems. Note first that the fileid attribute, since it is within a new fs and thus a new fileid space, will not be unique within the directory. Also, since the fs, at its new location, may arrange things differently, the fileid decided on at the directing server may be overridden at the target server, making it of little value. Neither of these problems arises in the case of mounted_on_fileid since that fileid is in the context of the mounted-on fs and unique within it. 2.1.3. fileid For reasons explained above under mounted_on_fileid, it would be difficult for the referring server to provide a fileid value that is of any use to the client. Given this, it seems much better for the server never to return fileid values for files on an absent fs. 2.1.4. filehandle Returning file handles for files in the absent fs, whether by use of GETFH (discussed below) or by using the filehandle attribute with GETATTR or READDIR poses problems for the client as the server to which it is referred is likely not to assign the same filehandle value to the object in question. Even though it is possible that volatile filehandles may allow a change, the referring server should not prejudge the issue of filehandle volatility for the server which actually has the fs. By not providing the file handle, the referring server allows the target server freedom to choose the file handle value without constraint. 2.2. Issues with the Error NFS4ERR_MOVED RFC3530, in addition to being somewhat unclear about the situations in which NFS4ERR_MOVED is to be returned, is self-contradictory. In particular in section 6.2, it is stated, "The NFS4ERR_MOVED error is returned for all operations except PUTFH and GETATTR.", which is contradicted by the error lists in the detailed operation descriptions. Specifically, o NFS4ERR_MOVED is listed as an error code for PUTFH (section 14.2.20), despite the statement noted above. o NFS4ERR_MOVED is listed as an error code for GETATTR (section 14.2.7), despite the statement noted above. Noveck, Burnett April 2006 [Page 6] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 o Despite the "all operations except" in the statement above, six operations (PUTROOTFH, PUTPUBFH, RENEW, SETCLIENTID, SETCLIENTID_CONFIRM, RELEASE_OWNER) are not allowed to return NFS4ERR_MOVED. 2.2.1. Issue of when to check current filehandle In providing the definition of NFS4ERR_MOVED, RFC 3530 refers to the "filesystem which contains the current filehandle object" being moved to another server. This has led to some confusion when considering the case of operations which change the current filehandle and potentially the current file system. For example, a LOOKUP which causes a transition to an absent file system might be supposed to result in this error. This should be clarified to make it explicit that only the current filehandle at the start of the operation can result in NFS4ERR_MOVED. 2.2.2. Issue of GETFH While RFC 3530 does not make any exception for GETFH when the current filehandle is within an absent filesystem, the fact that GETFH is such a passive, purely interrogative operation, may lead readers to wrongly suppose that an NFSERR_MOVED error will not arise in this situation. Any new NFSv4 RFC should explicitly state that GETFH will return this error if the current filehandle is within an absent filesystem. This fact has a particular importance in the case of referrals as it means that filehandles within absent filesystems will never be seen by clients. Filehandles not seen by clients can pose no expiration or consistency issues on the target server. 2.2.3. Handling of PUTFH As noted above, the handling of PUTFH regarding NFS4ERR_MOVED is not clear in RFC3530. Part of the problem is that there is felt to be a need for an exception for PUTFH, to enable the sequence PUTFH- GETATTR(fs_locations). However, if one clearly establishes, as should be established, that the check for an absent filesystem is only to be made at the start of each operation, then no such exception is required. The sequence PUTFH-GETATTR(fs_locations) requires an exception for the GETATTR but not the PUTFH. PUTFH can return NFS4ERR_MOVED but only if the current filehandle, as established by a previous operation, is within an absent filesystem. Whether the filehandle established by the PUTFH, is within an absent filesystem is of no consequence in determining Noveck, Burnett April 2006 [Page 7] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 whether such an error is returned, since the check is to be done at the start of the operation. 2.2.4. Inconsistent handling of GETATTR While, as noted above, RFC 3530 indicates that NFS4ERR_MOVED is not returned for a GETATTR operation, NFS4ERR_MOVED is listed as an error that can be returned by GETATTR. The best resolution for this is to limit the exception for GETATTR to specific cases in which it is required. o If all of the attributes requested can be provided (e.g. fsid, fs_locations, mounted_on_fileid in the case of the root of an absent filesystem), then NFS4ERR_MOVED is not returned. o If an attribute which indicates that the client is aware of the likelihood of migration having happened (such as fs_locations) then NFS4ERR_MOVED is not returned, irrespective of what additional attributes are requested. The newly- proposed attributes fs_absent and fs_location_info (see sections 3.2.1 and 3.2.2) would, like fs_locations, also cause NFS4ERR_MOVED not to be returned. For the rest this document, the phrase "fs_locations-like attributes" is to be understood as including fs_locations, and the new attributes fs_absent and fs_location_info, if added to the protocol. In all other cases, if the current filesystem is absent, NFS4ERR_MOVED is to be returned. 2.2.5. Ops not allowed to return NFS4ERR_MOVED As noted above, RFC 3530 does not allow the following ops to return NFS4ERR_MOVED: o PUTROOTFH o PUTPUBFH o RENEW o SETCLIENTID o SETCLIENTID_CONFIRM o RELEASE_OWNER Noveck, Burnett April 2006 [Page 8] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 All of these are ops which do not require a current file handle, although two other ops that also do not require a current file handle, DELEGPURGE and PUTFH are allowed to return NFS4ERR_MOVED. There is no good reason to continue these as exceptions. In future NFSv4 versions it should be the case that if there is a current filehandle and the associated filesystem is not present an NFS4ERR_MOVED error should result, as it does for other ops. 2.2.6. Summary of NFS4ERR_MOVED To summarize, NFSv4.1 should: o Make clear that the check for an absent filesystem is to occur at the start (and only at the start) of each operation. o Allow NFS4ERR_MOVED to be returned by all ops including those not allowed to return it in RFC3530. o Be clear about the circumstances in which GETATTR will or will not return NFS4ERR_MOVED. o Delete the confusing text regarding an exception for PUTFH. o Make it clear that GETFH will return NFS4ERR_MOVED rather than a filehandle within an absent filesystem. 2.3. Issues of Incomplete Attribute Sets Migration or referral events naturally create situations in which all of the attributes normally supported on a server are not obtainable. RFC3530 is in places ambivalent and/or apparently self-contradictory on such issues. Any new NFSv4 RFC should take a clear position on these issues (and it should not impose undue difficulties on support for migration). The first problem concerns the statement in the third paragraph of section 6.2: "If the client requests more attributes than just fs_locations, the server may return fs_locations only. This is to be expected since the server has migrated the filesystem and may not have a method of obtaining additional attribute data." While the above seems quite reasonable, it is seemingly contradicted by the following text from section 14.2.7 the second paragraph of the DESCRIPTION for GETATTR: "The server must return a value for each attribute that the client requests if the attribute is supported by the server. If the server does not support an attribute or cannot approximate a useful value then it must not Noveck, Burnett April 2006 [Page 9] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 return the attribute value and must not set the attribute bit in the result bitmap. The server must return an error if it supports an attribute but cannot obtain its value. In that case no attribute values will be returned." While the above is a useful restriction in that it allows clients to simplify their attribute interpretation code since it allows them to assume that all of the attributes they request are present often making it possible to get successive attributes at fixed offsets within the data stream, it seems to contradict what is said in section 6.2, where it is clearly anticipated, at least when fs_locations is requested, that fewer (often many fewer) attributes will be available than are requested. It could be argued that you could harmonize these two by being creative with the interpretation of the phrase "if the attribute is supported by the server". One could argue that many attributes are not supported by the server for an absent fs even though the text by talking about attributes "supported by a server" seems to indicate that this is not allowed to be different for different fs's (which is troublesome in itself as one server might have filesystems that do support and don't support acl's for example). Note however that the following paragraph in the description says, "All servers must support the mandatory attributes as specified in the section 'File Attributes'". That's reasonable enough in general, but for an absent fs it is not reasonable and so section 14.2.7 and section 6.2 are contradictory. NFSv4.1 should remove the contradiction, by making an explicit exception for the case of an absent filesystem. 2.3.1. Handling of attributes for READDIR A related issue concerns attributes in a READDIR. There has been discussion, without any resolution yet, regarding the server's obligation (or not) to return the attributes requested with READDIR. There has been discussion of cases in which this is inconvenient for the server, and an argument has been made that the attributes request should be treated as a hint, since the client can do a GETATTR to get requested attributes that are not supplied by the server. Regardless of how this issue is resolved, it needs to be made clear that at least in the case of a directory that contains the roots of absent filesystems, the server must not be required to return attributes that it is simply unable to return, just it cannot with GETATTR. Noveck, Burnett April 2006 [Page 10] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 The following rules, derived from section 3.1 of [referrals], modified for suggested attribute changes in NFSv4.1 represent a good base for handling this issue, although the resolution of the general issue regarding the attribute mask for READDIR will affect the ultimate choices for NFSv4.1. o When any of the fs_locations-like attributes is among the attributes requested, the server may provide a subset of the other requested attributes together with the request fs_locations-like attributes for roots of absent fs's, without causing any error for the READDIR as a whole. If rdattr_error is also requested and there are attributes which are not available, then rdattr_error will receive the value NFS4ERR_MOVED. o When no fs_locations-like attributes are requested, but all of the attributes requested can be provided, then they will be provided and no NFS4ERR_MOVED will be generated. An example would be READDIR's that request mounted_on_fileid either with or without fsid. o When none of the fs_locations-like attributes are requested, but rdattr_error is and some attributes requested are not available because of the absence of the filesystem, the server will return NFS4ERR_MOVED for the rdattr_error attribute and, in addition, the requested attributes that are valid for the root of an absent filesystem. o When none of fs_locations-like attributes are requested and there is a directory within an absent fs within the directory being read, if some unavailable attributes are requested, the handling will depend on the overall decision about READDIR referred to above. If the attribute mask is to be treated as a hint, only available attributes will be returned. Otherwise, no data will be returned and the READDIR will get an NFS4ERR_MOVED error. 2.4. Referral Issues RFC 3530 defines a migration feature which allows the server to direct clients to another server for the purpose of accessing a given file system. While that document explains the feature in terms of a client accessing a given file system and then finding that it has moved, an important limiting case is that in which the clients are redirected as part of their first attempt to access a given file system. Noveck, Burnett April 2006 [Page 11] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 2.4.1. Editorial Changes Related to Referrals Given the above framework for implementing referrals, within the basic migration framework described in RFC 3530, we need to consider how future NFSv4 RFC's should be modified, relative to RFC 3530, to address referrals. The most important change is to include an explanation of how referrals fit into the v4 migration model. Since the existing discussion does not specifically call out the case in which the absence of a filesystem is noted while attempting to cross into the absent file system, it makes it hard to understand how referrals work and how they relate to other sorts of migration events. It makes sense to present a description of referrals in a new sub- section following the "Migration" section, and would be section 6.2.1, given the current numbering scheme of RFC 3530. The material in [referrals], suitably modified for the changes proposed for v4.1, would be very helpful in providing the basis for this sub-section. There are also a number of cases in which the existing wording of RFC 3530 seems to ignore the referral case of the migration feature. In the following specific cases, some suggestions are made for edits to tidy this up. o In section 1.4.3.3, in the third sentence of the first paragraph, the phrase "In the event of a migration of a filesystem" is unnecessarily restrictive and having the sentence read "In the event of the absence of a filesystem, the client will receive an error when operating on the filesystem and it can then query the server as to the current location of the file system" would be better. o In section 6.2, the following should be added as a new second paragraph: "Migration may be signaled when a file system is absent on a given server, when the file system in question has never actually been located on the server in question. In such a case, the server acts to refer the client to the proper fs location, using fs_locations to indicate the server location, with the existence of the server as a migration source being purely conventional." o In the existing second paragraph of section 6.2, the first sentence should be modified to read as follows: "Once a filesystem has been successfully established at a new server location, the error NFS4ERR_MOVED will be returned for subsequent requests received by the server whose role is as Noveck, Burnett April 2006 [Page 12] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 the source of the filesystem, whether the filesystem actually resided on that server, or whether its original location was purely nominal (i.e. the pure referral case)." o The following should be added as an additional paragraph to the end of section 6.4, the: "Note that in the case of a referral, there is no issue of filehandle recovery since no filehandles for the absent filesystem are communicated to the client (and neither is the fh_expire_type)". o The following should be added as an additional paragraph to the end of section 8.14.1: "Note that in the case of referral, there is no issue of state recovery since no state can have been generated for the absent filesystem." o In section 12, in the description of NFS4ERR_MOVED, the first sentence should read, "The filesystem which contains the current filehandle object is now located on another server." 3. Feature Extensions A number of small extensions can be made within NFSv4's minor versioning framework to enhance the ability to provide multi-vendor implementations of migration and replication where the transition from server instance to server instance is transparent to client users. This includes transitions due to migration or transitions among replicas due to server or network problems. These same extensions would enhance the ability of the server to present clients with multiple replicas in referral situation, so that the most appropriate one might be selected. These extensions would all be in the form of additional recommended attributes. 3.1. Attribute Continuity There are a number of issues with the existing protocol that revolve around the continuity (or lack thereof) of attribute values across a migration event. In some cases, the spec is not clear about whether such continuity is required and different readers may make different assumptions. In other cases, continuity is not required but there are significant cases in which there would be a benefit and there is no way for the client to take advantage of attribute continuity when it exists. A third situation is that attribute continuity is generally assumed (although not specified in the spec), but allowing change at a migration event would add greatly to flexibility in handling a global namespace. Noveck, Burnett April 2006 [Page 13] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 3.1.1. filehandle The issue of filehandle continuity is not fully addressed in RFC3530. In many cases of vendor-specific migration or replication (where an entire fs image is copied, for instance), it is relatively easy to provide that the same persistent filehandles used on the source server be recognized on the destination server. On the other hand, for many forms of migration, filehandle continuity across a migration event cannot be provided, requiring that filehandles be re-established. Within RFC3530, volatile filehandles (FH4_VOL_MIGRATION) is the only mechanism to satisfy this need and in many environments they will work fine. Unfortunately, in the case in which an open file is renamed by a another client, the re-establishment of the filehandle on the destination target will give the wrong result and the client will attempt to re-open an incorrect file on the target. There needs to be a way to address this difficulty in order to provide transparent switching among file system instance, both in the event of migration or when transitioning among replicas. 3.1.2. fileid RFC3530 gives no real guidance on the issue of continuity of fileid's in the event of migration or a transition between two replicas. The general expectation has been that in situations in which the two filesystem instances are created by a single vendor using some sort of filesystem image copy, fileid's will be consistent across the transition while in the analogous multi- vendor transitions they will not. This latter can pose some difficulties. It is important to note that while clients themselves may have no trouble with a fileid changing as a result of a filesystem transition event, applications do typically have access to the fileid (e.g. via stat), and the result of this is that an application may work perfectly well if there is no filesystem instance transition or if any such transition is among instances created by a single vendor, yet be unable to deal with the situation in which a multi-vendor transition occurs, at the wrong time. Providing the same fileid's in a multi-vendor (multiple server vendors) environment has generally been held to be quite difficult. While there is work to be done, it needs to be pointed out that this difficulty is partly self-imposed. Servers have typically Noveck, Burnett April 2006 [Page 14] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 identified fileid with inode number, i.e. with a quantity used to find the file in question. This identification poses special difficulties for migration of an fs between vendors where assigning the same index to a given file may not be possible. Note here that a fileid does not require that it be useful to find the file in question, only that it is unique within the given fs. Servers prepared to accept a fileid as a single piece of metadata and store it apart from the value used to index the file information can relatively easily maintain a fileid value across a migration event, allowing a truly transparent migration event. In any case, where servers can provide continuity of fileids, they should and the client should be able to find out that such continuity is available, and take appropriate action. 3.1.3. change attribute Currently the change attribute is defined as strictly the province of the server, making it necessary for the client to re-establish the change attribute value on the new server. This has the further consequence that the lack of continuity between change values on the source and destination servers creates a window during which we have no reliable way of determining whether caches are still valid. Where there is a transition among writable filesystem instances, even if most of the access is for reading (in fact particularly if it is), the can be a big performance issue. Where the co-operating servers can provide continuity of change number across the migration event, the client should be able to determine this fact and use this knowledge to avoid unneeded attribute fetches and client cache flushes. 3.1.4. fsid Although RFC3530 does not say so explicitly, it has been the general expectation that although the fsid is expected to change as part of migration (since the fsid space is per-server), the boundaries of a server when migrated will be the same as they were on the source. The possibility of splitting an existing filesystem into two or more as part of migration can provide important additional functionality in a global namespace environment. When one divides up pieces of a global namespace into convenient-sized fs's (to allow their independent assignment to individual servers), difficulties will arise over time. As the sizes of directories grow, what was once a convenient set of files, embodied as a separate fs, may become inconveniently large. This requires a Noveck, Burnett April 2006 [Page 15] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 means to divide it into a new set of pieces which are of a convenient size. The important point is that while there are many ways to do that currently, they are all disruptive. A method is needed which allows this division to occur without disrupting access. 3.2. Additional Attributes A small number of additional attributes in V4.1 can provide significant additional functionality, by addressing the attribute continuity issues discussed above and allowing more complete information about the possible replicas, post-migration locations, or referral targets for a given filesystem that allows the client to choose the one most suited to its needs, and to more effectively handle the transition to a new target server. All of the proposed attributes would be defined as validly requested when the current filehandle is within an absent filesystem, i.e. an attempt to obtain these attributes would not result in NFS4ERR_MOVED. In some cases, it may be optional to actually provide the requested attribute information based on the presence or absence of the filesystem. The specifics will be discussed under each of the individual attributes. 3.2.1. fs_absent In NFSv4.0, fs_locations is the only attribute which, when fetched, indicates that the client is aware of the possibility that the current filesystem may be absent. Since fs_locations is a complicated attribute and the client may simply want an indication of whether the filesystem is present, we propose the addition of a boolean attribute named "fs_absent" to provide this information simply. As noted above, this attribute, when supported, may be requested of absent filesystems without causing NFS4ERR_MOVED to be returned and it should always be available. Servers are strongly urged to support this attribute on all filesystems if they support it on any filesystem. 3.2.2. fs_location_info The fs_location_info attribute is intended as a more functional replacement for fs_locations which will continue to exist and be supported. Clients which need the additional information provided by this attribute will interrogate it and get the information from servers that support it. When the server does not support fs_location_info, fs_locations can be used to get a subset of the Noveck, Burnett April 2006 [Page 16] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 information. A server which supports fs_location_info MUST support fs_locations as well. There are several sorts of additional information present in fs_location_info, that aren't available in fs_locations: o Attribute continuity information to allow a client to select a location which meets the transparency requirements of the applications accessing the data and to take advantage of optimizations that server guarantees as to attribute continuity may provide (e.g. change attribute). o Filesystem identity information which indicates when multiple replicas, from the clients point of view, correspond to the same target filesystem, allowing them to be used interchangeably, without disruption, as multiple paths to the same thing. o Information which will bear on the suitability of various replicas, depending on the use that the client intends. For example, many applications need an absolutely up-to-date copy (e.g. those that write), while others may only need access to the most up-to-date copy reasonably available. o Server-derived preference information for replicas, which can be used to implement load-balancing while giving the client the entire fs list to be used in case the primary fails. Attribute continuity and filesystem identity information define a number of identity relations among the various filesystem replicas. Most often, the relevant question for the client will be whether a given replica is identical-with/continuous-to the current one in a given respect but the information should be available also as to whether two other replicas match in that respect as well. The way in which such pairwise filesystem comparisons are relatively compactly encoded is to associate with each replica a 32-bit integer, the location id. The fs_location_info attribute then contains for each of the identity relations among replicas a 32-bit mask. If that mask, when anded with the location ids of the two replicas, result in fields which are identical, then the two replicas are defined as belonging to the corresponding identity relation. This scheme allows the server to accommodate relatively large sets of replicas distinct according to a given criteria without requiring large amounts of data to be sent for each replica. Noveck, Burnett April 2006 [Page 17] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Server-specified preference information is also provided in a fashion that allows a number of different relations (in this case order relations) in a compact way. In this case each location4_server structure contains a 32-bit priority word which can be broken into fields devoted to these relations in any way the server wishes. The location4_info structure contains a set of 32-bit masks, one for each relation. Two replicas can be compared via that relation by anding the corresponding mask with the priority word for each replica and comparing the results. The fs_location_info attribute consists of a root pathname (just like fs_locations), together with an array of location4_item structures. Noveck, Burnett April 2006 [Page 18] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 struct location4_server { uint32_t priority; uint32_t flags; uint32_t location_id; int32_t currency; utf8str_cis server; }; const LIF_FHR_OPEN = 0x00000001; const LIF_FHR_ALL = 0x00000002; const LIF_MULTI_FS = 0x00000004; const LIF_WRITABLE = 0x00000008; const LIF_CUR_REQ = 0x00000010; const LIF_ABSENT = 0x00000020; const LIF_GOING = 0x00000040; struct location4_item { location4_server entries<>; pathname4 rootpath; }; struct location4_info { pathname4 fs_root; location4_item items<>; uint32_t fileid_keep_mask; uint32_t change_cont_mask; uint32_t same_fh_mask; uint32_t same_state_mask; uint32_t same_fs_mask; uint32_t valid_for; uint32_t read_rank_mask; uint32_t read_order_mask; uint32_t write_rank_mask; uint32_t write_order_mask; }; The fs_location_info attribute is structured similarly to the fs_locations attribute. A top-level structure (fs_locations4 or location4_info) contains the entire attribute including the root pathname of the fs and an array of lower-level structures that define replicas that share a common root path on their respective servers. Those lower-level structures in turn (fs_locations4 or location4_item) contain a specific pathname and information on one or more individual server replicas. For that last lowest-level information, fs_locations has a server name in the form of utf8str_cis, while fs_location_info has a location4_server Noveck, Burnett April 2006 [Page 19] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 structure that contains per-server-replica information in addition to the server name. The location4_server structure consists of the following items: o The priority word is used to implement server-specified ordering relations among replicas. These relations are intended to be used to select a replica when migration (including a referral) occurs, when a server appears to be down, when the server directs the client to find a new replica (see LIF_GOING) and, optionally, when a new filesystem is first entered. See the location4_info fields read_order, read_rank, write_order, and write_rank for details of the ordering relations. o A word of flags providing information about this replica/target. These flags are defined below. o An indication of file system up-to-date-ness (currency) in terms of approximate seconds before the present. A negative value indicates that the server is unable to give any reasonably useful value here. A zero indicates that filesystem is the actual writable data or a reliably coherent and fully up-to-date copy. Positive values indicate how out- of-date this copy can normally be before it is considered for update. Such a value is not a guarantee that such updates will always be performed on the required schedule but instead serve as a hint about how far behind the most up-to-date copy of the data, this copy would normally be expected to be. o A location id for the replica, to be used together with masks in the location4_info structure to determine whether that replica matches other in various respects, as described above. See below (after the mask definitions) for an example of how the location_id can be used to communicate filesystem information. When two location id's are identical, then access to the corresponding replicas are defined as identical in all respects. They access the same filesystem with the same filehandles and share v4 file state. Further, multiple connections to the two replicas may be done as part of the same session. Two such replicas will share a common root path and are best presented within two location4_server entries in a common location4_item. These replicas should have identical values for the currency field although the flags and priority fields may be different. Noveck, Burnett April 2006 [Page 20] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Clients may find it helpful to associate all of the location4_server structures that share a location_id value and treat this set as representing a single fs target. When they do so, they should take proper care to note that priority fields for these may be different and the selection of location4_server needs to reflect rank and order considerations (see below) for the individual entries. o The server string. For the case of the replica currently being accessed (via GETATTR), a null string may be used to indicate the current address is using for the RPC call. The flags field has the following bits defined: o LIF_FHR_OPEN indicates that the server will normally make a replacement filehandle available for files that are open at the time of a filesystem image transition. When this flag is associated with an alternative filesystem instance, the client may get the replacement filehandle to be used on the new filesystem instance from the current server. When this flag is associated with the current filesystem instance, a replacement for filehandles from a previous instance may be obtained on this one. See section 3.2.2, fh_replacement, for details. Because of the possibility of hardware and software failures, this is not a guarantee, but when this bit returned, the server should make all reasonable efforts to provide the replacement filehandle. o LIF_FHR_ALL indicates that a replacement filehandle will be made available for all files when there is a migration event or a replica switch. Like LIF_FHR_OPEN, it may indicate replacement availability on the source or the destination, and the details are described in section 3.2.3. o LIF_MULTI_FS indicates that when a transition occurs from the current filesystem instance to this one, the replacement may consist of multiple filesystems. In this case, the client has to be prepared for the possibility that objects on the same fs before migration will be on different ones after. Note that LIF_MULTI_FS is not incompatible with the two filesystems agreeing with respect to the fileid-keep mask since, if one has a set of fileid's that are unique within an fs, each subset assigned to a smaller fs after migration would not have any conflicts internal to that fs. A client, in the case of split filesystem will interrogate existing files with which it has continuing connection (it is free simply forget cached filehandles). If the client Noveck, Burnett April 2006 [Page 21] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 remembers the directory filehandle associated with each open file, it may proceed upward using LOOKUPP to find the new fs boundaries. Once the client recognizes that one filesystem has been split into two, it could maintain applications running without disruption by presenting the two filesystems as a single one until a convenient point to recognize the transition, such as a reboot. This would require a mapping of fsid's from the server's fsids to fsids as seen by the client but this already necessary for other reasons anyway. As noted above, existing fileids within the two descendant fs's will not conflict. Creation of new files in the two descendent fs's may require some amount of fileid mapping which can be performed very simply in many important cases. o LIF_WRITABLE indicates that this fs target is writable, allowing it to be selected by clients which may need to write the on this filesystem. When the current filesystem instance in writable, then any other filesystem to which the client might switch must incorporate within its data any committed write made on the current filesystem instance. See below, in the section on the same-fs mask, for issues related to uncommitted writes. While there is no harm in not setting this flag for a filesystem that turns out to be writable, turning the flag on for read-only filesystem can cause problems for clients who select a migration or replication target based on it and then find themselves unable to write. o LIF_VLCACHE indicates that the server is a cached copy where the measured latency of operation may differ very significantly depending on the particular data requested, in that already cached data may be provided with very low latency while other data may require transfer from a distant source. o LIF_CUR_REQ indicates that this replica is the one on which the request is being made. Only a single server entry may have this flag set and in the case of a referral, no entry will have it. o LIF_ABSENT indicates that this entry corresponds an absent filesystem replica. It can only be set if LIF_CUR_REQ is set. When both such bits are set it indicates that a filesystem instance is not usable but that the information in the entry can be used to determine the sorts of continuity available when switching from this replica to other possible replicas. Since this bit can only be true if LIF_CUR_REQ is true, the value could be determined using the fs_absent attribute but Noveck, Burnett April 2006 [Page 22] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 the information is also made available here for the convenience of the client. An entry with this bit, since it represents a true filesystem (albeit absent) does not appear in the event of a referral, but only where a filesystem has been accessed at this location and subsequently been migrated. o LIF_GOING indicates that a replica, while still available, should not be used further. The client, if using it, should make an orderly transfer to another filesystem instance as expeditiously as possible. It is expected that filesystems going out of service will be announced as LIF_GOING some time before the actual loss of service and that the valid_for value will be sufficiently small to allow servers to detect and act on scheduled events while large enough that the cost of the requests to fetch the fs_location_info values will not be excessive. Values on the order of ten minutes seem reasonable. The location4_item structure, analogous to an fs_locations4 structure, specifies the root pathname all used by an array of server replica entries. The location4_info structure, encoding the fs_location_info attribute contains the following: o The fs_root field which contains the pathname of the root of the current filesystem on the current server, just as it does the fs_locations4 structure. o An array of location4_item structures, which contain information about replicas of the current filesystem. Where the current filesystem is actually present, or has been present, i.e. this is not a referral situation, one of the location4_item structure will contain a location4_server for the current server. This structure will have LIF_ABSENT set if the current filesystem is absent, i.e. normal access to it will return NFS4ERR_MOVED. o The fileid-keep mask indicates, in combination with the appropriate location ids, that fileids will not change (i.e. they will be reliably maintained with no lack of continuity) across a transition between the two filesystem instances, whether by migration or a replica transition. This allows transition to safely occur without any chance that applications that depend on fileids will be impacted. o The change-cont mask indicates, in combination with the appropriate location ids, that the change attribute is Noveck, Burnett April 2006 [Page 23] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 continuous across a migration event between the server within any pair of replicas. In other words if the change attribute has a given value before the migration event, then it will have that same value after, unless there has been an intervening change to the file. This information is useful after a migration event, in avoiding any need to refetch change information or any requirement to needlessly flush cached data because of a lack of reliable change information. Although change attribute continuity allows the client to dispense with any migration-specific refetching of change attributes, it still must fetch the attribute in all cases in which would normally do so if there had been no migration. In particular, when an open-reclaim is not available and the file is re-opened, a check for an unexpected change in the change attribute must be done. o The same-fh mask indicates, in combination with the appropriate location ids, whether two replicas will have the same fh's for corresponding objects. When this is true, both filesystems must have the same filehandle expiration type. When this is true and that type is persistent, those filehandles may be used across a migration event, without disruption. o The same-state mask indicates, in combination with the appropriate location ids, whether two replicas will have the same state environment. This does not necessarily mean that when performing migration, the client will not have to reclaim state. However it does mean that the client may proceed using his current clientid just as if there were no migration event and only reclaim state when an NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID error is received. Filesystems marked as having the same state should also have same filehandles. In other words the same-fh mask should be a subset (not necessarily proper) of the same-state mask. o The same-fs mask indicates, in combination with the appropriate location ids, whether two replicas in fact designate the same filesystem in all respects. If so, any action taken on one is immediately on the other and the client can consider them as effectively the same thing. The same-fs mask must include all bits in the same-fh mask, the change-cont mask, and same-state mask. Thus, filesystem instances marked as same-fs must also share state, have the same filehandles, and be change continuous. These considerations imply that a transition can occur with no Noveck, Burnett April 2006 [Page 24] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 application disruption and no significant client work to update state related to the filesystem. When the same-fs mask indicates two filesystems are the same the clients are entitled to assume that there will also be no significant delay for the server to re-establish its state to effectively support the client. Where same-fs is not true and the other constituent continuity indication are true (fileid- keep, change-cont, same-fh), there may be significant delay under some circumstances, in line with the fact that the filesystems are being represented as being carefully kept in complete synchronization yet they are not the same. When two filesystems on separate servers have location ids which match on all the bits within the same-fs mask, clients should present the same nfs_client_id to both with the expectation the servers may be able to generate a shared clientid to be used when communicating with either. Such servers are expected to co-ordinate at least to the degree that they will not provide the same clientid to a client while not actually sharing the underlying state data. In handling of uncommitted writes, two servers with any pair of filesystems having the same-fs relation, write verifiers must be sufficiently unique that a client switching between the servers can determine whether previous async writes need to be reissued. This is unlike the general case of filesystems not bearing this relation, in which it must be assumed that asynchronous writes will be lost across a filesystem transition. When two replicas' location ids, match on all the bits within the same-fs mask, but are not identical, the client using sessions will establish separate sessions to each which together share any such common clientid. o The valid_for field specifies a time for which it is reasonable for a client to use the fs_location_info attribute without refetch. The valid_for value does not provide a guarantee of validity since servers can unexpectedly go out of service or become inaccessible for any number of reasons. Clients are well-advised to refetch this information for actively accessed filesystem at every valid_for seconds. This is particularly important when filesystem replicas may go out of service in a controlled way using the LIF_GOING flag to communicate an ongoing change. The server should set valid_for to a value which allows well-behaved clients to notice the LIF_GOING flag and make an orderly switch before Noveck, Burnett April 2006 [Page 25] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 the loss of service becomes effective. If this value is zero, then no refetch interval is appropriate and the client need not refetch this data on any particular schedule. In the event of a transition to a new filesystem instance, a new value of the fs_location_info attribute will be fetched at the destination and it is to be expected that this may have a different valid_for value, which the client should then use, in the same fashion as the previous value. o The read-rank, read-order, write-rank, and write-order masks are used, together with the priority words of various replicas to order the replicas according to the server's preference. See the discussion below for the interaction of rank, order, and the client's own preferences and needs. Read-rank and read-order are used to direct clients which only need read access while write-rank and write-order are used to direct clients that require some degree of write access to the filesystem. Depending on the potential need for write access by a given client, one of the pairs of rank and order masks is used, together with priority words, to determine a rank and an order for each instance under consideration. The read rank and order should only be used if the client knows that only reading will ever be done or if it is prepared to switch to a different replica in the event that any write access capability is required in the future. The rank is obtained by anding the selected rank mask with the priority and the order is obtained similarly by anding the selected order mask with the priority. The resulting rank and order are compared as described below with lower always being better (more preferred). Rank is used to express a strict server-imposed ordering on clients, with lower values indicating "more preferred." Clients should attempt to use all replicas with a given rank before they use one with a higher rank. Only if all of those servers are unavailable should the client proceed to servers of a higher rank. Within a rank, the order value is used to specify the server's preference to guide the client's selection when the client's own preferences are not controlling, with lower values of order indicating "more preferred." If replicas are approximately equal in all respects, clients should defer to the order specified by the server. When clients look at server latency as part of their selection, they are free to use this criterion but it is suggested that when latency differences are not significant, the server- specified order should guide selection. Noveck, Burnett April 2006 [Page 26] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 The server may configure the rank and order masks to considerably simplify the decisions if it so chooses. For example, if read vs. write is not to be important in the selection process, then the location4_info should be one in which the read-rank and write-rank mask, and the read-order and write-order mask are equal. If the server wishes to totally direct the process via rank, leaving no room for client choice, it may simply set the write-order mask and the read-order mask to zero. Conversely, if it wishes to give general preferences with more scope for client choice, it may set the read-rank mask and the write-rank mask to zero. A server may even set all the masks to zero and allow the client to make its own choices. The protocol allows multiple policies to be used as found appropriate. The use of location id together with the masks in location4_info structure can be illustrated by an example. Suppose one has the following sets of servers: o Server A with four IP addresses A1 through A4. o Servers B, C, D sharing a cluster filesystem with A and each having four IP addresses, B1, B2, ... D3, D4. o A point-in-time copy of the filesystem created using image copy which shares filehandles and is change-attribute continuous with the filesystem on A-D and has two IP address X1 and X2. o A point-in-time-copy of the filesystem which was created at a higher level but shares fileid's with the one on A-D but is accessed (via a clustered filesystem) by servers Ya and Yb. o A copy of the of the filesystem made by simple user-level copy tools and which is served from server Z. Given the above, one way of presenting these relationships is to assign the following location id's: o A1-4 would get 0x1111 o B1-4 would get 0x1112 o C1-4 would get 0x1113 o D1-4 would get 0x1114 Noveck, Burnett April 2006 [Page 27] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 o X1-2 would get 0x1125 o Ya would get 0x1236 o Yb would get 0x1237 o Z would get 0x2348 And then the following mask values would be used: o The same-fs and same-state masks would all be 0xfff0. o The same-fh and change-cont mask would be 0xff00. o The keep-fileid mask would be 0xf00 This scheme allows the number of bits devoted to various kinds of similarity classes to be adjusted as needed with no change to the protocol. The total of thirty-two bits is expected to suffice indefinitely. As noted above, the fs_location_info attribute, when supported, may be requested of absent filesystems without causing NFS4ERR_MOVED to be returned and it is generally expected that will be available for both present and absent filesystems even if only a single location_server entry is present, designating the current (present) filesystem, or two location_server entries designating the current (and now previous) location of an absent filesystem and its successor location. Servers are strongly urged to support this attribute on all filesystems if they support it on any filesystem. 3.2.3. fh_replacement The fh_replacement attribute provides a way of providing a substitute filehandle to be used on a target server when a migration event or other fs instance switching event occurs. This provides an alternative to maintaining access via the existing persistent filehandle (which may be difficult) or using volatile filehandles (which will not give the correct result in all cases). When a migration event occurs, information on the new location (or location choices) will be available via the fs_location_info attribute applied to any filehandle within the source filesystem. When LIF_FHR_OPEN or LIF_FHR_ALL is present, the fh_replacement attribute may be used to get the corresponding filehandle for filehandles that the client has accessed. Noveck, Burnett April 2006 [Page 28] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Similarly, after such an event, when the fs_location_info attribute is fetched on the new server, LIF_FHR_OPEN or LIF_FHR_ALL may be present in the server entry corresponding to the current filesystem instance. In this case, the fh_replacement attribute can be used to get the new filehandles corresponding to each of the now outdated filehandles on the previous instance. In either of these ways, the client may be assured of a consistent mapping from old to new filehandles without relying on a purely name-based mapping, which in some cases will not be correct. The choice of providing replacement on the source filesystem instance or the target will normally be based on which server has the proper mapping. Generally when the image is created by a push from the source, the source server naturally has the appropriate filehandles corresponding to its files and can provide them to the client. When the image transfer is done via pull, the target server will be aware of the source filehandles and can provide the appropriate mapping when the client requests it. Note that the target server can only provide replacement filehandles if it can assure filehandle uniqueness, i.e. that filehandles from the source do not conflict with valid filehandles on the destination server. In the case where such uniqueness can be assured, source filehandles can be accepted for the purpose of providing replacements with NFS4ERR_FHEXPIRED returned for any use other than interrogation of the fh_replacement attribute via GETATTR. Multiple fh replacement on different migration targets may be provided via multiple fhrep4 entries. Each fhrep4_entry provides a replacement filehandle applying to all targets whose location id, when anded with the fh-same mask (from the fs_location_info attribute) matches the location_set value in the fhrep4_entry. This set of replicas share the same filehandle and thus can a single entry can provide replacement filehandles for all of the members. Note that the location_set value will only match that of the current filesystem instance, when the client presents a filehandle from the previous filesystem instance and the target filesystem provides its own replacement filehandles. union fhrep4_entry switch (bool present) { uint32_t location_set; nfs_fh4 replacement; }; struct fh4_replacement { fhrep4_entry entries<>; }; Noveck, Burnett April 2006 [Page 29] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 When a filesystem becomes absent, the server in responding to requests for the fh_replacement attribute is not required to validate all fields of the filehandle if it does not maintain per- file information. This matches current handle of fs_locations (and applies as well to fs_location_info). For example, if a server has an fsid field within its filehandle implementation, it may simply recognize that value and return filehandles with the corresponding new fsid without validating other information within the handle. This can result in filesystem accepting a filehandle, which under other circumstances might result in NFS4ERR_STALE, just as it can when interrogating the fs_locations or fs_location_info attributes. Note that when it does so, it will return a replacement which, when presented to the new filesystem, will get an NFS4ERR_STALE there. Use of the fh_replacement attribute can allow wholesale change of filehandles to implement storage re-organization even within the context of a single server. If NFS4ERR_MOVED is returned, the client will fetch fs_location_info which may refer to a location on the original server. Use of fh_replacement in this context allows a new set of filehandles to be established as part of storage reconfiguration (including possibly a split into multiple fs's) without requiring the client to maintain name information against the possibility of such a reconfiguration (for volatile filehandles). Servers are not required to maintain the availability of replacement filehandles for any particular length of time, but in order to maintain continuity of access in the face of network disruptions, servers should generally maintain the mapping from the pre-replacement file handles persistently across server reboots, and for a considerable time. It should be the case that even under severe network disruption, any client that received pre-replacement filehandles is given an opportunity to obtain the replacements. When this mapping no longer made available, the pre-replacement filehandles should not be re-used, just as is the case for any other superseded file handle. As noted above, this attribute, when supported, may be requested of absent filesystems without causing NFS4ERR_MOVED to be returned, and it should always be available. When it is requested and the attribute is supported, if no replacement file handle information is present, either because the filesystem is still present and there is no migration event or because there are currently no replacement filehandles available, a zero-length array of fhrep4_entry structures should be returned. Noveck, Burnett April 2006 [Page 30] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 3.2.4. fs_status In an environment in which multiple copies of the same basic set of data are available, information regarding the particular source of such data and the relationships among different copies, can be very helpful in providing consistent data to applications. enum status4_type { STATUS4_FIXED = 1, STATUS4_UPDATED = 2, STATUS4_INTERLOCKED = 3, STATUS4_WRITABLE = 4, STATUS4_ABSENT = 5 }; struct fs4_status { status4_type type; utf8str_cs source; utf8str_cs current; nfstime4 version; }; The type value indicates the kind of filesystem image represented. This is of particular importance when using the version values to determine appropriate succession of filesystem images. Five types are distinguished: o STATUS4_FIXED which indicates a read-only image in the sense that it will never change. The possibility is allowed that as a result of migration or switch to a different image, changed data can be accessed but within the confines of this instance, no change is allowed. The client can use this fact to aggressively cache. o STATUS4_UPDATED which indicates an image that cannot be updated by the user writing to it but may be changed exogenously, typically because it is a periodically updated copy of another writable filesystem somewhere else. o STATUS4_VERSIONED which indicates that the image, like the STATUS4_UPDATED case, is updated exogenously, but it provides a guarantee that the server will carefully update the associated version value so that the client, may if it chooses, protect itself from a situation in which it reads data from one version of the filesystem, and then later reads data from an earlier version of the same filesystem. See Noveck, Burnett April 2006 [Page 31] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 below for a discussion of how this can be done. o STATUS4_WRITABLE which indicates that the filesystem is an actual writable one. The client need not of course actually write to the filesystem, but once it does, it should not accept a transition to anything other than a writable instance of that same filesystem. o STATUS4_ABSENT which indicates that the information is the last valid for a filesystem which is no longer present. The opaque strings source and current provide a way of presenting information about the source of the filesystem image being present. It is not intended that client do anything with this information other than make it available to administrative tools. It is intended that this information be helpful when researching possible problems with a filesystem image that might arise when it is unclear if the correct image is being accessed and if not, how that image came to be made. This kind of debugging information will be helpful, if, as seems likely, copies of filesystems are made in many different ways (e.g. simple user-level copies, filesystem- level point-in-time copies, cloning of the underlying storage), under a variety of administrative arrangements. In such environments, determining how a given set of data was constructed can be very helpful in resolving problems. The opaque string 'source' is used to indicate the source of a given filesystem with the expectation that tools capable of creating a filesystem image propagate this information, when that is possible. It is understood that this may not always be possible since a user-level copy may be thought of as creating a new data set and the tools used may have no mechanism to propagate this data. When a filesystem is initially created associating with it data regarding how the filesystem was created, where it was created, by whom, etc. can be put in this attribute in a human- readable string form so that it will be available when propagated to subsequent copies of this data. The opaque string 'current' should provide whatever information is available about the source of the current copy. Such information as the tool creating it, any relevant parameters to that tool, the time at which the copy was done, the user making the change, the server on which the change was made etc. All information should be in a human-readable string form. The version field provides a version identification, in the form of a time value, such that successive versions always have later time values. When the filesystem type is anything other than Noveck, Burnett April 2006 [Page 32] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 STATUS4_VERSIONED, the server may provide such a value but there is no guarantee as to its validity and clients will not use it except to provide additional information to add to 'source' and 'current'. When the type is STATUS4_VERSIONED, servers should provide a value of version which progresses monotonically whenever any new version of the data is established. This allows the client, if reliable image progression is important to it, to fetch this attribute as part of each COMPOUND where data or metadata from the filesystem is used. When it is important to the client to make sure that only valid successor images are accepted, it must make sure that it does not read data or metadata from the filesystem without updating its sense of the current state of the image, to avoid the possibility that the fs_status which the client holds will be one for an earlier image, and so accept a new filesystem instance which is later than that but still earlier than updated data read by the client. In order to do this reliably, it must do a GETATTR of fs_status that follows any interrogation of data or metadata within the filesystem in question. Often this is most conveniently done by appending such a GETATTR after all other operations that reference a given filesystem. When errors occur between reading filesystem data and performing such a GETATTR, care must be exercised to make sure that the data in question is not used before obtaining the proper fs_status value. In this connection, when an OPEN is done within such a versioned filesystem and the associated GETATTR of fs_status is not successfully completed, the open file in question must not be accessed until that fs_status is fetched. The procedure above will ensure that before using any data from the filesystem the client has in hand a newly-fetched current version of the filesystem image. Multiple values for multiple requests in flight can be resolved by assembling them into the required partial order (and the elements should form a total order within it) and using the last. The client may then, when switching among filesystem instances, decline to use an instance which is not of type STATUS4_VERSIONED or whose version field is earlier than the last one obtained from the predecessor filesystem instance. 4. Migration Protocol As discussed above, it has always been anticipated that a migration protocol would be developed, to address the issue of migration of a filesystem between different filesystem implementations. This need remains, and it can be expected that as client implementations of Noveck, Burnett April 2006 [Page 33] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 migration become more common, it will become more pressing and the working group needs to seriously consider how that need may be best addressed. We are going to suggest that the working group should seriously consider what may be a significantly lighter-weight alternative, the addition of features to support server-to-server migration within NFSv4 itself, but taking advantage of existing NFSv4 facilities and only adding the features needed to support efficient migration, as items within a minor version. One thing that needs to be made clear is that a common migration protocol does not mean a common migration approach or common migration functionality. Thus the need for the kinds of information provided by fs_location_info. For example, the fact that the migration protocol will make available on the target the file id, file handle, and change attribute from the source, does not means that the receiving can store these values natively, or that it will choose to implement translation support to accommodate the values exported by the source. This will remain an implementation choice. Clients will need information about those various choices, such as would be provided by fs_location_info, in order to deal with the various implementations. 4.1. NFSv4.x as a Migration Protocol Whether the following approach or any other is adopted, considerable work will still be required to flesh out the details, requiring a number of drafts for a problem statement, initial protocol spec, etc. But to give an idea of what would be involved in this kind of approach, a rough sketch is given below. First, let us fix for the moment on a pull model, in which the target server, selected by a management application pulls data from the source using NFSv4.x. The server acts as a client, albeit a specially privileged one, to copy the existing data. The first point to be made is that using NFSv4 means that we have a representation for all data that is representable within NFSv4 and that that is maintained automatically as minor versioning proceeds. That is, when attributes are added to a minor version of NFSv4, they are "automatically" added to the migration copy protocol, because the two are the same. The presence of COMPOUND is a further help in that implementations will be able to maintain high throughput when copying without creating a special protocol devoted to that purpose. For example, when copying a large set of small files, these files can all be Noveck, Burnett April 2006 [Page 34] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 read with a single COMPOUND. This means that the benefit of creating a stream format for the entire fs is much reduced and allows existing servers (with small modifications) to simply support the kinds of access they have to support anyway. The servers acting as clients would probably use a non-standard implementation but they would share lots of infrastructure with more standard clients, so this would probably be a win on the implementation side as well as on the specification side. One other point is that if the migration protocol were in fact an NFSv4.x, NFSv4 developments such as pNFS would be available for high-performance migration, with no special effort. Clearly, there is still considerable work to do this, even if it is not of the same order as a new protocol. The working group needs to discuss this and see if there is agreement that a means of cross-server migration is worthwhile and whether this is the best way to get there. Here is a basic list of things that would have to be dealt with to effect a transfer: o Reads without changing access times. This is probably best done as a per-session attribute (it is best to assume sessions here). o Reads that ignore share reservations and mandatory locks. It may be that the existing all-ones special stateid is adequate. o A way to obtain the locking state information for the source fs: the locks (byte-range and share reservations) for that fs including associated stateids and owner opaque strings, clientid's and the other identifying client information for all clients with locks on that fs. This is all protocol- defined, rather than implementation-specific data. o A way to lock out changes on a filesystem. This would be similar to a read delegation on the entire filesystem, but would have a greater degree of privilege, in that the holder would be allowed to keep it as long as his lease was renewed. o A way to permanently terminate existing access to the filesystem (by everyone except the calling session) and report it MOVED to the users. Conventions as far as appropriate security for such operations would have to be developed to assure interoperability, but it is a Noveck, Burnett April 2006 [Page 35] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 question of establishing conventions rather than defining new mechanisms. Given the facilities above, you could get an initial image of a filesystem, and then rescan and update the destination until the amount of change to be propagated stabilized. At this point, changes could be locked out and a final set up updates propagated while read-only access to the filesystem continued. At that point further access would be locked out, and the locking state and any final changes to access time would be propagated. The access time scan would be manageable since the client could issue long COMPOUND's with many PUTFH-GETATTR pairs and many such requests could be in flight at a time. If it was required that the disruption to access be smaller, some small additions to the functionality might be quite effective: o Notifications for a filesystem, perhaps building on the notifications proposed in the directory delegations document would limit the rescanning for changes, and so would make the window in which additional changes could happen much smaller. This would greatly reduce the window in which write access would have to be locked out. o A facility for global scans for attribute changes could help reduce lockout periods. Something that gave a list of object filehandles that met a given attribute search criterion (e.g. attribute x greater than, less than, equal to, some value) could reduce rescan update times and also rescan times for accesstime updates. These lists assume that the server initiating the transfer is doing its own writing to disk. Extending this to writing the new fs via NFSv4 would require further protocol support. The basic message for the working group is that the set of things to do is of moderate size and builds in large part on existing or already proposed facilities. Acknowledgements The authors wish to thank Ted Anderson and Jon Haswell for their contributions to the ideas within this document. Noveck, Burnett April 2006 [Page 36] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Normative References [RFC3530] S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track RFC Informative References [referrals] D. Noveck, R. Burnett, "Implementation Guide for Referrals in NFSv4", Internet Draft draft-ietf-nfsv4-referrals-00.txt, Work in progress Authors' Addresses David Noveck Network Appliance, Inc. 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 781 768 5347 EMail: dnoveck@netapp.com Rodney C. Burnett IBM, Inc. 13001 Trailwood Rd Austin, TX 78727 USA Phone: +1 512 838 8498 EMail: cburnett@us.ibm.com Full Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Noveck, Burnett April 2006 [Page 37] Internet-Draft Next Steps for NFSv4 Migration/Replication October 2005 Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Noveck, Burnett April 2006 [Page 38]