Network Working Group Robert Thurlow Internet Draft June 2002 Document: draft-thurlow-nfsv4-repl-mig-design-00.txt Server-to-Server Replication/Migration Protocol Design Principles Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Discussion and suggestions for improvement are requested. This document will expire in December, 2002. Distribution of this draft is unlimited. Abstract NFS Version 4 [RFC3010] provided support for client/server interactions to support replication and migration, but left unspecified how replication and migration would be done. This document discusses the nature of a protocol to be used to transfer filesystem data and metadata for use with replication and migration services for NFS Version 4. Expires: December 2002 [Page 1] Title Replication/Migration Design Principles June 2002 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Definitions of terms . . . . . . . . . . . . . . . . . . . 3 1.1.1. Replication . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Current practice . . . . . . . . . . . . . . . . . . . . . 4 1.3. The problem . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1. NFS clients today . . . . . . . . . . . . . . . . . . . 4 1.3.2. NFS Version 4 . . . . . . . . . . . . . . . . . . . . . 5 1.4. The need for a transfer protocol . . . . . . . . . . . . . 5 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1. Interoperability . . . . . . . . . . . . . . . . . . . . . 5 2.2. Transparency . . . . . . . . . . . . . . . . . . . . . . . 5 2.3. Security . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4. Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5. Scalability . . . . . . . . . . . . . . . . . . . . . . . 6 3. What the protocol will not do (now) . . . . . . . . . . . . 6 4. Design considerations . . . . . . . . . . . . . . . . . . . 7 4.1. Basic structure . . . . . . . . . . . . . . . . . . . . . 7 4.2. Administrative Control . . . . . . . . . . . . . . . . . . 7 4.3. Basic environment . . . . . . . . . . . . . . . . . . . . 7 4.4. Handling file changes . . . . . . . . . . . . . . . . . . 7 4.5. Replication model . . . . . . . . . . . . . . . . . . . . 8 5. Security considerations . . . . . . . . . . . . . . . . . . 8 6. Implementation considerations . . . . . . . . . . . . . . . 8 6.1. Filehandle preservation . . . . . . . . . . . . . . . . . 8 6.2. Data transfer phases . . . . . . . . . . . . . . . . . . . 9 6.3. Operation on filesystem subsets . . . . . . . . . . . . . 9 7. Difficult issues . . . . . . . . . . . . . . . . . . . . . 10 7.1. Transparency violations . . . . . . . . . . . . . . . . 10 7.2. Directory access . . . . . . . . . . . . . . . . . . . . 10 8. Bibliography . . . . . . . . . . . . . . . . . . . . . . . 11 9. Author's Address . . . . . . . . . . . . . . . . . . . . . 12 Expires: December 2002 [Page 2] Title Replication/Migration Design Principles June 2002 1. Introduction Though used in different circumstances, replication of data and migration of data share a common problem: how to accurately transfer data (which may be in use by applications) from one location to another with reasonable bandwidth usage and in reasonable time. Years ago, this was done by taking storage offline (or at least preventing write access), making a tape copy of the data files, and walking it to the new machine, after warning the twenty or so people who cared about it. Networks reduced wear on sneakers, but many of the data formats we use for filesystem copies tend to be little improved - they are either lowest-common-denominator standards like "tar" and "cpio" or internal dump formats which are non-standard. Today, with distributed filesystems like NFS Version 4, richer metadata including Access Control Lists (ACLs) and extended attributes, and potential users all over the enterprise and the Internet, we need something better - a standard, complete and extensible protocol to transfer filesystems. Though data replication and transfer are needed in many areas, this document will focus primarily on solving the problem of providing replication and migration support between NFS Version 4 servers. It is assumed that the reader has familiarity with NFS Version 4 [RFC3010]. 1.1. Definitions of terms 1.1.1. Replication Filesystem replication is the creation of a functionally identical copy of a filesystem, usually to enhance availability or provide for redundancy or disaster recovery. For example, a company may set up replicas of a customer database accessed by employees in different geographies. The data sets are often read-only, and initial creation of a replica is not as interesting a problem as maintaining the replica efficiently over time via incremental updates, which will likely be set up to push automatically. 1.1.2. Migration Filesystem migration is the moving of a filesystem to another server for load balancing purposes or because a user or server has moved. For example, a user may have moved from one building to another, or across the country, and want his home directory to follow him, or it may just be time to decommission an old server and move data to a new one. Only one data transfer is done, and it is important for this to be done efficiently and with the lowest possible impact on users. Expires: December 2002 [Page 3] Title Replication/Migration Design Principles June 2002 1.2. Current practice System administrators typically have several options available to them to replicate or migrate files, but none of them cover the problem space: o The pax, cpio and tar tape archivers as defined by IEEE 1003.1 or ISO/IEC 9945-1 are often used without tape over a network for data transfer; these support only generic Unix-specific metadata and do not support ACLs or extended attributes o The rdist (http://www.magnicomp.com/rdist) and rsync (http://samba.anu.edu.au/rsync) applications focus on propagating changes to replicas, but are documented only by source code, are not available on all platforms, and do not support more than generic Unix-specific metadata o "cp -r" or its equivalent over NFS Version 4 could work in cases where capabilities of servers were the same, but if the destination did not support ACLs or extended attributes, would it do what the user wanted? o Most server filesystems have a "dump" format of some kind, which can preserve all data and metadata as long as there are no architectural differences in the servers o Most server vendors have products which can keep replicas in sync by monitoring changes at the block level below the server filesystem, which are again inherently tied to one architecture o Most of the above tools are not set up to properly deal with exotic metadata which may be present on filesystems like MacOS's HFS or NTFS, which can result in loss of data even when transferring to the same platform 1.3. The problem 1.3.1. NFS clients today Replication and migration events both cause problems for NFS clients, which may have applications operating on data when the event occurs. Past versions of NFS did not provide any support in protocol for the client, and typical clients did not even attempt to find another replica which might provide service. Expires: December 2002 [Page 4] Title Replication/Migration Design Principles June 2002 1.3.2. NFS Version 4 NFS Version 4 [RFC3010] introduced some extra error codes and attributes to improve this situation. For replication, the new "fs_locations" attribute could be retrived by the client to determine if multiple locations were available, so that when a server became unavailable, the client could fail over to a new location without hoping updated information was available in its name service. For migration and in the case of a decommissioned replica, the NFS4ERR_MOVED error would inform a client that it should consult "fs_locations" and make contact with a new server responsible for the data. In both cases, a client is required to establish a relationship with a new server, which may involve state recovery and using saved pathname information to discover new filehandles. 1.4. The need for a transfer protocol To support NFS Version 4, a method is needed to transfer functionally complete filesystem data from one server to another. The shortcomings listed previously in the common tools in use demonstrate that there is value in a standard protocol to transfer filesystem data. 2. Requirements The requirements for a replication and migration protocol are to be addressed in a separate document, but are approximately these: 2.1. Interoperability The replication/migration protocol must first and foremost be one which can potentially be implemented on any server. Several vendors already have a replication mechanism in their product lines which takes advantage of known properties of their servers to replicate at the block level, but this is inherently tied to one system. 2.2. Transparency When a client has been using a file which has been migrated, it should be able to detect this and recover the file state on the new server without applications needing to take action. Similarly, when a client has availability problems with a particular replica, it should be able to adapt to the use of the new replica without application involvement. This implies that, as far as possible, the replication/migration protocol must copy all filesystem data, as much metadata as possible, and all non-recoverable transient state such as outstanding lock and delegation state, completely and correctly. It Expires: December 2002 [Page 5] Title Replication/Migration Design Principles June 2002 is acceptable that the client must recover some state as occurs in the event of a server reboot. 2.3. Security NFS Version 4 supported strong mandatory-to-implement security mechanisms to protect the integrity and privacy of file data and metadata. The replication/migration protocol must specify mandatory-to-implement security to protect data in transit, and provide a security payload and an encryption mechanism to ensure strong security for each message. It is expected that the security mechanisms will correlate well with NFS Version 4 [RFC3010]. 2.4. Efficiency The replication/migration protocol must get the job of data movement done as efficiently as possible in terms of both bandwidth and time. Components of this are: o the protocol will conserve bandwidth by streaming data in large blocks with limited header overhead o the protocol will transfer changed regions in files rather than complete files whenever possible o the protocol will permit restart in the event of a server failure or lost connection 2.5. Scalability The replication/migration protocol must be able to handle both huge files and huge filesystems, while maintaining low enough overhead to work well with small filesystems as well. 3. What the protocol will not do (now) There have been discussions about the things a good replication protocol could do which are not considered part of the scope of this work, though some of them could be specified by future RFCs. These non-requirements include: o being an "rdist" or "rsync" replacement o being a tool to permit unprivileged users to copy file trees o being used for replication of other types of data Expires: December 2002 [Page 6] Title Replication/Migration Design Principles June 2002 4. Design considerations 4.1. Basic structure For best performance, a replication/migration protocol should be able to move large amounts of data without frequent small packets in the direction of data movement. Use of RPC [RFC1831] may be inapprpriate; current thinking is that the protocol should be composed of messages encoded with XDR [RFC1832], exchanged under the control of a finite state machine. Groups of messages would probably include: o Initialization and negotiation messages o Filesystem information messages o Data transfer messages o Finalization messages 4.2. Administrative Control The replication and migration protocol should include nothing specifying how an administrative user contacts a server to initiate replication or migration. A separate document should define a mechanism suitable for this purpose. 4.3. Basic environment The replication/migration protocol should be available to a privileged context on a well-known TCP port on an NFSv4 server, able to authenticate and act on control messages from administration clients and general messages from other servers. 4.4. Handling file changes For replication, it should be possible to handle large files changed in small ways without transferring the entire file. The protocol needs to be able to express changes to byte ranges within a file; ideally, the server will be able to extract such changes from some kind of change log or from internal filesystem data. However, this may not be practical. The existence of "rdist" shows that a bidirectional protocol can determine differences in files at a reasonable bandwidth cost, and it would be good for the replication/migration protocol to be able to operate in this mode. Expires: December 2002 [Page 7] Title Replication/Migration Design Principles June 2002 4.5. Replication model Replication is usually set up as a series of read-only replicas, with the master copy of the filesystem generally unaccessible to the client or accessible through a different mount point. It is possible to envision a case where, along with several read-only replicas, a single writer is available and "marked" as such in the fs_locations attribute. The client would have to ensure that all reads and writes were directed to the writable copy from the time a particular file on the filesystem was first written to the time the client ceased caring about the file. This is considered beyond our current scope at this time. 5. Security considerations NFS Version 4 is the primary impetus behind a replication/migration protocol, so this protocol should mandate a strong security scheme and security negotiation in a manner compatible with NFS Version 4. Since NFS Version 4 specifies RPCSEC_GSS [RFC2203], which in turn builds on GSS-API [RFC2078], it makes sense for a replication/migration protocol to specify RPCSEC_GSS if it is based on RPC, and GSS-API if it is not based on RPC. Kerberos Version 5 will be used as described in [RFC1964] to provide one security framework. The LIPKEY GSS-API mechanism described in [RFC2847] will be used to provide for the use of user password and server public key. An initial message exchange will permit security negotiation. The replication/migration protocol will also specify a NULL security mechanism to optimize its performance when used with strong host- based security mechanism such as SSH and IPSec. 6. Implementation considerations 6.1. Filehandle preservation Filahandles are the basic shorthand used by clients to perform most operations on files. The are opaque to the client, but are usually derived from: o the fsid of the filesystem o the fileid or "inode number" of the directory shared by the server o the fileid or "inode number" of the file o the "generation number", an internal field to support inode reuse. Expires: December 2002 [Page 8] Title Replication/Migration Design Principles June 2002 It is, in some circumstances, desireable to preserve persistant filehandles across a replication or migration event. The most likely circumstance for this is when both servers are of the same architecture, and when the destination server can assign values to these fields as data is accepted. To support this case, the filehandle should be available as an attribute which can be passed to the new server. Some operating environments will not have interfaces to support access to this data or a way to recreate it anew, so this should be negotiated so that this data is not sent unnecessarily. Even if a server implementation can transfer and accept persistent filehandles, it must ensure that the client is not falsely promised that this will happen. [RFC3010] specifies that a server may migrate a filesystem with persistent filehandles as long as the new server also uses persistent filehandles and the same filehandles will correspond to the same files after migration. In the general case, the decision to migrate a filesystem, perhaps to a heterogeneous server with different filehandles, will be made after clients have accessed filesystems and learned of the value of the "fh_expire_type" attribute. Thus it seems necessary that servers return an "fh_expire_type" of at least FH4_VOL_MIGRATION so that clients will always store partial pathnames for later use. It is possible for clients to attempt to use pre-event filehandles with the new server in the hope that persistent filehandles would have been transferred intact, but there is no way for the server to promise this unless it will never transfer to a server of a different implementation. 6.2. Data transfer phases For both replication and migration, transfer most generally happens in two phases: first, the bulk of the data is copied to the target while access to the source filesystem continues, and second, changes made since the start of the first phase are transferred while write access to the source filesystem is curtailed. This reduces the window during which clients will see restrictions, at the cost of needing a method to lock out writes to files in the file tree. For replication, it would be possible to bypass locking by the use of multiple point-in-time copies ("snapshots"), since the delta represented by each snapshot could be used to update the replicas. 6.3. Operation on filesystem subsets When NFSv4 clients discover that they must react to a replication or migration event, [RFC3010] states that they will recover at the granularity of an entire filesystem, i.e. a set of files sharing the same "fsid" attribute. It is possible that this protocol could be useful for splitting up of large filesystems to permit them to be replicated and migrated separately. This can most easily be done if Expires: December 2002 [Page 9] Title Replication/Migration Design Principles June 2002 the server can arrange to return distinct "fsid"s for subdirectories of what it manages as a single filesystem. 7. Difficult issues 7.1. Transparency violations When being used between servers that are sufficiently different, it may be impossible for the new server to support some metadata enumerated in the data stream, or it may be that metadata critical to the new server are not supported on the old. When this happens, the client may notice and react badly to the loss of transparency. Sources of this kind of problem include: o Filename encoding differences o Attributes supported on one server and not the other o A failure of atomicity during transfer o Incomplete or no transfer of locking, delegation and other state 7.2. Directory access When a directory is read, a series of RPCs is used to get the entries in small parts. The sequence of RPCs is tied together by a "cookie" returned by the server in each reply and used by the client in the next request. The sequence can be interrupted by a replication or migration event, which can lead to NFS4ERR_BAD_COOKIE on the new server, even if the servers are the same architecture, due to different orders of creation of the directory entries and compaction. Expires: December 2002 [Page 10] Title Replication/Migration Design Principles June 2002 8. Bibliography [RFC1831] R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification Version 2", RFC1831, August 1995. [RFC1832] R. Srinivasan, "XDR: External Data Representation Standard", RFC1832, August 1995. [RFC3010] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, D. Noveck, "NFS version 4 Protocol", RFC3010, December 2000. [RDIST] MagniComp, Inc., "The RDist Home Page", http://www.magnicomp.com/rdist. [RSYNC] The Samba Team, "The rsync web pages", http://samba.anu.edu.au/rsync. Expires: December 2002 [Page 11] Title Replication/Migration Design Principles June 2002 9. Author's Address Address comments related to this memorandum to: nfsv4-wg@sunroof.eng.sun.com Robert Thurlow Sun Microsystems, Inc. 500 Eldorado Boulevard, UBRM05-171 Broomfield, CO 80021 Phone: 877-718-3419 E-mail: robert.thurlow@sun.com Expires: December 2002 [Page 12]