INTERNET-DRAFT Brent Welch Panasas Inc. Benny Halevy Panasas Inc. David Black EMC Corporation Andy Adamson CITI University of Michigan Dave Noveck Network Appliance Document: draft-welch-pnfs-ops-00.txt October 2004 Expires: April 2005 pNFS Operations Summary October 2004 Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This Internet-Draft provides a description of the pNFS extension for NFSv4. The key feature of the protocol extension is the ability for clients to perform read and write operations that go directly from the client to individual storage system elements without funneling all such accesses through a single file server. Of course, the file server must coordinate the client I/O so that the file system retains its integrity. welch-pnfs-ops Expires - April 2005 [Page 1] Internet-Draft pNFS Operations Summary October 2004 The extension adds operations that query and manage layout information that allows parallel I/O between clients and storage system elements. The layouts are managed in a similar way as delegations in that they have leases and can be recalled by the server, but layout information is independent of delegations. Table of Contents 1. Introduction 3 2. General Definitions 3 2.1 Metadata 3 2.2 Storage Device 4 2.3 Storage Protocol 4 2.4 Management Protocol 4 2.5 Layout 4 3. Layouts and Aggregation 5 4. Security Information 6 4.1 Object Storage Security 6 4.2 File Security 6 4.3 Block Security 7 5. pNFS Typed data structures 7 5.1 pnfs_layoutclass4 7 5.2 pnfs_deviceid4 7 5.3 pnfs_devaddr4 7 5.4 pnfs_devlist_item4 8 5.5 pnfs_layouttype4 8 5.6 pnfs_layout4 8 6. pNFS File Attributes 9 6.1 pnfs_layoutclass4<> LAYOUT_CLASSES 9 6.2 pnfs_layouttype4 LAYOUT_TYPE 9 6.3 pnfs_layouttype4 LAYOUT_HINT 9 7. pNFS Error Definitions 9 8. pNFS Operations 9 8.1 LAYOUTGET - Get Layout Information 9 8.2 LAYOUTCOMMIT - Commit writes made using a layout 11 8.3 LAYOUTRETURN - Release Layout Information 13 8.4 GETDEVICEINFO - Get Device Information 14 8.5 GETDEVICELIST - Get List of Devices 15 9. Callback Operations 16 9.1 CB_LAYOUTRECALL 16 10. Usage Scenarios 17 10.1 Basic Read Scenario 17 10.2 Multiple Reads to a File 17 10.3 Multiple Reads to a File with Delegations 17 10.4 Read with existing writers 18 10.5 Read with later conflict 18 10.6 Basic Write Case 18 10.7 Large Write Case 19 10.8 Create with special layout 19 11. Layouts and Aggregation 19 11.1 Simple Map 19 11.2 Block Map 19 11.3 Striped Map (RAID 0) 20 11.4 Replicated Map 20 welch-pnfs-ops Expires - April 2005 [Page 2] Internet-Draft pNFS Operations Summary October 2004 11.5 Concatenated Map 20 11.6 Nested Map 20 12. Issues 21 12.1 Storage Protocol Negotiation 21 12.2 Crash recovery 21 12.3 Storage Errors 21 13. References 22 14. Acknowledgments 22 15. Author's Addresses 22 16. Full Copyright Notice 23 welch-pnfs-ops Expires - April 2005 [Page 2] Internet-Draft pNFS Operations Summary October 2004 1. Introduction The pNFS extension to NFSv4 takes the form of new operations that return data location information called a "layout". The layout is protected by layout delegations. When a client has a layout delegation, it has rights to access the data directly using the location information in the layout. There are both read and write layouts and they may only apply to a sub-range of the file's contents. The layout delegations are managed in a similar fashion as NFSv4 data delegations (e.g., they are recallable and revocable), but they are distinct abstractions and are manipulated with new operations as described below. To avoid any confusion between the existing NFSv4 data delegations and layout delegations, the term "layout" implies "layout delegation". There are new attributes that describe general layout characteristics. However, attributes do not provide all we need to support layouts, hence the use of operations instead. Finally, there are issues about how layout delegations interact with the existing NFSv4 abstractions of data delegations and byte range locking. These issues (and more) are also discussed here. 2. General Definitions This protocol extension partitions the file system protocol into two parts, the control path and the data path. The control path is implemented by the extended (p)NFSv4 file server, while the data path may be implemented by direct communication between the file system client and the storage devices. This leads to a few new terms used to describe the protocol extension. 2.1 Metadata This is information about a file, like its name, owner, where it stored, and so forth. The information is managed by the File server welch-pnfs-ops Expires - April 2005 [Page 3] Internet-Draft pNFS Operations Summary October 2004 (sometimes called the metadata manager). Metadata also includes lower-level information like block addresses and indirect block pointers. Depending the storage protocol, block-level metadata may or may not be managed by the File server, but is instead managed by Object Storage Devices or other File servers acting as a Storage Device. 2.2 Storage Device This is a device, or server, that controls the file's data, but leaves other metadata management up to the file server (i.e., metadata manager). A Storage Device could be another NFS server, or an Object Storage Device (OSD) or a block device accessed over a SAN (either FiberChannel or iSCSI SAN). The goal of this extension is to allow direct communication between clients and storage devices. 2.3 Storage Protocol This is the protocol between the client and the storage device used to access the file data. There are three primary types: file protocols (such as NFSv4 or NFSv3), object protocols (OSD), and block protocols (SCSI-block commands, or "SBC"). These protocols are in turn layered over transport protocols such as RPC/TCP/IP or iSCSI/TCP/IP or FC/SCSI. We anticipate there will be variations on these storage protocols, including new protocols that are unknown at this time or experimental in nature. The details of the storage protocols will be described in other documents so that pNFS clients can be written to use these storage protocols. 2.4 Management Protocol This is the protocol between the File server and the Storage devices. This protocol is outside the scope of this draft, and is used for various management activities that include storage allocation and deallocation. For example, the regular NFSv4 OPEN operation is used to create a new file. This is applied to the File Server, which in turn uses the management protocol to allocate storage on the storage devices. The file server returns a layout for the new file that the client uses to access the new file directly. The management protocol could be entirely private to the File server and Storage devices, and need not be published in order to implement a pNFS client that uses the associated Storage protocol. 2.5 Layout (Also, "map") A layout defines how a file's data is organized on one or more storage devices. There are many possible layout types. They vary in the storage protocol used to access the data, and in the aggregation scheme that lays out the file data on the underlying storage devices. Layouts are described in more detail below. welch-pnfs-ops Expires - April 2005 [Page 4] Internet-Draft pNFS Operations Summary October 2004 3. Layouts and Aggregation The layout, or "map", is a typed data structure that has variants to handle different storage protocols (block, object, and file). A layout describes a range of a file's contents. For example, a block layout might be an array of tuples that store (deviceID, block_number, block count) along with information about block size and the file offset of the first block. An object layout is an array of tuples (deviceID, objectID) and an additional structure (i.e., the aggregation map) that defines how the logical byte sequence of the file data is serialized into the different objects. A file layout is an array of tuples (deviceID, file_handle), along with a similar aggregation map. The deviceID is a short name for a storage device. In practice, a significant amount of information may be required to fully identify a storage device. Instead of embedding all that information in a layout, a level of indirection is used. Layouts embed device Ids, and a new op (GETDEVICEINFO) is used to retrieve the complete identity information about the storage device. For example, the identity of a file server or object server could be an IP address and port. The identity of a block device could be a volume label. Due to multipath connectivity in a SAN environment, agreement on a volume label is considered the reliable way to locate a particular storage device. Aggregation schemes can describe layouts like simple one-to-one mapping, concatenation, and striping. A general aggregation scheme allows nested maps so that more complex layouts can be compactly described. The canonical aggregation type for this extension is striping, which allows a client to access storage devices in parallel. Even a one-to-one mapping use useful for a file server that wishes to distribute its load among a set of other file servers. There are also experimental aggregation types such as writeable mirrors and RAID, however these are outside the scope of this document. The file server is in control of the layout for a file, but the client can provide hints to the server when a file is opened or created about preferred layout parameters. The pNFS extension introduces a LAYOUT_HINT attribute that the client can query at anytime, and can set with a compound SETATTR after OPEN to provide a hint to the server for new files. While not completely specified in this summary, there must be adjunct specifications that precisely define layout formats to allow interoperability among clients and metadata servers. The point is that the metadata server will give out layouts of a particular class (block, object, or file) and aggregation, and the client needs to select a "layout driver" that understands how to use that layout. The API used by the client to talk to its drivers is outside the scope of the pNFS extension, but is an important notion to keep in mind when thinking about this work. The storage protocol between the client's layout driver and the actual storage is covered by welch-pnfs-ops Expires - April 2005 [Page 5] Internet-Draft pNFS Operations Summary October 2004 other protocols such as SBC (block storage), OSD (object storage) or NFS (file storage). 4. Security Information All existing NFS security mechanisms apply to the operations added by this extension. However, this extension is used in conjunction with other storage protocols for client to storage access. Each storage protocol introduces its own security constraints. Clients may need security information in order to complete direct data access. The rest of this section gives an overview of the security schemes used by different storage protocols. However, the details are outside the scope of this protocol extension and private to the storage protocol. We only assume that the file server returns security tokens to the client that uses them when accessing storage. The file server does permission checking before issuing the security tokens. 4.1 Object Storage Security The object storage protocol relies on a cryptographically secure capability to control accesses at the object storage devices. Capabilities are generated by the metadata server, returned to the client, and passed to the object storage device, which verifies that the capability allows the requested operation. Each capability is specific to a particular object, an operation on that object, a byte range w/in the object, and has an explicit expiration time. The capabilities are signed with a secret key that is shared by the object storage devices (OSD) and the metadata managers. Typically each OSD has a set of master keys and working keys, and the working keys are rotated periodically under the control of the metadata manager. Clients do not have device keys so they are unable to forge capabilities. Capabilities need to be protected from snooping, which can be done by using facilities such as Ipsec to create a secure VPN that contains the clients, the file server, and the storage devices. 4.2 File Security The file storage protocol has the same security mechanism between the client and metadata server as between the client and data server. This implies that the files that store the data need the same ACL as the metadata file that represents the "control point" for the file. This ensures that access control decisions are consistent between the metadata server and the data server. One alternative that was briefly discussed was the introduction of special file handles that essentially have the properties of capabilities so they can be generated by the metadata servers and checked by the data servers. (Peter Corbett described "one shot" file handles.) To be effective, these need all the properties of a capability so the data server can efficiently and securely enforce the access control decisions made by the metadata manager. welch-pnfs-ops Expires - April 2005 [Page 6] Internet-Draft pNFS Operations Summary October 2004 [We need to elaborate on this section. We should be able to leverage the NFSv4 GSS context between the client and the NFSv4 "Storage Devices".] 4.3 Block Security The block model relies on SAN-based security, and trusts that clients will only access the blocks they have been directed to use. In these systems, there may not need to be any additional security information returned with the map. There are LUN masking/unmapping and zone-based security schemes that can be manipulated to fence clients from each other's data. These are fairly heavy weight operations that are not expected to be part of the normal execution path for pNFS. But, a metadata server can always fall back to these mechanisms if it needs to prevent a client from accessing storage (i.e., "fence the client".) 5. pNFS Typed data structures 5.1 pnfs_layoutclass4 uint16_t pnfs_layoutclass4; A layout class specifies a family of layout types. The implication is that clients have "layout drivers" for one or more layout classes. The file server advertises the layout classes it supports through the LAYOUT_CLASSES file system attribute. A client asks for layouts of a particular class in LAYOUTGET, and passes those layouts to its layout driver. A layout is further typed by a pnfs_layouttype4 that identifies a particular layout in the family of layouts of that class. Custom installations should be allowed to introduce new layout classes. [There is an IANA issue here for the initial set of well known layout classes. There should also be a reserved range for custom layout classes used in local installations.] 5.2 pnfs_deviceid4 unsigned uint32_t pnfs_deviceid4; /* 32-bit device ID */ Layout information includes device IDs that specify a data server with a compact handle. Addressing and type information is obtained with the GETDEVICEINFO operation. 5.3 pnfs_devaddr4 struct pnfs_devaddr4 { uint16_t type; string r_netid<>; /* network ID */ string r_addr<>; /* Universal address */ }; This value is used to set up a communication channel with the welch-pnfs-ops Expires - April 2005 [Page 7] Internet-Draft pNFS Operations Summary October 2004 storage device. For now we borrow the structure of a clientid4, and assume we will be able to specify SAN devices as well as TCP/IP devices using this format. The type is used to distinguish between known types. [TODO: we need an enum of known device address types. These include IP+port for file servers and object storage devices. There may be several types for different variants on SAN volume labels. Do we need a concrete definition of volume labels for SAN block devices? We have discussed a scheme where the volume label is defined as a set of tuples that allow matching on the initial contents of a SAN volume in order to determine equality. If we do this, is this type a discriminated union with a fixed number of branches? One type would be an IP/port combination for an NFS or iSCSI device. Another type would be this volume label specification.] 5.4 pnfs_devlist_item4 struct pnfs_devlist_item4 { pnfs_deviceid4 id; nfs_deviceaddr4 addr; }; An array of these values is returned by the GETDEVICELIST operation. They define the set of devices associated with a file system. 5.5 pnfs_layouttype4 struct pnfs_layouttype4 { pnfs_layoutclass4 class; uint16_t type; }; The protocol extension enumerates known layout types and their structure. Additional layout types may be added later. To allow for graceful extension of layout types, the type is broken into two fields. [TODO: We should chart out the major layout classes and representative instances of them, then indicate how new layout classes can be introduced. Alternatively, we can put these definitions into the document that specifies the storage protocol.] 5.6 pnfs_layout4 union pnfs_layout4 switch (pnfs_layouttype4 type) { default: opaque layout_data<>; }; This opaque type defines a layout. As noted, we need to flesh out this union with a number of "blessed" layouts for different storage protocols and aggregation types. welch-pnfs-ops Expires - April 2005 [Page 8] Internet-Draft pNFS Operations Summary October 2004 6. pNFS File Attributes 6.1 pnfs_layoutclass4<> LAYOUT_CLASSES This attribute applies to a file system and indicates what layout classes are supported by the file system. We expect this attribute to be queried when a client encounters a new fsid. This attribute is used by the client to determine if it has applicable layout drivers. 6.2 pnfs_layouttype4 LAYOUT_TYPE This attribute indicates the particular layout type used for a file. This is for informational purposes only. The client needs to use the LAYOUTGET operation in order to get enough information (e.g., specific device information) in order to perform I/O. 6.3 pnfs_layouttype4 LAYOUT_HINT This attribute is set on newly created files to influence the file server's choice for the file's layout. 7. pNFS Error Definitions NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available for the file or its containing file system. NFS4ERR_LAYOUTTRYLATER Layouts are temporarily unavailable for the file, client should retry later. 8. pNFS Operations 8.1 LAYOUTGET - Get Layout Information SYNOPSIS (cfh), storage_type, iomode, sharemode, offset, length -> layout_stateid, layout ARGUMENT enum layoutget_iomode4 { LAYOUTGET_READ = 1, LAYOUTGET_WRITE = 2, LAYOUTGET_RW = 3 }; enum layoutget_sharemode4 { LAYOUTGET_SHARED = 1, LAYOUTGET_EXCLUSIVE = 2 }; struct LAYOUTGET4args { /* CURRENT_FH: file */ pnfs_layoutclass4 layout_class; welch-pnfs-ops Expires - April 2005 [Page 9] Internet-Draft pNFS Operations Summary October 2004 layoutget_iomode4 iomode; layoutget_sharemode4 sharemode; offset4 offset; length4 length; }; RESULT struct LAYOUTGET4resok { stateid4 layout_stateid; pnfs_layout4 layout; }; union LAYOUTGET4res switch (nfsstat4 status) { case NFS4_OK: LAYOUTGET4resok resok4; default: void; }; DESCRIPTION Requests a layout for reading or writing the file given by the filehandle at the byte range given by offset and length. The client requests either a shared or exclusive sharing mode for the layout to indicate whether it provides its own synchronization mechanism. A shared layout allows cooperating clients to perform direct I/O using a layout that potentially conflicts with other clients. The clients are asserting that they are aware of this issue and can coordinate via an external mechanism (either NFSv4 advisory locks or, e.g., MPI-IO toolkit). An exclusive layout means that the client wants the server to prevent other clients from making conflicting changes to the part of the file covered by the layout. An exclusive read layout, for example, would not be granted at the same time as there was an outstanding write layout that overlapped the range. Multiple exclusive read layouts can be given out for the same file range. An exclusive write layout can only be given out if there are no other outstanding layouts for the specified range. Issue - there is some debate about the default value for sharemode in client implementations. One view is that the safest scheme is to require applications to request shared layouts explicitly via, e.g., an ioctl() operation. Another view is that shared layouts during concurrent access provide the same risks and guarantees that NFS does today (i.e., there is only open-to-close sharing semantics) and that applications "know" they should use advisory locking to serialize access when they anticipate sharing. By specifying the sharemode in the protocol, we support both points of view. The LAYOUTGET operation returns layout information for the specified byte range. To get a layout from a specific offset through the end-of-file (no matter how long the file actually is) use a length field with all bits set to 1 (one). If the length is zero, or if a length which is not all bits set to one is specified, and length welch-pnfs-ops Expires - April 2005 [Page 10] Internet-Draft pNFS Operations Summary October 2004 when added to the offset exceeds the maximum 64-bit unsigned integer value, the error NFS4ERR_INVAL will result. The format of the returned layout is specific to the underlying file system and is specified outside of this document. If layouts are not supported for the requested file or its containing filesystem the server should return NFS4ERR_LAYOUTUNAVAILABLE. If layout for the file is unavailable due to transient conditions, e.g. file sharing prohibits layouts, the server should return NFS4ERR_LAYOUTTRYLATER. On success, the current filehandle retains its value. IMPLEMENTATION Typically, LAYOUTGET will be called as part of a compound RPC after an OPEN operation and results in the client having location information for the file. The client specifies a layout class that limits what kind of layout the server will return. This prevents servers from issuing layouts that are unusable by the client. ERRORS NFS4ERR_INVAL NFS4ERR_NOTSUPP NFS4ERR_LAYOUTUNAVAILABLE NFS4ERR_LAYOUTTRYLATER TBD 8.2 LAYOUTCOMMIT - Commit writes made using a layout SYNOPSIS (cfh), layout_stateid, offset, length, neweof, newlayout -> layout_stateid ARGUMENT union neweof4 switch (bool eofchanged) { case TRUE: length4 eof; case FALSE: void; } struct LAYOUTCOMMIT4args { /* CURRENT_FH: file */ stateid4 layout_stateid; neweof4 neweof; offset4 offset; length4 length; opaque newlayout<>; }; welch-pnfs-ops Expires - April 2005 [Page 11] Internet-Draft pNFS Operations Summary October 2004 RESULT struct LAYOUTCOMMIT4resok { stateid4 layout_stateid; }; union LAYOUTFLUSH4res switch (nfsstat4 status) { case NFS4_OK: LAYOUTFLUSH4resok resok4; default: void; }; DESCRIPTION Commit changes in the layout represented by the current filehandle and stateid. The LAYOUTCOMMIT operation indicates that the client has completed writes using a layout obtained by a previous LAYOUTGET. The client may have only written a subset of the data range it previously requested. LAYOUTCOMMIT allows it to commit or discard provisionally allocated space and to update the server with a new end of file. The layout argument to LAYOUTCOMMIT describes what regions have been used and what regions can be deallocated. The resulting layout is still valid after LAYOUTCOMMIT and can be referenced by the returned stateid for future operations. The layout information is more verbose for block devices than for objects and files because the later hide the details of block allocation behind their storage protocols. At the minimum, the client needs to communicate changes to the end of file location back to the server, and its view of the file modify and access times. For blocks, it needs to specify precisely which blocks have been used. The client may use a SETATTR operation in a compound right after LAYOUTCOMMIT in order to set the access and modify times of the file. Alternatively, the server could use the time of the LAYOUTCOMMIT operation as the file modify time. On success, the current filehandle retains its value. ERRORS TBD welch-pnfs-ops Expires - April 2005 [Page 12] Internet-Draft pNFS Operations Summary October 2004 8.3 LAYOUTRETURN - Release Layout Information SYNOPSIS (cfh), layout_stateid -> ARGUMENT struct LAYOUTRETURN4args { /* CURRENT_FH: file */ stateid4 layout_stateid; }; RESULT struct LAYOUTRETURN4res { nfsstat4 status; }; DESCRIPTION Returns the layout represented by the current filehandle and layout_stateid. After this call, the client must not use the layout and the associated storage protocol to access the file data. Before it can do that, it must get a new layout delegation with LAYOUTGET. Layouts may be returned when recalled or voluntarily (i.e., before the server has recalled them). In either case the client must properly propagate state changed under the context of the layout to storage or to the server before returning the layout. On success, the current filehandle retains its value. If a client fails to return a layout in a timely manner, then the File server should use its management protocol with the storage devices to fence the client from accessing the data referenced by the layout. [TODO: We need to work out how clients return error information if they encounter problems with storage. We could return a single OK bit, or we could return more extensive information from the layout driver that describes the error condition in more detail. It seems like we need an opaque "layout_error" type that is defined by the storage protocol along with its layout types.] ERRORS TBD welch-pnfs-ops Expires - April 2005 [Page 13] Internet-Draft pNFS Operations Summary October 2004 8.4 GETDEVICEINFO - Get Device Information SYNOPSIS (cfh), device_id -> device_addr ARGUMENT struct GETDEVICEINFO4args { pnfs_deviceid4 device_id; }; RESULT struct GETDEVICEINFO4resok { pnfs_devaddr4 device_addr; }; union GETDEVICEINFO4res switch (nfsstat4 status) { case NFS4_OK: GETDEVICEINFO4resok resok4; default: void; }; DESCRIPTION Returns device type and device address information for a specified device. The returned device_addr includes a type that indicates how to interpret the addressing information for that device. [TODO: or, it is a discriminated union.] At this time we expect two main kinds of device addresses, either IP address and port numbers, or SCSI volume identifiers. The final protocol specification will detail the allowed values for device_type and the format of their associated location information. Note, it is possible that address information for a deviceID changes dynamically due to various system reconfiguration events. Clients may get errors on their storage protocol that causes them to query the metadata server with GETDEVICEINFO and refresh their information about a device. welch-pnfs-ops Expires - April 2005 [Page 14] Internet-Draft pNFS Operations Summary October 2004 8.5 GETDEVICELIST - Get List of Devices SYNOPSIS (cfh) -> device_addr<> ARGUMENT /* Current file handle */ RESULT struct GETDEVICELIST4resok { pnfs_devlist_item4 device_addr_list<>; }; union GETDEVICEINFO4res switch (nfsstat4 status) { case NFS4_OK: GETDEVICEINFO4resok resok4; default: void; }; DESCRIPTION In some applications, especially SAN environments, it is convenient to find out about all the devices associated with a file system. This lets a client determine if it has access to these devices, e.g., at mount time. This operation returns a list of items that establish the association between the short pnfs_deviceid4 and the addressing information for that device. welch-pnfs-ops Expires - April 2005 [Page 15] Internet-Draft pNFS Operations Summary October 2004 9. Callback Operations 9.1 CB_LAYOUTRECALL SYNOPSIS stateid, fh -> ARGUMENT struct CB_LAYOUTRECALLargs { stateid4 stateid; nfs_fh4 fh; }; RESULT struct CB_LAYOUTRECALLres { nfsstat4 status; }; DESCRIPTION The CB_LAYOUTRECALL operation is used to begin the process of recalling a layout and returning it to the server. If the handle specified is not one for which the client holds a layout, an NFS4ERR_BADHANDLE error is returned. If the stateid specified is not one corresponding to a valid layout for the file specified by the filehandle, an NFS4ERR_BAD_STATEID is returned. Issue: We have debated about another kind of callback to push new EOF information to the client. May not be necessary. The client could discover that via polling for attirbutes. IMPLEMENTATION The client should reply to the callback immediately. Replying does not complete the recall except when an error was returned. The recall is not complete until the layout is returned using a LAYOUTRETURN. The client should complete any in-flight I/O operations using the recalled layout before returning it via LAYOUTRETURN. If the client has buffered dirty data, it may chose to write it directly to storage before calling LAYOUTRETURN, or to write it later using normal NFSv4 WRITE operations. ERRORS NFS4ERR_BADHANDLE NFS4ERR_BAD_STATEID TBD welch-pnfs-ops Expires - April 2005 [Page 16] Internet-Draft pNFS Operations Summary October 2004 10. Usage Scenarios This section has a description of common open, close, read, write interactions and how those work with layout delegations. [TODO: this section feels rough and I'm not sure it adds value in its present form.] 10.1 Basic Read Scenario Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Client returns the layout with LAYOUTRETURN Client closes stateID and open delegation with CLOSE. This is rather boring as the client is careful to clean up all server state after only a single use of the file. 10.2 Multiple Reads to a File Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Client closes stateID and with CLOSE. Client does an OPEN to get a file handle. Client finds cached layout associated with file handle. Client uses the storage protocol and the layout to access the file. Client closes stateID and with CLOSE. A bit more interesting as we've saved the LAYOUTGET operation, but we are still doing server round-trips. 10.3 Multiple Reads to a File with Delegations Client does an OPEN to get a file handle and an open delegation. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Application does a close(), but client keeps state under the delegation. (time passes) Application does another open(), which client handles under the delegation. Client finds cached layout associated with file handle. Client uses the storage protocol and the layout to access the file. (pattern continues until open delegation and/or layout is recalled) This illustrates the efficiency of combining open delegations and layouts to eliminate interactions with the file server altogether. Of course, we assume the client's operating system is only allowing the local open() to succeed based on the file permissions. The use of layouts does not change anything about the semantics of open delegations. welch-pnfs-ops Expires - April 2005 [Page 17] Internet-Draft pNFS Operations Summary October 2004 10.4 Read with existing writers NOTE: This scenario was under some debate, but we have resolved that the server is able to give out overlapping/conflicting layout information to different clients. In these cases we assume that clients are using an external mechanism such as MPI-IO to synchronize and serialize access to shared data. One can argue that even unsynchronized clients get the same open-to-close consistency semantics as NFS already provides, even when going direct to storage. Client does an OPEN to get an open stateID and open delegation The file is open for writing elsewhere by different clients and so no open delegation is returned. Client does a LAYOUT get and gets a layout from the server. Client either synchronizes with the writers, or not, and accesses data via the layout and storage protocol. There are no guarantees about when data that is written by the writer is visible to the reader. Once the writer has closed the file and flushed updates to storage, then they are visible to the client. [TODO: we really aren't explaining the sharemode field here.] 10.5 Read with later conflict ClientA does an OPEN to get an open stateID and open delegation. ClientA does a LAYOUTGET for a range of the file, gets back a map and layout stateid. ClientA uses the storage protocol to access the file data. ClientB opens the file for WRITE File server issues CB_RECALL to ClientA ClientA issues DELEGRETURN ClientA continues to use the storage protocol to access file data. If it is accessing data from its cache, it will periodically check that its data is still up-to-date because it has no open delegation. [This is an odd scenario that mixes in open delegations for no real value. Basically this is a "regular writer" being mixed with a pNFS reader. I guess this example shows that no particular semantics are provided during the simultaneous access. If the server so chose, it could also recall the layout with CB_LAYOUTRECALL to force the different clients to serialize at the file server.] 10.6 Basic Write Case Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout and layout stateid. Client writes to the file using the storage protocol. Client uses LAYOUTCOMMIT to communicate new EOF position. Client does SETATTR to update timestamps Client does a LAYOUTRETURN Client does a CLOSE Again, the boring case where the client cleans up all of its server state by returning the layout. welch-pnfs-ops Expires - April 2005 [Page 18] Internet-Draft pNFS Operations Summary October 2004 10.7 Large Write Case Client does an OPEN to get a file handle. (loop) Client does a LAYOUTGET for a range of the file, gets back a layout and layout stateid. Client writes to the file using the storage protocol. Client fills up the range covered by the layout. Client updates the server with LAYOUTCOMMIT, communicating about new EOF position. Client does SETATTR to update timestamps. Client releases the layout with LAYOUTRELEASE (end loop) Client does a CLOSE 10.8 Create with special layout Client does an OPEN and a SETATTR that specifies a particular layout type using the LAYOUT_HINT attribute. Client gets back an open stateID and file handle. (etc) 11. Layouts and Aggregation This section describes several layout formats in a semi-formal way to provide context for the layout delegations. These definitions will be formalized in other protocols. However, the set of understood types is part of this protocol in order to provide for basic interoperability. The layout descriptions include tuples that identify some storage object on some storage device. The addressing formation adsociated with the deviceID is obtained with GETDEVICEINFO. The interpretation of the objectID depends on the storage protocol. The objectID could be a filehandle for an NFSv4 data server. It could be a OSD object ID for an object server. The layout for a block device generally includes additional block map information to enumerate blocks or extents that are part of the layout. 11.1 Simple Map The data is located on a single storage device. In this case the file server can act as the front end for several storage devices and distribute files among them. Each file is limited in its size and performance characteristics by a single storage device. The simple map consists of . 11.2 Block Map The data is located on a LUN in the SAN. The layout consists of an array of tuples. Alternatively, the blocksize could be specified once to apply to all entries in the layout. welch-pnfs-ops Expires - April 2005 [Page 19] Internet-Draft pNFS Operations Summary October 2004 11.3 Striped Map (RAID 0) The data is striped across storage devices. The parameters of the stripe include the number of storage devices (N) and the size of each stripe unit (U). A full stripe of data is N * U bytes. The stripe map consists of an ordered list of tuples and the parameter value for U. The first stripe unit (the first U bytes) are stored on the first , the second stripe unit on the second and so forth until the first complete stripe. The data layout then wraps around so that byte (N*U) of the file is stored on the first in the list, but starting at offset U within that object. The striped layout allows a client to read or write to the component objects in parallel to achieve high bandwidth. The striped map for a block device would be slightly different. The map is an ordered list of , where the deviceID is rotated among a set of devices to achieve striping. 11.4 Replicated Map The file data is replicated on N data servers. The map consists of N tuples. When data is written using this map, it should be written to N objects in parallel. When data is read, any component object can be used. This map type is controversial because it highlights the issues with error recovery. Those issues get interesting with any scheme that employs redundancy. The handling of errors (e.g., only a subset of replicas get updated) is outside the scope of this protocol extension. Instead, it is a function of the storage protocol and the metadata management protocol. 11.5 Concatenated Map The map consists of an ordered set of N tuples. Each successive tuple describes the next segment of the file. 11.6 Nested Map The nested map is used to compose more complex maps out of simpler ones. The map format is an ordered set of M sub-maps, each submap applies to a byte range within the file and has its own type such as the ones introduced above. Any level of nesting is allowed in order to build up complex aggregation schemes. welch-pnfs-ops Expires - April 2005 [Page 20] Internet-Draft pNFS Operations Summary October 2004 12. Issues 12.1 Storage Protocol Negotiation Clients may want to negotiate with the metadata server about their preferred storage protocol, and to find out what storage protocols the server offers. Client can do this by querying the LAYOUT_CLASSES file system attribute. They respond by specifying a particular layout class in their LAYOUTGET operation. 12.2 Crash recovery We use the existing client crash recovery and server state recovery mechanisms in NFSv4. This includes that layouts have associated layout stateids that "expire" along with the rest of the client state. The main new issue introduced by pNFS is that the client may have to do a lot of I/O in response to a layout recall. The client may need to remember to send RENEW ops to the server during this period if it were to risk not doing anything within the lease time. Of course, the client should only reply with its LAYOUTRETURN after it knows its I/O has completed. 12.3 Storage Errors As noted under LAYOUTRETURN, there is a need for the client to communicate about errors it has when accessing storage directly. welch-pnfs-ops Expires - April 2005 [Page 21] Internet-Draft pNFS Operations Summary October 2004 13. References 1 Gibson et al, "pNFS Problem Statement", ftp://www.ietf.org/ /internet-drafts/draft-gibson-pnfs-problem-statement-01.txt, July 2004. 14. Acknowledgments Many members of the pNFS informal working group have helped considerably. The authors would like to thank Gary Grider, Peter Corbett, Dave Noveck, and Peter Honeyman. This work is inspired by the NASD and OSD work done by Garth Gibson. Gary Grider of the national labs (LANL) has been a champion of high-performance parallel I/O. 15. Author's Addresses Brent Welch Panasas,Inc. 6520 Kaiser Drive Fremont, CA 94555 USA Phone: +1 (510) 608 7770 Email: welch@panasas.com Benny Halevy Panasas, Inc. 1501 Reedsdale St., #400 Pittsburgh, PA 15233 USA Phone: +1 (412) 323 3500 Email: bhalevy@panasas.com David L. Black EMC Corporation 176 South Street Hopkinton, MA 01748 Phone: +1 (508) 293-7953 Email: black_david@emc.com Andy Adamson CITI University of Michigan 519 W. William Ann Arbor, MI 48103-4943 USA Phone: +1 (734) 764-9465 Email: andros@umich.edu David Noveck Network Appliance 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 (781) 768 5347 Email: dnoveck@netapp.com welch-pnfs-ops Expires - April 2005 [Page 22] Internet-Draft pNFS Operations Summary October 2004 16. Full Copyright Notice Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. welch-pnfs-ops Expires - April 2005 [Page 23]