INTERNET-DRAFT Brent Welch Panasas Inc. Benny Halevy Panasas Inc. David Black EMC Corporation Andy Adamson CITI University of Michigan Dave Noveck Network Appliance Document: draft-welch-pnfs-ops-01.txt 19 May 2005 Expires: November 2005 pNFS Operations Summary May 2005 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2005). All Rights Reserved. Abstract This Internet-Draft provides a description of the pNFS extension for NFSv4. welch-pnfs-ops Expires - November 2005 [Page 1] Internet-Draft pNFS Operations Summary May 2005 The key feature of the protocol extension is the ability for clients to perform read and write operations that go directly from the client to individual storage system elements without funneling all such accesses through a single file server. Of course, the file server must coordinate the client I/O so that the file system retains its integrity. The extension adds operations that query and manage layout information that allows parallel I/O between clients and storage system elements. The layouts are managed in a similar way as delegations in that they have leases and can be recalled by the server, but layout information is independent of delegations. Table of Contents 1. Introduction 4 2. General Definitions 6 2.1 Metadata 6 2.2 Storage Device 6 2.3 Storage Protocol 6 2.4 Management Protocol 7 2.5 Layout 7 3. Layouts and Aggregation 7 4. Security Information 8 4.1 File Security 10 4.2 Object Storage Security 10 4.3 Block Security 11 5. pNFS Typed data structures 11 5.1 pnfs_layoutclass4 11 5.2 pnfs_deviceid4 12 5.3 pnfs_devaddr4 12 5.4 pnfs_devlist_item4 12 5.5 pnfs_layouttype4 13 5.6 pnfs_layout4 13 6. pNFS File Attributes 13 6.1 pnfs_layoutclass4<> LAYOUT_CLASSES 13 6.2 pnfs_layouttype4 LAYOUT_TYPE 13 6.3 pnfs_layouttype4 LAYOUT_HINT 13 7. pNFS Error Definitions 14 8. pNFS Operations 14 8.1 LAYOUTGET - Get Layout Information 14 8.2 LAYOUTCOMMIT - Commit writes made using a layout 16 8.3 LAYOUTRETURN - Release Layout Information 17 8.4 GETDEVICEINFO - Get Device Information 18 8.5 GETDEVICELIST - Get List of Devices 19 9. Callback Operations 20 9.1 CB_LAYOUTRECALL 20 10. Usage Scenarios 21 10.1 Basic Read Scenario 21 welch-pnfs-ops Expires - November 2005 [Page 2] Internet-Draft pNFS Operations Summary May 2005 10.2 Multiple Reads to a File 21 10.3 Multiple Reads to a File with Delegations 21 10.4 Read with existing writers 22 10.5 Read with later conflict 22 10.6 Basic Write Case 23 10.7 Large Write Case 23 10.8 Create with special layout 23 11. Layouts and Aggregation 23 11.1 Simple Map 24 11.2 Block Map 24 11.3 Striped Map (RAID 0) 24 11.4 Replicated Map 24 11.5 Concatenated Map 25 11.6 Nested Map 25 12. Issues 25 12.1 Storage Protocol Negotiation 25 12.2 Crash recovery 25 12.3 Storage Errors 25 13. References 26 14. Acknowledgments 26 15. Author's Addresses 26 16. Full Copyright Notice 27 welch-pnfs-ops Expires - November 2005 [Page 3] Internet-Draft pNFS Operations Summary May 2005 1. Introduction The NFSv4 protocol [RFC 3530] specifies the interaction between a Client that accesses files and a Server that provides access to files and is responsible for coordinating access by multiple clients. As described in the pNFS problem statement, this requires that all access to a set of files exported by a single NFSv4 Server be performed by that server; at high data rates the Server may become a bottleneck. The parallel NFS (pNFS) extensions to NFSv4 allow data accesses to bypass this bottleneck by permitting direct Client access to the Storage containing the file data. When file data for a single NFSv4 Server is stored on multiple and/or higher throughput Storage Systems (by comparison to the Server's throughput capability), the result can be significantly better file access performance. The relationship among multiple Clients, a single Server, and multiple Storage systems for pNFS (Server and Clients have access to all Storage Systems is shown in this diagram: +-----------+ |+-----------+ +-----------+ ||+-----------+ | | ||| | NFSv4 + pNFS | | +|| Clients |<------------------------------>| Server | +| | | | +-----------+ | | ||| +-----------+ ||| | ||| | ||| +-----------+ | ||| |+-----------+ | ||+----------------||+-----------+ | |+-----------------||| | | +------------------+|| Storage |------------+ +| Systems | +-----------+ In this structure, the responsibility for coordination of file access by multiple clients is shared among the Server, Clients, and Storage Systems, in contrast to NFSv4 by itself where this is primarily the Server's responsibility, some of which can be delegated to Clients under strictly specified condition. pNFS specifies only the NFSv4 extensions required to distribute file access coordination between the Server and its Clients. The protocols used to access the Storage Systems are deliberately not specified, as a variety of such protocols can be used, including: welch-pnfs-ops Expires - November 2005 [Page 4] Internet-Draft pNFS Operations Summary May 2005 - Block protocols such as iSCSI, parallel SCSI, and FCP (SCSI over Fibre Channel) [refs]. The block protocol support can be independent of the addressing structure of the block protocol used, allowing more than one protocol to access the same file data and enabling extensibility to other block protocols. - Object protocols such as OSD over iSCSI or Fibre Channel [ref] - File protocols such as NFSv4 itself. In this case, the Storage Systems would also be NFSv4 Servers whose files could only be accessed by contacting the main Server (see above) first. - Other storage protocols, including PVFS and other file systems that are in use in HPC environments. pNFS is designed to accomodate these protocols and be extensible to new classes of storage protocols that may be of interest. The distribution of file access coordination between the Server and its Clients increases the level of responsibility placed on Clients. Clients are already responsible for ensuring that suitable access checks are made to cached data and that attributes are suitably propagated to the server. Generally, a single user Client's misbehavior can only impact files accessible to that single user. Misbehavior by a multi-user Client may impact files accessible to all of its users. Delegations increase the level of Client responsibility as a client that carries out actions requiring a delegation without obtaining that delegation will cause its user(s) to see unexpected and/or incorrect behavior. Some uses of pNFS extend the responsibility of Clients further, in that in some configurations, the Storage Systems cannot check at a fine grain that Clients are only performing accesses within the bounds permitted to them by the pNFS operations with the Server (e.g., the checks may only be possible at filesystem granularity rather than file granularity). In situations where this added responsibility placed on clients creates unacceptable security risks, pNFS configurations in which Storage Systems cannot perform fine-grained access checks SHOULD NOT be used. All pNFS implementations MUST support NFSv4 access to any file accessible via pNFS in order to provide an interoperable means of file access in such situations. See Section 4 on Security for further discussion. The pNFS extension to NFSv4 takes the form of new operations that return data location information called a "layout". The layout is protected by layout delegations. When a client has a layout delegation, it has rights to access the data directly using the location information in the layout. There are both read and write layouts and they may only apply to a sub-range of the file's contents. The layout delegations are managed in a similar fashion as NFSv4 data delegations (e.g., they are recallable and revocable), but they are distinct abstractions and are manipulated with new operations welch-pnfs-ops Expires - November 2005 [Page 5] Internet-Draft pNFS Operations Summary May 2005 as described below. To avoid any confusion between the existing NFSv4 data delegations and layout delegations, the term "layout" implies "layout delegation". There are new attributes that describe general layout characteristics. However, attributes do not provide everything needed to support layouts, hence the use of operations instead. Finally, there are issues about how layout delegations interact with the existing NFSv4 abstractions of data delegations and byte range locking. These issues (and more) are also discussed here. 2. General Definitions This protocol extension partitions the NFSv4 file system protocol into two parts, the control path and the data path. The control path is implemented by the extended (p)NFSv4 server. When the file system being exported by (p)NFSv4 employs storage devices, the data path may be implemented by direct communication between the extended (p)NFSv4 file system client and the storage devices. This leads to a few new terms used to describe the protocol extension. 2.1 Metadata This is information about a file, like its name, owner, where it stored, and so forth. The information is managed by the exported file system server (sometimes called the metadata server). Metadata also includes lower-level information like block addresses and indirect block pointers. Depending the storage protocol, block-level metadata may or may not be managed by the metadata server, but is instead managed by Object Storage Devices or other servers acting as a Storage Device. 2.2 Storage Device This is a device, or server, that controls the file's data, but leaves other metadata management up to the metadata server. A Storage Device could be another NFS server, or an Object Storage Device (OSD) or a block device accessed over a SAN (either FiberChannel or iSCSI SAN). The goal of this extension is to allow direct communication between pNFSv4 clients and storage devices. 2.3 Storage Protocol This is the protocol between the pNFSv4 client and the storage device used to access the file data. There are three primary types: file protocols (such as NFSv4 or NFSv3), object protocols (OSD), and block protocols (SCSI-block commands, or "SBC"). These protocols are in turn layered over transport protocols such as RPC/TCP/IP or welch-pnfs-ops Expires - November 2005 [Page 6] Internet-Draft pNFS Operations Summary May 2005 iSCSI/TCP/IP or FC/SCSI. We anticipate there will be variations on these storage protocols, including new protocols that are unknown at this time or experimental in nature. The details of the storage protocols will be described in other documents so that pNFS clients can be written to use these storage protocols. 2.4 Management Protocol This is the protocol used by the exported file system between the metadata server and storage devices. This protocol is outside the scope of this draft, and is used for various management activities that include storage allocation and deallocation. For example, the regular NFSv4 OPEN operation is used to create a new file over pNFSv4. The pNFSv4 server applies the open to the file system it is exporting, which in turn uses the management protocol to allocate storage on the storage devices. The pNFSv4 server receives a layout for the new file from the exported file system which it returns to the pNFSv4 client for direct access to the new file. The management protocol could be entirely private to the exported file system, and need not be published in order to implement a pNFSv4 client that uses the associated storage protocol. (Note: I think "Management Protocol" can easily be confused with protocols used to manage LUNs in a SAN and other sysadmin kinds of tasks. Here we really mean a much finer grain protocol between the file server and the storage devices for the purposes of implementing individual files.) 2.5 Layout (Also, "map") A layout defines how a file's data is organized on one or more storage devices. There are many possible layout types. They vary in the storage protocol used to access the data, and in the aggregation scheme that lays out the file data on the underlying storage devices. Layouts are described in more detail below. 3. Layouts and Aggregation The layout, or "map", is a typed data structure that has variants to handle different storage protocols (block, object, and file). A layout describes a range of a file's contents. For example, a block layout might be an array of tuples that store (deviceID, block_number, block count) along with information about block size and the file offset of the first block. An object layout is an array of tuples (deviceID, objectID) and an additional structure (i.e., the aggregation map) that defines how the logical byte sequence of the file data is serialized into the different objects. A file layout could be an array of tuples (deviceID, file_handle), along with a similar aggregation map. welch-pnfs-ops Expires - November 2005 [Page 7] Internet-Draft pNFS Operations Summary May 2005 The deviceID is a short name for a storage device. In practice, a significant amount of information may be required to fully identify a storage device. Instead of embedding all that information in a layout, a level of indirection is used. Layouts embed device Ids, and a new op (GETDEVICEINFO) is used to retrieve the complete identity information about the storage device. For example, the identity of a file server or object server could be an IP address and port. The identity of a block device could be a volume label. Due to multipath connectivity in a SAN environment, agreement on a volume label is considered the reliable way to locate a particular storage device. Aggregation schemes can describe layouts like simple one-to-one mapping, concatenation, and striping. A general aggregation scheme allows nested maps so that more complex layouts can be compactly described. The canonical aggregation type for this extension is striping, which allows a client to access storage devices in parallel. Even a one-to-one mapping use useful for a file server that wishes to distribute its load among a set of other file servers. There are also experimental aggregation types such as writeable mirrors and RAID, however these are outside the scope of this document. The metadata server is in control of the layout for a file, but the pNFSv4 client can provide hints to the server when a file is opened or created about preferred layout parameters. The pNFSv4 extension introduces a LAYOUT_HINT attribute that the client can query at anytime, and can set with a compound SETATTR after OPEN to provide a hint to the server for new files. While not completely specified in this summary, there must be adjunct specifications that precisely define layout formats to allow interoperability among clients and metadata servers. The point is that the metadata server will give out layouts of a particular class (block, object, or file) and aggregation, and the client needs to select a "layout driver" that understands how to use that layout. The API used by the client to talk to its drivers is outside the scope of the pNFS extension, but is an important notion to keep in mind when thinking about this work. The storage protocol between the client's layout driver and the actual storage is covered by other protocols such as SBC (block storage), OSD (object storage) or NFS (file storage). 4. Security Information The pNFSv4 extension partitions the NFSv4 file system protocol into two parts, the control path and the data path. The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the welch-pnfs-ops Expires - November 2005 [Page 8] Internet-Draft pNFS Operations Summary May 2005 control path. The combination of components in a pNFS system (see diagram in Section 1) is required to preserve the security properties of NFSv4 with respect to an entity accessing data via a Client, including security countermeasures to defend against threats that NFSv4 provides defenses for in environments where these threats are considered significant. In some cases, the security countermeasures for connections to Storage Systems may take the form of physical isolation or a recommendation not to use pNFS in an environment. For example, it is currently infeasible to provide confidentiality protection for some Storage System access protocols to protect against eavesdropping; in environments where eavesdropping on such protocols is of sufficient concern to require countermeasures, physical isolation of the communication channel (e.g., via direct connection from Client(s) to Storage System(s)) and/or a decision to forego use of pNFS (e.g., and fall back to NFSv4) may be appropriate courses of action. In full generality where communication with Storage Systems is subject to the same threats as Client-Server communication, the protocols used for that communication need to provide security mechanisms comparable to those available via RPSEC_GSS for NFSv4. Many situations in which pNFS is likely to be used will not be subject to the overall threat profile for which NFSv4 is required to provide countermeasures. pNFS implementations MUST NOT remove NFSv4's access controls. The combination of Clients, Storage Systems, and the Server are responsible for ensuring that all Client to Storage System file data access respects NFSv4 ACLs and file open modes. This entails performing both of these checks on every access in the Client, the Storage System, or both. If a pNFS configuration performs these checks only in the Client, the risk of a misbehaving Client obtaining unauthorized access is an important consideration in determining when it is appropriate to use such a pNFS configuration. Such configurations SHOULD NOT be used when Client- only access checks do not provide sufficient assurance that NFSv4 access control is being applied correctly. The following subsections describe security considerations specifically applicable to each of the three major Storage System protocol classes supported for pNFS. [Additional security info - the object protocol needs this, but it may be out-of-band; the OSD experts will know for sure. For Block and File an approach of the Client being expected to know what it needs when it sees what it's being asked to access probably suffices, although we might be able to help (e.g., pass iSCSI CHAP authentication identities, but NOT secrets, via pNFS). For File in particular, defaulting to the NFSv4 principal is probably welch-pnfs-ops Expires - November 2005 [Page 9] Internet-Draft pNFS Operations Summary May 2005 a good idea, although it's not strictly necessary.] [Requiring strict equivalence to NFSv4 security mechanisms is the wrong approach. Will need to lay down a set of statements that each protocol has to make starting with access check location/properties.] [Not sure about "Security" as the last word of each subsection title.] 4.1 File Security NFSv4 can be used as a protocol for communication with Storage Systems. In this case, the Storage Systems would be NFSv4 Servers whose files could only be accessed via pNFS communication with the main server. If the Storage System NFSv4 Servers use the same principals as the main Server, and do not use a Storage System file to contain parts of multiple files (as seen by a Client) NFSv4 will make all of the required access checks. On the other hand, if the principals are different or Storage Server files contain portions of more than one file as seen by the Client, the Client may be required to make additional access checks. 4.2 Object Storage Security The object storage protocol relies on a cryptographically secure capability to control accesses at the object storage devices. Capabilities are generated by the metadata server, returned to the client, and used by the client as described below to authenticate their requests to the Object Storage Device (OSD). Capabilities therefore achieve the required access and open mode checking. They allow the file server to define and check a policy (e.g., open mode) and the OSD to check and enforce that policy without knowing the details (e.g., user IDs and ACLs). Each capability is specific to a particular object, an operation on that object, a byte range w/in the object, and has an explicit expiration time. The capabilities are signed with a secret key that is shared by the object storage devices (OSD) and the metadata managers. Clients do not have device keys so they are unable to forge capabilities. The details of the security and privacy model for Object Storage are out of scope of this document and will be specified in the Object Storage version of the storage protocol definition. However, the following sketch of the algorithm should help the reader understand the basic model. LAYOUTGET returns {CapKey = MAC(CapArgs), CapArgs} The client uses CapKey to sign all the requests it issues for that object using the respective CapArgs. In other words, the CapArgs appears in the request to the StorageDevice, and welch-pnfs-ops Expires - November 2005 [Page 10] Internet-Draft pNFS Operations Summary May 2005 that request is signed with the CapKey as follows: ReqMAC = MAC(Req, Nonceln) The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. The OSD uses the SecretKey it shares with the metadata server to compare the ReqMAC the client sent with a locally computed MAC(CapArgs)>(Req, Nonceln) and if they match the OSD assumes that the capabilities came from an authentic metadata server and allows access to the object, as allowed by the CapArgs. Therefore, if the server LAYOUTGET reply, holding CapKey and CapArgs, is snooped by another client, it can be used to generate valid OSD requests (within the CapArgs access restriction). To provide the required privacy requirements for the capabilities returned by LAYOUTGET, the GSS-API can be used, e.g. by using a session key known to the file server and to the client to encrypt the whole layout or parts of it. Two general ways to provide privacy in the absence of GSS-API that are independent of NFSv4 are either an isolated network such as a VLAN or a secure channel provided by IPsec. 4.3 Block Security Block protocols rely on Clients to enforce file access checks, as the Storage Systems are generally unaware of the files they are storing (and in particular are unaware of which block belongs to which file). In environments where access control is important and Client-only access checks provide insufficient assurance of access control enforcement (e.g., there is concern about a malicious Client skipping the access check), the Storage Systems will generally be unable to compensate for these Client deficiencies. In such threat environments, Block protocols SHOULD NOT be used with pNFS; NFSv4 without pNFS may be a more suitable means of accessing files in the presence of such threats. Storage-System-specific (e.g., LUN masking/mapping) may be available to prevent malicious or high-risk Clients from directly accessing Storage Systems. 5. pNFS Typed data structures 5.1 pnfs_layoutclass4 uint16_t pnfs_layoutclass4; A layout class specifies a family of layout types. The implication is that clients have "layout drivers" for one or more layout classes. The file server advertises the layout classes it supports through the LAYOUT_CLASSES file system attribute. A client asks for layouts of a particular class in LAYOUTGET, and passes those layouts to its layout driver. A layout is further typed by a pnfs_layouttype4 that identifies a particular layout in the family of layouts of welch-pnfs-ops Expires - November 2005 [Page 11] Internet-Draft pNFS Operations Summary May 2005 that class. Custom installations should be allowed to introduce new layout classes. [There is an IANA issue here for the initial set of well known layout classes. There should also be a reserved range for custom layout classes used in local installations.] 5.2 pnfs_deviceid4 unsigned uint32_t pnfs_deviceid4; /* 32-bit device ID */ Layout information includes device IDs that specify a data server with a compact handle. Addressing and type information is obtained with the GETDEVICEINFO operation. 5.3 pnfs_devaddr4 struct pnfs_devaddr4 { uint16_t type; string r_netid<>; /* network ID */ string r_addr<>; /* Universal address */ }; This value is used to set up a communication channel with the storage device. For now we borrow the structure of a clientid4, and assume we will be able to specify SAN devices as well as TCP/IP devices using this format. The type is used to distinguish between known types. [TODO: we need an enum of known device address types. These include IP+port for file servers and object storage devices. There may be several types for different variants on SAN volume labels. Do we need a concrete definition of volume labels for SAN block devices? We have discussed a scheme where the volume label is defined as a set of tuples that allow matching on the initial contents of a SAN volume in order to determine equality. If we do this, is this type a discriminated union with a fixed number of branches? One type would be an IP/port combination for an NFS or iSCSI device. Another type would be this volume label specification.] 5.4 pnfs_devlist_item4 struct pnfs_devlist_item4 { pnfs_deviceid4 id; nfs_deviceaddr4 addr; }; An array of these values is returned by the GETDEVICELIST operation. They define the set of devices associated with a file system. welch-pnfs-ops Expires - November 2005 [Page 12] Internet-Draft pNFS Operations Summary May 2005 5.5 pnfs_layouttype4 struct pnfs_layouttype4 { pnfs_layoutclass4 class; uint16_t type; }; The protocol extension enumerates known layout types and their structure. Additional layout types may be added later. To allow for graceful extension of layout types, the type is broken into two fields. [TODO: We should chart out the major layout classes and representative instances of them, then indicate how new layout classes can be introduced. Alternatively, we can put these definitions into the document that specifies the storage protocol.] 5.6 pnfs_layout4 union pnfs_layout4 switch (pnfs_layouttype4 type) { default: opaque layout_data<>; }; This opaque type defines a layout. As noted, we need to flesh out this union with a number of "blessed" layouts for different storage protocols and aggregation types. 6. pNFS File Attributes 6.1 pnfs_layoutclass4<> LAYOUT_CLASSES This attribute applies to a file system and indicates what layout classes are supported by the file system. We expect this attribute to be queried when a client encounters a new fsid. This attribute is used by the client to determine if it has applicable layout drivers. 6.2 pnfs_layouttype4 LAYOUT_TYPE This attribute indicates the particular layout type used for a file. This is for informational purposes only. The client needs to use the LAYOUTGET operation in order to get enough information (e.g., specific device information) in order to perform I/O. 6.3 pnfs_layouttype4 LAYOUT_HINT This attribute is set on newly created files to influence the file server's choice for the file's layout. welch-pnfs-ops Expires - November 2005 [Page 13] Internet-Draft pNFS Operations Summary May 2005 7. pNFS Error Definitions NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available for the file or its containing file system. NFS4ERR_LAYOUTTRYLATER Layouts are temporarily unavailable for the file, client should retry later. 8. pNFS Operations 8.1 LAYOUTGET - Get Layout Information SYNOPSIS (cfh), storage_type, iomode, sharemode, offset, length -> layout_stateid, layout ARGUMENT enum layoutget_iomode4 { LAYOUTGET_READ = 1, LAYOUTGET_WRITE = 2, LAYOUTGET_RW = 3 }; enum layoutget_sharemode4 { LAYOUTGET_SHARED = 1, LAYOUTGET_EXCLUSIVE = 2 }; struct LAYOUTGET4args { /* CURRENT_FH: file */ pnfs_layoutclass4 layout_class; layoutget_iomode4 iomode; layoutget_sharemode4 sharemode; offset4 offset; length4 length; }; RESULT struct LAYOUTGET4resok { stateid4 layout_stateid; pnfs_layout4 layout; }; union LAYOUTGET4res switch (nfsstat4 status) { case NFS4_OK: LAYOUTGET4resok resok4; default: welch-pnfs-ops Expires - November 2005 [Page 14] Internet-Draft pNFS Operations Summary May 2005 void; }; DESCRIPTION Requests a layout for reading or writing the file given by the filehandle at the byte range given by offset and length. The client requests either a shared or exclusive sharing mode for the layout to indicate whether it provides its own synchronization mechanism. A shared layout allows cooperating clients to perform direct I/O using a layout that potentially conflicts with other clients. The clients are asserting that they are aware of this issue and can coordinate via an external mechanism (either NFSv4 advisory locks or, e.g., MPI-IO toolkit). An exclusive layout means that the client wants the server to prevent other clients from making conflicting changes to the part of the file covered by the layout. An exclusive read layout, for example, would not be granted at the same time as there was an outstanding write layout that overlapped the range. Multiple exclusive read layouts can be given out for the same file range. An exclusive write layout can only be given out if there are no other outstanding layouts for the specified range. Issue - there is some debate about the default value for sharemode in client implementations. One view is that the safest scheme is to require applications to request shared layouts explicitly via, e.g., an ioctl() operation. Another view is that shared layouts during concurrent access provide the same risks and guarantees that NFS does today (i.e., there is only open-to-close sharing semantics) and that applications "know" they should use advisory locking to serialize access when they anticipate sharing. By specifying the sharemode in the protocol, we support both points of view. The LAYOUTGET operation returns layout information for the specified byte range. To get a layout from a specific offset through the end-of-file (no matter how long the file actually is) use a length field with all bits set to 1 (one). If the length is zero, or if a length which is not all bits set to one is specified, and length when added to the offset exceeds the maximum 64-bit unsigned integer value, the error NFS4ERR_INVAL will result. The format of the returned layout is specific to the underlying file system and is specified outside of this document. If layouts are not supported for the requested file or its containing filesystem the server should return NFS4ERR_LAYOUTUNAVAILABLE. If layout for the file is unavailable due to transient conditions, e.g. file sharing prohibits layouts, the server should return NFS4ERR_LAYOUTTRYLATER. welch-pnfs-ops Expires - November 2005 [Page 15] Internet-Draft pNFS Operations Summary May 2005 On success, the current filehandle retains its value. IMPLEMENTATION Typically, LAYOUTGET will be called as part of a compound RPC after an OPEN operation and results in the client having location information for the file. The client specifies a layout class that limits what kind of layout the server will return. This prevents servers from issuing layouts that are unusable by the client. ERRORS NFS4ERR_INVAL NFS4ERR_NOTSUPP NFS4ERR_LAYOUTUNAVAILABLE NFS4ERR_LAYOUTTRYLATER TBD 8.2 LAYOUTCOMMIT - Commit writes made using a layout SYNOPSIS (cfh), layout_stateid, offset, length, neweof, newlayout -> layout_stateid ARGUMENT union neweof4 switch (bool eofchanged) { case TRUE: length4 eof; case FALSE: void; } struct LAYOUTCOMMIT4args { /* CURRENT_FH: file */ stateid4 layout_stateid; neweof4 neweof; offset4 offset; length4 length; opaque newlayout<>; }; RESULT struct LAYOUTCOMMIT4resok { stateid4 layout_stateid; }; union LAYOUTFLUSH4res switch (nfsstat4 status) { case NFS4_OK: LAYOUTFLUSH4resok resok4; welch-pnfs-ops Expires - November 2005 [Page 16] Internet-Draft pNFS Operations Summary May 2005 default: void; }; DESCRIPTION Commit changes in the layout represented by the current filehandle and stateid. The LAYOUTCOMMIT operation indicates that the client has completed writes using a layout obtained by a previous LAYOUTGET. The client may have only written a subset of the data range it previously requested. LAYOUTCOMMIT allows it to commit or discard provisionally allocated space and to update the server with a new end of file. The layout argument to LAYOUTCOMMIT describes what regions have been used and what regions can be deallocated. The resulting layout is still valid after LAYOUTCOMMIT and can be referenced by the returned stateid for future operations. The layout information is more verbose for block devices than for objects and files because the later hide the details of block allocation behind their storage protocols. At the minimum, the client needs to communicate changes to the end of file location back to the server, and its view of the file modify and access times. For blocks, it needs to specify precisely which blocks have been used. The client may use a SETATTR operation in a compound right after LAYOUTCOMMIT in order to set the access and modify times of the file. Alternatively, the server could use the time of the LAYOUTCOMMIT operation as the file modify time. On success, the current filehandle retains its value. ERRORS TBD 8.3 LAYOUTRETURN - Release Layout Information SYNOPSIS (cfh), layout_stateid -> ARGUMENT struct LAYOUTRETURN4args { /* CURRENT_FH: file */ stateid4 layout_stateid; }; welch-pnfs-ops Expires - November 2005 [Page 17] Internet-Draft pNFS Operations Summary May 2005 RESULT struct LAYOUTRETURN4res { nfsstat4 status; }; DESCRIPTION Returns the layout represented by the current filehandle and layout_stateid. After this call, the client must not use the layout and the associated storage protocol to access the file data. Before it can do that, it must get a new layout delegation with LAYOUTGET. Layouts may be returned when recalled or voluntarily (i.e., before the server has recalled them). In either case the client must properly propagate state changed under the context of the layout to storage or to the server before returning the layout. On success, the current filehandle retains its value. If a client fails to return a layout in a timely manner, then the File server should use its management protocol with the storage devices to fence the client from accessing the data referenced by the layout. [TODO: We need to work out how clients return error information if they encounter problems with storage. We could return a single OK bit, or we could return more extensive information from the layout driver that describes the error condition in more detail. It seems like we need an opaque "layout_error" type that is defined by the storage protocol along with its layout types.] ERRORS TBD 8.4 GETDEVICEINFO - Get Device Information SYNOPSIS (cfh), device_id -> device_addr ARGUMENT struct GETDEVICEINFO4args { pnfs_deviceid4 device_id; }; RESULT welch-pnfs-ops Expires - November 2005 [Page 18] Internet-Draft pNFS Operations Summary May 2005 struct GETDEVICEINFO4resok { pnfs_devaddr4 device_addr; }; union GETDEVICEINFO4res switch (nfsstat4 status) { case NFS4_OK: GETDEVICEINFO4resok resok4; default: void; }; DESCRIPTION Returns device type and device address information for a specified device. The returned device_addr includes a type that indicates how to interpret the addressing information for that device. [TODO: or, it is a discriminated union.] At this time we expect two main kinds of device addresses, either IP address and port numbers, or SCSI volume identifiers. The final protocol specification will detail the allowed values for device_type and the format of their associated location information. Note, it is possible that address information for a deviceID changes dynamically due to various system reconfiguration events. Clients may get errors on their storage protocol that causes them to query the metadata server with GETDEVICEINFO and refresh their information about a device. 8.5 GETDEVICELIST - Get List of Devices SYNOPSIS (cfh) -> device_addr<> ARGUMENT /* Current file handle */ RESULT struct GETDEVICELIST4resok { pnfs_devlist_item4 device_addr_list<>; }; union GETDEVICEINFO4res switch (nfsstat4 status) { case NFS4_OK: GETDEVICEINFO4resok resok4; default: void; }; welch-pnfs-ops Expires - November 2005 [Page 19] Internet-Draft pNFS Operations Summary May 2005 DESCRIPTION In some applications, especially SAN environments, it is convenient to find out about all the devices associated with a file system. This lets a client determine if it has access to these devices, e.g., at mount time. This operation returns a list of items that establish the association between the short pnfs_deviceid4 and the addressing information for that device. 9. Callback Operations 9.1 CB_LAYOUTRECALL SYNOPSIS stateid, fh -> ARGUMENT struct CB_LAYOUTRECALLargs { stateid4 stateid; nfs_fh4 fh; }; RESULT struct CB_LAYOUTRECALLres { nfsstat4 status; }; DESCRIPTION The CB_LAYOUTRECALL operation is used to begin the process of recalling a layout and returning it to the server. If the handle specified is not one for which the client holds a layout, an NFS4ERR_BADHANDLE error is returned. If the stateid specified is not one corresponding to a valid layout for the file specified by the filehandle, an NFS4ERR_BAD_STATEID is returned. Issue: We have debated about another kind of callback to push new EOF information to the client. May not be necessary. The client could discover that via polling for attributes. IMPLEMENTATION The client should reply to the callback immediately. Replying does welch-pnfs-ops Expires - November 2005 [Page 20] Internet-Draft pNFS Operations Summary May 2005 not complete the recall except when an error was returned. The recall is not complete until the layout is returned using a LAYOUTRETURN. The client should complete any in-flight I/O operations using the recalled layout before returning it via LAYOUTRETURN. If the client has buffered dirty data, it may chose to write it directly to storage before calling LAYOUTRETURN, or to write it later using normal NFSv4 WRITE operations. ERRORS NFS4ERR_BADHANDLE NFS4ERR_BAD_STATEID TBD 10. Usage Scenarios This section has a description of common open, close, read, write interactions and how those work with layout delegations. [TODO: this section feels rough and I'm not sure it adds value in its present form.] 10.1 Basic Read Scenario Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Client returns the layout with LAYOUTRETURN Client closes stateID and open delegation with CLOSE. This is rather boring as the client is careful to clean up all server state after only a single use of the file. 10.2 Multiple Reads to a File Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Client closes stateID and with CLOSE. Client does an OPEN to get a file handle. Client finds cached layout associated with file handle. Client uses the storage protocol and the layout to access the file. Client closes stateID and with CLOSE. A bit more interesting as we've saved the LAYOUTGET operation, but we are still doing server round-trips. 10.3 Multiple Reads to a File with Delegations welch-pnfs-ops Expires - November 2005 [Page 21] Internet-Draft pNFS Operations Summary May 2005 Client does an OPEN to get a file handle and an open delegation. Client does a LAYOUTGET for a range of the file, gets back a layout. Client uses the storage protocol and the layout to access the file. Application does a close(), but client keeps state under the delegation. (time passes) Application does another open(), which client handles under the delegation. Client finds cached layout associated with file handle. Client uses the storage protocol and the layout to access the file. (pattern continues until open delegation and/or layout is recalled) This illustrates the efficiency of combining open delegations and layouts to eliminate interactions with the file server altogether. Of course, we assume the client's operating system is only allowing the local open() to succeed based on the file permissions. The use of layouts does not change anything about the semantics of open delegations. 10.4 Read with existing writers NOTE: This scenario was under some debate, but we have resolved that the server is able to give out overlapping/conflicting layout information to different clients. In these cases we assume that clients are using an external mechanism such as MPI-IO to synchronize and serialize access to shared data. One can argue that even unsynchronized clients get the same open-to-close consistency semantics as NFS already provides, even when going direct to storage. Client does an OPEN to get an open stateID and open delegation The file is open for writing elsewhere by different clients and so no open delegation is returned. Client does a LAYOUT get and gets a layout from the server. Client either synchronizes with the writers, or not, and accesses data via the layout and storage protocol. There are no guarantees about when data that is written by the writer is visible to the reader. Once the writer has closed the file and flushed updates to storage, then they are visible to the client. [TODO: we really aren't explaining the sharemode field here.] 10.5 Read with later conflict ClientA does an OPEN to get an open stateID and open delegation. ClientA does a LAYOUTGET for a range of the file, gets back a map and layout stateid. ClientA uses the storage protocol to access the file data. ClientB opens the file for WRITE File server issues CB_RECALL to ClientA ClientA issues DELEGRETURN welch-pnfs-ops Expires - November 2005 [Page 22] Internet-Draft pNFS Operations Summary May 2005 ClientA continues to use the storage protocol to access file data. If it is accessing data from its cache, it will periodically check that its data is still up-to-date because it has no open delegation. [This is an odd scenario that mixes in open delegations for no real value. Basically this is a "regular writer" being mixed with a pNFS reader. I guess this example shows that no particular semantics are provided during the simultaneous access. If the server so chose, it could also recall the layout with CB_LAYOUTRECALL to force the different clients to serialize at the file server.] 10.6 Basic Write Case Client does an OPEN to get a file handle. Client does a LAYOUTGET for a range of the file, gets back a layout and layout stateid. Client writes to the file using the storage protocol. Client uses LAYOUTCOMMIT to communicate new EOF position. Client does SETATTR to update timestamps Client does a LAYOUTRETURN Client does a CLOSE Again, the boring case where the client cleans up all of its server state by returning the layout. 10.7 Large Write Case Client does an OPEN to get a file handle. (loop) Client does a LAYOUTGET for a range of the file, gets back a layout and layout stateid. Client writes to the file using the storage protocol. Client fills up the range covered by the layout. Client updates the server with LAYOUTCOMMIT, communicating about new EOF position. Client does SETATTR to update timestamps. Client releases the layout with LAYOUTRELEASE (end loop) Client does a CLOSE 10.8 Create with special layout Client does an OPEN and a SETATTR that specifies a particular layout type using the LAYOUT_HINT attribute. Client gets back an open stateID and file handle. (etc) 11. Layouts and Aggregation This section describes several layout formats in a semi-formal way to provide context for the layout delegations. These definitions welch-pnfs-ops Expires - November 2005 [Page 23] Internet-Draft pNFS Operations Summary May 2005 will be formalized in other protocols. However, the set of understood types is part of this protocol in order to provide for basic interoperability. The layout descriptions include tuples that identify some storage object on some storage device. The addressing formation associated with the deviceID is obtained with GETDEVICEINFO. The interpretation of the objectID depends on the storage protocol. The objectID could be a filehandle for an NFSv4 data server. It could be a OSD object ID for an object server. The layout for a block device generally includes additional block map information to enumerate blocks or extents that are part of the layout. 11.1 Simple Map The data is located on a single storage device. In this case the file server can act as the front end for several storage devices and distribute files among them. Each file is limited in its size and performance characteristics by a single storage device. The simple map consists of . 11.2 Block Map The data is located on a LUN in the SAN. The layout consists of an array of tuples. Alternatively, the blocksize could be specified once to apply to all entries in the layout. 11.3 Striped Map (RAID 0) The data is striped across storage devices. The parameters of the stripe include the number of storage devices (N) and the size of each stripe unit (U). A full stripe of data is N * U bytes. The stripe map consists of an ordered list of tuples and the parameter value for U. The first stripe unit (the first U bytes) are stored on the first , the second stripe unit on the second and so forth until the first complete stripe. The data layout then wraps around so that byte (N*U) of the file is stored on the first in the list, but starting at offset U within that object. The striped layout allows a client to read or write to the component objects in parallel to achieve high bandwidth. The striped map for a block device would be slightly different. The map is an ordered list of , where the deviceID is rotated among a set of devices to achieve striping. 11.4 Replicated Map welch-pnfs-ops Expires - November 2005 [Page 24] Internet-Draft pNFS Operations Summary May 2005 The file data is replicated on N data servers. The map consists of N tuples. When data is written using this map, it should be written to N objects in parallel. When data is read, any component object can be used. This map type is controversial because it highlights the issues with error recovery. Those issues get interesting with any scheme that employs redundancy. The handling of errors (e.g., only a subset of replicas get updated) is outside the scope of this protocol extension. Instead, it is a function of the storage protocol and the metadata management protocol. 11.5 Concatenated Map The map consists of an ordered set of N tuples. Each successive tuple describes the next segment of the file. 11.6 Nested Map The nested map is used to compose more complex maps out of simpler ones. The map format is an ordered set of M sub-maps, each submap applies to a byte range within the file and has its own type such as the ones introduced above. Any level of nesting is allowed in order to build up complex aggregation schemes. 12. Issues 12.1 Storage Protocol Negotiation Clients may want to negotiate with the metadata server about their preferred storage protocol, and to find out what storage protocols the server offers. Client can do this by querying the LAYOUT_CLASSES file system attribute. They respond by specifying a particular layout class in their LAYOUTGET operation. 12.2 Crash recovery We use the existing client crash recovery and server state recovery mechanisms in NFSv4. This includes that layouts have associated layout stateids that "expire" along with the rest of the client state. The main new issue introduced by pNFS is that the client may have to do a lot of I/O in response to a layout recall. The client may need to remember to send RENEW ops to the server during this period if it were to risk not doing anything within the lease time. Of course, the client should only reply with its LAYOUTRETURN after it knows its I/O has completed. 12.3 Storage Errors welch-pnfs-ops Expires - November 2005 [Page 25] Internet-Draft pNFS Operations Summary May 2005 As noted under LAYOUTRETURN, there is a need for the client to communicate about errors it has when accessing storage directly. 13. References 1 Gibson et al, "pNFS Problem Statement", ftp://www.ietf.org/ /internet-drafts/draft-gibson-pnfs-problem-statement-01.txt, July 2004. 2 "Object-Based Storage Device Commands (OSD)", INCITS 400-2004, http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf, July 2004. 14. Acknowledgments Many members of the pNFS informal working group have helped considerably. The authors would like to thank Gary Grider, Peter Corbett, Dave Noveck, and Peter Honeyman. This work is inspired by the NASD and OSD work done by Garth Gibson. Gary Grider of the national labs (LANL) has been a champion of high-performance parallel I/O. 15. Author's Addresses Brent Welch Panasas,Inc. 6520 Kaiser Drive Fremont, CA 94555 USA Phone: +1 (510) 608 7770 Email: welch@panasas.com Benny Halevy Panasas, Inc. 1501 Reedsdale St., #400 Pittsburgh, PA 15233 USA Phone: +1 (412) 323 3500 Email: bhalevy@panasas.com David L. Black EMC Corporation 176 South Street Hopkinton, MA 01748 Phone: +1 (508) 293-7953 Email: black_david@emc.com Andy Adamson CITI University of Michigan 519 W. William Ann Arbor, MI 48103-4943 USA Phone: +1 (734) 764-9465 Email: andros@umich.edu welch-pnfs-ops Expires - November 2005 [Page 26] Internet-Draft pNFS Operations Summary May 2005 David Noveck Network Appliance 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 (781) 768 5347 Email: dnoveck@netapp.com 16. Full Copyright Notice Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. welch-pnfs-ops Expires - November 2005 [Page 27]