Network J. Zelenka Internet-Draft B. Welch Expires: December 12, 2005 Panasas June 10, 2005 Object-based pNFS Operations draft-zelenka-pnfs-obj-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 12, 2005. Copyright Notice Copyright (C) The Internet Society (2005). Abstract This Internet-Draft provides a description of the object-based pNFS extension for NFSv4. This is a companion to the main pnfs operations draft, which is currently draft-welch-pnfs-ops-02.txt Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this Zelenka & Welch Expires December 12, 2005 [Page 1] Internet-Draft pnfs ops June 2005 document are to be interpreted as described in RFC 2119 [1]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Proposed change to the pNFS spec . . . . . . . . . . . . . . . 3 3. LAYOUTGET . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . . . 5 5. Mapping virtual object offsets to component object offsets . . 5 6. Usage and implementation notes . . . . . . . . . . . . . . . . 7 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7.1 Object Layout Security . . . . . . . . . . . . . . . . . . 9 8. Normative References . . . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 11 Intellectual Property and Copyright Statements . . . . . . . . 12 Zelenka & Welch Expires December 12, 2005 [Page 2] Internet-Draft pnfs ops June 2005 1. Introduction In pNFS, the file server returns typed layout structures that describe where file data is located. There are different layouts for different storage systems and methods of arranging data on storage devices. This document describes several layouts to be used with object-based storage devices (OSD) that are accessed according to the iSCSI/OSD storage protocol standard. An "object" is a container for data and attributes, and files are stored in one or more objects. The OSD protocol specifies several operations on objects, including READ, WRITE, FLUSH, GETATTR, SETATTR, CREATE and DELETE. However, in this proposal the client only uses the READ, WRITE, and FLUSH OSD commands. The other commands are only used by the pNFS server. The OSD protocol has a capability-based security scheme that allows the pNFS server to control what operations and what objects are used by clients. This scheme is described in more detail in the "Security Considerations" section. An object-based layout for pNFS includes object identifiers, capabilities that allow clients to READ or WRITE those objects, and various parameters that control how file data is striped across their component objects. 2. Proposed change to the pNFS spec In the past, we've talked about modifying LAYOUTCOMMIT to include parameters that are somewhat specialized to object-based layouts, such as new attribute values for space_used and timestamps. Rather than commingle such specializations with the general-purpose LAYOUTCOMMIT operation, we instead propose replacing the newlayout4 argument with layoutupdate4, which is specialized to each layout. Example usage of this appears below. [Note: this can also be achieved by defining new "layouts" that are only used during LAYOUTCOMMIT and contain additional information. However, it seems cleaner to introduce a new type.] 3. LAYOUTGET The layouts defined here provide striped, mirrored, or stripe- mirrored data organizations. See the discussion section below for more details on usage of these layouts. Zelenka & Welch Expires December 12, 2005 [Page 3] Internet-Draft pnfs ops June 2005 struct pnfs_layout_object_id { uint64 device_id; uint64 partition_id; uint64 object_id; }; struct pnfs_layout_object { pnfs_layout_object_id object_id; uint64 offset; uint64 length; opaque capability<>; }; struct pnfs_object_striped_layouttype4 { uint64 stripe_unit; pnfs_layout_object objects<>; }; struct pnfs_object_mirrored_layouttype4 { boolean stripe_locks_required; pnfs_layout_object objects<>; }; struct pnfs_object_striped_mirrored_layouttype4 { boolean stripe_locks_required; uint64 stripe_unit; uint16 mirror_cnt; pnfs_layout_object objects<>; }; union pnfs_layouttypees4 switch (pnfs_layouttype4 class) { case LAYOUT_FILES_NFSV4: pnfs_nfsv4_layouttype4 file_layout; case LAYOUT_OBJECTS_STRIPED: pnfs_object_striped_layouttype4 object_striped_layout; case LAYOUT_OBJECTS_MIRRORED: pnfs_object_mirrored_layouttype4 object_mirrored_layout; case LAYOUT_OBJECTS_STRIPED_MIRRORED: pnfs_object_striped_mirrored_layouttype4 object_striped_mirrored_layout; default: opaque layout_data<>; }; Figure 1 Zelenka & Welch Expires December 12, 2005 [Page 4] Internet-Draft pnfs ops June 2005 4. LAYOUTCOMMIT struct panfs_object_ioerr4 { pnfs_layout_object_id obj_id; uint64 offset; uint64 length; }; struct pnfs_object_layoutupdate4 { uint32 update_attr_flags; uint64 space_used; nfstime4 time_access_set; nfstime4 time_modify; nfstime4 time_metadata; panfs_object_ioerr4 ioerr<>; }; union pnfs_layoutupdate4 switch (pnfs_layouttype4 class) { case LAYOUT_OBJECTS_STRIPED: case LAYOUT_OBJECTS_MIRRORED: case LAYOUT_OBJECTS_STRIPED_MIRRORED: pnfs_object_layoutupdate4 object_layout_update; default: opaque layout_data<>; }; struct LAYOUTCOMMIT4args { pnfs_layoutid4 layoutid; neweof4 neweof; offset4 offset; length4 length; pnfs_layoutupdate4 layoutupdate; }; In pnfs_object_layoutupdate4, the attr_flags are: PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_SPACE_USED PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_TIME_ACCESS_SET PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_TIME_MODIFY PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_METADATA PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_LENGTH Figure 2 5. Mapping virtual object offsets to component object offsets For LAYOUT_OBJECTS_MIRRORED, bytes at offset N length L correspond to bytes at offset N length L in each component mirror. Note that the Zelenka & Welch Expires December 12, 2005 [Page 5] Internet-Draft pnfs ops June 2005 layout definition supports any number of mirrors. For LAYOUT_OBJECTS_STRIPED and LAYOUT_OBJECTS_STRIPED_MIRRORED, the data is densely packed in component objects. The layout specifies a stripe_unit. In LAYOUT_OBJECTS_STRIPED, the number of devices in the stripe is equal to the length of the objects<> array. For LAYOUT_OBJECTS_STRIPED_MIRRORED, the number of devices in the stripe is equal to the length of the objects<> array divided by mirror_cnt. The objects<> array in LAYOUT_OBJECTS_STRIPED_MIRRORED is indexed such that mirrors appear adjacent to one another. Thus, an object with 6 items in the object array and a mirror_cnt of 2 would have an object array . D0a and D0b are mirrors; D1a and D1b are mirrors; D2a and D2b are mirors. In LAYOUT_OBJECTS_STRIPED and LAYOUT_OBJECTS_STRIPED_MIRRORED, the stripe width (S) is the stripe_unit times the number of devices in the stripe. To map offset L in the virtual object, one determines the stripe number N by computing N = L / S. The device number D = (L-(N*S)) / stripe_unit. The offset (o) within the D's component is (N*stripe_unit)+(L%stripe_unit). For example, consider an object striped over four devices, . The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 4096 = 16384. Zelenka & Welch Expires December 12, 2005 [Page 6] Internet-Draft pnfs ops June 2005 Offset 0: N = 0 / 16384 = 0 D = 0-0/4096 = 0 (D0) o = 0*4096 + (0%4096) = 0 Offset 4096: N = 4096 / 16384 = 0 D = (4096-(0*16384)) / 4096 = 1 (D1) o = (0*4096)+(4096%4096) = 0 Offset 9000: N = 9000 / 16384 = 0 D = (9000-(0*16384)) / 4096 = 2 (D2) o = (0*4096)+(9000%4096) = 808 Offset 132000: N = 132000 / 16384 = 8 D = (132000-(8*16384)) / 4096 = 0 o = (8*4096) + (132000%4096) = 33696 Figure 3 6. Usage and implementation notes When a client wishes to access storage directly, it issues a LAYOUTGET for the object. If it receives NFS4ERR_LAYOUTUNAVAILABLE, it remembers that layouts are not available for this object, and subsequent accesses are performed through the server using normal NFSv4 operations. If it receives NFS4ERR_LAYOUTTRYLATER, it satisfies its immediate I/O needs with normal NFSv4 operations, but after a short time may retry the LAYOUTGET. The access to data objects given to clients via LAYOUTGET is strictly for the purpose of reading and writing data. Clients should always retrieve attributes by requesting them from the metadata server, and attribute updates should only be done through the metadata server. When ANSI/T10 objects are used for the backing store, the only T10 commands that pNFS clients SHOULD issue to storage are READ, WRITE, and FLUSH. We expect clients to flush any cached writes before releasing locks or issuing CLOSEs. When a client holds a layout delegation, this flush should include a LAYOUTCOMMIT. Mirrored object types require additional serialization of updates to ensure correct operation. Otherwise, if two clients simultaneously write to the same logical range of an object, the result could Zelenka & Welch Expires December 12, 2005 [Page 7] Internet-Draft pnfs ops June 2005 include different data in the same ranges of mirrored tuples. These locks are distinct from application-visible locks, and they are implemented using a side protocol between the clients and the metadata server. In the case where only one client is accessing storage directly, the server may choose to allow it to proceed without acquiring and releasing these locks. The server does this by setting stripe_locks_required=0 in the layout parameters. If stripe_locks_required is nonzero, the client must take and release these locks when updating the object. We do not require these locks for reading. If multiple clients that are reading and writing the same region of an object wish to serialize their accesses, they may (and should) do so using the appliation-visible byte-range locks. When the server receives a layout request that it cannot grant due to a sharing issue (for example, LAYOUTGET on a mirrored object, where one client holds a layout with stripe_locks_required=0), the server will issue a CB_LAYOUTRECALL to the client (or clients) holding conflicting layouts, and it will respond to the new request with NFS4ERR_LAYOUTTRYLATER. When the clients return their layouts, they may simply issue a LAYOUTRETURN and cease using the layout. Alternatively, they may issue LAYOUTRETURN and LAYOUTGET in the same compound operation, thus requesting a new (and possibly downgraded) layout. In this sharing case, the server would reply with a new layout where stripe_locks_required=1. This way, when the rejected client retries its LAYOUTGET, that operation may succeed. [welch: I think the stripe_locks should be managed transparently so that the server doesn't give out overlapping RW layouts to different clients to avoid the problem. Or, we just assume the client is serializing with other clients using byte-locks. If multiple clients are writing into the same region at the same time, the results could simply be "undefined".] When a client issues a LAYOUTGET and it receives a layout that contains the beginning of the byterange it requested, it may immediately issue another LAYOUTGET for the subsequent byterange. When a layout is granted but the offset of the layout is past the beginning of the range requested, it should not immediately re- request a layout for the non-granted range; instead, it should assume that such a request would fail with NFS4ERR_LAYOUTTRYLATER. When a server receives a request for a layout range which it cannot entirely grant, it should either fail the entire LAYOUTGET with NFS4ERR_LAYOUTTRYLATER, or it should grant the sub-range with the lowest offset. At any time, a client that holds a layout may issue a LAYOUTCOMMIT. In the LAYOUTCOMMIT args, neweof represents the client's belief in the current end-of-file marker. Offset and length represent the Zelenka & Welch Expires December 12, 2005 [Page 8] Internet-Draft pnfs ops June 2005 range of the file modified by the client. This range is inclusive; it is okay for a client to modify two nonadjacent regions of the file and issue a single LAYOUTCOMMIT with an offset and length covering both. It is also okay for a client to issue two LAYOUTCOMMITs in this situation. The layoutupdate field of LAYOUTCOMMIT args allows the client to propagate new attributes to the server. Space_used, time_access_set, time_modify, and time_metadata reflect the values for these attributes currently known to the client, just as neweof reflects the client's knowledge of the logical length of the file. For each of these attributes, update_attr_flags contains a bit corresponding to the attribute. If that bit is set, the client is indicating that the server should update this attribute on storage to reflect the passed-in value. If a client is implementing strict atime semantics, it may use LAYOUTCOMMIT to update time_access_set on an object. 7. Security Considerations The pNFS extension partitions the NFSv4 file system protocol into two parts, the control path and the data path (storage protocol). The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the control path. The combination of components in a pNFS system is required to preserve the security properties of NFSv4 with respect to an entity accessing data via a client, including security countermeasures to defend against threats that NFSv4 provides defenses for in environments where these threats are considered significant. 7.1 Object Layout Security The object storage protocol relies on a cryptographically secure capability to control accesses at the object storage devices. Capabilities are generated by the metadata server, returned to the client, and used by the client as described below to authenticate their requests to the Object Storage Device (OSD). Capabilities therefore achieve the required access and open mode checking. They allow the file server to define and check a policy (e.g., open mode) and the OSD to check and enforce that policy without knowing the details (e.g., user IDs and ACLs). Each capability is specific to a particular object, an operation on that object, a byte range w/in the object, and has an explicit expiration time. The capabilities are signed with a secret key that is shared by the object storage devices (OSD) and the metadata managers. clients do not have device keys so they are unable to forge capabilities. Zelenka & Welch Expires December 12, 2005 [Page 9] Internet-Draft pnfs ops June 2005 The details of the security and privacy model for Object Storage are out of scope of this document and will be specified in the Object Storage version of the storage protocol definition. However, the following sketch of the algorithm should help the reader understand the basic model. LAYOUTGET returns {CapKey = MAC(CapArgs), CapArgs} The client uses CapKey to sign all the requests it issues for that object using the respective CapArgs. In other words, the CapArgs appears in the request to the storage device, and that request is signed with the CapKey as follows: ReqMAC = MAC(Req, Nonceln) The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. The OSD uses the SecretKey it shares with the metadata server to compare the ReqMAC the client sent with a locally computed MAC(CapArgs)>(Req, Nonceln) and if they match the OSD assumes that the capabilities came from an authentic metadata server and allows access to the object, as allowed by the CapArgs. Therefore, if the server LAYOUTGET reply, holding CapKey and CapArgs, is snooped by another client, it can be used to generate valid OSD requests (within the CapArgs access restriction). To provide the required privacy requirements for the capabilities returned by LAYOUTGET, the GSS-API can be used, e.g. by using a session key known to the file server and to the client to encrypt the whole layout or parts of it. Two general ways to provide privacy in the absence of GSS-API that are independent of NFSv4 are either an isolated network such as a VLAN or a secure channel provided by IPsec. 8. Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [2] Welch, B., "pNFS Operations", July 2005, . [3] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) Zelenka & Welch Expires December 12, 2005 [Page 10] Internet-Draft pnfs ops June 2005 version 4 Protocol", RFC 3530, April 2003. Authors' Addresses Jim Zelenka Panasas, Inc. 1501 Reedsdale St. Suite 400 Pittsburgh, PA 15233 USA Phone: +1-412-323-6402 Email: jimz@panasas.com URI: http://www.panasas.com/ Brent Welch Panasas, Inc. 6520 Kaiser Drive Fremont, CA 95444 USA Phone: +1-650-608-7770 Email: welch@panasas.com URI: http://www.panasas.com/ Zelenka & Welch Expires December 12, 2005 [Page 11] Internet-Draft pnfs ops June 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Zelenka & Welch Expires December 12, 2005 [Page 12]