Network                                                       J. Zelenka
Internet-Draft                                                  B. Welch
Expires: December 12, 2005                                       Panasas
                                                           June 10, 2005


                      Object-based pNFS Operations
                     draft-zelenka-pnfs-obj-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on December 12, 2005.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   This Internet-Draft provides a description of the object-based pNFS
   extension for NFSv4.  This is a companion to the main pnfs operations
   draft, which is currently draft-welch-pnfs-ops-02.txt

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this


Zelenka & Welch         Expires December 12, 2005               [Page 1]

Internet-Draft                  pnfs ops                       June 2005


   document are to be interpreted as described in RFC 2119 [1].

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Proposed change to the pNFS spec . . . . . . . . . . . . . . .  3
   3.  LAYOUTGET  . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   4.  LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . . .  5
   5.  Mapping virtual object offsets to component object offsets . .  5
   6.  Usage and implementation notes . . . . . . . . . . . . . . . .  7
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
     7.1   Object Layout Security . . . . . . . . . . . . . . . . . .  9
   8.  Normative References . . . . . . . . . . . . . . . . . . . . . 10
       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 11
       Intellectual Property and Copyright Statements . . . . . . . . 12


Zelenka & Welch         Expires December 12, 2005               [Page 2]

Internet-Draft                  pnfs ops                       June 2005


1.  Introduction

   In pNFS, the file server returns typed layout structures that
   describe where file data is located.  There are different layouts for
   different storage systems and methods of arranging data on storage
   devices.  This document describes several layouts to be used with
   object-based storage devices (OSD) that are accessed according to the
   iSCSI/OSD storage protocol standard.

   An "object" is a container for data and attributes, and files are
   stored in one or more objects.  The OSD protocol specifies several
   operations on objects, including READ, WRITE, FLUSH, GETATTR,
   SETATTR, CREATE and DELETE.  However, in this proposal the client
   only uses the READ, WRITE, and FLUSH OSD commands.  The other
   commands are only used by the pNFS server.

   The OSD protocol has a capability-based security scheme that allows
   the pNFS server to control what operations and what objects are used
   by clients.  This scheme is described in more detail in the "Security
   Considerations" section.

   An object-based layout for pNFS includes object identifiers,
   capabilities that allow clients to READ or WRITE those objects, and
   various parameters that control how file data is striped across their
   component objects.

2.  Proposed change to the pNFS spec

   In the past, we've talked about modifying LAYOUTCOMMIT to include
   parameters that are somewhat specialized to object-based layouts,
   such as new attribute values for space_used and timestamps.  Rather
   than commingle such specializations with the general-purpose
   LAYOUTCOMMIT operation, we instead propose replacing the newlayout4
   argument with layoutupdate4, which is specialized to each layout.
   Example usage of this appears below.

   [Note: this can also be achieved by defining new "layouts" that are
   only used during LAYOUTCOMMIT and contain additional information.
   However, it seems cleaner to introduce a new type.]

3.  LAYOUTGET

   The layouts defined here provide striped, mirrored, or stripe-
   mirrored data organizations.  See the discussion section below for
   more details on usage of these layouts.


Zelenka & Welch         Expires December 12, 2005               [Page 3]

Internet-Draft                  pnfs ops                       June 2005


   struct pnfs_layout_object_id {
     uint64  device_id;
     uint64  partition_id;
     uint64  object_id;
   };

   struct pnfs_layout_object {
     pnfs_layout_object_id  object_id;
     uint64                 offset;
     uint64                 length;
     opaque                 capability<>;
   };

   struct pnfs_object_striped_layouttype4 {
     uint64              stripe_unit;
     pnfs_layout_object  objects<>;
   };

   struct pnfs_object_mirrored_layouttype4 {
     boolean             stripe_locks_required;
     pnfs_layout_object  objects<>;
   };

   struct pnfs_object_striped_mirrored_layouttype4 {
     boolean             stripe_locks_required;
     uint64              stripe_unit;
     uint16              mirror_cnt;
     pnfs_layout_object  objects<>;
   };

   union pnfs_layouttypees4 switch (pnfs_layouttype4 class) {
     case LAYOUT_FILES_NFSV4:
       pnfs_nfsv4_layouttype4                    file_layout;
     case LAYOUT_OBJECTS_STRIPED:
       pnfs_object_striped_layouttype4           object_striped_layout;
     case LAYOUT_OBJECTS_MIRRORED:
       pnfs_object_mirrored_layouttype4          object_mirrored_layout;
     case LAYOUT_OBJECTS_STRIPED_MIRRORED:
       pnfs_object_striped_mirrored_layouttype4
                                         object_striped_mirrored_layout;
     default:
       opaque                                    layout_data<>;
   };

                                 Figure 1


Zelenka & Welch         Expires December 12, 2005               [Page 4]

Internet-Draft                  pnfs ops                       June 2005


4.  LAYOUTCOMMIT

   struct panfs_object_ioerr4 {
     pnfs_layout_object_id  obj_id;
     uint64                 offset;
     uint64                 length;
   };

   struct pnfs_object_layoutupdate4 {
     uint32               update_attr_flags;
     uint64               space_used;
     nfstime4             time_access_set;
     nfstime4             time_modify;
     nfstime4             time_metadata;
     panfs_object_ioerr4  ioerr<>;
   };

   union pnfs_layoutupdate4 switch (pnfs_layouttype4 class) {
     case LAYOUT_OBJECTS_STRIPED:
     case LAYOUT_OBJECTS_MIRRORED:
     case LAYOUT_OBJECTS_STRIPED_MIRRORED:
       pnfs_object_layoutupdate4          object_layout_update;
     default:
       opaque                             layout_data<>;
   };

   struct LAYOUTCOMMIT4args {
     pnfs_layoutid4       layoutid;
     neweof4              neweof;
     offset4              offset;
     length4              length;
     pnfs_layoutupdate4   layoutupdate;
   };

   In pnfs_object_layoutupdate4, the attr_flags are:
     PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_SPACE_USED
     PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_TIME_ACCESS_SET
     PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_TIME_MODIFY
     PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_METADATA
     PNFS_OBJECT_LAYOUTUPDATE4_ATTR_FLAGS_LENGTH

                                 Figure 2


5.  Mapping virtual object offsets to component object offsets

   For LAYOUT_OBJECTS_MIRRORED, bytes at offset N length L correspond to
   bytes at offset N length L in each component mirror.  Note that the


Zelenka & Welch         Expires December 12, 2005               [Page 5]

Internet-Draft                  pnfs ops                       June 2005


   layout definition supports any number of mirrors.

   For LAYOUT_OBJECTS_STRIPED and LAYOUT_OBJECTS_STRIPED_MIRRORED, the
   data is densely packed in component objects.  The layout specifies a
   stripe_unit.

   In LAYOUT_OBJECTS_STRIPED, the number of devices in the stripe is
   equal to the length of the objects<> array.

   For LAYOUT_OBJECTS_STRIPED_MIRRORED, the number of devices in the
   stripe is equal to the length of the objects<> array divided by
   mirror_cnt.  The objects<> array in LAYOUT_OBJECTS_STRIPED_MIRRORED
   is indexed such that mirrors appear adjacent to one another.  Thus,
   an object with 6 items in the object array and a mirror_cnt of 2
   would have an object array <D0a D0b D1a D1b D2a D2b>.  D0a and D0b
   are mirrors; D1a and D1b are mirrors; D2a and D2b are mirors.

   In LAYOUT_OBJECTS_STRIPED and LAYOUT_OBJECTS_STRIPED_MIRRORED, the
   stripe width (S) is the stripe_unit times the number of devices in
   the stripe.  To map offset L in the virtual object, one determines
   the stripe number N by computing N = L / S. The device number D =
   (L-(N*S)) / stripe_unit.  The offset (o) within the D's component is
   (N*stripe_unit)+(L%stripe_unit).

   For example, consider an object striped over four devices, <D0 D1 D2
   D3>.  The stripe_unit is 4096 bytes.  The stripe width S is thus 4 *
   4096 = 16384.


Zelenka & Welch         Expires December 12, 2005               [Page 6]

Internet-Draft                  pnfs ops                       June 2005


   Offset 0:
     N = 0 / 16384 = 0
     D = 0-0/4096 = 0 (D0)
     o = 0*4096 + (0%4096) = 0

   Offset 4096:
     N = 4096 / 16384 = 0
     D = (4096-(0*16384)) / 4096 = 1 (D1)
     o = (0*4096)+(4096%4096) = 0

   Offset 9000:
     N = 9000 / 16384 = 0
     D = (9000-(0*16384)) / 4096 = 2 (D2)
     o = (0*4096)+(9000%4096) = 808

   Offset 132000:
     N = 132000 / 16384 = 8
     D = (132000-(8*16384)) / 4096 = 0
     o = (8*4096) + (132000%4096) = 33696

                                 Figure 3


6.  Usage and implementation notes

   When a client wishes to access storage directly, it issues a
   LAYOUTGET for the object.  If it receives NFS4ERR_LAYOUTUNAVAILABLE,
   it remembers that layouts are not available for this object, and
   subsequent accesses are performed through the server using normal
   NFSv4 operations.  If it receives NFS4ERR_LAYOUTTRYLATER, it
   satisfies its immediate I/O needs with normal NFSv4 operations, but
   after a short time may retry the LAYOUTGET.

   The access to data objects given to clients via LAYOUTGET is strictly
   for the purpose of reading and writing data.  Clients should always
   retrieve attributes by requesting them from the metadata server, and
   attribute updates should only be done through the metadata server.
   When ANSI/T10 objects are used for the backing store, the only T10
   commands that pNFS clients SHOULD issue to storage are READ, WRITE,
   and FLUSH.

   We expect clients to flush any cached writes before releasing locks
   or issuing CLOSEs.  When a client holds a layout delegation, this
   flush should include a LAYOUTCOMMIT.

   Mirrored object types require additional serialization of updates to
   ensure correct operation.  Otherwise, if two clients simultaneously
   write to the same logical range of an object, the result could


Zelenka & Welch         Expires December 12, 2005               [Page 7]

Internet-Draft                  pnfs ops                       June 2005


   include different data in the same ranges of mirrored tuples.  These
   locks are distinct from application-visible locks, and they are
   implemented using a side protocol between the clients and the
   metadata server.  In the case where only one client is accessing
   storage directly, the server may choose to allow it to proceed
   without acquiring and releasing these locks.  The server does this by
   setting stripe_locks_required=0 in the layout parameters.  If
   stripe_locks_required is nonzero, the client must take and release
   these locks when updating the object.  We do not require these locks
   for reading.  If multiple clients that are reading and writing the
   same region of an object wish to serialize their accesses, they may
   (and should) do so using the appliation-visible byte-range locks.

   When the server receives a layout request that it cannot grant due to
   a sharing issue (for example, LAYOUTGET on a mirrored object, where
   one client holds a layout with stripe_locks_required=0), the server
   will issue a CB_LAYOUTRECALL to the client (or clients) holding
   conflicting layouts, and it will respond to the new request with
   NFS4ERR_LAYOUTTRYLATER.  When the clients return their layouts, they
   may simply issue a LAYOUTRETURN and cease using the layout.
   Alternatively, they may issue LAYOUTRETURN and LAYOUTGET in the same
   compound operation, thus requesting a new (and possibly downgraded)
   layout.  In this sharing case, the server would reply with a new
   layout where stripe_locks_required=1.  This way, when the rejected
   client retries its LAYOUTGET, that operation may succeed.

   [welch: I think the stripe_locks should be managed transparently so
   that the server doesn't give out overlapping RW layouts to different
   clients to avoid the problem.  Or, we just assume the client is
   serializing with other clients using byte-locks.  If multiple clients
   are writing into the same region at the same time, the results could
   simply be "undefined".]

   When a client issues a LAYOUTGET and it receives a layout that
   contains the beginning of the byterange it requested, it may
   immediately issue another LAYOUTGET for the subsequent byterange.
   When a layout is granted but the offset of the layout is past the
   beginning of the range requested, it should not immediately re-
   request a layout for the non-granted range; instead, it should assume
   that such a request would fail with NFS4ERR_LAYOUTTRYLATER.  When a
   server receives a request for a layout range which it cannot entirely
   grant, it should either fail the entire LAYOUTGET with
   NFS4ERR_LAYOUTTRYLATER, or it should grant the sub-range with the
   lowest offset.

   At any time, a client that holds a layout may issue a LAYOUTCOMMIT.
   In the LAYOUTCOMMIT args, neweof represents the client's belief in
   the current end-of-file marker.  Offset and length represent the


Zelenka & Welch         Expires December 12, 2005               [Page 8]

Internet-Draft                  pnfs ops                       June 2005


   range of the file modified by the client.  This range is inclusive;
   it is okay for a client to modify two nonadjacent regions of the file
   and issue a single LAYOUTCOMMIT with an offset and length covering
   both.  It is also okay for a client to issue two LAYOUTCOMMITs in
   this situation.

   The layoutupdate field of LAYOUTCOMMIT args allows the client to
   propagate new attributes to the server.  Space_used, time_access_set,
   time_modify, and time_metadata reflect the values for these
   attributes currently known to the client, just as neweof reflects the
   client's knowledge of the logical length of the file.  For each of
   these attributes, update_attr_flags contains a bit corresponding to
   the attribute.  If that bit is set, the client is indicating that the
   server should update this attribute on storage to reflect the
   passed-in value.  If a client is implementing strict atime semantics,
   it may use LAYOUTCOMMIT to update time_access_set on an object.

7.  Security Considerations

   The pNFS extension partitions the NFSv4 file system protocol into two
   parts, the control path and the data path (storage protocol).  The
   control path contains all the new operations described by this
   extension; all existing NFSv4 security mechanisms and features apply
   to the control path.  The combination of components in a pNFS system
   is required to preserve the security properties of NFSv4 with respect
   to an entity accessing data via a client, including security
   countermeasures to defend against threats that NFSv4 provides
   defenses for in environments where these threats are considered
   significant.

7.1  Object Layout Security

   The object storage protocol relies on a cryptographically secure
   capability to control accesses at the object storage devices.
   Capabilities are generated by the metadata server, returned to the
   client, and used by the client as described below to authenticate
   their requests to the Object Storage Device (OSD).  Capabilities
   therefore achieve the required access and open mode checking.  They
   allow the file server to define and check a policy (e.g., open mode)
   and the OSD to check and enforce that policy without knowing the
   details (e.g., user IDs and ACLs).

   Each capability is specific to a particular object, an operation on
   that object, a byte range w/in the object, and has an explicit
   expiration time.  The capabilities are signed with a secret key that
   is shared by the object storage devices (OSD) and the metadata
   managers. clients do not have device keys so they are unable to forge
   capabilities.


Zelenka & Welch         Expires December 12, 2005               [Page 9]

Internet-Draft                  pnfs ops                       June 2005


   The details of the security and privacy model for Object Storage are
   out of scope of this document and will be specified in the Object
   Storage version of the storage protocol definition.  However, the
   following sketch of the algorithm should help the reader understand
   the basic model.

   LAYOUTGET returns

     {CapKey = MAC<SecretKey>(CapArgs), CapArgs}

   The client uses CapKey to sign all the requests it issues for that
   object using the respective CapArgs.  In other words, the CapArgs
   appears in the request to the storage device, and that request is
   signed with the CapKey as follows:

     ReqMAC = MAC<CapKey>(Req, Nonceln)

   The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}.
   The OSD uses the SecretKey it shares with the metadata server to
   compare the ReqMAC the client sent with a locally computed

     MAC<MAC<SecretKey>(CapArgs)>(Req, Nonceln)

   and if they match the OSD assumes that the capabilities came from an
   authentic metadata server and allows access to the object, as allowed
   by the CapArgs.  Therefore, if the server LAYOUTGET reply, holding
   CapKey and CapArgs, is snooped by another client, it can be used to
   generate valid OSD requests (within the CapArgs access restriction).

   To provide the required privacy requirements for the capabilities
   returned by LAYOUTGET, the GSS-API can be used, e.g. by using a
   session key known to the file server and to the client to encrypt the
   whole layout or parts of it.  Two general ways to provide privacy in
   the absence of GSS-API that are independent of NFSv4 are either an
   isolated network such as a VLAN or a secure channel provided by
   IPsec.

8.  Normative References

   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", March 1997.

   [2]  Welch, B., "pNFS Operations", July 2005,
        <ftp://www.ietf.org/internet-drafts/
        draft-welch-pnfs-ops-02.txt>.

   [3]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
        C., Eisler, M., and D. Noveck, "Network File System (NFS)


Zelenka & Welch         Expires December 12, 2005              [Page 10]

Internet-Draft                  pnfs ops                       June 2005


        version 4 Protocol", RFC 3530, April 2003.


Authors' Addresses

   Jim Zelenka
   Panasas, Inc.
   1501 Reedsdale St. Suite 400
   Pittsburgh, PA  15233
   USA

   Phone: +1-412-323-6402
   Email: jimz@panasas.com
   URI:   http://www.panasas.com/


   Brent Welch
   Panasas, Inc.
   6520 Kaiser Drive
   Fremont, CA  95444
   USA

   Phone: +1-650-608-7770
   Email: welch@panasas.com
   URI:   http://www.panasas.com/


Zelenka & Welch         Expires December 12, 2005              [Page 11]

Internet-Draft                  pnfs ops                       June 2005


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2005).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Zelenka & Welch         Expires December 12, 2005              [Page 12]