Network B. Halevy Internet-Draft B. Welch Expires: July 27, 2006 J. Zelenka Panasas T. Pisek Sun January 23, 2006 Object-based pNFS Operations draft-ietf-nfsv4-pnfs-obj-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on July 27, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This Internet-Draft provides a description of the object-based pNFS extension for NFSv4. This is a companion to the main pnfs operations draft, which is currently draft-ietf-nfsv4-pnfs-00.txt Requirements Language Halevy, et al. Expires July 27, 2006 [Page 1] Internet-Draft pnfs objects January 2006 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Object Storage Device Addressing and Discovery . . . . . . . . 3 3. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 5 3.1 pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 5 3.2 pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 6 3.3 pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 7 3.3.1 Simple Striping . . . . . . . . . . . . . . . . . . . 7 3.3.2 Nested Striping . . . . . . . . . . . . . . . . . . . 9 3.3.3 Mirroring . . . . . . . . . . . . . . . . . . . . . . 11 3.3.4 RAID . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.5 Usage and implementation notes . . . . . . . . . . . . 13 3.4 pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . . . 13 4. Security Considerations . . . . . . . . . . . . . . . . . . . 15 4.1 Security Data Types . . . . . . . . . . . . . . . . . . . 16 4.2 Security Protocol . . . . . . . . . . . . . . . . . . . . 16 4.3 Revoking capabilities . . . . . . . . . . . . . . . . . . 17 5. Normative References . . . . . . . . . . . . . . . . . . . . . 18 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 Intellectual Property and Copyright Statements . . . . . . . . 20 Halevy, et al. Expires July 27, 2006 [Page 2] Internet-Draft pnfs objects January 2006 1. Introduction In pNFS, the file server returns typed layout structures that describe where file data is located. There are different layouts for different storage systems and methods of arranging data on storage devices. This document describes the layouts used with object-based storage devices (OSD) that are accessed according to the iSCSI/OSD storage protocol standard (SNIA T10/1355-D [2]). An "object" is a container for data and attributes, and files are stored in one or more objects. The OSD protocol specifies several operations on objects, including READ, WRITE, FLUSH, GETATTR, SETATTR, CREATE and DELETE. However, in this proposal the client only uses the READ, WRITE, GETATTR and FLUSH commands. The other commands are only used by the pNFS server. An object-based layout for pNFS includes object identifiers, capabilities that allow clients to READ or WRITE those objects, and various parameters that control how file data is striped across their component objects. The OSD protocol has a capability-based security scheme that allows the pNFS server to control what operations and what objects are used by clients. This scheme is described in more detail in the "Security Considerations" section. 2. Object Storage Device Addressing and Discovery Data operations to an OSD require the client to know the "address" of each OSD's root object. The root object is synonymous with SCSI logical unit. The client specifies SCSI logical units to its SCSI stack using a representation local to the client. Because these representations are local, GETDEVICEINFO must return information that can be used by the client to select the correct local representation. In the block world, a set offset (logical block number or track/ sector) contains a disk label. This label identifies the disk uniquely. In constrast, an OSD has a standard set of attributes on its root object. For device identification purposes, the OSD name (root information attribute number 9) will be used as the label. This appears in the pnfs_obj_deviceaddr4 type below under the "root_id" field. In some situations, SCSI target discovery may need to be driven based on information contained in the GETDEVICEINFO response. One example of this is iSCSI targets that are not known to the client until a layout has been requested. Eventually iSCSI will adopt ANSI T10 SAM-3, at which time the World Wide Name (WWN aka, EUI-64/EUI-128) naming conventions can be specified. In addition, Fibre Channel (FC) SCSI targets have a unique WWN. Although these FC targets have Halevy, et al. Expires July 27, 2006 [Page 3] Internet-Draft pnfs objects January 2006 already been discovered, some implementations may want to specify the WWN in addition to the label. This information appears as the "target" and "lun" fields in the pnfs_obj_deviceaddr4 type described below. The following enum specifies the manner in which a scsi target can be specified. The target can be specified as an IP address (v4 or v6), as an Internet Qualified Name (IQN), or by the WWN of the target. enum pnfs_obj_addr_type4 { OBJ_TARGET_IP_ADDR = 1, OBJ_TARGET_IQN = 2, OBJ_TARGET_WWN = 3 }; Figure 1 A device can be specified by the tuple , or in the default case, just by the OSD Name. The following enum is used to select the format: enum pnfs_obj_dev_specifier4 { OBJ_DEV_SPEC_TARGET = 1 }; Figure 2 To summarize, the device addressing is fundamentally done by specifying the OSD name (i.e., root_id). In order to help the client resource discovery process, physical address hints can also be provided. The specification for an object device address is as follows: union pnfs_obj_deviceaddr4 switch (pnfs_obj_dev_specifier4 dev) { case OBJ_DEV_SPEC_TARGET: pnfs_obj_addr_type4 addr_type; string target<>; uint64 lun; opaque root_id<>; default: opaque root_id<>; }; Figure 3 Halevy, et al. Expires July 27, 2006 [Page 4] Internet-Draft pnfs objects January 2006 3. Object-Based Layout This draft defines structure associated with the pnfs_layouttype4 value, LAYOUT_OSD_OBJECTS. The pnfs draft specifies the structure as an XDR type "opaque". The opaque layout is uninterpreted by the generic pNFS client layers, but obviously must be interpreted by the object-storage layout driver. This document defines the structure of this opaque value. This is the pnfs_layoutdata4 type from the general pNFS specification: enum pnfs_layouttype4 { LAYOUT_NFSV4_FILES = 1, LAYOUT_OSD_OBJECTS = 2, LAYOUT_BLOCK_VOLUME = 3 }; struct pnfs_layoutdata4 { pnfs_layouttype4 layout_type; opaque layout_data<>; }; Figure 4 3.1 pnfs_osd_objid4 An object is identified by a number, somewhat like an inode number. The object storage model has a two level scheme, where the objects within an object storage device are grouped into partitions. struct pnfs_osd_objid4 { pnfs_deviceid4 device_id; uint64 partition_id; uint64 object_id; }; Figure 5 The pnfs_osd_objid4 identifies an object within a partition on a specified object storage device. The device selects the object storage device from the set of available storage devices. The device is identified with the pnfs_deviceid4 type, which is an index into addressing information about that device returned by the GETDEVICEINFO pnfs operation. Within an OSD, a partition is identified with a 64-bit number. Within a partition, an object is identified with a 64-bit number. Creation and management of Halevy, et al. Expires July 27, 2006 [Page 5] Internet-Draft pnfs objects January 2006 partitions is outside the scope of this standard, and is a facility provided by the object storage file system. 3.2 pnfs_osd_layout4 The pnfs_osd_layout4 specifies a layout over a set of component objects. The components field is an array of object identifiers and security credentials that grant access to each object. The organization of the data is defined by the pnfs_osd_data_map4 type that specifies how the file's data is mapped onto the component objects (i.e., the striping pattern). The data placement algorithm that maps file data onto component objects assume that each component object occurs exactly once in the array of components. Therefore, component objects MUST appear in the component array only once. At this time the OSD standard is at version 1.0, and we anticipate a version 2.0 of the standard. The second generation OSD protocol has additional proposed features to support more robust error recovery, snapshots, and byte-range capabilities. Therefore, the OSD version is explicitly called out in the information returned in the layout. (This information can also be deduced by looking inside the capability type at the format field, which is the first byte. The format value is 0x1 for an OSD v1 capability. However, it seems most robust to call out the version explicitly.) In addition, the osd_version field is used to indicate that an object may be missing (i.e., unavailable). Some layout schemes encode redundant information and can compensate for missing components, but the data placement algorithm needs to know what parts are missing. enum pnfs_osd_version { PNFS_OSD_MISSING = 0, PNFS_OSD_VERSION_1 = 1, PNFS_OSD_VERSION_2 = 2 }; struct pnfs_osd_object_cred4 { pnfs_osd_objid4 object_id; pnfs_osd_version osd_version; opaque credential<>; }; struct pnfs_osd_layout4 { pnfs_osd_object_cred4 components<>; pnfs_osd_data_map4 map; }; Halevy, et al. Expires July 27, 2006 [Page 6] Internet-Draft pnfs objects January 2006 Figure 6 Note that the layout depends on the file size, which the client learns from the generic return parameters of LAYOUTGET, by doing GETATTR commands to the metadata server, and by getting CB_SIZE_CHANGED callbacks from the metadata server. The client uses the file size to decide if it should fill holes with zeros, or return a short read. Striping patterns can cause cases where component objects are shorter than other components because a hole happens to correspond to the last part of the component object. 3.3 pnfs_osd_data_map4 The pnfs_osd_data_map4 parameterizes the algorithm that maps a file's contents over the component objects. Instead of limiting the system to simple striping scheme where loss of a single component object results in data loss, the map parameters support mirroring and more complicated schemes that protect against loss of a component object. The type is shown first, and then each parameter is explained. enum pnfs_osd_raid_algorithm4 { PNFS_OSD_RAID_0 = 1, PNFS_OSD_RAID_4 = 2, PNFS_OSD_RAID_5 = 3, PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ }; struct pnfs_osd_data_map4 { length4 stripe_unit; uint16 group_width; uint16 group_depth; uint16 mirror_cnt; pnfs_osd_raid_algorithm4 raid_algorithm; }; Figure 7 3.3.1 Simple Striping The stripe_unit is the number of bytes placed on one component before advancing to the next one in the list of components. The number of bytes in a full stripe is stripe_unit times the number of components. In some raid schemes, a stripe includes redundant information (i.e., parity) that lets the system recover from loss or damage to a component object. Halevy, et al. Expires July 27, 2006 [Page 7] Internet-Draft pnfs objects January 2006 The object layout always uses a "dense" layout as described in the pNFS document. This means that the second stripe unit of the file starts at offset 0 of the second component, rather than at offset stripe_unit bytes. After a full stripe has been written, the next stripe unit is appended to the first component object in the list without any holes in the component objects. The mapping from the logical offset within a file (L) to do the component object C and object-specific offset O is defined by the following equations: L = logical offset into the file W = total number of components S = W * stripe_unit N = L / S C = (L-(N*S)) / stripe_unit O = (N*stripe_unit)+(L%stripe_unit) Figure 8 In these equations, S is the number of bytes in a full stripe, and N is the stripe number. C is an index into the array of components, so it selects a particular object storage device. Both N and C count from zero. O is the offset within the object that corresponds to the file offset. Note that this computation does not accomodate the same object appearing in the component array multiple times. For example, consider an object striped over four devices, . The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 4096 = 16384. Halevy, et al. Expires July 27, 2006 [Page 8] Internet-Draft pnfs objects January 2006 Offset 0: N = 0 / 16384 = 0 C = 0-0/4096 = 0 (D0) O = 0*4096 + (0%4096) = 0 Offset 4096: N = 4096 / 16384 = 0 C = (4096-(0*16384)) / 4096 = 1 (D1) O = (0*4096)+(4096%4096) = 0 Offset 9000: N = 9000 / 16384 = 0 C = (9000-(0*16384)) / 4096 = 2 (D2) O = (0*4096)+(9000%4096) = 808 Offset 132000: N = 132000 / 16384 = 8 C = (132000-(8*16384)) / 4096 = 0 O = (8*4096) + (132000%4096) = 33696 Figure 9 3.3.2 Nested Striping The group_width and group_depth parameters allow a nested striping pattern. If there is no nesting, then group_width and group_depth MUST be zero. Otherwise, the group_width defines the width of a data stripe, and the group_depth defines how many stripes are written before advancing to the next group of components in the list of component objects for the file. The size of the components array MUST be a multiple of group_width. The math used to map from a file offset to a component object and offset within that object is shown below. The computations map from the logical offset L to the component index C and offset relative O within that component object. L = logical offset into the file W = total number of components S = stripe_unit * group_depth * W T = stripe_unit * group_depth * group_width U = stripe_unit * group_width M = L / S G = (L - (M * S)) / T H = (L - (M * S)) % T N = H / U C = (H - (N * U)) / stripe_unit + G * group_width O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit Halevy, et al. Expires July 27, 2006 [Page 9] Internet-Draft pnfs objects January 2006 Figure 10 In these equations, S is the number of bytes striped across all component objects before the pattern repeats. T is the number of bytes striped within a group of component objects before advancing to the next group. U is the number of bytes in a stripe within a group. M is the "major" (i.e., across all components) stripe number, and N is the "minor" (i.e., across the group) stripe number. G counts the groups from the beginning of the major stripe, and H is the byte offset within the group. For example, consider an object striped over 100 devices with a group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. In this scheme, 500 MB are written to the first 10 components, and 5000 MB is written before the pattern wraps back around to the first component in the array. Offset 0: W = 100 S = 1 MB * 50 * 100 = 5000 MB T = 1 MB * 50 * 10 = 500 MB U = 1 MB * 10 = 10 MB M = 0 / 5000 MB = 0 G = (0 - (0 * 5000 MB)) / 500 MB = 0 H = (0 - (0 * 5000 MB)) % 500 MB = 0 N = 0 / 10 MB = 0 C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0 O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0 Offset 27 MB: M = 27 MB / 5000 MB = 0 G = (27 MB - (0 * 5000 MB)) / 500 MB = 0 H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB N = 27 MB / 10 MB = 2 C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7 O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB Offset 7232 MB: M = 7232 MB / 5000 MB = 1 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB N = 232 MB / 10 MB = 23 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB Figure 11 Halevy, et al. Expires July 27, 2006 [Page 10] Internet-Draft pnfs objects January 2006 3.3.3 Mirroring The mirror_cnt is used to replicate a file by replicating its component objects. If there is no mirroring, then mirror_cnt MUST be 0. If mirror_cnt is greater than zero, then the size of the component array MUST be a multiple of (mirror_cnt+1). Thus, for a classic mirror on two objects, mirror_cnt is one. If group_width is also non-zero, then the size MUST be a multiple of group_width * (mirror_cnt+1). Replicas are adjacent in the components array, and the value C produced by the above equations is not a direct index into the components array. Instead, the following equations deterine the replica component index RCi, where i ranges from 0 to mirror_cnt. C = component index for striping or two-level striping i ranges from 0 to mirror_cnt, inclusive RCi = C * (mirror_cnt+1) + i Figure 12 3.3.4 RAID The raid_algorithm determines the algorithm and placement of redundant data. PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the component objects are data bytes located by the above equations for C and O. If a component object is unavailable, the pNFS client can choose to return NULLs for the missing data, or it can retry the READ against the pNFS server, or it can return an EIO error. PNFS_OSD_RAID_4 means that the last component object, or the last in each group if group_width is > zero, contains parity information computed over the rest of the stripe with an XOR operation. If a component object is unavailable, the client can read the rest of the stripe units in the damaged stripe and recompute the missing stripe unit by XORing the other stripe units in the stripe. Or the client can replay the READ against the pNFS server which will presumably perform the reconstructed read on the client's behalf. When parity is present in the file, then there is an additional computation to map from the file offset L to the offset that accounts for embedded parity, L'. First compute L', and then use L' in the above equations for C and O. Halevy, et al. Expires July 27, 2006 [Page 11] Internet-Draft pnfs objects January 2006 L = file offset, not accounting for parity P = number of parity devices in each stripe W = group_width, if not zero, else size of component array N = L / (W-P * stripe_unit) L' = N * (W * stripe_unit) + (L % (W-P * stripe_unit)) Figure 13 PNFS_OSD_RAID_5 means that the position of the parity data is rotated on each stripe. In the first stripe, the last component holds the parity. In the second stripe, the next-to-last component holds the parity, and so on. In this scheme, all stripe units are rotated so that I/O is evenly spread across objects as the file is read sequentially. The rotated parity layout is illustrated here, with numbers indicating the stripe unit. 0 1 2 P 4 5 P 3 8 P 6 7 P 9 a b Figure 14 To compute the component object C, first compute the offset that accounts for parity L' and use that to compute C. Then rotate C to get C'. Finally, increase C' by one if the parity information comes at or before C' within that stripe. The following equations illustrate this by computing I, which is the index of the component that contains parity for a given stripe. L = file offset, not accounting for parity W = group_width, if not zero, else size of component array N = L / (W-1 * stripe_unit) (Compute L' as describe above) (Compute C based on L' as described above) C' = (C - (N%W)) % W I = W - (N%W) - 1 if (C' <= I) { C'++ } Figure 15 PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon P+Q encoding scheme. In this layout, the last two component objects hold the P and Q data, respectively. P is parity computed with XOR, and Q is a more complex equation that is not described here. The Halevy, et al. Expires July 27, 2006 [Page 12] Internet-Draft pnfs objects January 2006 equations given above for embedded parity can be used to map a file offset to the correct component object by setting the number of parity components to 2 instead of 1 for RAID4 or RAID5. Clients may simply choose to read data through the metadata server if two components are missing or damaged. Issue: this scheme also has a RAID_4 like layout where the ECC blocks are stored on the same components on every stripe and a rotated, RAID-5 like layout where the stripe units are rotated. Should we make the following properties orthogonal: RAID_4 or RAID_5 (i.e., non-rotated or rotated), and then have the number of parity components and the associated algorithm be the orthogonal parameter? 3.3.5 Usage and implementation notes RAID layouts with redundant data in their stripes require additional serialization of updates to ensure correct operation. Otherwise, if two clients simultaneously write to the same logical range of an object, the result could include different data in the same ranges of mirrored tuples, or corrupt parity information. It is the responsibility of the metadata server to enforce serialization requirements such as this. For example, the metadata server may do so by not granting overlapping write layouts within mirrored objects. 3.4 pnfs_layoutupdate4 The pnfs_layoutupdate4 type is an opaque value at the generic pNFS client level. If the type is LAYOUT_OSD_OBJECTS, then the opaque value is described by the pnfs_osd_layoutupdate4 type. This type conveys error information, timestamp information, and capacity used information back to the metadata server. Halevy, et al. Expires July 27, 2006 [Page 13] Internet-Draft pnfs objects January 2006 struct pnfs_layoutupdate4 { pnfs_layouttype4 type; opaque layoutupdate_data<>; }; enum pnfs_osd_errno { PNFS_OBJ_NOT_FOUND = 1, PNFS_OBJ_NO_SPACE = 2, PNFS_OBJ_EIO = 3, PNFS_OBJ_BAD_CRED = 4, PNFS_OBJ_NO_ACCESS = 5, PNFS_OBJ_UNREACHABLE = 6 }; struct pnfs_osd_ioerr4 { pnfs_osd_objid4 component; length4 offset; length4 length; pnfs_osd_errno errno; }; union deltaspaceused4 switch (bool valid) { case TRUE: length4 delta; /* Bytes consumed by write activity */ case FALSE: void; } struct pnfs_osd_layoutupdate4 { deltaspaceused4 delta_space_used; newtime4 time_metadata; pnfs_osd_ioerr4 ioerr<>; }; Figure 16 The deltaspaceused4 type is used to convey space utilization information at the time of LAYOUTCOMMIT. For the file system to properly maintain capacity used information, it needs to track how much capacity was consumed by WRITE operations performed by the client. In this protocol, the OSD returns the capacity consumed by a write, which can be different because of internal overhead like block-based allocation and indirect blocks, and the client reflects this back to the pNFS server so it can accurately track quota. The pNFS server can choose to trust this information coming from the clients and therefore avoid querying the OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain this information Halevy, et al. Expires July 27, 2006 [Page 14] Internet-Draft pnfs objects January 2006 from the OSD, it simply returns invalid deltaspaceused4. The time_metadata value indicates the new modify time of the file. The server can choose to trust the client's view of this attribute, or it can query storage to determine the actual modify time. A file's modify time will be the latest modify time among all components of the file. A client can avoid returning time information by returning an invalid time_metadata (i.e., the time_changed union descriminator is FALSE.) The pnfs_osd_ioerr4 returns error indications for objects that generated errors during data transfers. These are hints to the metadata server that there are problems with that object. PNFS_OBJ_NOT_FOUND indicates the object ID specifics an object that does not exist on the Object Storage Device. PNFS_OBJ_NO_SPACE indicates the operation failed because the Object Storage Device ran out of free capacity during the operation. PNFS_OBJ_EIO indicates the operation failed because the Object Storage Device experienced a failure trying to access the object. The most common source of these errors is media errors, but other internal errors might cause this. In this case, the metadata server should go examine the broken object more closely. PNFS_OBJ_BAD_CRED indicates the security parameters are not valid. The primary cause of this is that the capability has expired, or the security policy tag (i.e., capability version number) has been changed to revoke capabilities. The client will need to return the layout and get a new one with fresh capabilities. PNFS_OBJ_NO_ACCESS indicates the capability does not allow the requested operation. This should not occur in normal operation because the metadata server should give out correct capabilities, or none at all. PNFS_OBJ_UNREACHABLE indicates the client was unable to contact the Object Storage Device due to a communication failure. 4. Security Considerations The pNFS extension partitions the NFSv4 file system protocol into two parts, the control path and the data path (storage protocol). The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the control path. The combination of components in a pNFS system is required to preserve the security properties of NFSv4 with respect Halevy, et al. Expires July 27, 2006 [Page 15] Internet-Draft pnfs objects January 2006 to an entity accessing data via a client, including security countermeasures to defend against threats that NFSv4 provides defenses for in environments where these threats are considered significant. 4.1 Security Data Types There are three main data types associated with object security: a capability, a credential, and security parameters. The capability is a set of fields that specifies an object and what operations can be performed on it. A credential is a signed capability. Only a security manager that knows the secret device keys can correctly sign a capability to form a valid credential. In pNFS, the file server acts as the security manager and returns signed capabilities (i.e., credentials) to the pNFS client. The security parameters are values computed by the issuer of OSD commands (i.e., the client) that prove they hold valid credentials. The client uses the credential as a signing key to sign the requests it makes to OSD, and puts the resulting signatures into the security_parameters field of the OSD command. The object storage device uses the secret keys it shares with the security manager to validate the signature values in the security parameters. The security types are opaque to the generic layers of the pNFS client. The credential is defined as opaque within the pnfs_obj_and_cred type. Instead of repeating the definitions here, the reader is referred to section 4.9.2.2 of the OSD standard. 4.2 Security Protocol The object storage protocol relies on a cryptographically secure capability to control accesses at the object storage devices. Capabilities are generated by the metadata server, returned to the client, and used by the client as described below to authenticate their requests to the Object Storage Device (OSD). Capabilities therefore achieve the required access and open mode checking. They allow the file server to define and check a policy (e.g., open mode) and the OSD to enforce that policy without knowing the details (e.g., user IDs and ACLs). Each capability is specific to a particular object, an operation on that object, a byte range w/in the object (in OSDv2), and has an explicit expiration time. The capabilities are signed with a secret key that is shared by the object storage devices (OSD) and the metadata managers. Clients do not have device keys so they are unable to forge the signatures in the security parameters. The combination of a capability and its signature is called a "credential" in the OSD specification. Halevy, et al. Expires July 27, 2006 [Page 16] Internet-Draft pnfs objects January 2006 The details of the security and privacy model for Object Storage are defined in the T10 OSD standard. The following sketch of the algorithm should help the reader understand the basic model. LAYOUTGET returns a CapKey, which is also called a credential. It is a capability and a signature over that capability. {CapKey = MAC(CapArgs), CapArgs} The client uses CapKey to sign all the requests it issues for that object using the respective CapArgs. In other words, the CapArgs appears in the request to the storage device, and that request is signed with the CapKey as follows: ReqMAC = MAC(Req, Nonceln) The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. The OSD uses the SecretKey it shares with the metadata server to compare the ReqMAC the client sent with a locally computed MAC(CapArgs)>(Req, Nonceln) and if they match the OSD assumes that the capabilities came from an authentic metadata server and allows access to the object, as allowed by the CapArgs. Therefore, if the server LAYOUTGET reply, holding CapKey and CapArgs, is snooped by another client, it can be used to generate valid OSD requests (within the CapArgs access restriction). To provide the required privacy requirements for the capabilities returned by LAYOUTGET, the GSS-API can be used, e.g. by using a session key known to the file server and to the client to encrypt the whole layout or parts of it. Two general ways to provide privacy in the absence of GSS-API that are independent of NFSv4 are either an isolated network such as a VLAN or a secure channel provided by IPsec. 4.3 Revoking capabilities At any time, the metadata server may invalidate all outstanding capabilities on an object by changing its capability version attribute. There is also a "fence bit" attribute that the metadata server can toggle to temporarily block access without permanently revoking capabilities. The value of the fence bit and the capability version are part of a capability, and they must match the state of the attributes. If they do not match, the OSD rejects accesses to the object. When a client attempts to use a capability and discovers a capability version mismatch, it should issue a LAYOUTRETURN for the object and specify PNFS_OBJ_BAD_CRED in the pnfs_obj_ioerr parameter. Halevy, et al. Expires July 27, 2006 [Page 17] Internet-Draft pnfs objects January 2006 The client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed set of capabilities. The metadata server may elect to change the capability version on an object at any time, for any reason (with the understanding that there is likely an associated performance penalty, especially if there are outstanding layouts for this object). The metadata server MUST revoke outstanding capabilities when any one of the following occurs: (1) the permissions on the object change, (2) a conflicting mandatory byte-range lock is granted. A pNFS client will typically hold one layout for each byte range for either READ or READ/WRITE. It is the pNFS client's responsibility to enforce access control among multiple users accessing the same file. It is neither required nor expected that the pNFS client will obtain a separate layout for each user accessing a shared object. The client SHOULD use ACCESS calls to check user permissions when performing I/O so that the server's access control policies are correctly enforced. The result of the ACCESS operation may be cached indefinitely, as the server is expected to recall layouts when the file's access permissions or ACL change. 5. Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [2] Weber, R., "SCSI Object-Based Storage Device Commands", July 2004, . [3] Goodson, G., "NFSv4 pNFS Extentions", October 2005, . [4] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003. Halevy, et al. Expires July 27, 2006 [Page 18] Internet-Draft pnfs objects January 2006 Authors' Addresses Benny Halevy Panasas, Inc. 1501 Reedsdale St. Suite 400 Pittsburgh, PA 15233 USA Phone: +1-412-323-3500 Email: bhalevy@panasas.com URI: http://www.panasas.com/ Brent Welch Panasas, Inc. 6520 Kaiser Drive Fremont, CA 95444 USA Phone: +1-650-608-7770 Email: welch@panasas.com URI: http://www.panasas.com/ Jim Zelenka Panasas, Inc. 1501 Reedsdale St. Suite 400 Pittsburgh, PA 15233 USA Phone: +1-412-323-3500 Email: jimz@panasas.com URI: http://www.panasas.com/ Todd Pisek Sun Microsystems, Inc. 1270 Eagan Industrial Rd. - Suite 160 Eagant, MN 55121-1231 USA Phone: +1-651-552-6415 Email: trp@sun.com URI: http://www.sun.com/ Halevy, et al. Expires July 27, 2006 [Page 19] Internet-Draft pnfs objects January 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Halevy, et al. Expires July 27, 2006 [Page 20]