NFSv4 M. Benjamin Internet-Draft CohortFS, LLC Intended status: Standards Track Bodley Expires: September 12, 2013 Emerson Honeyman March 11, 2013 pNFS Metadata Striping draft-mbenjamin-nfsv4-pnfs-metastripe-00 Abstract This Internet-Draft describes a means to add metadata striping to pNFS. The text of this draft is substantially based on prior drafts by Eisler, M., with some departures. The current draft attempts to define a somewhat lighter-weight protocol, in particular, seeks to permit striping for ?filehandle only? operations such as LOCK and OPEN + CLAIM_FH, without clients having to obtain metadata layouts on regular files. We gratefully acknowledge the primary contributions of Mike Eisler, Pranoop Ersani, and others. Internet Draft Comments Comments regarding this draft are solicited. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 12, 2013. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. Benjamin, et al. Expires September 12, 2013 [Page 1] Internet-Draft pNFS Metastripe March 2013 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction and Motivation . . . . . . . . . . . . . . . . . 4 2. Short List of Protocol Changes from Previous Drafts . . . . . 4 2.1. File-system wide Striping for Filehandle-Only Operations . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Uniform Filehandles . . . . . . . . . . . . . . . . . . . 4 2.3. Simplified Multipath Device Model . . . . . . . . . . . . 4 2.4. Cookie Model . . . . . . . . . . . . . . . . . . . . . . . 5 2.5. Recommended Attributes . . . . . . . . . . . . . . . . . . 5 2.5.1. meta_stripe_deviceid (deviceid4) . . . . . . . . . . . 5 2.5.2. meta_stripe_count (uint32_t) . . . . . . . . . . . . . 5 2.6. PREADDIR (Operation) . . . . . . . . . . . . . . . . . . . 5 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Scope of Metadata Layouts . . . . . . . . . . . . . . . . . . 6 4.1. Inode Striping . . . . . . . . . . . . . . . . . . . . . . 7 4.2. Dentry Striping . . . . . . . . . . . . . . . . . . . . . 7 4.2.1. Name-Based Operations . . . . . . . . . . . . . . . . 8 4.2.2. Directory Enumeration . . . . . . . . . . . . . . . . 8 5. The Metadata Striping Layout . . . . . . . . . . . . . . . . . 9 5.1. Name . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2. Value . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.3. Data Type Definitions . . . . . . . . . . . . . . . . . . 9 5.3.1. Layout Hint . . . . . . . . . . . . . . . . . . . . . 9 5.3.2. Devices . . . . . . . . . . . . . . . . . . . . . . . 9 5.3.3. Metadata Layout . . . . . . . . . . . . . . . . . . . 10 5.3.4. Layoutupdate4 lou_body . . . . . . . . . . . . . . . . 10 5.4. Metadata Layout Semantics . . . . . . . . . . . . . . . . 11 5.4.1. LAYOUTGET Argument Conventions . . . . . . . . . . . . 11 5.4.2. Inode Striping Layouts . . . . . . . . . . . . . . . . 11 5.4.2.1. Inode Stripe Hints . . . . . . . . . . . . . . . . 11 5.4.3. Dentry Striping Layouts . . . . . . . . . . . . . . . 12 5.4.3.1. L-MDS Selection for Name-based Operations . . . . 12 5.4.3.2. Directory Enumeration . . . . . . . . . . . . . . 13 6. Further Considerations . . . . . . . . . . . . . . . . . . . . 14 6.1. Storage Access Protocols . . . . . . . . . . . . . . . . . 14 6.2. Revocation of Layouts . . . . . . . . . . . . . . . . . . 14 Benjamin, et al. Expires September 12, 2013 [Page 2] Internet-Draft pNFS Metastripe March 2013 6.3. Stateids . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.4. Lease Terms . . . . . . . . . . . . . . . . . . . . . . . 15 6.5. Layout Operations Sent to an L-MDS . . . . . . . . . . . . 15 6.6. Filehandles in Metadata Layouts . . . . . . . . . . . . . 16 6.7. Restriping . . . . . . . . . . . . . . . . . . . . . . . . 16 6.7.1. Layout Recall Cases . . . . . . . . . . . . . . . . . 16 6.7.2. Hint Invalidation . . . . . . . . . . . . . . . . . . 16 6.8. Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.9. Failure and Restart of Client . . . . . . . . . . . . . . 17 6.10. Failure and Restart of Server . . . . . . . . . . . . . . 17 7. Negotiation . . . . . . . . . . . . . . . . . . . . . . . . . 17 8. Operational Recommendation for Deployment . . . . . . . . . . 17 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 10. Security Considerations . . . . . . . . . . . . . . . . . . . 17 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 12.1. Normative References . . . . . . . . . . . . . . . . . . . 18 12.2. Informative References . . . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 Benjamin, et al. Expires September 12, 2013 [Page 3] Internet-Draft pNFS Metastripe March 2013 1. Introduction and Motivation The NFSv4.1 specification describes pNFS [NFSv4.1]. In NFSv4.1, parallel access is limited to the data contents of regular files: the content of regular files is distributed (striped) across multiple storage devices. Metadata is not distributed or striped: the model presented in the NFSv4.1 specification is that of a single metadata server. This document describes a means to add metadata striping to pNFS, which includes the notion of multiple metadata servers. Two methods are described. The first, called inode striping, directs metadata operations associated with a file handle to a preferred metadata server. The second, called dentry striping, distributes directory operations across a collection of metadata servers. With metadata striping, multiple metadata servers may work together to provide a higher parallel performance. 2. Short List of Protocol Changes from Previous Drafts 2.1. File-system wide Striping for Filehandle-Only Operations Stripe hints redirect clients to a preferred metadata server for filehandle-only operations (below), but are backed by a single layout per-file system, rather than per-file, as in [METASTRIPE]. The new model is lighter weight, but since it remains layout-based retains the advantages of pNFS device indirection and garbage collection. 2.2. Uniform Filehandles [METASTRIPE] offers implementations the option to propagate layout filehandles for all metadata layout types. Since it would be impossible to reasonably support this under the new proposed model for filehandle-only operations, we propose instead that L-MDS filehandles always be equivalent to I-MDS filehandles. 2.3. Simplified Multipath Device Model [METASTRIPE] defines two different methods for encoding metadata server locations, only the ?simple? model uses the pNFS device mechanism. For this draft, we propose a single model based on pNFS devices, in which there is a one-to-one mapping between devices and L-MDS servers. This approach facilitates sharing device addresses across layouts which have servers in common and also minimizes the difficulty of reclaiming devices no longer in use by any metadata layout. Benjamin, et al. Expires September 12, 2013 [Page 4] Internet-Draft pNFS Metastripe March 2013 2.4. Cookie Model NFSv4 associates with each entry in a directory a unique value of type cookie4, a 64-bit integer. [METASTRIPE] involves cookies in stripe selection, and imposes specific requirements on cookie values. In the current proposal we treat cookies as opaque values except as specified in ordinary NFSv4.1. We concur with [METASTRIPE] that cookies MUST be unique within any logical directory regardless of the striping pattern. As in ordinary NFSv4.1, the behavior of READDIR (or PREADDIR, below) when cookie has a value previously returned to a client by the same server, but no longer associated with any directory entry, is not defined. 2.5. Recommended Attributes We propose two new recommended attributes. meta_stripe_deviceid (deviceid4) meta_stripe_count (uint32_t) 2.5.1. meta_stripe_deviceid (deviceid4) An attribute of type meta_stripe_deviceid represents an inode stripe hint. This attribute MUST NOT be offered to clients unless they hold a valid inode striping layout on the containing file system. 2.5.2. meta_stripe_count (uint32_t) The meta_stripe_count attribute represents, for directory objects, the directory?s current stripe count, which may help the client decide if it will request a dentry striping layout on the directory. This attribute MAY be offered only to clients which hold an inode striping layout on the containing file system. 2.6. PREADDIR (Operation) The NFSv4.1 READDIR operation has insufficient information to perform all possible enumerations required in our proposed dentry striping model. We propose a new PREADDIR operation which takes, in addition to all the current READDIR operations, also a controlling metadata layout stateid and stripe number. 3. Terminology Benjamin, et al. Expires September 12, 2013 [Page 5] Internet-Draft pNFS Metastripe March 2013 Initial Metadata Server (I-MDS). The I-MDS is the metadata server from which the client obtains a filehandle prior to acquiring any layout on the file Layout Metadata Server (L-MDS). The L-MDS is the metadata server from which the client obtains a filehandle from after redirection from a layout. Regular file: An object of file type NF4REG or NF4NAMEDATTR. Inode striping. Hint-based indirection to a preferred MDS for filehandle-based operations, backed by a filesystem-wide metadata layout. Dentry striping. Fine-grained, layout-based indirection for parallel operations on directories, using a striping pattern. 4. Scope of Metadata Layouts This proposal assumes a model where there are two or more servers capable of supporting NFSv4.1 operations. At least one server is an I-MDS, and the I-MDS should be thought of as a normal NFSv4.1 server, with the additional capability of granting metadata layouts on demand. The I-MDS might also be capable of granting non-metadata layouts, but this is orthogonal to the scope of metadata striping. The model also requires at least one additional server, an L-MDS, that is capable of supporting NFSv4.1 operations that are directed to the server by the I-MDS. It is permissible for an I-MDS to also be an L-MDS, and an L-MDS to also be an I-MDS. Indeed, a simple submodel is for every NFSv4.1 server in a set to be both an I-MDS and L-MDS. For convenience, we divide NFSv4.1 metadata operations into three classes: Filehandle-only. These are operations that take only filehandles as arguments, i.e. the current filehandle, or both the current filehandle and the saved filehandle, and no component names of files (e.g., LOCK, LAYOUTGET). Name-based. These are operations that take one or two filehandles (i.e. the current filehandle, or both the current file handle and the saved filehandle) and one or two component names of files (e.g., LINK, RENAME). Benjamin, et al. Expires September 12, 2013 [Page 6] Internet-Draft pNFS Metastripe March 2013 Directory-enumeration. These are operations that take one filehandle and return the contents of a directory. Currently, NFSv4 has only one such operation, READDIR. This draft adds a section, PREADDIR. Metadata striping applies to all of the foregoing NFSv4.x operations, and is of two types: inode striping uses hints (attribute-based indications) backed by a filesystem-wide layout to direct clients to a preferred MDS on which to perform filehandle-only operations dentry striping uses fine-grained metadata layouts on directories to support execution of name-based operations (directory enumeration, creates) on a set of MDS servers in parallel 4.1. Inode Striping To avoid an explosion of new client state, a coarse-grained hinting mechanism is used to direct filehandle-only operations to a preferred metadata server. As specified in 5.12.1 of [NFSv4.1], when a client encounters file system which supports LAYOUT4_METADATA, it can obtain a metadata layout of subtype LAYOUTMETA4_INODE, whose scope is the entire file system, using the LAYOUTGET operation on any filehandle object in the file system which it is permitted to access. Then using ordinary READDIR and GETATTR requests, the client can obtain for any object in the file system a meta_stripe_deviceid attribute that indicates the preferred device to send filehandle-only or name-based operations for that object. For example, suppose that after obtaining an ordinary filehandle via OPEN, a LAYOUTMETA4_INODE layout on the containing file system, and a meta_stripe_deviceid hint from a previous GETATTR, READDIR, or PREADDIR,, the client wants to get a byte range lock on the file. The client sends the LOCK request to the network address (pNFS device, L-MDS) indicated by the meta_stripe_deviceid attribute. 4.2. Dentry Striping For name-based and directory enumeration operations, a more fine- grained, layout-based redirection mechanism is used. When a client obtains a filehandle for an object that is of type directory and wishes to take advantage of metadata striping, the client first obtains a metadata layout of subtype LAYOUTMETA4_DENTRY Benjamin, et al. Expires September 12, 2013 [Page 7] Internet-Draft pNFS Metastripe March 2013 on the directory. The client is provided with a directory-specific list of network addresses (devices) to which to send requests specific to objects in that directory. 4.2.1. Name-Based Operations For name-based operations, the metadata layout indicates the preferred destinations in the network to send name-based operations for that directory (e.g., CREATE). The preferred destinations MUST apply to the current filehandle that the operation uses. In other words, for LINK and RENAME, which take both the saved filehandle and the current filehandle as parameters, the pNFS client would use the stripe hint of the target directory (indicated in the current filehandle) for guidance where to send the operation. Note that if an L-MDS accepts a LINK or RENAME operation, the L-MDS MUST perform the operation atomically. If it cannot, then the L-MDS MUST return the error NFS4ERR_XDEV, and the client MUST send the operation to the I-MDS. The choice of destination is a function of the name the client is requesting. For example, after the client obtains the filehandle of a directory via LOOKUP and the metadata layout via LAYOUTGET, the client wants to open a regular file within the directory. As with the LAYOUT4_NFSV4_1_FILES layout type, the client has a list network addresses to which to send requests. With the LAYOUT4_NFSV4_1_FILES layout, the choice of the index in the list of network addresses was computed from the offset of the read or write request. With the metadata layout, the choice of the index is derived from the name (or some other method, such as the name and one or more attributes of the directory, such as the filehandle, fileid, as below.) passed to OPEN. 4.2.2. Directory Enumeration For directory-enumeration operations, the metadata layout indicates the preferred destination in the network to send directory reading operations for that directory. For example, after the client obtains the filehandle of a directory via LOOKUP and the metadata layout via LAYOUTGET, the client wants to read the directory. As with the LAYOUT4_NFSV4_1_FILES layout type, the client has a list network addresses to which to send requests. With the LAYOUT4_NFSV4_1_FILES layout, the choice of the index in list of network addresses was computed from the offset of the read or write request. For dentry striping layouts, the index merely counts from 0 to the dentry stripe count, less 1. Benjamin, et al. Expires September 12, 2013 [Page 8] Internet-Draft pNFS Metastripe March 2013 5. The Metadata Striping Layout 5.1. Name The name of the metadata striping layout type is LAYOUT4_METADATA. 5.2. Value The value of the metadata striping layout type is TBD1. 5.3. Data Type Definitions 5.3.1. Layout Hint /// % /// %/* Encoded in the loh_body field of type layouthint4: */ /// % /// struct md_dirsize_layouthint4 { /// uint64_t *mdlh_min_est; /// uint64_t *mdlh_avg_est; /// uint64_t *mdlh_max_est; /// uint32_t *mdlh_stripe_count; /// uint32_t *mdlh_stripe_modulus; /// }; Figure 1 The layout-type specific layouthint4 content for the LAYOUT4_METDATA layout type is composed of four fields, each optional. Using some combination of the mdlh_min_est, mdlh_avg_est, and mdlh_max_est fields, the client is enabled to give an indication of the dentry workload it expects for a new directory inode. The client also may suggest an explicit stripe count or modulus preference in mdlh_stripe_count or mdlh_stripe_modulus, which SHOULD be congruent if specified together. 5.3.2. Devices /// % /* /// % * Encoded in the da_addr_body field of data type /// % * device_addr4: /// % */ /// struct md_layout_addr4 { /// multipath_list4 mdla_multipath_list<>; /// }; Figure 2 Benjamin, et al. Expires September 12, 2013 [Page 9] Internet-Draft pNFS Metastripe March 2013 5.3.3. Metadata Layout /// enum md_layout_subtype4 { /// LAYOUTMETA4_INODE = 0, /// LAYOUTMETA4_DENTRY /// }; /// /// enum md_namebased_alg4 { /// MDN_ALG_CITYHASH64 = 0, /// /* XXX TBD2 */ /// }; /// /// struct md_layout_dentry { /// switch(uint32_t mdln_namebased_alg) { /// case MDN_ALG_CITYHASH64: /// uint32_t seed; /// }; /// /// deviceid4 mdln_devicelist<>; /// uint32_t mdln_stripe_pattern<>; /// }; /// struct md_layout4 { /// union md_layout_type /// switch (enum md_layout_subtype4 subtype) { /// case LAYOUTMETA4_INODE: /// void; /// case LAYOUTMETA4_DENTRY: /// md_layout_dentry mdl_layout; /// }; /// }; Figure 3 5.3.4. Layoutupdate4 lou_body /// %/* /// % * LAYOUT4_METADATA. /// % * Encoded in the lou_body field of type layoutupdate4: /// % * Nothing. lou_body is a zero length array of octets. /// % */ /// % Figure 4 The LAYOUT4_METADATA layout type has no content for lou_body member Benjamin, et al. Expires September 12, 2013 [Page 10] Internet-Draft pNFS Metastripe March 2013 of the layoutupdate4 data type. 5.4. Metadata Layout Semantics The reply to a successful LAYOUTGET request MUST contain exactly one element in logr_layout. The element contains the metadata layout. 5.4.1. LAYOUTGET Argument Conventions When a client requests a layout of type LAYOUT4_METADATA, it specifies the desired subtype, which must be one of LAYOUTMETA4_INODE or LAYOUTMETA4_DENTRY, as the value of the LAYOUTGET loga_iomode argument. 5.4.2. Inode Striping Layouts If the requested layout is of subtype LAYOUTMETA4_INODE, the value of the layout is void. The inode redirection information issued under auspices of the layout will be entirely in the form of inode striping attribute hints. As noted in Section 4, the scope of inode striping layouts is an entire file system. The client can acquire the (singleton) inode striping layout for a given file system using any corresponding file handle which it happens to hold, and whose object the client is permitted to access. For example, the client could use the file handle of the first directory it traverses on a given file system, provided the file server is an NFSv4.x file server that supports layouts of type LAYOUT4_METADATA. 5.4.2.1. Inode Stripe Hints Inode stripe hints are objects of type deviceid4, and are the value of a new recommended, get-only attribute meta_stripe_deviceid. A client may successfully obtain the meta_stripe_deviceid attribute on any file object if and only if it has successfully obtained an inode striping layout on the containing file system. Since the meta_stripe_deviceid hint is an ordinary NFSv4 attribute, the client may acquire it from a GETATTR, READDIR, or PREADDIR request. A server implementation SHOULD interpret a PREADDIR operation (which has a controlling metadata layout stateid) as a request for just those attributes that are appropriate for the layout stateid that has been presented. At all events, when a client holds an inode stripe hint for a file object, it uses the GETDEVICEINFO operation to map the hint value to a to a device address of data type md_layout_addr4 in the ordinary Benjamin, et al. Expires September 12, 2013 [Page 11] Internet-Draft pNFS Metastripe March 2013 pNFS manner. The server ensures that each such device remains accessible (unrecalled) for at least as long as any inode striping layout exists for which the device has been named in a hint. 5.4.3. Dentry Striping Layouts If the requested layout is of subtype LAYOUTMETA4_DENTRY, then the layout contains a triple enabling the client to perform both parallel directory enumeration operations and stripe-aware name-based operations, as outlined in Section 4. When the layout subtype is LAYOUTMETA4_DENTRY, the layout content provides an integer identifying a hashing algorithm, a list of deviceids, and a striping pattern. Then mdln_namebased_alg identifies an algorithm that maps a name, as a component4, to an integer. Each entry in the mdln_devicelist specifies a set of metadata servers that may be treated as equally valid for metadata requests to the same block in the partitioned namespace. Each entry in the stripe pattern is an index into the device list. To perform a name based operation, the client maps the name to a number with the name based algorithm, looks that number up in the stripe pattern (modulo the length of the stripe pattern), yielding a device id that may be interpreted with GETDEVICEINFO, in the ordinary pNFS manner. After resolving the device id as a device address of data type md_layout_addr4, the client sends the request to any of the devices specified in the corresponding entry in the device list. 5.4.3.1. L-MDS Selection for Name-based Operations Clients with layouts of type LAYOUTMETA4_DENTRY may use the algorithm supplied in field mdln_namebased_alg of the layout content to compute a preferred L-MDS to use when performing name-based operations, as follows: Benjamin, et al. Expires September 12, 2013 [Page 12] Internet-Draft pNFS Metastripe March 2013 Let F be the function specified in mdln_namebased_alg; Let X = (x1, x2, x3, ...) some set of inputs for function F, such that x1 SHOULD be the component name of the file, and x2, x3, ? any additional parameters required for the chosen F, their arguments asserted to be values available to the client. Let stripe_unit_number = F(X); Let stripe_count = number of elements in mdl_layout.mdln_stripe_pattern; Let idx = mdl_layout.mdln_stripe_pattern(stripe_unit_number % stripe_count); Let deviceid = mdl_layout.mdln_devicelist[idx]; pseudocode Figure 5 The client then selects an L-MDS indicated by the deviceid (using GETDEVICEINFO in the normal manner), and sends the name-based operation to that server. The current proposal defines a single algorithm, consisting of application of the 64-bit CityHash non-cryptographic hashing function [CITY], with x1 the desired component name, and x2 the 32-bit seed value returned in the layout. 5.4.3.2. Directory Enumeration Clients with layouts of type LAYOUTMETA4_DENTRY may use the following algorithm to perform enumeration of striped directories preferred metadata servers, in parallel: For stripe_number in 0 .. length(mdl_layout.mdln_stripe_pattern) -1 do Let stripe = mdl_layout.mdln_stripe_pattern[stripe_number]; Let device = mdl_layout.mdln_devicelist[stripe]; pseudocode Figure 6 That is, for each logical stripe in the directory, the client notes stripe number (merely the stripe?s offset in the sequence), and derives from it the corresponding index into mdln_devicelist by indirection on mdln_stripe_pattern. The object at mdln_devicelist[stripe_number] is a device id, which the client maps Benjamin, et al. Expires September 12, 2013 [Page 13] Internet-Draft pNFS Metastripe March 2013 to an L-MDS using GETDEVICEINFO, and performs a sequence of PREADDIR operations on that server. The PREADDIR operation behaves exactly as described in section 18.23.3 of [NFSv4.1], but takes in addition to the arguments of READDIR, a metadata layout stateid and stripe number. As in ordinary NFSv4.1, to perform a full enumeration of the directory entries at each component L-MDS, the client commences iteration by sending a cookie argument of zero for the first PREADDIR operation in the current stripe, and continues performing PREADDIR operations supplying for the cookie argument the value of last cookie value returned in the prior PREADDIR operation in the same logical (L-MDS) enumeration only, until a PREADDIR operation indicates that no further entries are available. The client and server behavior for subsequent re-traversals of a previously-enumerated logical directory are exactly as in ordinary NFSv4.1, except with respect to entry and cookie partitioning as described here. The client SHOULD present to a component L-MDS only cookie values previously returned to that client by that same L-MDS, or 0 to commence iteration. An L-MDS that is not also an I-MDS MAY reject with NFS4ERR_BADCOOKIE PREADDIR operations using cookie values that are valid cookies for the logical directory, but which are local to another L-MDS segment. 6. Further Considerations 6.1. Storage Access Protocols The LAYOUT4_METADATA layout type uses NFSv4.1 operations (and potentially, operations of higher minor versions of NFSv4, subject to the definition of a minor version of NFSv4) to access striped metadata. The LAYOUT4_METADATA does not affect access to storage devices, and indeed, in the protocol described here, layouts of type LAYOUT4_METADATA and ordinary pNFS layouts for parallel data access (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or LAYOUT4_BLOCK_VOLUME, or a future flexible files layout), are orthogonal. 6.2. Revocation of Layouts Servers MAY revoke layouts of type LAYOUT4_METADATA. A client detects if layout has been revoked if the operation is rejected with NFS4ERR_PNFS_NO_LAYOUT. In NFSv4.1, the error NFS4ERR_PNFS_NO_LAYOUT could be returned only by READ and WRITE. When the server returns a layout of type LAYOUT4_METADATA, the set of operations that can return NFS4ERR_PNFS_NO_LAYOUT is: ACCESS, CLOSE, COMMIT, CREATE, DELEGRETURN, GETATTR, LINK, LOCK, LOCKT, LOCKU, LOOKUP, LOOKUPP, NVERIFY, OPEN, OPENATTR, OPEN_DOWNGRADE, PREADDIR, READ, READDIR, Benjamin, et al. Expires September 12, 2013 [Page 14] Internet-Draft pNFS Metastripe March 2013 READLINK, REMOVE, RENAME, SECINFO, SETATTR, VERIFY, WRITE, GET_DIR_DELEGATION, SECINFO, SECINFO_NO_NAME, and WANT_DELEGATION. 6.3. Stateids The pNFS specification for LAYOUT4_NFSV4_1_FILES states data servers MUST be aware of the stateids granted by MDS so that the stateids passed to READ and WRITE can be properly validated. This requirement extends to the LAYOUT4_METADATA layout type: the L-MDS MUST be aware of any non-layout stateids granted by the I-MDS, if and only if the client is in contact the L-MDS under direction of a metadata layout returned by the I-MDS, and the I-MDS has not recalled or revoked that layout. In addition, because an L-MDS can accept operations like OPEN and LOCK that create or modify stateids, the I-MDS MUST be aware of stateids that an L-MDS has returned to a client, if and only if the I-MDS granted the client a metadata layout that directed the client to the L-MDS. In some cases, one L-MDS MUST be aware of a stateid generated by another L-MDS. For example a client can obtain a stateid from the L-MDS serving as the destination of name-based operations, which includes OPEN. However, operations that use the stateid will be filehandle-only operations, and the L-MDS the OPEN operation is sent to might differ from the L-MDS the LOCK operation for the same target file is sent to. 6.4. Lease Terms Any state the client obtains from an I-MDS or L-MDS is guaranteed to last for an interval lasting as long as the maximum of the lease_time attribute of the the I-MDS, and any L-MDS the client is directed to as the result of a metadata layout. The client has a lease for each client ID it has with an I-MDS or L-MDS, and each lease MUST be renewed separately for each client ID. 6.5. Layout Operations Sent to an L-MDS An L-MDS MAY allow a LAYOUTGET operation of type LAYOUT4_METADATA. One reason the L-MDS might allow such a LAYOUTGET operation is to allow hierarchical striping. For example, for name-based operations, the pNFS server might use a radix tree, (which the field mdln_namebased_alg would indicate). The first four bytes of the component name would be combined to form a 32-bit stripe_unit_number. Once the client contacted the L-MDS, it would repeat the algorithm on the second four bytes of the component, and so on until the component name was exhausted. More typically, an L-MDS MAY allow a LAYOUTGET operation of type Benjamin, et al. Expires September 12, 2013 [Page 15] Internet-Draft pNFS Metastripe March 2013 LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or LAYOUT4_BLOCK_VOLUME. Naturally, a reason to allow this would be for increased pNFS MDS scalability. Once an L-MDS grants a layout, the client MUST use only the L-MDS that granted the layout to send LAYOUTUPDATE, LAYOUTCOMMIT, and LAYOUTRETURN. 6.6. Filehandles in Metadata Layouts Metadata layouts do not present filehandles. 6.7. Restriping 6.7.1. Layout Recall Cases When a server implementation intends to perform restriping, it MUST ensure that it has successfully recalled any metadata layout which would be invalidated by the restriping. If the implementation wishes to restripe a directory on which there are outstanding layouts of type LAYOUTMETA4_DENTRY, it must first successfully recall these layouts at their controlling I-MDS servers, as described in [NFSv4.1]. If the implementation wishes to perform inode restriping which would invalidate any inode stripe hint which it has issued to clients, it MUST successfully recall all controlling layouts of type LAYOUTMETA4_INODE which would conflict with the restriping Naturally, if a client requests an L-MDS to perform any operation under the auspices of a metadata layout which is no longer valid, the L-MDS is not required to perform it. The L-MDS SHOULD fail the operation with NFS4ERR_PNFS_NO_LAYOUT. 6.7.2. Hint Invalidation When an implementation wishes to perform inode restriping that would invalidate an inode stripe hint or hints it has issued to clients, it can use ordinary NFSv4.1 invalidation to reclaim the hints. Since inode stripe hints are recommended attributes, the controlling I-MDS or L-MDS does this by updating the change attribute on the inode being updated, as it would for any other inode update. 6.8. Recovery [[Comment.1: it is likely this section will follow that of the files layout type specified in the NFSv4.1 specification.]] Benjamin, et al. Expires September 12, 2013 [Page 16] Internet-Draft pNFS Metastripe March 2013 6.9. Failure and Restart of Client TBD 6.10. Failure and Restart of Server TBD 7. Negotiation n pNFS client sends a GETATTR operation for attribute fs_layout_type. If the reply contains the metadata layout type, then either or both of inode or dentry striping are supported, subject to further verification by subsequent LAYOUTGET operations. If not, the client cannot use metadata striping. 8. Operational Recommendation for Deployment Deploy the metadata striping layout when it is anticipated that the workload will involve a high fraction of non-I/O operations on filehandles. 9. Acknowledgements We gratefully acknowledge the primary contributions of Mike Eisler, Pranoop Ersani, and others, in [METASTRIPE]. From prior drafts, Brent Welch had the idea of returning a separate device ID for filehandle-only operations in the metadata layout. Pranoop Erasani, Dave Noveck, and Richard Jernigan provided valuable feedback. 10. Security Considerations The security considerations of Section 13.12 of [NFSv4.1] which are specific to data servers apply to l-MDSes. In addition, each l-MDS server and client are, respectively, a complete NFSv4.1 server and client, and so the security considerations of [NFSv4.1] apply to any client or server using the metadata layout type. 11. IANA Considerations This specification requires an addition to the Layout Types registry Benjamin, et al. Expires September 12, 2013 [Page 17] Internet-Draft pNFS Metastripe March 2013 described in Section 22.4 of [NFSv4.1]. The five fields added to the registy are: 1. Name of layout type: LAYOUT4_METADATA. 2. Value of layout type: TBD1. 3. Standards Track RFC that describes this layout: RFCTBD2, which would be the RFC of this document. 4. How the RFC Introduces the specification: minor revision (we believe). 5. Minor versions of NFSv4 that can use the layout type: [TBD]. This specification requires the creation of a registry of hash algorithms for supporting the field mdln_namebased_alg. Additional details TBD. This specification introduces two new recommended attributes (meta_stripe_deviceid and meta_stripe_count). This specification introduces a new operation (PREADDIR). 12. References 12.1. Normative References [CITY] Pike and Alakuijala, "Introducing CityHash", April 2011, < http://google-opensource.blogspot.com/2011/04/ introducing-cityhash.html>. [METASTRIPE] Eisler, "Metadata Striping for pNFS", October 2010, . [NFSv4.1] Shepler, Eisler, and Noveck, "Network File System (NFS) Version 4 Minor Version 1 Protocol", January 2010, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. Benjamin, et al. Expires September 12, 2013 [Page 18] Internet-Draft pNFS Metastripe March 2013 12.2. Informative References [RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006. Authors' Addresses Matt Benjamin CohortFS, LLC 206 S. Fifth Ave, Suite 150 Ann Arbor, MI 48104 USA Phone: +1 734 761 4689 Email: matt@cohortfs.com Casey Bodley Email: casey@cohortfs.com Adam C. Emerson Email: aemerson@cohortfs.com Peter Honeyman Email: peter.honeyman@gmail.com Benjamin, et al. Expires September 12, 2013 [Page 19]