NFSv4 C. Lever Internet-Draft Oracle Intended status: Experimental October 18, 2013 Expires: April 21, 2014 End-to-end Data Integrity For NFSv4 draft-cel-nfsv4-end2end-data-protection-01 Abstract End-to-end data integrity protection provides a strong guarantee that data an application reads from durable storage is exactly the same data it wrote previously to durable storage. This document specifies possible additions to the NFSv4 protocol enabling it to convey end- to-end data integrity information between client and server. Lever Expires April 21, 2014 [Page 1] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 21, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Lever Expires April 21, 2014 [Page 2] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Scope Of This Document . . . . . . . . . . . . . . . . . . 4 1.2. Causes of Data Corruption . . . . . . . . . . . . . . . . 4 1.3. End-to-end Data Integrity . . . . . . . . . . . . . . . . 5 1.4. The Case For End-To-End Data Integrity Management . . . . 5 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 2. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1. Protection types . . . . . . . . . . . . . . . . . . . . . 9 2.1.1. Protection Type Table . . . . . . . . . . . . . . . . 10 2.2. GETATTR . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3. INIT_PROT_INFO - Initialize Protection Information . . . . 12 2.3.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 12 2.4. New data content type . . . . . . . . . . . . . . . . . . 12 2.5. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6. WRITE_PLUS . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7. Error codes . . . . . . . . . . . . . . . . . . . . . . . 14 3. Protocol Design Considerations . . . . . . . . . . . . . . . . 16 3.1. Protection Envelopes . . . . . . . . . . . . . . . . . . . 16 3.2. Protecting Holes . . . . . . . . . . . . . . . . . . . . . 17 3.3. Multi-server Considerations . . . . . . . . . . . . . . . 18 3.3.1. pNFS and Protection Information . . . . . . . . . . . 19 3.3.2. Server-to-server copy . . . . . . . . . . . . . . . . 19 4. Security Considerations . . . . . . . . . . . . . . . . . . . 21 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.1. Normative References . . . . . . . . . . . . . . . . . . . 24 7.2. Informative References . . . . . . . . . . . . . . . . . . 24 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 25 Lever Expires April 21, 2014 [Page 3] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 1. Introduction 1.1. Scope Of This Document This document specifies a protocol based on NFSv4 minor version 2 [PROVISIONAL-NFSV42] that enables per-I/O data integrity information to be conveyed between an NFS client and an NFS server. A key requirement is that data integrity verification is possible from application write to read. This does not mean that a single protection envelope must exist from application to storage. However, it must be possible to perform integrity checking during each step of an I/O request's journey from application to storage and back. Therefore, the authors will not address how an NFSv4 client handles integrity-protected read and write requests from applications, nor with how an NFSv4 server manages protection information on its durable storage. We specify only a generic mechanism for transmitting integrity-protected read and write requests via the NFSv4 protocol, which client and server implementors may use as they see fit. A key interest in specifying and prototyping an integrity protection feature is exploring how I/O error handling and state recovery mechanisms in NFSv4 must be strengthened to guarantee the integrity of protected data. Additionally, we want to identify exactly what modes of corruption are faced in environments where applications run on nodes separated from physical data storage. Do we expect corruption that has never been seen in SAN or DAS environments, particularly failure modes in NAS clients that cannot be detected by traditional means (such as looking for misplaced block writes)? Finally, do we already have appropriate integrity protection mechanisms in the current protocol? Network-layer integrity mechanisms such as an integrity-protecting RPCSEC_GSS service have been around for years, and might be adequate. But do these mechanisms protect against CPU and memory corruption and application bugs, as well as malicious changes to data-at-rest? 1.2. Causes of Data Corruption Data can be corrupted during transmission, during the act of recording, or during the act of retrieval. Data can become corrupt while at rest on durable storage. Either active corruption (e.g. data is accidentally or maliciously overwritten) or passive corruption (e.g. storage device failure) can occur. Lever Expires April 21, 2014 [Page 4] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Data storage systems must handle an increasingly large amount of data. If the rate of corruption stays fixed while the amount of data stored increases, we expect corruption to become more common. To reduce failure rate and increase performance, data storage system complexity has increased. Complexity itself introduces the risk of corruption, since complexity can introduce bugs and make test coverage unacceptably sparse. Diagnosing a failure in complex systems is an everyday challenge. Data corruption can be "detected" or "undetected" (silent). The goal of data integrity protection is not to make corruption impossible, but rather to ensure corruption is detected before it can no longer be corrected, or at least before corrupt data is used by an application. 1.3. End-to-end Data Integrity End-to-end data integrity is a class of operating system, file system, storage controller, and storage device features that provide broad protection against unwanted changes to or loss of data that resides on data storage devices. Typically, data integrity is verified at individual steps in a data flow using techniques such as parity. This provides isolated protection during particular transfer operations or at best between adjacent nodes in an I/O path. In contrast, end-to-end protection guarantees data can be verified at every step as data flows from an application through a file system and storage controllers, via a variety of communication protocols, as it is stored on storage devices, and when it is read back from storage. 1.4. The Case For End-To-End Data Integrity Management A modern NFSv4 deployment may already provide some degree of data protection to in-transit data. o The use of RPCSEC GSS Kerberos 5i and 5p [RFC2203] can protect NFSv4 requests from tampering or corruption during network transfer. o An NFSv4 fileserver can employ RAID or block devices that store additional checksum data per logical block, in order to detect media failure. Lever Expires April 21, 2014 [Page 5] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 o An advanced file system on an NFSv4 fileserver may protect data integrity by storing multiple copies of data or by separately storing additional checksums. To demonstrate why end-to-end data integrity protection provides a stronger integrity guarantee than protection provided by the single- domain mechanisms above, consider the following cases: o On an NFSv4 fileserver, suppose a device driver bug causes a write operation to DMA the wrong memory pages to durable storage. The written data is incorrect, but the DMA transport checksum matches it. The DMA operation completes without reporting an error, and upper layers discard the original copy of the data. o Suppose an operating system or file system bug allows modifications to a page after it has been prepared for I/O and a checksum has been generated. The page and checksum are then written to storage. The written data does not represent the data originally by the application, and the accompanying stored checksum does not match it. The write operation completes without reporting an error, and upper layers discard the original copy of the data. o Suppose a RAID array on an NFSv4 server receives incorrect data for some reason. The array will generate RAID parity blocks from the incorrect data. The data is incorrect, but the accompanying parity matches it. The write operation completes without reporting an error, and upper layers discard the original copy of the data. o Suppose an application is writing data repeatedly to the same area of a file stored on an NFSv4 fileserver. Retransmits of an old write request become indistinguishable from new write requests to the same region. The written data always matches its appliction- generated checksum, but a replayed retransmission can overwrite newer data, and upper layers discard the original copy of the data. o Suppose a middle box is caching NFSv4 write requests on behalf of a number of NFSv4 clients. The wsize in effect for the clients does not have to match the wsize in effect between the middle box and the NFSv4 server. If the middle box fragments and reassembles the write requests incorrectly, the write requests appear to complete, but incorrect data is written to the NFSv4 server, and the clients discard the original copy of the data. In none of these cases is corruption identified while the original data remains available to correct the situation. An end-to-end Lever Expires April 21, 2014 [Page 6] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 solution could have caught and reported each of these, allowing the data's originator to retry or report failure before the data loss is compounded. 1.5. Terminology Buffer separation: Protection information and the data it protects is contained in distinct buffers which have independent paths to durable storage. Checksum: A value which is used to detect corruption in a collection of data. It is usually computed by applying a simple operation (such as addition) to each element of the collection. Computing a checksum is a low-overhead operation, but is less effective at helping detect and correct errors than a CRC. Cyclic Redundancy Check: A value which is used to detect corruption in a collection of data. It is based on a linear block error- correcting code. The hash function's generator polynomial is chosen to maximize error detection, and is typically more successful than either simple parity or a checksum. A CRC is efficient to compute with dedicated hardware, but can be expensive to compute in software. Data corruption: Any undesired alteration of data. Data corruption can be "detected" or "undetected" (silent). Data integrity: A database term used here to mean that a collection of data is exactly the same before and after processing, transmission, or storage. Data integrity verification failure: A node in an I/O path has failed to verify protection information associated with some data. This can be because the data or the protection information has been corrupted, or the node is malfunctioning. Integrity metadata: See "Protection information." Latent corruption: Data corruption that is discovered long after data was originally recorded on a storage device. Lost write: A write operation to a storage device which behaves as if the target data is stored durably, but in fact the data is never recorded. Lever Expires April 21, 2014 [Page 7] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Misdirected write: A write operation that causes the target data to be written to a different location on a storage device than was intended. Parity: A single bit which represents the evenness or oddness of a collection of data. Checking a parity bit can reveal and help correct data corruption. Parity is easy to compute and requires little space to store, but is generally less effective than other methods of error correction. "Parity" can also refer to checksum data in a RAID. Protection envelope: A set of nodes in an I/O system which together guarantee data integrity from input to output. Protection information: Information about a collection of application data that allows detection and possibly correction of corruption. This can take the form of parity, a checksum, a CRC value, or something more complex. Also the formal name of an end- to-end data integrity mechanism adopted by T10 for SCSI block storage devices. Protection interval: A collection of application data that is protected from corruption. The collection must be no larger or smaller than what can be written atomically to durable storage. Typically there is a one-to-one mapping between a protection interval and a logical block on a storage device. However, a device with a large sector size may store multiple protection intervals per sector, to maintain adequate protection with limited protection information. Protection type: An enumerated value that indicates the the size, contents, and interpretation of fields containing protection information. Lever Expires April 21, 2014 [Page 8] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 2. Protocol This section prescribes changes to the NFSv4 XDR specification [PROVISIONAL-NFSV42-XDR] to enable the conveyance of Protection Information via NFSv4. Therefore, an NFSv4.2 implementation is a necessary starting point. These changes are compatible with the NFSv4 minor versioning rules described in the NFSv4.2 specification. The RPC protocol used by NFSv4 is ONC RPC [RFC5531]. The data structures used for the parameters and return values of these procedures are expressed in this document in XDR [RFC4506]. 2.1. Protection types A new fixed-size structure is defined that encodes the format and content of Protection Information. This includes the meaning of tags, the size of the protection interval, and so on. For NFS, we need to go beyond existing SCSI protection types and consider cryptographic integrity types (i.e. the ability to guarantee integrity of data-at-rest over time by means of digital signature). To begin, we provide NFSv4 equivalents for a few typical T10 PI protection types [T10-SBC2], in addition to a few new protection types: enum nfs_protection_type4 { NFS_PI_TYPE1 = 1, NFS_PI_TYPE2 = 2, NFS_PI_TYPE3 = 3, NFS_PI_TYPE4 = 4, NFS_PI_TYPE4 = 5, }; struct nfs_protection_info4 { nfs_protection_type4 pi_type; uint32_t pi_intvl_size; uint64_t pi_other_data; }; The pi_type field reports the protection type. The pi_intvl_size field reports the supported protection interval size, in octets. The meaning of the content of the pi_other_data field depends on the protection type. Lever Expires April 21, 2014 [Page 9] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 2.1.1. Protection Type Table The following table specifies tag sizes and contents, and other features of each protection type. +------------+----------------------+--------------------+----------+ | NFS | Description | pi_other_data | Comment | | Protection | | | | | Type | | | | +------------+----------------------+--------------------+----------+ | 1 | PI field is | Always zero | NFS | | | application-owned; | | "native" | | | 8-byte protection | | PI | | | information field | | | | | containing a SHA-1 | | | | | hash of the | | | | | protection interval | | | | | | | | | 2 | PI field is | Zero means the | NFS | | | application-owned; | RSASSA-PKCS1-v1_5 | "native" | | | 8-byte protection | signing scheme | PI | | | information field | [RFC3447] is used | | | | containing a hash of | | | | | the protection | | | | | interval signed by a | | | | | private key. A | | | | | public key is | | | | | provided separately | | | | | so the server can | | | | | verify incoming | | | | | protection intervals | | | | | | | | | 3 | 8-byte protection | 1 if the PI field | T10 PI | | | information field | is | Type 1 | | | containing 2-byte | application-owned; | | | | guard tag (CRC-16 | otherwise zero | | | | checksum of | | | | | protection | | | | | interval), 2-byte | | | | | application tag | | | | | (user defined), and | | | | | 4-byte reference tag | | | | | (LO 32-bits of LBA) | | | | | | | | Lever Expires April 21, 2014 [Page 10] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 | 4 | 8-byte protection | 1 if the PI field | T10 PI | | | information field | is | Type 2 | | | containing 2-byte | application-owned; | | | | guard tag (CRC-16 | otherwise zero | | | | checksum of | | | | | protection | | | | | interval), 2-byte | | | | | application tag | | | | | (user defined), and | | | | | 4-byte reference tag | | | | | (*) | | | | | | | | | 5 | PI field is | 1 if the PI field | T10 PI | | | application-owned; | is | Type 3 | | | 8-byte protection | application-owned; | | | | information field | otherwise zero | | | | containing 2-byte | | | | | guard tag (CRC-16 | | | | | checksum of | | | | | protection | | | | | interval), 2-byte | | | | | application tag | | | | | (user defined), and | | | | | 4-byte reference tag | | | | | (user defined) | | | +------------+----------------------+--------------------+----------+ The protection type enumerator is key to the extensibility of the NFSv4 end-to-end data integrity feature. A future specification can introduce new protection types that support Advanced Format drives, or types for storage that does not support application-owned Protection Information fields, for example. To manage this ongoing process, the contents of this table should be administered by IANA. [*] Protection Type 2 uses an indirect LBA in its reference tag. In this case, the I/O operation passes the reference tag value for the first protection interval in a separate operation. The reference tag in the first protection field must match this value. The reference tags in subsequent fields are this value plus (n-1). It's still not clear to me how type 2 works without chaining read and write requests. When an application writes a series of unrelated blocks, what should the reference LBNs be? When an application reads randomly, what reference LBNs should it expect? Lever Expires April 21, 2014 [Page 11] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 2.2. GETATTR A new read-only per-FSID GETATTR attribute is defined to request the list of protection types supported on a particular FSID. const FATTR4_PROTECTION_TYPES = 82; The reply data type follows. typedef nfs_protection_info4 fattr4_protection_info<>; 2.3. INIT_PROT_INFO - Initialize Protection Information Some protection types require additional data in order for the storage to perform integrity verification. This data is transmitted by a new operation. 2.3.1. ARGUMENTS struct INITPROTINFO4args { nfs_protection_type4 ipi_type; opaque ipi_data; }; 2.3.2. RESULTS struct INITPROTINFO4res { nfsstat4 status; }; 2.3.3. DESCRIPTION This operation is used to transmit initialization data in preparation for a stream of integrity-protected I/O requests. The exact content of the ipi_data field depends on the protection type specified in the ipi_type field. For example, for NFS_PI_TYPE2, the ipi_data field might contain a binary format public key that can be used to validate the signature of incoming protection intervals. 2.4. New data content type NFSv4.2 introduces a mechanism that can be used to extend the types of data that can be read and written by a client. To convey protection information we extend the data_content4 enum. Lever Expires April 21, 2014 [Page 12] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 enum data_content4 { NFS4_CONTENT_DATA = 0, NFS4_CONTENT_APP_DATA_HOLE = 1, NFS4_CONTENT_HOLE = 2, NFS4_CONTENT_PROTECTED_DATA = 3, }; struct data_protected4 { nfs_protection_info4 pd_type; offset4 pd_offset; bool pd_allocated; opaque pd_info<>; opaque pd_data<>; }; The pd_offset field specifies the byte offset where data should be read or written. The number of bytes to write is specified by the size of the pd_data array. The pd_allocated field is equivalent to the d_allocated field in the data4 type specified in [PROVISIONAL-NFSV42]. The opaque pd_info field contains a packed array of fixed-size protection fields. The length of the array must be consistent with the pd_offset and count arguments specified for the data range of the operation. The size and format of the contents of each field in the array is determined by the value of the pd_type field. The opaque pd_data field contains the normal data being conveyed in this operation. 2.5. READ_PLUS The READ_PLUS operation reads protection information using the NFS4_CONTENT_PROTECTED_DATA content type. union read_plus_content switch (data_content4 rpc_content) { case NFS4_CONTENT_DATA: data4 rpc_data; case NFS4_CONTENT_APP_DATA_HOLE: app_data_hole4 rpc_adh; case NFS4_CONTENT_HOLE: data_info4 rpc_hole; case NFS4_CONTENT_PROTECTED_DATA: data_prot_fields4 rpc_pdata; default: void; }; Lever Expires April 21, 2014 [Page 13] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 The offset and length arguments of the READ_PLUS operation (rpa_offset and rpa_count) determine the data byte range covered by the protection information and normal data returned in each request. For example, suppose the protection type mandated 8-byte protection fields and a 512-byte protection interval. A READ_PLUS requesting protection information for a 4096-byte range of a file would receive an array of eight 8-byte protection fields, or 64 bytes. 2.6. WRITE_PLUS The WRITE_PLUS operation writes protection information using the NFS4_CONTENT_PROTECTED_DATA content type. union write_plus_arg4 switch (data_content4 wpa_content) { case NFS4_CONTENT_DATA: data4 wpa_data; case NFS4_CONTENT_APP_DATA_HOLE: app_data_hole4 wpa_adh; case NFS4_CONTENT_HOLE: data_info4 wpa_hole; case NFS4_CONTENT_PROTECTED_DATA: data_prot_fields4 wpa_pdata; default: void; }; The offset and length arguments of the WRITE_PLUS operation (pd_offset and the size of pd_data) determine the data byte range covered by the protection information. For example, suppose the protection type mandated 8-byte protection fields and a 512-byte protection interval. A WRITE_PLUS writing protection information to a 4096-byte range of a file would send an array of eight 8-byte protection fields, or 64 bytes. 2.7. Error codes New error codes are introduced to allow an NFSv4 server to convey integrity-related failure modes to clients. These new codes include (but are not limited to) the following: enum nfsstat4 { ... NFS4ERR_PROT_NOTSUPP = 10200, NFS4ERR_PROT_INVAL = 10201, NFS4ERR_PROT_FAIL = 10202, NFS4ERR_PROT_LATFAIL = 10203, Lever Expires April 21, 2014 [Page 14] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 }; NFS4ERR_PROT_NOTSUPP: The protection type specified in an operation is not supported for the FSID upon which the file resides. NFS4ERR_PROT_INVAL: The protection information passed as an argument is garbled (cf. BADXDR). This error code MUST be returned if the offset and length of read or written data does not align with the protection interval specified by the protection type. NFS4ERR_PROT_FAIL: During a WRITE_PLUS operation, the protection information does not verify the written data. If this was an UNSTABLE WRITE_PLUS, the client should retry the operation using FILE_SYNC so the server can report precisely where the data writes are failing. NFS4ERR_PROT_LATFAIL: During a READ_PLUS operation, the protection information does not verify the read data. This error code reports a verification that occurred before the data arrives at an NFSv4 client. The client is not required to read protection information to see this error. If data integrity verification fails while a server is pre- fetching data, the failure cannot be reported until the client reads the section of the file where the failure occurs. Pre- fetched data might never be read by a client, therefore a data integrity verification failure that occured while pre-fetching may never be reported to an NFS client or an application. Lever Expires April 21, 2014 [Page 15] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 3. Protocol Design Considerations 3.1. Protection Envelopes We explore protection envelopes that might appear in a typical NFSv4 deployment, and design an architecture that guarantees unbroken data integrity protection through each of these envelopes. In addition, it is useful to permit varying degrees of server, client, and application participation in a data protection scheme. We can define protection envelopes of varying circumference that allow implementations and deployments to choose a level of complexity, data protection, and performance impact that suits their applications. The following are presented in order of smallest to largest circumference. To enable end-to-end protection, each protection envelope in this list depends on having the previous envelope in place. Server storage: The storage subsystem on an NFSv4 server is below the physical filesystems on that server. If a data integrity mechanism is available on the block storage, the physical filesystem may or may not choose to use it. Data integrity verification failures are reflected to NFS clients as simple I/O errors. Server filesystem: The physical filesystem on an NFSv4 server may provide a data integrity mechanism based on its own checksumming scheme, or by using a standard block storage mechanism such as T10 PI/DIX [DIX]. The NFSv4 service on that system may or may not choose to use the filesystem's integrity service. Data integrity verification failures are reflected to NFS clients as simple I/O errors. Server: An NFSv4 server may choose to use the local filesystem's data integrity mechanism, but not to advertise a data integrity mechanism via NFSv4. Data integrity verification failures are reflected to NFS clients as simple I/O errors. Client-server: If an NFSv4 server advertises data integrity mechanisms via NFSv4, an NFSv4 client may choose to use NFSv4 data integrity protection without advertising the capability to applications running on it. It may also choose not to use NFSv4 data integrity protection at all. Data integrity verification failures are reflected to applications as simple I/O errors. Lever Expires April 21, 2014 [Page 16] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Application-client-server: Suppose that an NFSv4 client chooses to use data integrity protection via NFSv4 and that. the capability is advertised to applications. Applications may or may not choose to use the capability. An NFSv4 client uses on-the-wire data integrity when an application chooses to use the capability, but may or may not use it when the application chooses not to use it. Data integrity verification failures are reflected to applications as is. This is full end-to-end data integrity protection via NFSv4. Note that the "server" envelope is not externally distinguishable from a server that does not support data integrity protection at all (other than that it provides somewhat better data integrity guarantees than one that does not support data integrity protection). This is a way to introduce stronger data integrity without requiring a large deployment of NFSv4 clients capable of integrity verification. Or, stronger data integrity can be introduced to legacy NFS environments that have no protocol mechanisms for extending the protection envelop past the server. The "application-client-server" envelope illustrates that, on a protection-enabled file system, data integrity verification can be used on a per-file basis. Applications may choose to use protection for some files and not others. Some applications may choose to use protection, and some applications may choose not to use it. Note that in each case, data integrity protection is available to the edge of the farthest protection envelope. Data integrity is protected only after the data arrives at a protection envelope boundary, and before it leaves that boundary. Legacy NFS clients continue to access protected data on a server, but are unaware of data integrity verification failures except as generic I/O errors. The client-cache-server case is considered separately. The "cache" node in this case may be a dedicated NFSv4 cache, a caching peer-to- peer NFSv4 client, or a pNFS metadata server. A separate protection envelope exists between an NFSv4 client and an intermediate cache, and that cache and the NFSv4 server where the protected data resides. 3.2. Protecting Holes NFSv4 minor version 2 [PROVISIONAL-NFSV42] exposes clients to certain mechanics of the underlying file systems on servers which allow more direct control of the storage space utilized by files. The goal of these new features is to economize the transfer and storage of file data. These new features include support for reading sparse files efficiently, space reservation, and punching holes (similar to a TRIM or DISCARD operation on a block device) in files. Lever Expires April 21, 2014 [Page 17] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 A hole is an area of a file that can be represented in a file system by having no backing storage. By definition any read of that region of the file returns a range of bytes containing zero. Any write to that region allocates fresh backing storage normally. NFSv4.2 extends this notion to allow NFSv4 clients to specify a pattern containing non-zero bytes to be returned when reading that region of a file. The protocol feature is independent of how an NFSv4 server's file system chooses to store this data. In fact a server's file system is free to simply store zeroes or a byte pattern on disk as raw data rather than in some optimized fashion. If an NFSv4 server's file system does use an optimized storage method, a decision must be made about whether accompanying PI is needed. For a plain hole (where zero is always returned by a raw data read operation) the intention is that there is no backing storage there, thus PI is not meaningful. However a read operation that requests protection information must return something meaningful. For protection types that mandate only a checksum guard tag (and do not store either reference or or application tag data), a checksum for each protection interval can be generated on the server during a normal read operation, or on the client if a sparse read is used. For a data hole (where some non-zero pattern is returned by a raw read operation), storing PI is optional, and depends on whether the protection type requires the storage to return an intact application tag. Without the requirement of storing the application tag, the file system could discard the PI after a write operation, and recompute it from the pattern on a read operation. Or, it could store the PI information as part of the pattern metadata. 3.3. Multi-server Considerations The NFSv4 protocol provides several mechanisms for NFSv4 servers to co-operate in ways that enhance performance scalability and data availability. An NFSv4 client can access the same data serially on single NFSv4 servers when a file system is replicated. A file system can be migrated between NFSv4 servers transparently to clients. Or a file system can be constructed from files that reside in parts on several NFSv4 servers. To allow coherent use of a data integrity mechanism: o Each NFSv4 Data Server hosting a particular file system MUST support the same protection types. Lever Expires April 21, 2014 [Page 18] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 o Each replica of a file system MUST support the same protection types. o The destination of a file system migration MUST support all protection types supported by the source, and the transitioned file system MUST use the same protection type it did on the source server. Enforcing these mandates is likely outside the purview of the NFSv4 protocol, particularly because no mechanism for transitioning file systems is set out by any NFSv4 protocol specification. However, enforcing such mandates could be built into administrative tools. 3.3.1. pNFS and Protection Information There has been some uncertainty about whether Protection Information should be considered metadata or data. pNFS has a convenient operational definition of data and metadata: if it's data, it goes to the Data Server; if it's metadata, it goes to the Metadata Server. Protection Information belongs with the data it protects, which is written to Data Servers. Therefore Protection Information is data. If a client ever writes Protection Information to a Metadata Server, such Protection Information will be forwarded to an appropriate Data Server for storage. For the file layout type, which uses NFSv4 when communicating with Data Servers, all protection types have protocol support for Protection Information. For other layout types, support may or may not be available in their respective data protocols. Layout implementations are not guaranteed to support every protection type. 3.3.2. Server-to-server copy NFSv4 minor version 2 [PROVISIONAL-NFSV42] introduces a new multi- server feature known as server-to-server copy. Clients can offload the data copy portion of copying part or all of a file. The destination file is recognized as a separate entity (ie. has a unique file handle), not as a replica of the original file. As such, the destination file may be stored in a file system that has a different protection type than the source file, or may not be protected at all. If the destination filesystem supports the same protection type as the source filesystem, the copy offload operation MUST copy Protection Information associated with the source file to the destination file. Server implementors MAY provide data integrity verification on both Lever Expires April 21, 2014 [Page 19] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 ends of the offloaded copy operation. A server MUST report data integrity verification failures that occur during an offloaded copy operation. Lever Expires April 21, 2014 [Page 20] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 4. Security Considerations A man-in-the-middle attack can replace both the data and integrity metadata in any NFSv4 request that is sent in the clear. Therefore, when a data integrity protection mechanism is deployed on an untrusted network, it is strongly urged that a cryptographically secure integrity-checking RPC transport, such as RPCSEC GSS Kerberos 5i [RFC2203], is used to convey NFSv4 traffic on open networks. Lever Expires April 21, 2014 [Page 21] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 5. IANA Considerations This document currently does not require actions by IANA. However, see Section 2.1. Lever Expires April 21, 2014 [Page 22] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 6. Acknowledgements The author of this document gratefully acknowledges the contributions of Martin K. Petersen, David Noveck, and Spencer Shepler. Bill Baker, Chris Mason, and Tom Haynes also provided guidance and suggestions. Lever Expires April 21, 2014 [Page 23] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol Specification", RFC 2203, September 1997. [RFC3447] Jonsson, J. and B. Kaliski, "Public-Key Cryptography Standards (PKCS) #1: RSA Cryptography Specifications Version 2.1", RFC 3447, February 2003. [RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006. [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, May 2009. 7.2. Informative References [DIX] Petersen, M., "I/O Controller Data Integrity Extensions", November 2009, . [PROVISIONAL-NFSV42] Haynes, T., Ed., "NFS Version 4 Minor Version 2", March 2013, . [PROVISIONAL-NFSV42-XDR] Haynes, T., Ed., "NFS Version 4 Minor Version 2 Protocol External Representation Standard (XDR) Description", March 2013, . [T10-SBC2] Elliott, R., Ed., "ANSI INCITS 405-2005, Information Technology - SCSI Block Commands - 2 (SBC-2)", November 2004. Lever Expires April 21, 2014 [Page 24] Internet-Draft NFSv4 End-to-end Data Integrity October 2013 Author's Address Charles Lever Oracle Corporation 1015 Granger Avenue Ann Arbor, MI 48104 US Phone: +1 734 274 2396 Email: chuck.lever@oracle.com Lever Expires April 21, 2014 [Page 25]