Internet DRAFT - draft-cel-nfsv4-end2end-data-protection
draft-cel-nfsv4-end2end-data-protection
NFSv4 C. Lever
Internet-Draft Oracle
Intended status: Experimental October 18, 2013
Expires: April 21, 2014
End-to-end Data Integrity For NFSv4
draft-cel-nfsv4-end2end-data-protection-01
Abstract
End-to-end data integrity protection provides a strong guarantee that
data an application reads from durable storage is exactly the same
data it wrote previously to durable storage. This document specifies
possible additions to the NFSv4 protocol enabling it to convey end-
to-end data integrity information between client and server.
Lever Expires April 21, 2014 [Page 1]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 21, 2014.
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Lever Expires April 21, 2014 [Page 2]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Scope Of This Document . . . . . . . . . . . . . . . . . . 4
1.2. Causes of Data Corruption . . . . . . . . . . . . . . . . 4
1.3. End-to-end Data Integrity . . . . . . . . . . . . . . . . 5
1.4. The Case For End-To-End Data Integrity Management . . . . 5
1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7
2. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Protection types . . . . . . . . . . . . . . . . . . . . . 9
2.1.1. Protection Type Table . . . . . . . . . . . . . . . . 10
2.2. GETATTR . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3. INIT_PROT_INFO - Initialize Protection Information . . . . 12
2.3.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 12
2.4. New data content type . . . . . . . . . . . . . . . . . . 12
2.5. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6. WRITE_PLUS . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7. Error codes . . . . . . . . . . . . . . . . . . . . . . . 14
3. Protocol Design Considerations . . . . . . . . . . . . . . . . 16
3.1. Protection Envelopes . . . . . . . . . . . . . . . . . . . 16
3.2. Protecting Holes . . . . . . . . . . . . . . . . . . . . . 17
3.3. Multi-server Considerations . . . . . . . . . . . . . . . 18
3.3.1. pNFS and Protection Information . . . . . . . . . . . 19
3.3.2. Server-to-server copy . . . . . . . . . . . . . . . . 19
4. Security Considerations . . . . . . . . . . . . . . . . . . . 21
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.1. Normative References . . . . . . . . . . . . . . . . . . . 24
7.2. Informative References . . . . . . . . . . . . . . . . . . 24
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 25
Lever Expires April 21, 2014 [Page 3]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
1. Introduction
1.1. Scope Of This Document
This document specifies a protocol based on NFSv4 minor version 2
[PROVISIONAL-NFSV42] that enables per-I/O data integrity information
to be conveyed between an NFS client and an NFS server.
A key requirement is that data integrity verification is possible
from application write to read. This does not mean that a single
protection envelope must exist from application to storage. However,
it must be possible to perform integrity checking during each step of
an I/O request's journey from application to storage and back.
Therefore, the authors will not address how an NFSv4 client handles
integrity-protected read and write requests from applications, nor
with how an NFSv4 server manages protection information on its
durable storage. We specify only a generic mechanism for
transmitting integrity-protected read and write requests via the
NFSv4 protocol, which client and server implementors may use as they
see fit.
A key interest in specifying and prototyping an integrity protection
feature is exploring how I/O error handling and state recovery
mechanisms in NFSv4 must be strengthened to guarantee the integrity
of protected data.
Additionally, we want to identify exactly what modes of corruption
are faced in environments where applications run on nodes separated
from physical data storage. Do we expect corruption that has never
been seen in SAN or DAS environments, particularly failure modes in
NAS clients that cannot be detected by traditional means (such as
looking for misplaced block writes)?
Finally, do we already have appropriate integrity protection
mechanisms in the current protocol? Network-layer integrity
mechanisms such as an integrity-protecting RPCSEC_GSS service have
been around for years, and might be adequate. But do these
mechanisms protect against CPU and memory corruption and application
bugs, as well as malicious changes to data-at-rest?
1.2. Causes of Data Corruption
Data can be corrupted during transmission, during the act of
recording, or during the act of retrieval. Data can become corrupt
while at rest on durable storage. Either active corruption (e.g.
data is accidentally or maliciously overwritten) or passive
corruption (e.g. storage device failure) can occur.
Lever Expires April 21, 2014 [Page 4]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Data storage systems must handle an increasingly large amount of
data. If the rate of corruption stays fixed while the amount of data
stored increases, we expect corruption to become more common.
To reduce failure rate and increase performance, data storage system
complexity has increased. Complexity itself introduces the risk of
corruption, since complexity can introduce bugs and make test
coverage unacceptably sparse. Diagnosing a failure in complex
systems is an everyday challenge.
Data corruption can be "detected" or "undetected" (silent). The goal
of data integrity protection is not to make corruption impossible,
but rather to ensure corruption is detected before it can no longer
be corrected, or at least before corrupt data is used by an
application.
1.3. End-to-end Data Integrity
End-to-end data integrity is a class of operating system, file
system, storage controller, and storage device features that provide
broad protection against unwanted changes to or loss of data that
resides on data storage devices.
Typically, data integrity is verified at individual steps in a data
flow using techniques such as parity. This provides isolated
protection during particular transfer operations or at best between
adjacent nodes in an I/O path.
In contrast, end-to-end protection guarantees data can be verified at
every step as data flows from an application through a file system
and storage controllers, via a variety of communication protocols, as
it is stored on storage devices, and when it is read back from
storage.
1.4. The Case For End-To-End Data Integrity Management
A modern NFSv4 deployment may already provide some degree of data
protection to in-transit data.
o The use of RPCSEC GSS Kerberos 5i and 5p [RFC2203] can protect
NFSv4 requests from tampering or corruption during network
transfer.
o An NFSv4 fileserver can employ RAID or block devices that store
additional checksum data per logical block, in order to detect
media failure.
Lever Expires April 21, 2014 [Page 5]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
o An advanced file system on an NFSv4 fileserver may protect data
integrity by storing multiple copies of data or by separately
storing additional checksums.
To demonstrate why end-to-end data integrity protection provides a
stronger integrity guarantee than protection provided by the single-
domain mechanisms above, consider the following cases:
o On an NFSv4 fileserver, suppose a device driver bug causes a write
operation to DMA the wrong memory pages to durable storage. The
written data is incorrect, but the DMA transport checksum matches
it. The DMA operation completes without reporting an error, and
upper layers discard the original copy of the data.
o Suppose an operating system or file system bug allows
modifications to a page after it has been prepared for I/O and a
checksum has been generated. The page and checksum are then
written to storage. The written data does not represent the data
originally by the application, and the accompanying stored
checksum does not match it. The write operation completes without
reporting an error, and upper layers discard the original copy of
the data.
o Suppose a RAID array on an NFSv4 server receives incorrect data
for some reason. The array will generate RAID parity blocks from
the incorrect data. The data is incorrect, but the accompanying
parity matches it. The write operation completes without
reporting an error, and upper layers discard the original copy of
the data.
o Suppose an application is writing data repeatedly to the same area
of a file stored on an NFSv4 fileserver. Retransmits of an old
write request become indistinguishable from new write requests to
the same region. The written data always matches its appliction-
generated checksum, but a replayed retransmission can overwrite
newer data, and upper layers discard the original copy of the
data.
o Suppose a middle box is caching NFSv4 write requests on behalf of
a number of NFSv4 clients. The wsize in effect for the clients
does not have to match the wsize in effect between the middle box
and the NFSv4 server. If the middle box fragments and reassembles
the write requests incorrectly, the write requests appear to
complete, but incorrect data is written to the NFSv4 server, and
the clients discard the original copy of the data.
In none of these cases is corruption identified while the original
data remains available to correct the situation. An end-to-end
Lever Expires April 21, 2014 [Page 6]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
solution could have caught and reported each of these, allowing the
data's originator to retry or report failure before the data loss is
compounded.
1.5. Terminology
Buffer separation: Protection information and the data it protects
is contained in distinct buffers which have independent paths to
durable storage.
Checksum: A value which is used to detect corruption in a collection
of data. It is usually computed by applying a simple operation
(such as addition) to each element of the collection. Computing a
checksum is a low-overhead operation, but is less effective at
helping detect and correct errors than a CRC.
Cyclic Redundancy Check: A value which is used to detect corruption
in a collection of data. It is based on a linear block error-
correcting code. The hash function's generator polynomial is
chosen to maximize error detection, and is typically more
successful than either simple parity or a checksum. A CRC is
efficient to compute with dedicated hardware, but can be expensive
to compute in software.
Data corruption: Any undesired alteration of data. Data corruption
can be "detected" or "undetected" (silent).
Data integrity: A database term used here to mean that a collection
of data is exactly the same before and after processing,
transmission, or storage.
Data integrity verification failure: A node in an I/O path has
failed to verify protection information associated with some data.
This can be because the data or the protection information has
been corrupted, or the node is malfunctioning.
Integrity metadata: See "Protection information."
Latent corruption: Data corruption that is discovered long after
data was originally recorded on a storage device.
Lost write: A write operation to a storage device which behaves as
if the target data is stored durably, but in fact the data is
never recorded.
Lever Expires April 21, 2014 [Page 7]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Misdirected write: A write operation that causes the target data to
be written to a different location on a storage device than was
intended.
Parity: A single bit which represents the evenness or oddness of a
collection of data. Checking a parity bit can reveal and help
correct data corruption. Parity is easy to compute and requires
little space to store, but is generally less effective than other
methods of error correction. "Parity" can also refer to checksum
data in a RAID.
Protection envelope: A set of nodes in an I/O system which together
guarantee data integrity from input to output.
Protection information: Information about a collection of
application data that allows detection and possibly correction of
corruption. This can take the form of parity, a checksum, a CRC
value, or something more complex. Also the formal name of an end-
to-end data integrity mechanism adopted by T10 for SCSI block
storage devices.
Protection interval: A collection of application data that is
protected from corruption. The collection must be no larger or
smaller than what can be written atomically to durable storage.
Typically there is a one-to-one mapping between a protection
interval and a logical block on a storage device. However, a
device with a large sector size may store multiple protection
intervals per sector, to maintain adequate protection with limited
protection information.
Protection type: An enumerated value that indicates the the size,
contents, and interpretation of fields containing protection
information.
Lever Expires April 21, 2014 [Page 8]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
2. Protocol
This section prescribes changes to the NFSv4 XDR specification
[PROVISIONAL-NFSV42-XDR] to enable the conveyance of Protection
Information via NFSv4. Therefore, an NFSv4.2 implementation is a
necessary starting point. These changes are compatible with the
NFSv4 minor versioning rules described in the NFSv4.2 specification.
The RPC protocol used by NFSv4 is ONC RPC [RFC5531]. The data
structures used for the parameters and return values of these
procedures are expressed in this document in XDR [RFC4506].
2.1. Protection types
A new fixed-size structure is defined that encodes the format and
content of Protection Information. This includes the meaning of
tags, the size of the protection interval, and so on.
For NFS, we need to go beyond existing SCSI protection types and
consider cryptographic integrity types (i.e. the ability to guarantee
integrity of data-at-rest over time by means of digital signature).
To begin, we provide NFSv4 equivalents for a few typical T10 PI
protection types [T10-SBC2], in addition to a few new protection
types:
enum nfs_protection_type4 {
NFS_PI_TYPE1 = 1,
NFS_PI_TYPE2 = 2,
NFS_PI_TYPE3 = 3,
NFS_PI_TYPE4 = 4,
NFS_PI_TYPE4 = 5,
};
struct nfs_protection_info4 {
nfs_protection_type4 pi_type;
uint32_t pi_intvl_size;
uint64_t pi_other_data;
};
The pi_type field reports the protection type. The pi_intvl_size
field reports the supported protection interval size, in octets. The
meaning of the content of the pi_other_data field depends on the
protection type.
Lever Expires April 21, 2014 [Page 9]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
2.1.1. Protection Type Table
The following table specifies tag sizes and contents, and other
features of each protection type.
+------------+----------------------+--------------------+----------+
| NFS | Description | pi_other_data | Comment |
| Protection | | | |
| Type | | | |
+------------+----------------------+--------------------+----------+
| 1 | PI field is | Always zero | NFS |
| | application-owned; | | "native" |
| | 8-byte protection | | PI |
| | information field | | |
| | containing a SHA-1 | | |
| | hash of the | | |
| | protection interval | | |
| | | | |
| 2 | PI field is | Zero means the | NFS |
| | application-owned; | RSASSA-PKCS1-v1_5 | "native" |
| | 8-byte protection | signing scheme | PI |
| | information field | [RFC3447] is used | |
| | containing a hash of | | |
| | the protection | | |
| | interval signed by a | | |
| | private key. A | | |
| | public key is | | |
| | provided separately | | |
| | so the server can | | |
| | verify incoming | | |
| | protection intervals | | |
| | | | |
| 3 | 8-byte protection | 1 if the PI field | T10 PI |
| | information field | is | Type 1 |
| | containing 2-byte | application-owned; | |
| | guard tag (CRC-16 | otherwise zero | |
| | checksum of | | |
| | protection | | |
| | interval), 2-byte | | |
| | application tag | | |
| | (user defined), and | | |
| | 4-byte reference tag | | |
| | (LO 32-bits of LBA) | | |
| | | | |
Lever Expires April 21, 2014 [Page 10]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
| 4 | 8-byte protection | 1 if the PI field | T10 PI |
| | information field | is | Type 2 |
| | containing 2-byte | application-owned; | |
| | guard tag (CRC-16 | otherwise zero | |
| | checksum of | | |
| | protection | | |
| | interval), 2-byte | | |
| | application tag | | |
| | (user defined), and | | |
| | 4-byte reference tag | | |
| | (*) | | |
| | | | |
| 5 | PI field is | 1 if the PI field | T10 PI |
| | application-owned; | is | Type 3 |
| | 8-byte protection | application-owned; | |
| | information field | otherwise zero | |
| | containing 2-byte | | |
| | guard tag (CRC-16 | | |
| | checksum of | | |
| | protection | | |
| | interval), 2-byte | | |
| | application tag | | |
| | (user defined), and | | |
| | 4-byte reference tag | | |
| | (user defined) | | |
+------------+----------------------+--------------------+----------+
The protection type enumerator is key to the extensibility of the
NFSv4 end-to-end data integrity feature. A future specification can
introduce new protection types that support Advanced Format drives,
or types for storage that does not support application-owned
Protection Information fields, for example. To manage this ongoing
process, the contents of this table should be administered by IANA.
[*] Protection Type 2 uses an indirect LBA in its reference tag. In
this case, the I/O operation passes the reference tag value for the
first protection interval in a separate operation. The reference tag
in the first protection field must match this value. The reference
tags in subsequent fields are this value plus (n-1).
It's still not clear to me how type 2 works without chaining read and
write requests. When an application writes a series of unrelated
blocks, what should the reference LBNs be? When an application reads
randomly, what reference LBNs should it expect?
Lever Expires April 21, 2014 [Page 11]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
2.2. GETATTR
A new read-only per-FSID GETATTR attribute is defined to request the
list of protection types supported on a particular FSID.
const FATTR4_PROTECTION_TYPES = 82;
The reply data type follows.
typedef nfs_protection_info4 fattr4_protection_info<>;
2.3. INIT_PROT_INFO - Initialize Protection Information
Some protection types require additional data in order for the
storage to perform integrity verification. This data is transmitted
by a new operation.
2.3.1. ARGUMENTS
struct INITPROTINFO4args {
nfs_protection_type4 ipi_type;
opaque ipi_data;
};
2.3.2. RESULTS
struct INITPROTINFO4res {
nfsstat4 status;
};
2.3.3. DESCRIPTION
This operation is used to transmit initialization data in preparation
for a stream of integrity-protected I/O requests. The exact content
of the ipi_data field depends on the protection type specified in the
ipi_type field.
For example, for NFS_PI_TYPE2, the ipi_data field might contain a
binary format public key that can be used to validate the signature
of incoming protection intervals.
2.4. New data content type
NFSv4.2 introduces a mechanism that can be used to extend the types
of data that can be read and written by a client. To convey
protection information we extend the data_content4 enum.
Lever Expires April 21, 2014 [Page 12]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
enum data_content4 {
NFS4_CONTENT_DATA = 0,
NFS4_CONTENT_APP_DATA_HOLE = 1,
NFS4_CONTENT_HOLE = 2,
NFS4_CONTENT_PROTECTED_DATA = 3,
};
struct data_protected4 {
nfs_protection_info4 pd_type;
offset4 pd_offset;
bool pd_allocated;
opaque pd_info<>;
opaque pd_data<>;
};
The pd_offset field specifies the byte offset where data should be
read or written. The number of bytes to write is specified by the
size of the pd_data array.
The pd_allocated field is equivalent to the d_allocated field in the
data4 type specified in [PROVISIONAL-NFSV42].
The opaque pd_info field contains a packed array of fixed-size
protection fields. The length of the array must be consistent with
the pd_offset and count arguments specified for the data range of the
operation. The size and format of the contents of each field in the
array is determined by the value of the pd_type field.
The opaque pd_data field contains the normal data being conveyed in
this operation.
2.5. READ_PLUS
The READ_PLUS operation reads protection information using the
NFS4_CONTENT_PROTECTED_DATA content type.
union read_plus_content switch (data_content4 rpc_content) {
case NFS4_CONTENT_DATA:
data4 rpc_data;
case NFS4_CONTENT_APP_DATA_HOLE:
app_data_hole4 rpc_adh;
case NFS4_CONTENT_HOLE:
data_info4 rpc_hole;
case NFS4_CONTENT_PROTECTED_DATA:
data_prot_fields4 rpc_pdata;
default:
void;
};
Lever Expires April 21, 2014 [Page 13]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
The offset and length arguments of the READ_PLUS operation
(rpa_offset and rpa_count) determine the data byte range covered by
the protection information and normal data returned in each request.
For example, suppose the protection type mandated 8-byte protection
fields and a 512-byte protection interval. A READ_PLUS requesting
protection information for a 4096-byte range of a file would receive
an array of eight 8-byte protection fields, or 64 bytes.
2.6. WRITE_PLUS
The WRITE_PLUS operation writes protection information using the
NFS4_CONTENT_PROTECTED_DATA content type.
union write_plus_arg4 switch (data_content4 wpa_content) {
case NFS4_CONTENT_DATA:
data4 wpa_data;
case NFS4_CONTENT_APP_DATA_HOLE:
app_data_hole4 wpa_adh;
case NFS4_CONTENT_HOLE:
data_info4 wpa_hole;
case NFS4_CONTENT_PROTECTED_DATA:
data_prot_fields4 wpa_pdata;
default:
void;
};
The offset and length arguments of the WRITE_PLUS operation
(pd_offset and the size of pd_data) determine the data byte range
covered by the protection information.
For example, suppose the protection type mandated 8-byte protection
fields and a 512-byte protection interval. A WRITE_PLUS writing
protection information to a 4096-byte range of a file would send an
array of eight 8-byte protection fields, or 64 bytes.
2.7. Error codes
New error codes are introduced to allow an NFSv4 server to convey
integrity-related failure modes to clients. These new codes include
(but are not limited to) the following:
enum nfsstat4 {
...
NFS4ERR_PROT_NOTSUPP = 10200,
NFS4ERR_PROT_INVAL = 10201,
NFS4ERR_PROT_FAIL = 10202,
NFS4ERR_PROT_LATFAIL = 10203,
Lever Expires April 21, 2014 [Page 14]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
};
NFS4ERR_PROT_NOTSUPP: The protection type specified in an operation
is not supported for the FSID upon which the file resides.
NFS4ERR_PROT_INVAL: The protection information passed as an argument
is garbled (cf. BADXDR). This error code MUST be returned if the
offset and length of read or written data does not align with the
protection interval specified by the protection type.
NFS4ERR_PROT_FAIL: During a WRITE_PLUS operation, the protection
information does not verify the written data. If this was an
UNSTABLE WRITE_PLUS, the client should retry the operation using
FILE_SYNC so the server can report precisely where the data writes
are failing.
NFS4ERR_PROT_LATFAIL: During a READ_PLUS operation, the protection
information does not verify the read data. This error code
reports a verification that occurred before the data arrives at an
NFSv4 client. The client is not required to read protection
information to see this error.
If data integrity verification fails while a server is pre-
fetching data, the failure cannot be reported until the client
reads the section of the file where the failure occurs. Pre-
fetched data might never be read by a client, therefore a data
integrity verification failure that occured while pre-fetching may
never be reported to an NFS client or an application.
Lever Expires April 21, 2014 [Page 15]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
3. Protocol Design Considerations
3.1. Protection Envelopes
We explore protection envelopes that might appear in a typical NFSv4
deployment, and design an architecture that guarantees unbroken data
integrity protection through each of these envelopes.
In addition, it is useful to permit varying degrees of server,
client, and application participation in a data protection scheme.
We can define protection envelopes of varying circumference that
allow implementations and deployments to choose a level of
complexity, data protection, and performance impact that suits their
applications.
The following are presented in order of smallest to largest
circumference. To enable end-to-end protection, each protection
envelope in this list depends on having the previous envelope in
place.
Server storage: The storage subsystem on an NFSv4 server is below
the physical filesystems on that server. If a data integrity
mechanism is available on the block storage, the physical
filesystem may or may not choose to use it. Data integrity
verification failures are reflected to NFS clients as simple I/O
errors.
Server filesystem: The physical filesystem on an NFSv4 server may
provide a data integrity mechanism based on its own checksumming
scheme, or by using a standard block storage mechanism such as T10
PI/DIX [DIX]. The NFSv4 service on that system may or may not
choose to use the filesystem's integrity service. Data integrity
verification failures are reflected to NFS clients as simple I/O
errors.
Server: An NFSv4 server may choose to use the local filesystem's
data integrity mechanism, but not to advertise a data integrity
mechanism via NFSv4. Data integrity verification failures are
reflected to NFS clients as simple I/O errors.
Client-server: If an NFSv4 server advertises data integrity
mechanisms via NFSv4, an NFSv4 client may choose to use NFSv4 data
integrity protection without advertising the capability to
applications running on it. It may also choose not to use NFSv4
data integrity protection at all. Data integrity verification
failures are reflected to applications as simple I/O errors.
Lever Expires April 21, 2014 [Page 16]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Application-client-server: Suppose that an NFSv4 client chooses to
use data integrity protection via NFSv4 and that. the capability
is advertised to applications. Applications may or may not choose
to use the capability. An NFSv4 client uses on-the-wire data
integrity when an application chooses to use the capability, but
may or may not use it when the application chooses not to use it.
Data integrity verification failures are reflected to applications
as is. This is full end-to-end data integrity protection via
NFSv4.
Note that the "server" envelope is not externally distinguishable
from a server that does not support data integrity protection at all
(other than that it provides somewhat better data integrity
guarantees than one that does not support data integrity protection).
This is a way to introduce stronger data integrity without requiring
a large deployment of NFSv4 clients capable of integrity
verification. Or, stronger data integrity can be introduced to
legacy NFS environments that have no protocol mechanisms for
extending the protection envelop past the server.
The "application-client-server" envelope illustrates that, on a
protection-enabled file system, data integrity verification can be
used on a per-file basis. Applications may choose to use protection
for some files and not others. Some applications may choose to use
protection, and some applications may choose not to use it.
Note that in each case, data integrity protection is available to the
edge of the farthest protection envelope. Data integrity is
protected only after the data arrives at a protection envelope
boundary, and before it leaves that boundary. Legacy NFS clients
continue to access protected data on a server, but are unaware of
data integrity verification failures except as generic I/O errors.
The client-cache-server case is considered separately. The "cache"
node in this case may be a dedicated NFSv4 cache, a caching peer-to-
peer NFSv4 client, or a pNFS metadata server. A separate protection
envelope exists between an NFSv4 client and an intermediate cache,
and that cache and the NFSv4 server where the protected data resides.
3.2. Protecting Holes
NFSv4 minor version 2 [PROVISIONAL-NFSV42] exposes clients to certain
mechanics of the underlying file systems on servers which allow more
direct control of the storage space utilized by files. The goal of
these new features is to economize the transfer and storage of file
data. These new features include support for reading sparse files
efficiently, space reservation, and punching holes (similar to a TRIM
or DISCARD operation on a block device) in files.
Lever Expires April 21, 2014 [Page 17]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
A hole is an area of a file that can be represented in a file system
by having no backing storage. By definition any read of that region
of the file returns a range of bytes containing zero. Any write to
that region allocates fresh backing storage normally.
NFSv4.2 extends this notion to allow NFSv4 clients to specify a
pattern containing non-zero bytes to be returned when reading that
region of a file. The protocol feature is independent of how an
NFSv4 server's file system chooses to store this data. In fact a
server's file system is free to simply store zeroes or a byte pattern
on disk as raw data rather than in some optimized fashion.
If an NFSv4 server's file system does use an optimized storage
method, a decision must be made about whether accompanying PI is
needed. For a plain hole (where zero is always returned by a raw
data read operation) the intention is that there is no backing
storage there, thus PI is not meaningful. However a read operation
that requests protection information must return something
meaningful. For protection types that mandate only a checksum guard
tag (and do not store either reference or or application tag data), a
checksum for each protection interval can be generated on the server
during a normal read operation, or on the client if a sparse read is
used.
For a data hole (where some non-zero pattern is returned by a raw
read operation), storing PI is optional, and depends on whether the
protection type requires the storage to return an intact application
tag. Without the requirement of storing the application tag, the
file system could discard the PI after a write operation, and
recompute it from the pattern on a read operation. Or, it could
store the PI information as part of the pattern metadata.
3.3. Multi-server Considerations
The NFSv4 protocol provides several mechanisms for NFSv4 servers to
co-operate in ways that enhance performance scalability and data
availability. An NFSv4 client can access the same data serially on
single NFSv4 servers when a file system is replicated. A file system
can be migrated between NFSv4 servers transparently to clients. Or a
file system can be constructed from files that reside in parts on
several NFSv4 servers.
To allow coherent use of a data integrity mechanism:
o Each NFSv4 Data Server hosting a particular file system MUST
support the same protection types.
Lever Expires April 21, 2014 [Page 18]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
o Each replica of a file system MUST support the same protection
types.
o The destination of a file system migration MUST support all
protection types supported by the source, and the transitioned
file system MUST use the same protection type it did on the source
server.
Enforcing these mandates is likely outside the purview of the NFSv4
protocol, particularly because no mechanism for transitioning file
systems is set out by any NFSv4 protocol specification. However,
enforcing such mandates could be built into administrative tools.
3.3.1. pNFS and Protection Information
There has been some uncertainty about whether Protection Information
should be considered metadata or data. pNFS has a convenient
operational definition of data and metadata: if it's data, it goes to
the Data Server; if it's metadata, it goes to the Metadata Server.
Protection Information belongs with the data it protects, which is
written to Data Servers. Therefore Protection Information is data.
If a client ever writes Protection Information to a Metadata Server,
such Protection Information will be forwarded to an appropriate Data
Server for storage.
For the file layout type, which uses NFSv4 when communicating with
Data Servers, all protection types have protocol support for
Protection Information. For other layout types, support may or may
not be available in their respective data protocols. Layout
implementations are not guaranteed to support every protection type.
3.3.2. Server-to-server copy
NFSv4 minor version 2 [PROVISIONAL-NFSV42] introduces a new multi-
server feature known as server-to-server copy. Clients can offload
the data copy portion of copying part or all of a file. The
destination file is recognized as a separate entity (ie. has a unique
file handle), not as a replica of the original file.
As such, the destination file may be stored in a file system that has
a different protection type than the source file, or may not be
protected at all. If the destination filesystem supports the same
protection type as the source filesystem, the copy offload operation
MUST copy Protection Information associated with the source file to
the destination file.
Server implementors MAY provide data integrity verification on both
Lever Expires April 21, 2014 [Page 19]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
ends of the offloaded copy operation. A server MUST report data
integrity verification failures that occur during an offloaded copy
operation.
Lever Expires April 21, 2014 [Page 20]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
4. Security Considerations
A man-in-the-middle attack can replace both the data and integrity
metadata in any NFSv4 request that is sent in the clear. Therefore,
when a data integrity protection mechanism is deployed on an
untrusted network, it is strongly urged that a cryptographically
secure integrity-checking RPC transport, such as RPCSEC GSS Kerberos
5i [RFC2203], is used to convey NFSv4 traffic on open networks.
Lever Expires April 21, 2014 [Page 21]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
5. IANA Considerations
This document currently does not require actions by IANA. However,
see Section 2.1.
Lever Expires April 21, 2014 [Page 22]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
6. Acknowledgements
The author of this document gratefully acknowledges the contributions
of Martin K. Petersen, David Noveck, and Spencer Shepler. Bill
Baker, Chris Mason, and Tom Haynes also provided guidance and
suggestions.
Lever Expires April 21, 2014 [Page 23]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
7. References
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, September 1997.
[RFC3447] Jonsson, J. and B. Kaliski, "Public-Key Cryptography
Standards (PKCS) #1: RSA Cryptography Specifications
Version 2.1", RFC 3447, February 2003.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, May 2009.
7.2. Informative References
[DIX] Petersen, M., "I/O Controller Data Integrity Extensions",
November 2009, <http://oss.oracle.com/~mkp/docs/dif.pdf>.
[PROVISIONAL-NFSV42]
Haynes, T., Ed., "NFS Version 4 Minor Version 2",
March 2013, <http://datatracker.ietf.org/doc/
draft-ietf-nfsv4-minorversion2>.
[PROVISIONAL-NFSV42-XDR]
Haynes, T., Ed., "NFS Version 4 Minor Version 2 Protocol
External Representation Standard (XDR) Description",
March 2013, <https://datatracker.ietf.org/doc/
draft-ietf-nfsv4-minorversion2-dot-x>.
[T10-SBC2]
Elliott, R., Ed., "ANSI INCITS 405-2005, Information
Technology - SCSI Block Commands - 2 (SBC-2)",
November 2004.
Lever Expires April 21, 2014 [Page 24]
Internet-Draft NFSv4 End-to-end Data Integrity October 2013
Author's Address
Charles Lever
Oracle Corporation
1015 Granger Avenue
Ann Arbor, MI 48104
US
Phone: +1 734 274 2396
Email: chuck.lever@oracle.com
Lever Expires April 21, 2014 [Page 25]