NFSv4 Working Group D. Hildebrand Internet Draft M. Eshel Intended status: Standards Track IBM Almaden Expires: June 2011 December 6, 2010 Simple and Efficient Read Support for Sparse Files draft-hildebrand-nfsv4-read-sparse-02.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on June 6, 2011. Hildebrand, et al. Expires June 6, 2011 [Page 1] Internet-Draft Read Support for Sparse Files December 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Abstract This document proposes a new READPLUS operation for NFSv4.2 to support efficient reading of sparse files, which are growing in the data center due to the increasing number of virtual disk images. READPLUS has all the features and functionality of READ, but has an extensible return value that includes an easy and efficient way for administrators to copy and manage sparse files without wasting disk space or transferring data unnecessarily. Table of Contents 1. Introduction...................................................3 1.1. Requirements Language.....................................4 2. Terminology....................................................4 3. Applications and Sparse Files..................................4 4. Overview of Sparse Files and NFSv4.............................5 5. Definition of READPLUS.........................................6 5.1. ARGUMENTS.................................................7 Hildebrand, et al. Expires June 6, 2011 [Page 2] Internet-Draft Read Support for Sparse Files December 2010 5.2. RESULTS...................................................7 5.3. DESCRIPTION...............................................8 5.4. IMPLEMENTATION............................................9 5.4.1. Additional pNFS Implementation Information..........10 5.5. READPLUS with Sparse Files Example.......................11 6. Related Work..................................................12 7. Security Considerations.......................................12 8. IANA Considerations...........................................12 9. References....................................................12 9.1. Normative References.....................................12 9.2. Informative References...................................13 10. Acknowledgments..............................................13 1. Introduction NFS is now used in many data centers as the sole or primary method of data access. Consequently, more types of applications are using NFS than ever before, each with their own requirements and generated workloads. As part of this, sparse files are increasing in number while NFS continues to lack any specific knowledge of a sparse file's layout. This document puts forth a proposal for the NFSv4.2 protocol to support efficient reading of sparse files. A sparse file is a common way of representing a large file without having to reserve disk space for it. Consequently, a sparse file uses less physical space than its size indicates. This means the file contains 'holes', byte ranges within the file that contain no data. Most modern file systems support sparse files, including most UNIX file systems and NTFS, but notably not Apple's HFS+. Common examples of sparse files include VM OS/disk images, database files, log files, and even checkpoint recovery files most commonly used by the HPC community. If an application reads a hole in a sparse file, the file system must returns all zeros to the application. For local data access there is little penalty, but with NFS these zeroes must be transferred back to the client. If an application uses the NFS client to read data into memory, this wastes time and bandwidth as the application waits for the zeroes to be transferred. Once the zeroes arrive, they then steal memory or cache space from real data. To make matters worse, if an application then proceeds to write data to another file system, the zeros are written into the file, expanding the sparse file into a full sized regular file. Beyond wasting disk space, this can actually prevent large sparse files from ever being copied to another storage location due to space limitations. Hildebrand, et al. Expires June 6, 2011 [Page 3] Internet-Draft Read Support for Sparse Files December 2010 This document adds a new READPLUS operation to efficiently read from sparse files by avoiding the transfer of all zero regions from the server to the client. READPLUS supports all the features of READ but includes a minimal extension to support sparse files. In addition, the return value of READPLUS is now compatible with NFSv4.1 minor versioning rules and could support other future extensions without requiring yet another operation. READPLUS is guaranteed to perform no worse than READ, and can dramatically improve performance with sparse files. READPLUS does not depend on pNFS protocol features, but can be used by pNFS to support sparse files. The XDR description is provided in this document in a way that makes it simple for the reader to extract into a ready to compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the metadata layout: #!/bin/sh grep "^ *///" | sed 's?^ */// ??' | sed 's?^.*///??' I.e. if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do: sh extract.sh < spec.txt > md.x The effect of the script is to remove leading white space from each line of the specification, plus a sentinel sequence of "///". 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [1]. 2. Terminology o Regular file: An object of file type NF4REG or NF4NAMEDATTR. o Sparse File. A Regular file that contains one or more Holes. o Hole. A byte range within a Sparse file that contains regions of all zeroes. For block-based file systems, this could also be an unallocated region of the file. 3. Applications and Sparse Files Applications may cause an NFS client to read holes in a file for several reasons. This section describes three different application Hildebrand, et al. Expires June 6, 2011 [Page 4] Internet-Draft Read Support for Sparse Files December 2010 workloads that cause the NFS client to transfer data unnecessarily. These workloads are simply examples, and there are probably many more workloads that are negatively impacted by sparse files. The first workload that can cause holes to be read is sequential reads within a sparse file. When this happens, the NFS client may perform read requests ("readahead") into sections of the file not explicitly requested by the application. Since the NFS client cannot differentiate between holes and non-holes, the NFS client may prefetch empty sections of the file. This workload is exemplified by Virtual Machines and their associated file system images, e.g., VMware .vmdk files, which are large sparse files encapsulating an entire operating system. If a VM reads files within the file system image, this will translate to sequential NFS read requests into the much larger file system image file. Since NFS does not understand the internals of the file system image, it ends up performing readahead file holes. The second workload is generated by copying a file from a directory in NFS to either the same NFS server, to another file system, e.g., another NFS or Samba server, to a local ext3 file system, or even a network socket. In this case, bandwidth and server resources are wasted as the entire file is transferred from the NFS server to the NFS client. Once a byte range of the file has been transferred to the client, it is up to the client application, e.g., rsync, cp, scp, on how it writes the data to the target location. For example, cp supports sparse files and will not write all zero regions, whereas scp does not support sparse files and will transfer every byte of the file. The third workload is generated by applications that do not utilize the NFS client cache, but instead use direct I/O and manage cached data independently, e.g., databases. These applications may perform whole file caching with sparse files, which would mean that even the holes will be transferred to the clients and cached. 4. Overview of Sparse Files and NFSv4 This proposal seeks to provide sparse file support to the largest number of NFS client and server implementations, and as such proposes to add a new return code to the mandatory NFSv4.1 READPLUS operation instead of proposing additions or extensions of new or existing optional features (such as pNFS). As well, this document seeks to ensure that the proposed extensions are simple and do not transfer data between the client and server Hildebrand, et al. Expires June 6, 2011 [Page 5] Internet-Draft Read Support for Sparse Files December 2010 unnecessarily. For example, one possible way to implement sparse file read support would be to have the client, on the first hole encountered or at OPEN time, request a Data Region Map from the server. A Data Region Map would specify all zero and non-zero regions in a file. While this option seems simple, it is less useful and can become inefficient and cumbersome for several reasons: o Data Region Maps can be large, and transferring them can reduce overall read performance. For example, VMware's .vmdk files can have a file size of over 100 GBs and have a map well over several MBs. o Data Region Maps can change frequently, and become invalidated on every write to the file. This can result the map being transferred multiple times with each update to the file. For example, a VM that updates a config file in its file system image would invalidate the Data Region Map not only for itself, but for all other clients accessing the same file system image. o Data Region Maps do not handle all zero-filled sections of the file, reducing the effectiveness of the solution. While it may be possible to modify the maps to handle zero-filled sections (at possibly great effort to the server), it is almost impossible with pNFS. With pNFS, the owner of the Data Region Map is the metadata server, which is not in the data path and has no knowledge of the contents of a data region. Another way to handle holes is compression, but this not ideal since it requires all implementations to agree on a single compression algorithm and requires a fair amount of computational overhead. Note that supporting writing to a sparse file does not require changes to the protocol. Applications and/or NFS implementations can choose to ignore WRITE requests of all zeroes to the NFS server without consequence. 5. Definition of READPLUS The section introduces a new read operation, named READPLUS, which allows NFS clients to avoid reading holes in a sparse file. READPLUS is guaranteed to perform no worse than READ, and can dramatically improve performance with sparse files. READPLUS supports all the features of the existing NFSv4.1 READ operation [3] and adds a simple yet significant extension to the format of its response. The change allows the client to avoid returning all zeroes from a file hole, wasting computational and Hildebrand, et al. Expires June 6, 2011 [Page 6] Internet-Draft Read Support for Sparse Files December 2010 network resources and reducing performance. READPLUS uses a new result structure that tells the client that the result is all zeroes AND the byte-range of the hole in which the request was made. Returning the hole's byte-range, and only upon request, avoids transferring large Data Region Maps that may be soon invalidated and contain information about a file that may not even be read in its entirely. A new read operation is required due to NFSv4.1 minor versioning rules that do not allow modification of existing operation's arguments or results. READPLUS is designed in such a way to allow future extensions to the result structure. The same approach could be taken to extend the argument structure, but a good use case is first required to make such a change. 5.1. ARGUMENTS struct READPLUS4args { /* CURRENT_FH: file */ stateid4 stateid; offset4 offset; count4 count; }; 5.2. RESULTS union nfs_readplusreshole switch (holeres4 resop) { CASE HOLE_NOINFO: void; CASE HOLE_INFO: offset4 hole_offset; length4 hole_length; }; union nfs_readplusresok4 switch (readplusrestype4 resop) { CASE READ_OK: opaque data<>; CASE READ_HOLE: nfs_readplusreshole reshole4; }; union READPLUS4res switch (nfsstat4 status) { case NFS4_OK: bool eof; nfs_readresok4 resok4; default: Hildebrand, et al. Expires June 6, 2011 [Page 7] Internet-Draft Read Support for Sparse Files December 2010 void; }; 5.3. DESCRIPTION The READPLUS operation is based upon the NFSv4.1 READ operation [3], and similarly reads data from the regular file identified by the current filehandle. The client provides an offset of where the READPLUS is to start and a count of how many bytes are to be read. An offset of zero means to read data starting at the beginning of the file. If offset is greater than or equal to the size of the file, the status NFS4_OK is returned with nfs_readplusrestype4 set to READ_OK, data length set to zero, and eof set to TRUE. The READPLUS is subject to access permissions checking. If the client specifies a count value of zero, the READPLUS succeeds and returns zero bytes of data, again subject to access permissions checking. In all situations, the server may choose to return fewer bytes than specified by the client. The client needs to check for this condition and handle the condition appropriately. If the client specifies an offset and count value that is entirely contained within a hole of the file, the status NFS4_OK is returned with nfs_readplusresok4 set to READ_HOLE, and if information is available regarding the hole, a nfs_readplusreshole structure containing the offset and range of the entire hole. The nfs_readplusreshole structure is considered valid until the file is changed (detected via the change attribute). The server MUST provide the same semantics for nfs_readplusreshole as if the client read the region and received zeroes; the implied holes contents lifetime MUST be exactly the same as any other read data. If the client specifies an offset and count value that begins in a non-hole of the file but extends into hole the server should return a short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK, and data length set to the number of bytes returned. The client will then issue another READPLUS for the remaining bytes, which the server will respond with information about the hole in the file. If the server knows that the requested byte range is into a hole of the file, but has no further information regarding the hole, it returns a nfs_readplusreshole structure with holeres4 set to HOLE_NOINFO. Hildebrand, et al. Expires June 6, 2011 [Page 8] Internet-Draft Read Support for Sparse Files December 2010 If hole information is available on the server and can be returned to the client, the server returns a nfs_readplusreshole structure with the value of holeres4 to HOLE_INFO. The values of hole_offset and hole_length define the byte-range for the current hole in the file. These values represent the information known to the server and may describe a byte-range smaller than the true size of the hole. Except when special stateids are used, the stateid value for a READPLUS request represents a value returned from a previous byte- range lock or share reservation request or the stateid associated with a delegation. The stateid identifies the associated owners if any and is used by the server to verify that the associated locks are still valid (e.g., have not been revoked). If the read ended at the end-of-file (formally, in a correctly formed READPLUS operation, if offset + count is equal to the size of the file), or the READPLUS operation extends beyond the size of the file (if offset + count is greater than the size of the file), eof is returned as TRUE; otherwise, it is FALSE. A successful READPLUS of an empty file will always return eof as TRUE. If the current filehandle is not an ordinary file, an error will be returned to the client. In the case that the current filehandle represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. For a READPLUS with a stateid value of all bits equal to zero, the server MAY allow the READPLUS to be serviced subject to mandatory byte-range locks or the current share deny modes for the file. For a READPLUS with a stateid value of all bits equal to one, the server MAY allow READPLUS operations to bypass locking checks at the server. On success, the current filehandle retains its value. 5.4. IMPLEMENTATION If the server returns a "short read" (i.e., fewer data than requested and eof is set to FALSE), the client should send another READPLUS to get the remaining data. A server may return less data than requested under several circumstances. The file may have been truncated by another client or perhaps on the server itself, changing the file size from what the requesting client believes to be the case. This would reduce the actual amount of data available to the client. It is possible that the server reduce the transfer size and so return a short read result. Server resource exhaustion may also occur in a short read. Hildebrand, et al. Expires June 6, 2011 [Page 9] Internet-Draft Read Support for Sparse Files December 2010 If mandatory byte-range locking is in effect for the file, and if the byte-range corresponding to the data to be read from the file is WRITE_LT locked by an owner not associated with the stateid, the server will return the NFS4ERR_LOCKED error. The client should try to get the appropriate READ_LT via the LOCK operation before re- attempting the READPLUS. When the READPLUS completes, the client should release the byte-range lock via LOCKU. If another client has an OPEN_DELEGATE_WRITE delegation for the file being read, the delegation must be recalled, and the operation cannot proceed until that delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while the delegation remains outstanding. Normally, delegations will not be recalled as a result of a READPLUS operation since the recall will occur as a result of an earlier OPEN. However, since it is possible for a READPLUS to be done with a special stateid, the server needs to check for this case even though the client should have done an OPEN previously. 5.4.1. Additional pNFS Implementation Information With pNFS, the semantics of using READPLUS remains the same. Any data server MAY return a READ_HOLE result for a READPLUS request that it receives. When a data server chooses to return a READ_HOLE result, it has a certain level of flexibility in how it fills out the nfs_readplusreshole structure. 1. For a data server that cannot determine any hole information, the data server SHOULD return HOLE_NOINFO. 2. For a data server that can only obtain hole information for the parts of the file stored on that data server, the data server SHOULD return HOLE_INFO and the byte range of the hole stored on that data server. 3. For a data server that can obtain hole information for the entire file without severe performance impact, it MAY return HOLE_INFO and the byte range of the entire file hole. In general, a data server should do its best to return as much information about a hole as is feasible. In general, pNFS server implementers should try ensure that data servers do not overload the metadata server with requests for information. Therefore, if supplying global sparse information for a file to data servers can Hildebrand, et al. Expires June 6, 2011 [Page 10] Internet-Draft Read Support for Sparse Files December 2010 overwhelm a metadata server, then data servers should use option 1 or 2 above. When a pNFS client receives a READ_HOLE result and a non-empty nfs_readplusreshole structure, it MAY use this information in conjunction with a valid layout for the file to determine the next data server for the next region of data that is not in a hole. 5.5. READPLUS with Sparse Files Example To see how the return value READ_HOLE will work, the following table describes a sparse file. For each byte range, the file contains either non-zero data or a hole. +-------------+-----------+ | Byte-Range | Contents | +-------------+-----------+ | 0-31999 | Non-Zero | | 32K-255999 | Hole | | 256K-287999 | Non-Zero | | 288K-353999 | Hole | | 354K-417999 | Non-Zero | +-------------+-----------+ Under the given circumstances, if a client was to read the file from beginning to end with a max read size of 64K, the following will be the result. This assumes the client has already opened the file and acquired a valid stateid and just needs to issue READPLUS requests. 1. READPLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof = false, data<>[32K]. Return a short read, as the last half of the request was all zeroes. 2. READPLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range was all zeros, and the current hole begins at offset 32K and is 224K in length. 3. READPLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof = false, data<>[32K]. Return a short read, as the last half of the request was all zeroes. 4. READPLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, nfs_readplusreshole(HOLE_INFO)(288K, 66K). Hildebrand, et al. Expires June 6, 2011 [Page 11] Internet-Draft Read Support for Sparse Files December 2010 5. READPLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof = true, data<>[64K]. 6. Related Work Solaris and ZFS support an extension to lseek(2) that allows applications to discover holes in a file. The values, SEEK_HOLE and SEEK_DATA, allow clients to seek to the next hole or beginning of data, respectively. XFS supports the XFS_IOC_GETBMAP extended attribute, which returns the Data Region Map for a file. Clients can then use this information to avoid reading holes in a file. NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows applications to control whether empty regions of the file are preallocated and filled in with zeros or simply left unallocated. 7. Security Considerations The additions to the NFS protocol for supporting sparse file reads does not alter the security considerations of the NFSv4.1 protocol [3]. 8. IANA Considerations There are no IANA considerations in this document. All NFSv4.1 IANA considerations are covered in [3]. 9. References 9.1. Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003. [3] Shepler, S., Eisler, M., and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010. Hildebrand, et al. Expires June 6, 2011 [Page 12] Internet-Draft Read Support for Sparse Files December 2010 9.2. Informative References [4] Shepler, S., Eisler, M., and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description", RFC 5662, January 2010. [5] Nowicki, B., "NFS: Network File System Protocol specification", RFC 1094, March 1989. [6] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, June 1995. 10. Acknowledgments This document was prepared using 2-Word-v2.0.template.dot. Valuable input and advice was received from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and Richard Scheffenegger. Hildebrand, et al. Expires June 6, 2011 [Page 13] Internet-Draft Read Support for Sparse Files December 2010 Authors' Addresses Dean Hildebrand IBM Almaden 650 Harry Rd San Jose, CA 95120 Phone: +1 408-927-2013 Email: dhildeb@us.ibm.com Marc Eshel IBM Almaden 650 Harry Rd San Jose, CA 95120 Phone: +1 408-927-1894 Email: eshel@almaden.ibm.com Hildebrand, et al. Expires June 6, 2011 [Page 14]