NFSv4 Working Group                                       D. Hildebrand
Internet Draft                                                 M. Eshel
Intended status: Standards Track                            IBM Almaden
Expires: June 2011                                     December 6, 2010


             Simple and Efficient Read Support for Sparse Files
                 draft-hildebrand-nfsv4-read-sparse-02.txt


Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008. The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on June 6, 2011.


Hildebrand, et al.       Expires June 6, 2011                  [Page 1]

Internet-Draft      Read Support for Sparse Files         December 2010


Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Abstract

   This document proposes a new READPLUS operation for NFSv4.2 to
   support efficient reading of sparse files, which are growing in the
   data center due to the increasing number of virtual disk images.
   READPLUS has all the features and functionality of READ, but has an
   extensible return value that includes an easy and efficient way for
   administrators to copy and manage sparse files without wasting disk
   space or transferring data unnecessarily.

Table of Contents


   1. Introduction...................................................3
      1.1. Requirements Language.....................................4
   2. Terminology....................................................4
   3. Applications and Sparse Files..................................4
   4. Overview of Sparse Files and NFSv4.............................5
   5. Definition of READPLUS.........................................6
      5.1. ARGUMENTS.................................................7


Hildebrand, et al.       Expires June 6, 2011                  [Page 2]

Internet-Draft      Read Support for Sparse Files         December 2010


      5.2. RESULTS...................................................7
      5.3. DESCRIPTION...............................................8
      5.4. IMPLEMENTATION............................................9
         5.4.1. Additional pNFS Implementation Information..........10
      5.5. READPLUS with Sparse Files Example.......................11
   6. Related Work..................................................12
   7. Security Considerations.......................................12
   8. IANA Considerations...........................................12
   9. References....................................................12
      9.1. Normative References.....................................12
      9.2. Informative References...................................13
   10. Acknowledgments..............................................13

1. Introduction

   NFS is now used in many data centers as the sole or primary method of
   data access.  Consequently, more types of applications are using NFS
   than ever before, each with their own requirements and generated
   workloads.  As part of this, sparse files are increasing in number
   while NFS continues to lack any specific knowledge of a sparse file's
   layout.  This document puts forth a proposal for the NFSv4.2 protocol
   to support efficient reading of sparse files.

   A sparse file is a common way of representing a large file without
   having to reserve disk space for it.  Consequently, a sparse file
   uses less physical space than its size indicates.  This means the
   file contains 'holes', byte ranges within the file that contain no
   data.  Most modern file systems support sparse files, including most
   UNIX file systems and NTFS, but notably not Apple's HFS+.  Common
   examples of sparse files include VM OS/disk images, database files,
   log files, and even checkpoint recovery files most commonly used by
   the HPC community.

   If an application reads a hole in a sparse file, the file system must
   returns all zeros to the application.   For local data access there
   is little penalty, but with NFS these zeroes must be transferred back
   to the client.  If an application uses the NFS client to read data
   into memory, this wastes time and bandwidth as the application waits
   for the zeroes to be transferred.  Once the zeroes arrive, they then
   steal memory or cache space from real data.  To make matters worse,
   if an application then proceeds to write data to another file system,
   the zeros are written into the file, expanding the sparse file into a
   full sized regular file.  Beyond wasting disk space, this can
   actually prevent large sparse files from ever being copied to another
   storage location due to space limitations.


Hildebrand, et al.       Expires June 6, 2011                  [Page 3]

Internet-Draft      Read Support for Sparse Files         December 2010


   This document adds a new READPLUS operation to efficiently read from
   sparse files by avoiding the transfer of all zero regions from the
   server to the client.  READPLUS supports all the features of READ but
   includes a minimal extension to support sparse files.  In addition,
   the return value of READPLUS is now compatible with NFSv4.1 minor
   versioning rules and could support other future extensions without
   requiring yet another operation.  READPLUS is guaranteed to perform
   no worse than READ, and can dramatically improve performance with
   sparse files.  READPLUS does not depend on pNFS protocol features,
   but can be used by pNFS to support sparse files.

   The XDR description is provided in this document in a way that makes
   it simple for the reader to extract into a ready to compile form.
   The reader can feed this document into the following shell script to
   produce the machine readable XDR description of the metadata layout:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///  ??' | sed 's?^.*///??'

   I.e. if the above script is stored in a file called "extract.sh", and
   this document is in a file called "spec.txt", then the reader can do:

    sh extract.sh < spec.txt > md.x

   The effect of the script is to remove leading white space from each
   line of the specification, plus a sentinel sequence of "///".

1.1. Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [1].

2. Terminology

   o  Regular file: An object of file type NF4REG or NF4NAMEDATTR.

   o  Sparse File. A Regular file that contains one or more Holes.

   o  Hole. A byte range within a Sparse file that contains regions of
      all zeroes.  For block-based file systems, this could also be an
      unallocated region of the file.

3. Applications and Sparse Files

   Applications may cause an NFS client to read holes in a file for
   several reasons.  This section describes three different application


Hildebrand, et al.       Expires June 6, 2011                  [Page 4]

Internet-Draft      Read Support for Sparse Files         December 2010


   workloads that cause the NFS client to transfer data unnecessarily.
   These workloads are simply examples, and there are probably many more
   workloads that are negatively impacted by sparse files.

   The first workload that can cause holes to be read is sequential
   reads within a sparse file.  When this happens, the NFS client may
   perform read requests ("readahead") into sections of the file not
   explicitly requested by the application.  Since the NFS client cannot
   differentiate between holes and non-holes, the NFS client may
   prefetch empty sections of the file.

   This workload is exemplified by Virtual Machines and their associated
   file system images, e.g., VMware .vmdk files, which are large sparse
   files encapsulating an entire operating system.  If a VM reads files
   within the file system image, this will translate to sequential NFS
   read requests into the much larger file system image file.  Since NFS
   does not understand the internals of the file system image, it ends
   up performing readahead file holes.

   The second workload is generated by copying a file from a directory
   in NFS to either the same NFS server, to another file system, e.g.,
   another NFS or Samba server, to a local ext3 file system, or even a
   network socket.  In this case, bandwidth and server resources are
   wasted as the entire file is transferred from the NFS server to the
   NFS client.   Once a byte range of the file has been transferred to
   the client, it is up to the client application, e.g., rsync, cp, scp,
   on how it writes the data to the target location.  For example, cp
   supports sparse files and will not write all zero regions, whereas
   scp does not support sparse files and will transfer every byte of the
   file.

   The third workload is generated by applications that do not utilize
   the NFS client cache, but instead use direct I/O and manage cached
   data independently, e.g., databases.  These applications may perform
   whole file caching with sparse files, which would mean that even the
   holes will be transferred to the clients and cached.

4. Overview of Sparse Files and NFSv4

   This proposal seeks to provide sparse file support to the largest
   number of NFS client and server implementations, and as such proposes
   to add a new return code to the mandatory NFSv4.1 READPLUS operation
   instead of proposing additions or extensions of new or existing
   optional features (such as pNFS).

   As well, this document seeks to ensure that the proposed extensions
   are simple and do not transfer data between the client and server


Hildebrand, et al.       Expires June 6, 2011                  [Page 5]

Internet-Draft      Read Support for Sparse Files         December 2010


   unnecessarily. For example, one possible way to implement sparse file
   read support would be to have the client, on the first hole
   encountered or at OPEN time, request a Data Region Map from the
   server.  A Data Region Map would specify all zero and non-zero
   regions in a file.  While this option seems simple, it is less useful
   and can become inefficient and cumbersome for several reasons:

   o  Data Region Maps can be large, and transferring them can reduce
      overall read performance.  For example, VMware's .vmdk files can
      have a file size of over 100 GBs and have a map well over several
      MBs.

   o  Data Region Maps can change frequently, and become invalidated on
      every write to the file.  This can result the map being
      transferred multiple times with each update to the file.  For
      example, a VM that updates a config file in its file system image
      would invalidate the Data Region Map not only for itself, but for
      all other clients accessing the same file system image.

   o  Data Region Maps do not handle all zero-filled sections of the
      file, reducing the effectiveness of the solution. While it may be
      possible to modify the maps to handle zero-filled sections (at
      possibly great effort to the server), it is almost impossible with
      pNFS.  With pNFS, the owner of the Data Region Map is the metadata
      server, which is not in the data path and has no knowledge of the
      contents of a data region.

   Another way to handle holes is compression, but this not ideal since
   it requires all implementations to agree on a single compression
   algorithm and requires a fair amount of computational overhead.

   Note that supporting writing to a sparse file does not require
   changes to the protocol.  Applications and/or NFS implementations can
   choose to ignore WRITE requests of all zeroes to the NFS server
   without consequence.

5. Definition of READPLUS

   The section introduces a new read operation, named READPLUS, which
   allows NFS clients to avoid reading holes in a sparse file. READPLUS
   is guaranteed to perform no worse than READ, and can dramatically
   improve performance with sparse files.

   READPLUS supports all the features of the existing NFSv4.1 READ
   operation [3] and adds a simple yet significant extension to the
   format of its response.  The change allows the client to avoid
   returning all zeroes from a file hole, wasting computational and


Hildebrand, et al.       Expires June 6, 2011                  [Page 6]

Internet-Draft      Read Support for Sparse Files         December 2010


   network resources and reducing performance.  READPLUS uses a new
   result structure that tells the client that the result is all zeroes
   AND the byte-range of the hole in which the request was made.
   Returning the hole's byte-range, and only upon request, avoids
   transferring large Data Region Maps that may be soon invalidated and
   contain information about a file that may not even be read in its
   entirely.

   A new read operation is required due to NFSv4.1 minor versioning
   rules that do not allow modification of existing operation's
   arguments or results.  READPLUS is designed in such a way to allow
   future extensions to the result structure.  The same approach could
   be taken to extend the argument structure, but a good use case is
   first required to make such a change.

5.1. ARGUMENTS

   struct READPLUS4args {
           /* CURRENT_FH: file */
           stateid4        stateid;
           offset4         offset;
           count4          count;
   };


5.2. RESULTS

     union nfs_readplusreshole switch  (holeres4 resop) {
         CASE HOLE_NOINFO:
             void;
         CASE HOLE_INFO:
             offset4         hole_offset;
             length4         hole_length;
      };
      union nfs_readplusresok4 switch  (readplusrestype4 resop) {
         CASE READ_OK:
              opaque          data<>;
         CASE READ_HOLE:
              nfs_readplusreshole   reshole4;
      };

      union READPLUS4res switch (nfsstat4 status) {
       case NFS4_OK:
               bool            eof;
               nfs_readresok4  resok4;
       default:


Hildebrand, et al.       Expires June 6, 2011                  [Page 7]

Internet-Draft      Read Support for Sparse Files         December 2010


               void;
      };

5.3. DESCRIPTION

   The READPLUS operation is based upon the NFSv4.1 READ operation [3],
   and similarly reads data from the regular file identified by the
   current filehandle.

   The client provides an offset of where the READPLUS is to start and a
   count of how many bytes are to be read.  An offset of zero means to
   read data starting at the beginning of the file.  If offset is
   greater than or equal to the size of the file, the status NFS4_OK is
   returned with nfs_readplusrestype4 set to READ_OK, data length set to
   zero, and eof set to TRUE.  The READPLUS is subject to access
   permissions checking.

   If the client specifies a count value of zero, the READPLUS succeeds
   and returns zero bytes of data, again subject to access permissions
   checking.  In all situations, the server may choose to return fewer
   bytes than specified by the client.  The client needs to check for
   this condition and handle the condition appropriately.

   If the client specifies an offset and count value that is entirely
   contained within a hole of the file, the status NFS4_OK is returned
   with nfs_readplusresok4 set to READ_HOLE, and if information is
   available regarding the hole, a nfs_readplusreshole structure
   containing the offset and range of the entire hole.  The
   nfs_readplusreshole structure is considered valid until the file is
   changed (detected via the change attribute).  The server MUST provide
   the same semantics for nfs_readplusreshole as if the client read the
   region and received zeroes; the implied holes contents lifetime MUST
   be exactly the same as any other read data.

   If the client specifies an offset and count value that begins in a
   non-hole of the file but extends into hole the server should return a
   short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK,
   and data length set to the number of bytes returned.  The client will
   then issue another READPLUS for the remaining bytes, which the server
   will respond with information about the hole in the file.

   If the server knows that the requested byte range is into a hole of
   the file, but has no further information regarding the hole, it
   returns a nfs_readplusreshole structure with holeres4 set to
   HOLE_NOINFO.


Hildebrand, et al.       Expires June 6, 2011                  [Page 8]

Internet-Draft      Read Support for Sparse Files         December 2010


   If hole information is available on the server and can be returned to
   the client, the server returns a nfs_readplusreshole structure with
   the value of holeres4 to HOLE_INFO.  The values of hole_offset and
   hole_length define the byte-range for the current hole in the file.
   These values represent the information known to the server and may
   describe a byte-range smaller than the true size of the hole.

   Except when special stateids are used, the stateid value for a
   READPLUS request represents a value returned from a previous byte-
   range lock or share reservation request or the stateid associated
   with a delegation.  The stateid identifies the associated owners if
   any and is used by the server to verify that the associated locks are
   still valid (e.g., have not been revoked).

   If the read ended at the end-of-file (formally, in a correctly formed
   READPLUS operation, if offset + count is equal to the size of the
   file), or the READPLUS operation extends beyond the size of the file
   (if offset + count is greater than the size of the file), eof is
   returned as TRUE; otherwise, it is FALSE.  A successful READPLUS of
   an empty file will always return eof as TRUE.

   If the current filehandle is not an ordinary file, an error will be
   returned to the client.  In the case that the current filehandle
   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

   For a READPLUS with a stateid value of all bits equal to zero, the
   server MAY allow the READPLUS to be serviced subject to mandatory
   byte-range locks or the current share deny modes for the file.  For a
   READPLUS with a stateid value of all bits equal to one, the server
   MAY allow READPLUS operations to bypass locking checks at the server.

   On success, the current filehandle retains its value.

5.4. IMPLEMENTATION

   If the server returns a "short read" (i.e., fewer data than requested
   and eof is set to FALSE), the client should send another READPLUS to
   get the remaining data.  A server may return less data than requested
   under several circumstances.  The file may have been truncated by
   another client or perhaps on the server itself, changing the file
   size from what the requesting client believes to be the case.  This
   would reduce the actual amount of data available to the client.  It
   is possible that the server reduce the transfer size and so return a
   short read result.  Server resource exhaustion may also occur in a
   short read.


Hildebrand, et al.       Expires June 6, 2011                  [Page 9]

Internet-Draft      Read Support for Sparse Files         December 2010


   If mandatory byte-range locking is in effect for the file, and if the
   byte-range corresponding to the data to be read from the file is
   WRITE_LT locked by an owner not associated with the stateid, the
   server will return the NFS4ERR_LOCKED error.  The client should try
   to get the appropriate READ_LT via the LOCK operation before re-
   attempting the READPLUS.  When the READPLUS completes, the client
   should release the byte-range lock via LOCKU.

   If another client has an OPEN_DELEGATE_WRITE delegation for the file
   being read, the delegation must be recalled, and the operation cannot
   proceed until that delegation is returned or revoked.  Except where
   this happens very quickly, one or more NFS4ERR_DELAY errors will be
   returned to requests made while the delegation remains outstanding.
   Normally, delegations will not be recalled as a result of a READPLUS
   operation since the recall will occur as a result of an earlier OPEN.
   However, since it is possible for a READPLUS to be done with a
   special stateid, the server needs to check for this case even though
   the client should have done an OPEN previously.

5.4.1. Additional pNFS Implementation Information

   With pNFS, the semantics of using READPLUS remains the same.  Any
   data server MAY return a READ_HOLE result for a READPLUS request that
   it receives.

   When a data server chooses to return a READ_HOLE result, it has a
   certain level of flexibility in how it fills out the
   nfs_readplusreshole structure.

   1. For a data server that cannot determine any hole information, the
      data server SHOULD return HOLE_NOINFO.

   2. For a data server that can only obtain hole information for the
      parts of the file stored on that data server, the data server
      SHOULD return HOLE_INFO and the byte range of the hole stored on
      that data server.

   3. For a data server that can obtain hole information for the entire
      file without severe performance impact, it MAY return HOLE_INFO
      and the byte range of the entire file hole.

   In general, a data server should do its best to return as much
   information about a hole as is feasible.  In general, pNFS server
   implementers should try ensure that data servers do not overload the
   metadata server with requests for information.  Therefore, if
   supplying global sparse information for a file to data servers can


Hildebrand, et al.       Expires June 6, 2011                 [Page 10]

Internet-Draft      Read Support for Sparse Files         December 2010


   overwhelm a metadata server, then data servers should use option 1 or
   2 above.

   When a pNFS client receives a READ_HOLE result and a non-empty
   nfs_readplusreshole structure, it MAY use this information in
   conjunction with a valid layout for the file to determine the next
   data server for the next region of data that is not in a hole.

5.5. READPLUS with Sparse Files Example

   To see how the return value READ_HOLE will work, the following table
   describes a sparse file.  For each byte range, the file contains
   either non-zero data or a hole.


                       +-------------+-----------+
                       | Byte-Range  |  Contents |
                       +-------------+-----------+
                       | 0-31999     |  Non-Zero |
                       | 32K-255999  |  Hole     |
                       | 256K-287999 |  Non-Zero |
                       | 288K-353999 |  Hole     |
                       | 354K-417999 |  Non-Zero |
                       +-------------+-----------+

   Under the given circumstances, if a client was to read the file from
   beginning to end with a max read size of 64K, the following will be
   the result.  This assumes the client has already opened the file and
   acquired a valid stateid and just needs to issue READPLUS requests.

   1. READPLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof =
      false, data<>[32K].  Return a short read, as the last half of the
      request was all zeroes.

   2. READPLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
      nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range was
      all zeros, and the current hole begins at offset 32K and is 224K
      in length.

   3. READPLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
      = false, data<>[32K].  Return a short read, as the last half of
      the request was all zeroes.

   4. READPLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
      nfs_readplusreshole(HOLE_INFO)(288K, 66K).


Hildebrand, et al.       Expires June 6, 2011                 [Page 11]

Internet-Draft      Read Support for Sparse Files         December 2010


   5. READPLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
      = true, data<>[64K].

6. Related Work

   Solaris and ZFS support an extension to lseek(2) that allows
   applications to discover holes in a file. The values, SEEK_HOLE and
   SEEK_DATA, allow clients to seek to the next hole or beginning of
   data, respectively.

   XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
   the Data Region Map for a file. Clients can then use this information
   to avoid reading holes in a file.

   NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
   applications to control whether empty regions of the file are
   preallocated and filled in with zeros or simply left unallocated.

7. Security Considerations

   The additions to the NFS protocol for supporting sparse file reads
   does not alter the security considerations of the NFSv4.1 protocol
   [3].

8. IANA Considerations

   There are no IANA considerations in this document.  All NFSv4.1 IANA
   considerations are covered in [3].

9. References

9.1. Normative References

   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [2]   Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
         C., Eisler, M., and D. Noveck, "Network File System (NFS)
         version 4 Protocol", RFC 3530, April 2003.

   [3]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January
         2010.


Hildebrand, et al.       Expires June 6, 2011                 [Page 12]

Internet-Draft      Read Support for Sparse Files         December 2010


9.2. Informative References

   [4]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
         (NFS) Version 4 Minor Version 1 External Data Representation
         Standard (XDR) Description", RFC 5662, January 2010.

   [5]   Nowicki, B., "NFS: Network File System Protocol specification",
         RFC 1094, March 1989.

   [6]   Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
         Protocol Specification", RFC 1813, June 1995.

10. Acknowledgments

   This document was prepared using 2-Word-v2.0.template.dot. Valuable
   input and advice was received from Sorin Faibish, Bruce Fields, Benny
   Halevy, Trond Myklebust, and Richard Scheffenegger.


Hildebrand, et al.       Expires June 6, 2011                 [Page 13]

Internet-Draft      Read Support for Sparse Files         December 2010


Authors' Addresses

   Dean Hildebrand
   IBM Almaden
   650 Harry Rd
   San Jose, CA 95120

   Phone: +1 408-927-2013
   Email: dhildeb@us.ibm.com

   Marc Eshel
   IBM Almaden
   650 Harry Rd
   San Jose, CA 95120

   Phone: +1 408-927-1894
   Email: eshel@almaden.ibm.com


Hildebrand, et al.       Expires June 6, 2011                 [Page 14]