Internet-Draft                                 Brent Callaghan
Expires: June 2004                       Sun Microsystems, Inc.

                                                   Mark Wittle
                                        Network Appliance, Inc.

Document: draft-callaghan-nfsrdmareq-00.txt      December 2003


                         NFS RDMA Requirements


Status of this Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet- Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.


Copyright Notice

   Copyright (C) The Internet Society (2003).  All Rights Reserved.


Expires: June 2004        Callaghan and Wittle                  [Page 1]

Internet-Draft           NFS RDMA Requirements             December 2003


Abstract

   Remote Direct Memory Access (RDMA) technology provides an efficient
   memory-to-memory data path across a network.  This draft addresses
   the requirements that an RDMA provider must satisfy to meet the needs
   of the NFS protocol and its implementations.  It is a companion to
   the NFS RDMA Problem Statement which presents a case for applying
   RDMA technology to the NFS protocol.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
   2.  RDMA Generic Requirements  . . . . . . . . . . . . . . . . . 3
   2.1.  RDMA features: SEND, READ and WRITE  . . . . . . . . . . . 3
   2.2.  Preserve ordering of RDMA SENDs  . . . . . . . . . . . . . 4
   2.3.  Support 32 bit Steering Tags . . . . . . . . . . . . . . . 4
   2.4.  Need to support RDMA ops with 64 bit virtual addresses . . 4
   2.5.  Integrity of transferred data. . . . . . . . . . . . . . . 4
   2.6.  Data Privacy . . . . . . . . . . . . . . . . . . . . . . . 4
   2.7.  Must support existing NFS protocols, 2, 3 and 4  . . . . . 5
   2.8.  Direct Data Placement  . . . . . . . . . . . . . . . . . . 5
   2.9.  Flow control . . . . . . . . . . . . . . . . . . . . . . . 5
   2.10.  Buffer size control . . . . . . . . . . . . . . . . . . . 6
   2.11.  Recovery from RDMA errors . . . . . . . . . . . . . . . . 6
   2.12.  Efficient use of RDMA services: RDMA READ vs WRITE  . . . 6
   3.  RDMA Specific Requirements . . . . . . . . . . . . . . . . . 7
   3.1.  IP based addressing  . . . . . . . . . . . . . . . . . . . 7
   3.2.  Minimal effect on latency or protocol overhead.  . . . . . 7
   3.3.  Should handle large transfers  . . . . . . . . . . . . . . 7
   4.  Security . . . . . . . . . . . . . . . . . . . . . . . . . . 8
   5.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8
   6.  References . . . . . . . . . . . . . . . . . . . . . . . . . 8
   7.  Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 9
   8.  Full Copyright Statement . . . . . . . . . . . . . . . . .  10


1.  Introduction

   The NFS RDMA Problem Statement [NFSRDMAPS] describes RDMA technology
   and how it can benefit NFS implementations.  The NFS RDMA Problem
   Statement recommends the application of RDMA to the existing NFS
   protocol versions. One example of this is presented in "RPC Transport
   for ONC RPC" [RPCRDMA] which implements RDMA as a network transport
   at the RPC layer.  This approach preserves the semantics of RPC based
   protocols while providing data marshaling access for RDMA.
   Extensions to the NFS version 4 protocol are proposed in the draft
   "NFSv4 RDMA and Session Extensions" [NFSRDMA] which supply RDMA


Expires: June 2004        Callaghan and Wittle                  [Page 2]

Internet-Draft           NFS RDMA Requirements             December 2003


   session negotiation and support for Exactly-Once Semantics in NFS
   version 4.

   Any use of RDMA by NFS must necessarily affect the RPC layer, that is
   used by NFS. Therefore, this document considers the problem from the
   perspective of RPC.

   This RDMA transport is assumed to be implemented as an upper-half
   generic part that assumes a general RDMA model, and a lower-half that
   presents a particular implementation of RDMA, like RDDP.  The
   requirements that follow are split into two groups: RDMA generic
   requirements, followed by RDMA specific requirements - specific to a
   particular implementation of RDMA.


2.  RDMA Generic Requirements

   ONC RPC can utilize a number of different transports to move RPC
   messages.  Established transports are UDP (Connectionless) and TCP
   (Connection-oriented). This draft anticipates that RDMA will be made
   available as another ONC RPC transport and assumes a generic RDMA
   model, i.e. an application may specify "rdma" as a transport for its
   RPC messages without regard to the particular implementation of RDMA
   such as RDDP.  The following requirements are of this generic RDMA
   transport layer for RPC.

2.1.  RDMA features: SEND, READ and WRITE

   The RDMA connection may support RDMA SEND, READ and WRITE operations.
   An RDMA SEND must move data to a pre-posted, untagged receive buffer
   on the peer and must signal a completion event to the receiver.  An
   RDMA WRITE is initiated on the sending system and must transfer data
   between registered buffers using a steering tag, offset and size to
   designate buffers on both the sending system (data source) and the
   receiving system (data sink).  A completion event on the data source
   must signal completion of the RDMA WRITE operation at the data
   source, though it does not necessarily signal the arrival of data at
   the sink.  Similarly, an RDMA READ is initiated on the data sink and
   must transfer data between registered buffers on peers using a
   steering tag, offset and size to designate buffers on both source and
   sink systems. A completion event on the data sink system must signal
   completion of the RDMA READ operation and the availability of data in


Expires: June 2004        Callaghan and Wittle                  [Page 3]

Internet-Draft           NFS RDMA Requirements             December 2003


   the sink buffer.

2.2.  Preserve ordering of RDMA SENDs

   The RDMA transport must preserve the ordering of SENDs, A sequence of
   SENDs must arrive at the receiver in the same order that they were
   sent.  There is no requirement for ordering of READ or WRITE
   operations.

2.3.  Support 32 bit Steering Tags

   The remote addresses used by RDMA are qualified by a Steering Tag
   (STag) or memory handle that uniquely describes a region of
   registered memory on a host.  There are no requirements for the
   content of the steering tag. An STag must not exceed 32 bits.

2.4.  Need to support RDMA ops with 64 bit virtual addresses

   Support for 64 bit virtual addresses is universal in large operating
   environments.  The NFS protocol allows up to 64 bit file offsets.
   RDMA operations must support 64 bit virtual addresses.

2.5.  Integrity of transferred data.

   The use of NFS exposes file and filesystem contents to a network,
   where data may be corrupted or modified by attackers, or privacy lost
   to a third party.  Network media like Ethernet provide CRC protection
   for data in an Ethernet frame.  Data is also protected by UDP or TCP
   checksums.  Finally, the use of an integrity or privacy mechanism in
   RPCSEC_GSS can protect data at the RPC layer - albeit with some loss
   of performance.  An RDMA transport layer must provide no less
   guarantee of data security than is already provided by existing
   transports. Even better if the RDMA transport provides stronger
   protection of data integrity, particularly in a business-critical
   data center environment characterized by very high bit rates.

2.6.  Data Privacy

   Some may perceive RDMA as providing new opportunities for attack or
   disclosure via the exposure of registered buffers to external view or
   modification.  Since a peer system can obtain data from or place data
   directly into a remote buffer, it is not hard to imagine another
   malicious peer also accessing or using direct placement to modify
   that data before it is used by the application.  While RPCSEC_GSS can
   provide protection, there should also be an RDMA provided mechanism
   that disallows remote access to direct-placed data after the data has
   been placed in the receiving buffer and before it is checked for


Expires: June 2004        Callaghan and Wittle                  [Page 4]

Internet-Draft           NFS RDMA Requirements             December 2003


   integrity or decrypted.

2.7.  Must support existing NFS protocols, 2, 3 and 4

   An RDMA transport for RPC must support existing versions of NFS.
   This isn't to say that an RDMA transport must be interoperable with
   old implementations that do not support RDMA.  We don't expect an
   NFS/UDP client to be interoperable with an NFS/TCP server.  However,
   we should expect the benefits of an RDMA transport be made available
   to existing versions of the NFS protocol. The availability of RDMA as
   a simple transport add-on can provide performance benefits to
   existing implementations while preserving their semantics ("bug-for-
   bug compatible").  This will reduce implementation, training and
   deployment cost.

2.8.  Direct Data Placement

   The NFS RDMA Problem Statement [NFSRDMAPS] describes the overhead in
   data copying suffered by existing NFS implementations.  The data
   copying is necessary, either to move data between an application-
   defined buffer and a network buffer, or to achieve correct data
   alignment.  An RDMA transport must support direct data placement and
   data alignment - sending data directly to a destination buffer on the
   remote system without any intermediate stops.

2.9.  Flow control

   While RPC protocols are typically call-response, connection sharing
   and use of read-ahead and write-behind can cause multiple RPC
   messages to be in flight concurrently on a single network connection.
   RPC provides no inherent flow control - there is nothing to prevent a
   client from initiating any number of concurrent RPC messages that can
   overwhelm an NFS server. While TCP allows the server to push back on
   a client with a window of the amount of data it is willing to accept,
   it does not reflect the transaction capacity of the server, e.g.  the
   number of service threads available to handle requests.  While the
   effect of a transaction overload on a TCP connection is just the
   server dropping requests, an overload on an RDMA transport can cause
   transport errors and loss of the connection.  RDMA transports that
   support the use of untagged, pre-posted receive buffers require that
   any message sent over the connection have a receive buffer posted at
   the transport endpoint.  A client that transmits a number of messages
   without acknowledgement must not send more messages than the number
   of receive buffers queued at the connection.  The client and server
   must be able to negotiate the number of posted receive buffers, i.e.
   maximum number of messages that can be in flight.  It's possible that
   the number of posted buffers may change over time depending on the
   memory or compute resources available to the server, so this


Expires: June 2004        Callaghan and Wittle                  [Page 5]

Internet-Draft           NFS RDMA Requirements             December 2003


   negotiation should extend over the life of the connection.  Given
   that RPC is a call-response protocol, the number of messages in
   flight is determined by the client.  Hence the client needs to know
   the server's limits, but the client can guarantee sufficient posted
   receive buffers at its end simply by posting a new receive buffer to
   accept an RPC reply whenever it sends an RPC call message.

2.10.  Buffer size control

   While the number of untagged receive buffers must be agreed upon by
   the client and server, the size of these buffers is also important.
   If an RPC message exceeds the size of the posted receive buffer then
   a network error will result and the connection may be lost.  The
   client and server must agree on the size of posted buffers.  The NFS
   protocol already allows some negotiation of buffer sizes in the
   protocol itself (the NFS version 3 FSINFO procedure provides maximum
   and preferred transfer sizes) so this buffer negotiation could be
   supported in an extended version of the NFS protocol, perhaps in an
   NFSv4 minor version.  Alternatively, buffer negotiation might be
   conducted in a protocol-generic fashion via a simple RPC protocol
   that negotiates receive buffer size.  This protocol may run over the
   existing RDMA connection when it is being initialized and
   bootstrapped with a requirement for a minimum receive buffer size,
   say 1 KB.

2.11.  Recovery from RDMA errors

   NFS is recognized as a robust protocol that survives network errors
   and outages.  Even the loss of a connection or crash and recovery of
   an NFS server need not affect the persistence of an NFS mount. NFS
   users are familiar with the "NFS server not responding" message, and
   the ability of applications to carry on as if nothing had happened
   after an "NFS server OK" message. We should expect the same behavior
   from an NFS mount over an RDMA connection.  It must be able to detect
   and recover from RDMA errors, re-establishing the RDMA connection if
   necessary.

2.12.  Efficient use of RDMA services: RDMA READ vs WRITE

   An RDMA transport must allow the NFS implementation to take full
   advantage of RDMA services - especially RDMA READ and WRITE
   operations that perform efficient memory-to-memory data transfers.
   The transport must allow client and server to make best use of RDMA
   READ and WRITE to perform NFS data transfers either for aligned data
   movement such as NFS READ and WRITE, or for movement of large amounts


Expires: June 2004        Callaghan and Wittle                  [Page 6]

Internet-Draft           NFS RDMA Requirements             December 2003


   of non-aligned data - such as READDIR results.

3.  RDMA Specific Requirements

   The following requirements are placed on a specific provider of RDMA
   service.  These are the features that it must provide or support to
   meet the needs of the generic RDMA model assumed by the RPC
   transport.

3.1.  IP based addressing

   Existing RPC networks are based on DNS provided hostnames and NICs
   identified with IP addresses.  Although an RDMA transport may have
   different addressing requirements, it must be possible to map those
   addresses or derive them from IP addresses to preserve system
   administrator's sanity.  Similarly, we should expect the familiar
   properties of RPC protocols as implemented on IP to be preserved by
   an RDMA transport.  For instance, some notion of a unique transport
   endpoint like the UDP or TCP port space.  The ability to handle
   multiple client connections to a server endpoint, and the ability for
   multiple RPC protocols to share a single RDMA connection.

3.2.  Minimal effect on latency or protocol overhead.

   Although the NFS protocol is commonly used to transfer large chunks
   of file data, most of the protocol messages are small.  For instance,
   even though an NFS READ message may return a large amount of data in
   the reply, the call message is quite tiny: typically less than 100
   bytes.  NFS traffic mixes show that directory LOOKUP and fetching of
   file metadata with GETATTR, which account for the majority of NFS
   traffic, also have small RPC messages.  It is important that an RDMA
   transport feature a small transaction cost, i.e. small messages must
   be handled efficiently with latency as low or lower than an
   equivalent UDP or TCP connection.

3.3.  Should handle large transfers

   The performance of the NFS protocol is often determined by the
   network transfer size available.  NFS version 2 was limited by the
   protocol to 8 KB transfers, which was quite large back in the mid
   80's. NFS version 3 allowed the client and server to negotiate
   transfer size.  While 32 KB transfers are common today, there is
   evidence to suggest that much larger transfer sizes can improve the
   performance of sequential access to large files.  An RDMA transport
   is expected to provide efficient transfer of large blocks of data
   using specialized network hardware.  NFS should expect an RDMA


Expires: June 2004        Callaghan and Wittle                  [Page 7]

Internet-Draft           NFS RDMA Requirements             December 2003


   transport to support transfers of a Megabyte or more.

4.  Security

   ONC RPC makes no security assumptions of its transports, since it can
   already provide security for RPC messages through RPCSEC_GSS
   mechanisms [RFC2203].  However, link-level encryption or encryption
   via IPsec [RFC2401] may provide important performance advantages for
   an RPCSEC_GSS security mechanism [CCM].


5.  Acknowledgements

   The authors would like to thank Tom Talpey and Chet Juszczak for
   their contributions to this document.


6.  References


   [CCM]
      M. Eisler, N. Williams, "The Channel Conjunction
      Mechanism (CCM) for GSS"
      Internet Draft.
      http://www.ietf.org/internet-drafts/
         draft-ietf-nfsv4-ccm-01.txt

   [DAFS]
      Direct Access File System Specification version 1.0, available
      from http://www.dafscollaborative.org, September 2001

   [NFSRDMA]
      T. Talpey, S. Shepler, "NFSv4 RDMA and Session Extensions"
      http://www.ietf.org/internet-drafts/
         draft-talpey-nfsv4-rdma-sess-00.txt

   [NFSRDMAPS]
      Tom Talpey, Chet Juszczak, "NFS RDMA Problem Statement"
      http://www.ietf.org/internet-drafts/
         draft-talpey-nfs-rdma-problem-statement-00.txt

   [RFC1094]
      "NFS: Network File System Protocol Specification",
      (NFS version 2) Informational RFC,
      http://www.ietf.org/rfc/rfc1094.txt

   [RFC1813]
      B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol


Expires: June 2004        Callaghan and Wittle                  [Page 8]

Internet-Draft           NFS RDMA Requirements             December 2003


      Specification",
      Informational RFC,
      http://www.ietf.org/rfc/rfc1813.txt

   [RFC1831]
      R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
      Version 2",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc1831.txt

   [RFC1832]
      R. Srinivasan, "XDR: External Data Representation Standard",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc1832.txt

   [RFC2203]
      Eisler, M., Chiu, A. and L. Ling,
      "RPCSEC_GSS Protocol Specification",
      RFC 2203, September 1997.
      http://www.ietf.org/rfc/rfc2203.txt

   [RFC2401]
      S. Kent, R. Atkinson, RFC2401, "Security Architecture for the
      Internet Protocol ", November, 1998.
      http://www.ietf.org/rfc/rfc2401.txt

   [RFC3530]
      S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
      Eisler, D. Noveck, "NFS version 4 Protocol",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc3530.txt

   [RPCRDMA]
     B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC"
     http://www.ietf.org/internet-drafts/
        draft-callaghan-rpc-rdma-00.txt


7.  Authors' Addresses


Expires: June 2004        Callaghan and Wittle                  [Page 9]

Internet-Draft           NFS RDMA Requirements             December 2003


           Brent Callaghan
           Sun Microsystems, Inc.
           17 Network Circle
           Menlo Park, California 94025 USA

           Phone: +1 650 786 5067
           EMail: brent.callaghan@sun.com


           Mark Wittle
           Network Appliance, Inc.
           375 Totten Pond Road
           Waltham, MA 02451 USA

           Phone: +1 919 993 5627
           EMail: mark.wittle@netapp.com


8.  Full Copyright Statement


   Copyright (C) The Internet Society (2003).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Expires: June 2004        Callaghan and Wittle                 [Page 10]