callaghan-rpc-rdma-00.txt

Internet DRAFT - draft-callaghan-rpc-rdma
draft-callaghan-rpc-rdma

Last Version:	draft-callaghan-rpc-rdma-00.txt	Tracker Entry
Date:	`03-Jun-2003`
Disposition:	expired







Internet-Draft                                 Brent Callaghan
Expires: November 2003                   Sun Microsystems, Inc.
                                                    Tom Talpey
                                        Network Appliance, Inc.

Document: draft-callaghan-rpcrdma-00.txt             May, 2003





                       RDMA Transport for ONC RPC


Status of this Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet- Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.


Copyright Notice

   Copyright (C) The Internet Society (2003).  All Rights Reserved.








Expires: November 2003    Callaghan and Talpey                  [Page 1]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


Abstract

   A protocol is described providing RDMA as a new transport for ONC
   RPC.  The RDMA transport binding conveys the benefits of efficient,
   bulk data transport over high speed networks, while providing for
   minimal change to RPC applications and with no required revision of
   the application RPC protocol, or the RPC protocol itself.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
   2.  Abstract RDMA Model  . . . . . . . . . . . . . . . . . . . . 3
   3.  Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 5
   3.1.  Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
   3.2.  Data Chunks  . . . . . . . . . . . . . . . . . . . . . . . 6
   3.3.  Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
   3.4.  XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
   3.5.  Padding  . . . . . . . . . . . . . . . . . . . . . . . . . 9
   3.6.  XDR Decoding with Read Chunks  . . . . . . . . . . . . .  10
   3.7.  XDR Decoding with Write Chunks . . . . . . . . . . . . .  10
   3.8.  RPC Call and Reply . . . . . . . . . . . . . . . . . . .  11
   4.  RPC RDMA Message Layout  . . . . . . . . . . . . . . . . .  13
   4.1.  RPC RDMA Transport Header  . . . . . . . . . . . . . . .  13
   4.2.  XDR Language Description . . . . . . . . . . . . . . . .  15
   5.  Large Chunkless Messages . . . . . . . . . . . . . . . . .  16
   5.1.  Message as an RDMA Read Chunk  . . . . . . . . . . . . .  17
   5.2.  RDMA Write of Long Replies . . . . . . . . . . . . . . .  18
   6.  Connection Configuration Protocol  . . . . . . . . . . . .  19
   6.1.  Initial Connection State . . . . . . . . . . . . . . . .  20
   6.2.  Protocol Description . . . . . . . . . . . . . . . . . .  20
   7.  Memory Registration Overhead . . . . . . . . . . . . . . .  21
   8.  Errors and Error Recovery  . . . . . . . . . . . . . . . .  21
   9.  Node Addressing  . . . . . . . . . . . . . . . . . . . . .  21
   10.  RPC Binding . . . . . . . . . . . . . . . . . . . . . . .  22
   11.  Security  . . . . . . . . . . . . . . . . . . . . . . . .  22
   12.  IANA Considerations . . . . . . . . . . . . . . . . . . .  23
   13.  Acknowledgements  . . . . . . . . . . . . . . . . . . . .  23
   14.  References  . . . . . . . . . . . . . . . . . . . . . . .  23
   15.  Authors' Addresses  . . . . . . . . . . . . . . . . . . .  24
   16.  Full Copyright Statement  . . . . . . . . . . . . . . . .  25


1.  Introduction

   RDMA is a technique for efficient movement of data over high speed
   transports.  It facilitates data movement via direct memory access by
   hardware, yielding faster transfers of data over a network while



Expires: November 2003    Callaghan and Talpey                  [Page 2]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   reducing host CPU overhead.

   ONC RPC [RFC1831] is a remote procedure call protocol that has been
   run over a variety of transports.  Most implementations today use UDP
   or TCP.  RPC messages are defined in terms of an eXternal Data
   Representation (XDR) [RFC1832] which provides a canonical data
   representation across a variety of host architectures.  An XDR data
   stream is conveyed differently on each type of transport.  On UDP,
   RPC messages are encapsulated inside datagrams, while on a TCP byte
   stream, RPC messages are delineated by a record marking protocol.  An
   RDMA transport also conveys RPC messages in a unique fashion that
   must be fully described if client and server implementations are to
   interoperate.

   RDMA transports present new semantics unlike the behaviors of either
   UDP and TCP.  They retain message delineations like UDP while also
   providing a reliable, sequenced data transfer like TCP.  All provide
   the new efficient, bulk transfer service of RDMA.  RDMA transports
   are therefore naturally viewed as a new transport type by ONC RPC.

   RDMA as a transport will benefit the performance of RPC protocols
   that move large "chunks" of data, since RDMA hardware excels at
   moving data efficiently between host memory and a high speed network
   with little or no host CPU involvement.  In this context, the NFS
   protocol, in all its versions, is an obvious beneficiary of RDMA.
   Many other RPC-based protocols will also benefit.

   Although the RDMA transport described here provides relatively
   transparent support for any RPC application, the proposal goes
   further in describing mechanisms that can optimize the use of RDMA
   with more active participation by the RPC application.


2.  Abstract RDMA Model

   An RPC transport is responsible for conveying an RPC message from a
   sender to a receiver. An RPC message is either an RPC call from a
   client to a server, or an RPC reply from the server back to the
   client. An RPC message contains an RPC call header followed by
   arguments if the message is an RPC call, or an RPC reply header
   followed by results if the message is an RPC reply.  The call header
   contains a transaction ID (XID) followed by the program and procedure
   number as well as a security credential.  An RPC reply header begins
   with an XID that matches that of the RPC call message, followed by a
   security verifier and results.  All data in an RPC message is XDR
   encoded. For a complete description of the RPC protocol and XDR
   encoding, see [RFC1831] and [RFC1832].




Expires: November 2003    Callaghan and Talpey                  [Page 3]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   This protocol assumes an abstract model for RDMA transports.  The
   following terms, common in the RDMA lexicon, are used in this
   document. A more complete glossary of RDMA terms can be found in
   [RDMA].


           o Registered Memory

             All data moved via RDMA must be resident in registered
             memory at its source and destination.  Each segment of
             registered memory must be identified with a Steering Tag
             (STag) of no more than 32 bits and memory addresses of up
             to 64 bits in length.


           o RDMA Send

             The RDMA provider supports an RDMA Send operation with
             completion signalled at the receiver when data is placed
             in a pre-posted buffer.  The amount of transferred data
             is limited only by the size of the receiver's buffer.
             Sends complete at the receiver in the order they were
             issued at the sender.


           o RDMA Write

             The RDMA provider supports an RDMA Write operation to
             directly place data in the receiver's buffer.  An RDMA
             Write is initiated by the sender and completion is
             signalled at the sender. No completion is signalled at
             the receiver. The sender uses a Steering Tag (STag),
             memory address and length of the remote destination
             buffer.  A subsequent completion, provided by RDMA Send,
             must be obtained at the receiver to guarantee that RDMA
             Write data has been successfully placed in the receiver's
             memory.














Expires: November 2003    Callaghan and Talpey                  [Page 4]

Internet-Draft         RDMA Transport for ONC RPC               May 2003



           o RDMA Read

             The RDMA provider supports an RDMA Read operation to
             directly place peer source data in the requester's buffer.
             An RDMA Read is initiated by the receiver and completion is
             signalled at the receiver.  The receiver provides
             Steering Tags, memory addresses and a length for the
             remote source and local destination buffers.
             Since the peer at the data source receives no notification
             of RDMA Read completion, there is an assumption that on
             receiving the data the receiver will signal completion
             with an RDMA Send message, so that the peer can free the
             source buffers.

   In its abstract form, this protocol is not an interoperable standard.
   It becomes a useful, implementable standard only when mapped onto a
   specific RDMA transport, like iWARP [RDDP] or Infiniband [IB].


3.  Protocol Outline

   An RPC message can be conveyed in identical fashion, whether it is a
   CALL or REPLY message.  In each case, the transmission of the message
   proper is preceded by transmission of a transport header for use by
   RPC over RDMA transports.  This header is analogous to the record
   marking used for RPC over TCP, but is more extensive, since RDMA
   transports support several modes of data transfer and it is important
   to allow the client and server to use the most efficient mode for any
   given transfer and also because multiple pieces of a message may be
   transferred in different ways to different destinations.

   All transfers of a CALL or REPLY begin with an RDMA send which
   transfers at least the transport header, usually with the CALL or
   REPLY message appended, or at least some part thereof.  Because the
   size of what may be transmitted via RDMA send is limited by the size
   of the receiver's pre-posted buffer, the RPC over RDMA transport
   provides a number of methods to reduce the amount transferred by
   means of the RDMA send, when necessary, by transferring various parts
   of the message using RDMA read and RDMA write.


3.1.  Short Messages

   Many RPC messages are quite short.  For example, the NFS version 3
   GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
   byte filehandle argument and 4 bytes of length.  The reply to this
   common request is about 100 bytes.



Expires: November 2003    Callaghan and Talpey                  [Page 5]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   There is no benefit in transferring such small messages with an RDMA
   Read or Write operation.  The overhead in transferring STags and
   memory addresses is justified only by large transfers.  The critical
   message size that justifies RDMA transfer will vary depending on the
   RDMA implementation and network, but is typically of the order of a
   few kilobytes.  It is appropriate to transfer a short message with an
   RDMA Send to a pre-posted buffer.  The transport header with the
   short message (CALL or REPLY) immediately following is transferred
   using a single RDMA send operation.

   Short RPC messages over an RDMA transport will look like this:

      Client                                Server
         |               RPC Call              |
    Send |   ------------------------------>   |
         |                                     |
         |               RPC Reply             |
         |   <------------------------------   | Send



3.2.  Data Chunks

   Some protocols, like NFS, have RPC procedures that can transfer very
   large "chunks" of data in the RPC call or reply and would cause the
   maximum send size to be exceeded if one tried to transfer them as
   part of the RDMA send.  These large chunks typically range from a
   kilobyte to a megabyte or more.  An RDMA transport can transfer large
   chunks of data more efficiently via the direct placement of an RDMA
   Read or RDMA Write operation.  Using direct placement instead of in-
   line transfer not only avoids expensive data copies, but provides
   correct data alignment at the destination.


3.3.  Flow Control

   It is critical to provide flow control for an RDMA connection.  RDMA
   receive operations will fail if a pre-posted receive buffer is not
   available to accept an incoming RDMA Send.  Such errors are fatal to
   the connection. This is a departure from conventional TCP/IP
   networking where buffers are allocated dynamically on an as-needed
   basis, and pre-posting is not required.

   It is not practical to provide for fixed credit limits at the RPC
   server.  Fixed limits scale poorly, since posted buffers are
   dedicated to the associated connection until consumed by receive
   operations.  Additionally for protocol correctness, the server must
   be able to reply whether or not a new buffer can be posted to accept



Expires: November 2003    Callaghan and Talpey                  [Page 6]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   future receives.

   Flow control is implemented as a simple request/grant protocol in the
   transport header associated with each RPC message.  The transport
   header for RPC CALL messages contains a requested credit value for
   the server, which may be dynamically adjusted by the caller to match
   its expected needs.  The transport header for the RPC REPLY messages
   provide the granted result, which may have any value except it may
   not be zero when no in-progress operations are present at the server,
   since such a value would result in deadlock.  The value may be
   adjusted up or down at each opportunity to match the server's needs
   or policies.

   While RPC CALLs may complete in any order, the current flow control
   limit at the RPC server is known to the RPC client from the Send
   ordering properties.  It is always the most recent server granted
   credits minus the number of requests in flight.


3.4.  XDR Encoding with Chunks

   The data comprising an RPC call or reply message is marshaled or
   serialized into a contiguous stream by an XDR routine.  XDR data
   types such as integers, strings, arrays and linked lists are commonly
   implemented over two very simple functions that encode either an XDR
   data unit (32 bits) or an array of bytes.

   Normally, the separate data items in an XDR call or reply are encoded
   as a contiguous sequence of bytes for network transmission over UDP
   or TCP.  However, in the case of an RDMA transport, local routines
   such as XDR encode can determine that an opaque byte array is large
   enough to be more efficiently moved via an RDMA data transfer
   operation like RDMA Read or RDMA Write.

   When sending any message (request or reply) that contains a candidate
   large data chunk, the XDR encoding routine avoids moving the data
   into the XDR stream.  Instead, it does not encode at all but records
   the address and size of each chunk in a separate "read chunk list"
   encoded within RPC RDMA transport-specific headers.  Such chunks will
   be transferred via RDMA Read operations initiated by the receiver.

   Since the chunks are to be moved via RDMA, the memory for each chunk
   must be registered.  This registration may take place within XDR
   itself, providing for full transparency to upper layers, or it may be
   performed by any other specific local implementation.

   Additionally, when making an RPC call that can result in bulk data
   transferred in the reply, it is desirable to provide chunks to accept



Expires: November 2003    Callaghan and Talpey                  [Page 7]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   the data directly via RDMA Write.  These chunks will therefore be
   pre-filled by the server prior to responding, and XDR decode at the
   client will not be required.  These "write chunk lists" undergo a
   similar registration and advertisement to chunks built as a part of
   XDR encoding. Just as with an encoded read chunk list, the memory
   referenced in an encoded write chunk list must be pre-registered.  If
   the client chooses not to make a write chunk list available, then the
   server must return chunks in the reply via a read chunk list.

   The following items are contained in a chunk list entry.

           STag
                   Steering tag or handle obtained when the chunk
                   memory is registered for RDMA.
           Length
                   The length of the chunk in bytes.
           Offset
                   The offset or memory address of the chunk.
           Position
                   For data which is to be encoded, the position in
                   the XDR stream where the chunk would normally
                   reside.  It is possible that a contiguous sequence
                   of chunks might all have the same position.  For
                   data which is to be decoded, no "position" is
                   used.


   When XDR marshaling is complete, the chunk list is XDR encoded, then
   sent to the receiver prepended to the RPC message.  Any source data
   for a read chunk, or the destination of a write chunk, remain behind
   in the sender's registered memory.

   +----------------+----------------+-------------
   |                |                |
   | RDMA header w/ |   RPC Header   | Non-chunk args/results
   |     chunks     |                |
   +----------------+----------------+-------------

   Read chunk lists are structured differently from write chunk lists.
   This is due to the different usage - read chunks are decoded and
   indexed by their position in the XDR data stream, and may be used for
   both arguments and results.  Write chunks on the other hand are used
   only for results, and have no preassigned offset in the XDR stream
   until the results are produced. The mapping of Write chunks onto
   designated NFS procedures and results is described in [NFSDDP].

   Therefore, read chunks are encoded as a single array, with each entry
   tagged by its position in the XDR stream.  Write chunks are encoded



Expires: November 2003    Callaghan and Talpey                  [Page 8]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   as a list of arrays of RDMA buffers, with each list element providing
   buffers for a separate result.


3.5.  Padding

   Alignment of specific opaque data enables certain scatter/gather
   optimizations.  Padding leverages the useful property that RDMA
   transfers preserve alignment of data, even when they are placed into
   pre-posted receive buffers by Sends.

   Many servers can make good use of such padding. Padding allows the
   chaining of RDMA receive buffers such that any data transferred by
   RDMA on behalf of RPC requests will be placed into appropriately
   aligned buffers on the system that receives the transfer.  In this
   way, the need for servers to perform RDMA Read to satisfy all but the
   largest client writes is obviated.

   The effect of padding is demonstrated below showing prior bytes on an
   XDR stream (XXX) followed by an opaque field consisting of four
   length bytes (LLLL) followed by data bytes (DDDD).  The receiver of
   the RDMA Send has posted two chained receive buffers.  Without
   padding, the opaque data is split across the two buffers.  With the
   addition of padding bytes (ppp) prior to the first data byte, the
   data can be forced to align correctly in the second buffer.

                                            Buffer 1       Buffer 2
   Unpadded                              --------------  --------------

     XXXXXXXLLLLDDDDDDDDDDDDDD     --->  XXXXXXXLLLLDDD  DDDDDDDDDDD

   Padded

     XXXXXXXLLLLpppDDDDDDDDDDDDDD  --->  XXXXXXXLLLLppp  DDDDDDDDDDDDDD


   Padding is implemented completely within the RDMA transport encoding,
   flagged with a specific message type.  Where padding is applied, two
   values are passed to the peer:  an "rdma_align" which is the padding
   value used, and "rdma_thresh", which is the opaque data size at or
   above which padding is applied. For instance, if the server is using
   chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
   could be used to achieve alignment of the data.  If padding is to
   apply only to chunks at least 1 KB in size, then the threshold should
   be set to 1 KB.  The XDR routine at the peer will consult these
   values when decoding opaque values.  Where the decoded length exceeds
   the rdma_thresh, the XDR decode will skip over the appropriate
   padding as indicated by rdma_align and the current XDR stream



Expires: November 2003    Callaghan and Talpey                  [Page 9]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   position.


3.6.  XDR Decoding with Read Chunks

   The XDR decode process moves data from an XDR stream into a data
   structure provided by the client or server application.  Where
   elements of the destination data structure are buffers or strings,
   the RPC application can either pre-allocate storage to receive the
   data, or leave the string or buffer fields null and allow the XDR
   decode to automatically allocate storage of sufficient size.

   When decoding a message from an RDMA transport, the receiver first
   XDR decodes the chunk lists from the RDMA transport header, then
   proceeds to decode the body of the RPC message (arguments or
   results).  Whenever the XDR offset in the decode stream matches that
   of a chunk in the read chunk list, the XDR routine registers the
   memory for the destination buffer, then initiates an RDMA Read to
   bring over the chunk data.  If an RPC client uses RDMA Read to fetch
   chunks in the reply then it must issue an RDMA_DONE message
   (described in Section 3.8) to notify the server that the source
   buffers can be freed.

   The read chunk list is constructed and used entirely within the
   RPC/XDR layer. Other than specifying the minimum chunk size, the
   management of the read chunk list is automatic and transparent to an
   RPC application.


3.7.  XDR Decoding with Write Chunks

   When a "write chunk list" is provided in the RPC CALL, the server
   must provide any corresponding data via RDMA Write to the memory
   referenced in the chunk list entries.  The RPC REPLY conveys this by
   returning the write chunk list to the client with the lengths
   rewritten to match the actual transfer.  The XDR "decode" of the
   reply therefore performs no local data transfer but merely returns
   the length obtained from the reply.

   Each decoded result consumes one entry in the write chunk list, which
   in turn consists of an array of RDMA segments.  The length is
   therefore the sum of all returned lengths in all segments comprising
   the corresponding list entry.  As each list entry is "decoded", the
   entire entry is consumed.

   The write chunk list is constructed and used by the RPC application.
   The RPC/XDR layer simply conveys the list between client and server
   and initiates the RDMA Writes back to the client. The mapping of



Expires: November 2003    Callaghan and Talpey                 [Page 10]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   write chunk list entries to procedure arguments must be determined
   for each protocol.  An example of a mapping is described in [NFSDDP].


3.8.  RPC Call and Reply

   The RDMA transport for RPC provides three methods of moving data
   between client and server:

           In-line
              Data are moved between client and server
              within an RDMA Send.

           RDMA Read
              Data are moved between client and server
              via an RDMA Read operation via STag, address
              and offset obtained from a read chunk list.

           RDMA Write
              Result data is moved from server to client
              via an RDMA Write operation via STag, address
              and offset obtained from a write chunk list
              or reply chunk in the client's RPC call message.

   These methods of data movement may occur in combinations within a
   single RPC.  For instance, an RPC call may contain some in-line data
   along with some large chunks transferred via RDMA Read by the server.
   The reply to that call may have some result chunks that the server
   RDMA Writes back to the client.  The following protocol interactions
   illustrate RPC calls that use these methods to move RPC message data:


   An RPC with write chunks in the call message looks like this:

      Client                                Server
         |     RPC Call + Write Chunk list     |
    Send |   ------------------------------>   |
         |                                     |
         |               Chunk 1               |
         |   <------------------------------   | Write
         |                  :                  |
         |               Chunk n               |
         |   <------------------------------   | Write
         |                                     |
         |               RPC Reply             |
         |   <------------------------------   | Send





Expires: November 2003    Callaghan and Talpey                 [Page 11]

Internet-Draft         RDMA Transport for ONC RPC               May 2003



   An RPC with read chunks in the call message looks like this:

      Client                                Server
         |     RPC Call + Read Chunk list      |
    Send |   ------------------------------>   |
         |                                     |
         |               Chunk 1               |
         |   +------------------------------   | Read
         |   v----------------------------->   |
         |                  :                  |
         |               Chunk n               |
         |   +------------------------------   | Read
         |   v----------------------------->   |
         |                                     |
         |               RPC Reply             |
         |   <------------------------------   | Send


   And an RPC with read chunks in the reply message looks like this:

      Client                                Server
         |               RPC Call              |
    Send |   ------------------------------>   |
         |                                     |
         |     RPC Reply + Read Chunk list     |
         |   <------------------------------   | Send
         |                                     |
         |               Chunk 1               |
    Read |   ------------------------------+   |
         |   <-----------------------------v   |
         |                  :                  |
         |               Chunk n               |
    Read |   ------------------------------+   |
         |   <-----------------------------v   |
         |                                     |
         |               RPC Done              |
    Send |   ------------------------------>   |

   The final RPC Done message allows the client to signal the server
   that it has received the chunks, so the server can de-register and
   free the memory holding the chunks.  An RPC Done completion is not
   necessary for an RPC call, since the RPC reply Send is itself a
   receive completion notification.

   The RPC Done message has no effect on protocol latency since the
   client has no expectation of a reply from the server.  Nor does it
   adversely affect bandwidth since it is only 16 bytes in length.  In



Expires: November 2003    Callaghan and Talpey                 [Page 12]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   the event that the client fails to return the Done message, the
   server can proceed with a de-register and free chunk buffers after a
   time-out.

   Finally, it is possible to conceive of RPC exchanges that involve any
   or all combinations of write chunks in the RPC CALL, read chunks in
   the RPC CALL, and read chunks in the RPC REPLY.  Support for such
   exchanges is straightforward from a protocol perspective, but in
   practice such exchanges would be quite rare, limited to upper layer
   protocol exchanges which transferred bulk data in both the call and
   corresponding reply.


4.  RPC RDMA Message Layout

   RPC call and reply messages are conveyed across an RDMA transport
   with a prepended RDMA transport header.  The transport header
   includes data for RDMA flow control credits, padding parameters and
   lists of addresses that provide direct data placement via RDMA Read
   and Write operations.  The layout of the RPC message itself is
   unchanged from that described in [RFC1831] except for the possible
   exclusion of large data chunks that will be moved by RDMA Read or
   Write operations.  If the RPC message (along with the transport
   header) is too long for the posted receive buffer (even after any
   large chunks are removed), then the entire RPC message can be moved
   separately as a chunk, leaving just the transport header in the RDMA
   Send.


4.1.  RPC RDMA Transport Header

   The RPC RDMA transport header begins with four 32-bit fields that are
   always present and which control the RDMA interaction including
   RDMA-specific flow control.  These are then followed by a number of
   items such as chunk lists and padding which may or may not be present
   depending on the type of transmission.  The four fields which are
   always present are:














Expires: November 2003    Callaghan and Talpey                 [Page 13]

Internet-Draft         RDMA Transport for ONC RPC               May 2003



           1. Transaction ID (XID).
              The XID generated for the RPC call and reply.  Having
              the XID at the beginning of the message makes it easy to
              establish the message context.  This XID mirrors the XID
              in the RPC call header, and takes precedence.

           2. Version number.
              This version of the RPC RDMA message protocol is 1.
              The version number must be increased by one whenever the
              format of the RPC RDMA messages is changed.

           3. Flow control credit value.
              When sent in an RPC CALL message, the requested value is
              provided.  When sent in an RPC REPLY message, the
              granted value is returned.  RPC CALLs must not be sent
              in excess of the currently granted limit.

           4. Message type.
              RDMA_MSG = 0 indicates that chunk lists and RPC message
              follow.  RDMA_NOMSG = 1 indicates that after the chunk
              lists there is no RPC message.  In this case, the chunk
              lists provide information to allow the message proper to
              be transferred using RDMA read or write and thus is not
              appended to the RPC RDMA transport header.  RDMA_MSGP =
              2 indicates that a chunk list and RPC message with some
              padding follow.  RDMA_DONE = 3 indicates that the
              message signals the completion of a chunk transfer via
              RDMA Read.


   For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
   chunk lists follow. If the Read chunk list is null (a 32 bit word of
   zeros), then there are no chunks to be transferred separately and the
   RPC message follows in its entirety.  If non-null, then it's the
   beginning of an XDR encoded sequence of Read chunk list entries. If
   the Write chunk list is non-null, then an XDR encoded sequence of
   Write chunk entries follows.

   If the message type is RDMA_MSGP, then two additional fields that
   specify the padding alignment and threshold are inserted prior to the
   Read and Write chunk lists.

   A transport header of message type RDMA_MSG or RDMA_MSGP will be
   followed by the RPC call or reply message, beginning with the XID.
   This XID should match the one at the beginning of the RPC message
   header.




Expires: November 2003    Callaghan and Talpey                 [Page 14]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   +--------+---------+---------+-----------+-------------+----------
   |        |         |         | Message   |   NULLs     | RPC Call
   |  XID   | Version | Credits |  Type     |    or       |    or
   |        |         |         |           | Chunk Lists | Reply Msg
   +--------+---------+---------+-----------+-------------+----------

   Note that in the case of RDMA_DONE, no chunk list or RPC message
   follows.  As an implementation hint: a gather operation on the Send
   of the RDMA RPC message can be used to marshal the initial header,
   the chunk list, and the RPC message itself.


4.2.  XDR Language Description

   Here is the message layout in XDR language.

      struct xdr_rdma_segment {
         uint32 handle;    /* Registered memory handle */
         uint32 length;    /* Length of the chunk in bytes */
         uint64 offset;    /* Chunk virtual address or offset */
      };

      struct xdr_read_chunk {
         uint32 position;               /* Position in XDR stream */
         struct xdr_rdma_segment target;
      };

      struct xdr_read_list {
         struct xdr_read_chunk entry;
         struct xdr_read_list  *next;
      };

      struct xdr_write_chunk {
         struct xdr_rdma_segment target<>;
      };

      struct xdr_write_list {
         struct xdr_write_chunk entry;
         struct xdr_write_list  *next;
      };

      struct rdma_msg {
         uint32    rdma_xid;    /* Mirrors the RPC header xid */
         uint32    rdma_vers;   /* Version of this protocol */
         uint32    rdma_credit; /* Buffers requested/granted */
         rdma_body rdma_body;
      };




Expires: November 2003    Callaghan and Talpey                 [Page 15]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


      enum rdma_proc {
         RDMA_MSG=0,   /* An RPC call or reply msg */
         RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
         RDMA_MSGP=2,  /* An RPC call or reply msg with padding */
         RDMA_DONE=3   /* Client signals reply completion */
      };

      union rdma_body switch (rdma_proc proc) {
         case RDMA_MSG:
           rpc_rdma_header rdma_msg;
         case RDMA_NOMSG:
           rpc_rdma_header_nomsg rdma_nomsg;
         case RDMA_MSGP:
           rpc_rdma_header_padded rdma_msgp;
         case RDMA_DONE:
           void;
      };

      struct rpc_rdma_header {
         struct xdr_read_list   *rdma_reads;
         struct xdr_write_list  *rdma_writes;
         struct xdr_write_chunk *rdma_reply;
         /* rpc body follows */
      };

      struct rpc_rdma_header_nomsg {
         struct xdr_read_list   *rdma_reads;
         struct xdr_write_list  *rdma_writes;
         struct xdr_write_chunk *rdma_reply;
      };

      struct rpc_rdma_header_padded {
         uint32                 rdma_align;   /* Padding alignment */
         uint32                 rdma_thresh;  /* Padding threshold */
         struct xdr_read_list   *rdma_reads;
         struct xdr_write_list  *rdma_writes;
         struct xdr_write_chunk *rdma_reply;
         /* rpc body follows */
      };


5.  Large Chunkless Messages

   The receiver of RDMA Send messages is required to have previously
   posted one or more correctly sized buffers.  The client can inform
   the server of the maximum size of its RDMA Send messages via the
   Connection Configuration Protocol described later in this document.




Expires: November 2003    Callaghan and Talpey                 [Page 16]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   Since RPC messages are frequently small, memory savings can be
   achieved by posting small buffers.  Even large messages like NFS READ
   or WRITE will be quite small once the chunks are removed from the
   message.  However, there may be large, chunkless messages that would
   demand a very large buffer be posted.  A good example is an NFS
   READDIR reply which may contain a large number of small filename
   strings.  Also, the NFS version 4 protocol [RFC3530] features
   COMPOUND request and reply messages of unbounded length.

   Ideally, each upper layer will negotiate these limits.  However, it
   is frequently necessary to provide a transparent solution.


5.1.  Message as an RDMA Read Chunk

   One relatively simple method is to have the client identify any RPC
   message that exceeds the server's posted buffer size and move it
   separately as a chunk, i.e. reference it as the first entry in the
   read chunk list with an XDR position of zero.

   Normal Message

   +--------+---------+---------+------------+-------------+----------
   |        |         |         |            |             | RPC Call
   |  XID   | Version | Credits |  RDMA_MSG  | Chunk Lists |    or
   |        |         |         |            |             | Reply Msg
   +--------+---------+---------+------------+-------------+----------
   Long Message

   +--------+---------+---------+------------+-------------+
   |        |         |         |            |             |
   |  XID   | Version | Credits | RDMA_NOMSG | Chunk Lists |
   |        |         |         |            |             |
   +--------+---------+---------+------------+-------------+
                                                |
                                                |  +----------
                                                |  | Long RPC Call
                                                +->|    or
                                                   | Reply Message
                                                   +----------

   If the receiver gets a transport header with a message type of
   RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
   position, it allocates a registered buffer and issues an RDMA Read of
   the long RPC message into it.  The receiver then proceeds to XDR
   decode the RPC message as if it had received it in-line with the Send
   data.  Further decoding may issue additional RDMA Reads to bring over
   additional chunks.



Expires: November 2003    Callaghan and Talpey                 [Page 17]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   Although the handling of long messages requires one extra network
   turnaround, in practice these messages should be rare if the posted
   receive buffers are correctly sized, and of course they will be non-
   existent for RDMA-aware upper layers.


   An RPC with long reply returned via RDMA Read looks
   like this:

      Client                                Server
         |             RPC Call                |
    Send |   ------------------------------>   |
         |                                     |
         |         RPC Transport Header        |
         |   <------------------------------   | Send
         |                                     |
         |          Long RPC Reply Msg         |
    Read |   ------------------------------+   |
         |   <-----------------------------v   |
         |                                     |
         |               RPC Done              |
    Send |   ------------------------------>   |



5.2.  RDMA Write of Long Replies

   An alternative method of handling long, chunkless RPC replies is to
   have the client post a large buffer into which the server can write a
   large RPC reply. This has the advantage that an RDMA Write may be
   slightly faster in network latency than an RDMA Read. Additionally,
   for a reply it removes the need for an RDMA_DONE message if the large
   reply is returned as a Read chunk.

   This protocol supports direct return of a large reply via the
   inclusion of an optional rdma_reply write chunk after the read chunk
   list and the write chunk list.  The client allocates a buffer sized
   to receive a large reply and enters its STag, address and length in
   the rdma_reply write chunk.  If the reply message is too long to
   return in-line with an RDMA Send (exceeds the size of the client's
   posted receive buffer), even with read chunks removed, then the
   server RDMA writes the RPC reply message into the buffer indicated by
   the rdma_reply chunk.  If the client doesn't provide an rdma_reply
   chunk, or if it's too small, then the message must be returned as a
   Read chunk.






Expires: November 2003    Callaghan and Talpey                 [Page 18]

Internet-Draft         RDMA Transport for ONC RPC               May 2003



   An RPC with long reply returned via RDMA Write looks
   like this:

      Client                                Server
         |      RPC Call with rdma_reply       |
    Send |   ------------------------------>   |
         |                                     |
         |          Long RPC Reply Msg         |
         |   <------------------------------   | Write
         |                                     |
         |         RPC Transport Header        |
         |   <------------------------------   | Send

   The use of RDMA Write to return long replies requires that the client
   application anticipate a long reply and have some knowledge of its
   size so that a correctly sized buffer can be allocated.  This is
   certainly true of NFS READDIR replies; where the client already
   provides an upper bound on the size of the encoded directory fragment
   to be returned by the server.


6.  Connection Configuration Protocol

   RDMA Send operations require the receiver to post one or more buffers
   at the RDMA connection endpoint, each large enough to receive the
   largest Send message.  Buffers are consumed as Send messages are
   received.  If a buffer is too small, or if there are no buffers
   posted, the RDMA transport will return an error and break the RDMA
   connection.  The receiver must post sufficient, correctly sized
   buffers to avoid buffer overrun or capacity errors.

   The protocol described above includes only a mechanism for managing
   the number of such receive buffers, and no explicit features to allow
   the client and server to provision or control buffer sizing, nor any
   other session parameters.

   In the past, this type of connection management has not been
   necessary for RPC.  RPC over UDP or TCP does not have a protocol to
   negotiate the link.  The server can get a rough idea of the maximum
   size of messages from the server protocol code.  However, a protocol
   to negotiate transport features on a more dynamic basis is desirable.

   The Connection Configuration Protocol allows the client to pass its
   connection requirements to the server, and allows the server to
   inform the client of its connection limits.





Expires: November 2003    Callaghan and Talpey                 [Page 19]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


6.1.  Initial Connection State

   This protocol will be used for connection setup prior to the use of
   another RPC protocol that uses the RDMA transport. It operates in-
   band, i.e. it uses the connection itself to negotiate the connection
   parameters. To provide a basis for connection negotiation, the
   connection is assumed to provide a basic level of interoperability:
   the ability to exchange at least one RPC message at a time that is at
   least 1 KB in size. The server may exceed this basic level of
   configuration, but the client must not assume it.


6.2.  Protocol Description

   Version 1 of the protocol consists of a single procedure that allows
   the client to inform the server of its connection requirements and
   the server to return connection information to the client.

   The maxcallsize argument is the maximum size of an RPC call message
   that the client will send in-line in an RDMA Send message to the
   server.  The server may return a maxcallsize value that is smaller or
   larger than the client's request.  The client must not send an in-
   line call message larger than what the server will accept.  The
   maxcallsize limits only the size of in-line RPC calls.  It does not
   limit the size of long RPC messages transferred as an initial chunk
   in the Read chunk list.

   The maxreplysize is the maximum size of an in-line RPC message that
   the client will accept from the server.

   The align value is the value recommended by the server for opaque
   data values such as strings and counted byte arrays.  The client can
   use this value to compute the number of prepended pad bytes when XDR
   encoding opaque values in the RPC call message.

      typedef unsigned int uint32;

      struct config_rdma_req {
           uint32  maxcallsize;  /* max size of in-line RPC call */
           uint32  maxreplysize; /* max size of in-line RPC reply */
      };

      struct config_rdma_reply {
           uint32  maxcallsize;  /* max call size accepted by server */
           uint32  align;        /* server's receive buffer alignment */
      };





Expires: November 2003    Callaghan and Talpey                 [Page 20]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


      program CONFIG_RDMA_PROG {
         version VERS1 {
            /*
             * Config call/reply
             */
            config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
         } = 1;
      } = nnnnnn;  <-- Need program number assigned



7.  Memory Registration Overhead

   RDMA requires that all data be transferred between registered memory
   regions at the source and destination.  All protocol headers as well
   as separately transferred data chunks must use registered memory.
   Since the cost of registering and de-registering memory can be a
   large proportion of the RDMA transaction cost, it is important to
   minimize registration activity.  This is easily achieved within RPC
   controlled memory by allocating chunk list data and RPC headers in a
   reusable way from pre-registered pools.

   The data chunks transferred via RDMA may occupy memory that persists
   outside the bounds of the RPC transaction.  Hence, the default
   behavior of an RDMA transport is to register and de-register these
   chunks on every transaction.  However, this is not a limitation of
   the protocol - only of the existing local RPC API.  The API is easily
   extended through such functions as rpc_control(3) to change the
   default behavior so that the application can assume responsibility
   for controlling memory registration through an RPC-provided
   registered memory allocator.


8.  Errors and Error Recovery

   Error reporting and recovery is outside the scope of this protocol.
   It is assumed that the link itself will provide some degree of error
   detection and retransmission.  Additionally, the RPC layer itself can
   accept errors from the link level and recover via retransmission.
   RPC recovery can handle complete loss and re-establishment of the
   link.


9.  Node Addressing

   In setting up a new RDMA connection, the first action by an RPC
   client will be to obtain a transport address for the server.  The
   mechanism used to obtain this address, and to open an RDMA connection



Expires: November 2003    Callaghan and Talpey                 [Page 21]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   is dependent on the type of RDMA transport, and outside the scope of
   this protocol.


10.  RPC Binding

   RPC services normally register with a portmap or rpcbind service,
   which associates an RPC program number with a service address.  In
   the case of UDP or TCP, the service address for NFS is normally port
   2049.  This policy should be no different with RDMA interconnects.

   One possibility is to have the server's portmapper register itself on
   the RDMA interconnect at a "well known" service address.  On UDP or
   TCP, this corresponds to port 111.  A client could connect to this
   service address and use the portmap protocol to obtain a service
   address in response to a program number, e.g. a VI discriminator or
   an Infiniband GID.


11.  Security

   ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
   2203].  RPCSEC_GSS can provide message authentication, integrity
   checking, and privacy.  This security mechanism will be unaffected by
   the RDMA transport.  The data integrity and privacy features alter
   the body of the message, presenting it as a single chunk.  For large
   messages the chunk may be large enough to qualify for RDMA Read
   transfer.  However, there is much data movement associated with
   computation and verification of integrity, or encryption/decryption,
   so any performance advantage will be lost.

   There should be no new issues here with exposed addresses.  The only
   exposed addresses here are in the chunk list and in the transport
   packets generated by an RDMA.  The data contained in these addresses
   is adequately protected by RPCSEC_GSS integrity and privacy.
   RPCSEC_GSS security mechanisms are typically implemented by the host
   CPU.  This additional data movement and CPU use may cancel out much
   of the RDMA direct placement and offload benefit.

   A more appropriate security mechanism for RDMA links may be link-
   level protection, like IPSec, which may be co-located in the RDMA
   link hardware. The use of link-level protection may be negotiated
   through the use of a new RPCSEC_GSS mechanism like the Credential
   Cache GSS Mechanism (CCM) [CCM].







Expires: November 2003    Callaghan and Talpey                 [Page 22]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


12.  IANA Considerations

   As a new RPC transport, this protocol should have no effect on RPC
   program numbers or registered port numbers.  The new RPC transport
   should be assigned a new RPC "netid".


13.  Acknowledgements

   The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
   Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
   Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their
   contributions to this document.


14.  References

   [RDMA] R. Recio et al, "An RDMA Protocol Specification",
      Internet Draft, February 2003,
      http://www.ietf.org/internet-drafts/
         draft-ietf-rddp-rdmap-00.txt

   [CCM] M. Eisler, "CCM: The Credential Cache GSS Mechanism",
      Internet Draft, February 2003,
      http://www.ietf.org/internet-drafts/
         draft-eisler-nfsv4-ccm-00.txt

   [NFSRDMA]
      T. Talpey, S. Shepler, "NFSv4 RDMA and Session Extensions"
      http://www.ietf.org/internet-drafts/
         draft-talpey-nfsv4-rdma-sess-00.txt

   [NFSDDP]
      B. Callaghan, T. Talpey, "NFS Direct Data Placement"
      Internet Draft, May 2003,
      http://www.ietf.org/internet-drafts/
         draft-callaghan-nfsdirect-00.txt

   [RFC1831]
      R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
      Version 2",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc1831.txt

   [RFC1832]
      R. Srinivasan, "XDR: External Data Representation Standard",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc1832.txt



Expires: November 2003    Callaghan and Talpey                 [Page 23]

Internet-Draft         RDMA Transport for ONC RPC               May 2003


   [RFC1813]
      B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
      Specification",
      Informational RFC,
      http://www.ietf.org/rfc/rfc1813.txt

   [RFC3530]
      S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
      Eisler, D. Noveck, "NFS version 4 Protocol",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc3530.txt

   [RFC2203]
      M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
      Standards Track RFC,
      http://www.ietf.org/rfc/rfc2203.txt

   [RDDP]
      Remote Direct Data Placement Working Group Charter,
      http://www.ietf.org/html.charters/rddp-charter.html

   [RDDPPS]
      Remote Direct Data Placement Working Group Problem Statement,
      A. Romanow, J. Mogul, T. Talpey, S. Bailey,
      http://www.ietf.org/internet-drafts/
         draft-ietf-rddp-problem-statement-00.txt
   [IB]
      Infiniband Architecture Specification,
      http://www.infinibandta.org


15.  Authors' Addresses



           Brent Callaghan
           Sun Microsystems, Inc.
           17 Network Circle
           Menlo Park, California 94025 USA

           Phone: +1 650 786 5067
           EMail: brent.callaghan@sun.com









Expires: November 2003    Callaghan and Talpey                 [Page 24]

Internet-Draft         RDMA Transport for ONC RPC               May 2003



           Tom Talpey
           Network Appliance, Inc.
           375 Totten Pond Road
           Waltham, MA 02451 USA

           Phone: +1 781 768 5329
           EMail: thomas.talpey@netapp.com



16.  Full Copyright Statement


   Copyright (C) The Internet Society (2003).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.












Expires: November 2003    Callaghan and Talpey                 [Page 25]