Internet DRAFT - draft-bailey-roi-rdma

draft-bailey-roi-rdma









                                               S. Bailey    (Sandburst)
Internet-draft Expires: July 2002

            The Remote Direct Memory Access Protocol (iWarp)
                        draft-bailey-roi-rdma-00


Status of this Memo

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

Copyright Notice

     Copyright (C) The Internet Society (2002). All Rights Reserved.


Abstract


     This document defines a Remote Direct Memory Protocol (iWarp) to
     run on the Direct Data Placement Protocol (DDPP) [DDPP].  This
     initial draft is an incomplete sketch of iWarp to be used only as
     the basis of discussion of protocol and architectural issues with
     DDPP and RDMA.

Table Of Contents

     1.    Introduction  . . . . . . . . . . . . . . . . . . . . . .   2
     2.    Flow Control in iWarp . . . . . . . . . . . . . . . . . .   2
     3.    Use of DDPP Message Identifiers In iWarp  . . . . . . . .   3



Bailey                      Expires July 2002                   [Page 1]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     4.    RDMA Write In iWarp . . . . . . . . . . . . . . . . . . .   4
     5.    RDMA Read In iWarp  . . . . . . . . . . . . . . . . . . .   4
     6.    Send In iWarp . . . . . . . . . . . . . . . . . . . . . .   6
     7.    Credit Return Message . . . . . . . . . . . . . . . . . .   8
     8.    Errors In iWarp . . . . . . . . . . . . . . . . . . . . .   9
     9.    Transport Characteristics In iWarp  . . . . . . . . . . .   9
     10.   Operation Ordering In iWarp . . . . . . . . . . . . . . .   9
     10.1. Ordering On Reliable, Ordered Transports  . . . . . . . .  10
     10.2. Ordering On Reliable, Unordered Transports  . . . . . . .  10
     10.3. Ordering On Unreliable, Ordered Transports  . . . . . . .  10
     10.4. Ordering On Unreliable, Unordered Transports  . . . . . .  10
     11.   Transport Topology In iWarp . . . . . . . . . . . . . . .  10
     12.   Negotiating iWarp . . . . . . . . . . . . . . . . . . . .  10
     13.   Security Considerations . . . . . . . . . . . . . . . . .  11
     14.   IANA Considerations . . . . . . . . . . . . . . . . . . .  11
           References  . . . . . . . . . . . . . . . . . . . . . . .  11
           Author's Address  . . . . . . . . . . . . . . . . . . . .  11
           Full Copyright Statement  . . . . . . . . . . . . . . . .  11



1.  Introduction

     This document defines a Remote Direct Memory Protocol (iWarp) to
     run on the Direct Data Placement Protocol (DDPP) [DDPP].  This
     initial draft is an incomplete sketch of iWarp to be used only as
     the basis of discussion of protocol and architectural issues with
     DDPP and RDMA.

     iWarp follows the architecture and terminology of `The Architecture
     of Direct Data Placement (DDP) And Remote Direct Memory Access
     (RDMA) On Internet Protocols' (DRARCH) [DRARCH].  A thorough
     understanding of DRARCH is necessary to understand this document.

     iWarp defines three data transfer operations:

     o    RDMA Write

     o    RDMA Read

     o    Send (an undecorated message)


2.  Flow Control in iWarp

     While it is straightforward for client protocols to implement flow
     control over iWarp protocol resources, iWarp defines its own flow
     control because many client protocols prefer not to handle this



Bailey                      Expires July 2002                   [Page 2]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     detail.

     iWarp flow control is credit-based, with two distinct pools of
     credits:

     o    Send and Notifying RDMA Write credits,

     o    RDMA Read Request credits.

     iWarp MAY submit one complete client protocol Send or Notifying
     RDMA Write (an RDMA Write which requests a reception indication) to
     DDPP for each Send and Notifying RDMA Write credit.  Initial Send
     and Notifying RDMA Write credits are established when iWarp is
     enabled and may be returned in any iWarp message.  Client protocols
     MAY chose to use Send and Notifying Write flow control or not.

     iWarp MAY submit one RDMA Read Request to DDPP for each RDMA Read
     Request credit.  Initial RDMA Read Request credits are established
     when iWarp is enabled and one credit is returned by each completed
     RDMA Read.  Client protocols MAY chose to use RDMA Read Request
     flow control or not.

3.  Use of DDPP Message Identifiers In iWarp

     iWarp uses the first 11 bits of DDPP's Message Identifier for its
     own purposes.  The remaining (at least 4, probably 20) bits remain
     for use by client protocols.  iWarp's 11 Message Identifier bits
     are:


     0 1 2 3 4 5 6 7 8 9 10
     +-+-+-+-+-+-+-+-+-+-+-+
     |R|F|N|    Credits    |
     +-+-+-+-+-+-+-+-+-+-+-+


     R - Read Reply Flag : 1 bit (boolean flag)

          if set to 1, the message is RDMA Read data, otherwise, it is
          RDMA Write data.

     F - Final RDMA Data Flag : 1 bit (boolean flag)

          if set to 1, the message is the last in a group for a single
          RDMA Write or RDMA Read Response.

     F - Notifying RDMA Write Flag : 1 bit (boolean flag)




Bailey                      Expires July 2002                   [Page 3]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


          if set to 1, the message is the last in a group for a single,
          notifying RDMA Write.

     Credits : 8 bits (unsigned integer)

          Amount by which to increase the Send and Notifying RDMA Write
          credits.  MUST be 0 if Send and Notifying RDMA Write flow
          control is disabled.

4.  RDMA Write In iWarp

     An iWarp RDMA Write operation is a group of one more DDP-decorated
     messages with the Message Identifier field set as defined above.

     A DDP-decorated message that is part of an RDMA Write MUST have
     Notify set when:

     o    it is the final DDP-decorated message in an RDMA Write which
          is requesting a completion indication, or

     o    the Credits value is not zero.

5.  RDMA Read In iWarp

     An iWarp RDMA Read operation is:

     o    an RDMA Read Request containing source and destination buffer
          addresses and RDMA Read size,

     o    an RDMA Read Response of one or more DDP-decorated messages
          targeting the destination buffer address with the data from
          the source buffer address.

     An RDMA Read Request is an undecorated message:

















Bailey                      Expires July 2002                   [Page 4]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Tp=0x01|   R   |    Credits    |               R               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                          Source STag                          |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +                         Source Offset                         +
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                           Read Size                           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                        Destination STag                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +                       Destination Offset                      +
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


     Tp - Message Type : 4 bit (unsigned integer)

          Undecorated message type.  Must be 0x01 for an RDMA Read
          Request.

     R - Reserved

          Sender SHOULD set to 0, receiver MUST ignore.

     Credits : 8 bit (unsigned integer)

          Amount by which to increase the Send and Notifying RDMA Write
          credits.

     Source STag : 32 bit (unsigned integer)

          The steering tag identifying the source buffer from which to
          retrieve the RDMA Read data.

     Source Offset : 64 bits (unsigned integer)

          The offset in the source buffer from which to retrieve the
          RDMA Read data.

     Read Size : 32 bits (unsigned integer)





Bailey                      Expires July 2002                   [Page 5]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


          The number of octets of data to be read from the source
          address.

     Destination STag : 32 bit (unsigned integer)

          The steering tag identifying the destination buffer in which
          to place the RDMA Read data.

     Destination Offset: 64 bits (unsigned integer)

          The offset in the destination buffer at which to place the
          RDMA Read data.

     An RDMA Read Response is a group of one or more DDP-decorated
     messages with the Message Identifier field set as defined above.  A
     DDP-decorated message that is part of an RDMA Read Response MUST
     have Notify set when:

     o    it is the final DDP-decorated message in an RDMA Read
          Response, or

     o    the Credits value is not zero.

     The client protocol portion of the Message Identifier field of the
     DDP-decorated messages in an RDMA Read Response may be chosen by
     the client protocol.  This allows the client protocol to
     distinguish among RDMA Read Responses for multiple outstanding RDMA
     Read Requests.  Allowing the client protocol to select a portion of
     the Message Identifier permits a different interface from DRARCH's
     synchronous rdma_read().  However, DRARCH's rdma_read() can be
     implemented in iWarp by having each outstanding call to rdma_read()
     automatically select a different client protocol portion of the
     Message Identifier.

     An RDMA Read Response MUST transfer exactly Read Size octets, or
     result in an error.

6.  Send In iWarp

     An iWarp Send is an undecorated message of up to 2^31-1 octets.

     To permit efficient implementation, each Send is identified by a
     Send Sequence Number.  The Send Sequence Number is not visible to
     client protocols.  The first Send after iWarp is enabled MUST have
     Send Sequence Number 0.  Each subsequent Send MUST have a Send
     Sequence number of 1 + the Send Sequence Number of the previous
     Send.




Bailey                      Expires July 2002                   [Page 6]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     The data from a single client protocol-submitted Send is sent as a
     group of one or more Send messages where:

     o    Each Send message of the group MUST have the same Send
          Sequence Number.

     o    Each Send message of the group MUST have Send Offset equal to
          the offset in the client protocol-submitted Send of its first
          octet of data.

     o    Send messages other than the last of the group MUST NOT have
          the Final Message Flag set.

     o    The last Send message of the group MUST have the Final Message
          Flag set.

     An individual Send message is an undecorated message:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Tp=0x02|F|  R  |    Credits    |               R               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                      Send Sequence Number                     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                           Send Offset                         |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     ~                          Data Payload                         ~
     ~                                                               ~
     |               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |
     +-+-+-+-+-+-+-+-+


     Tp - Message Type : 4 bit (unsigned integer)

          Undecorated message type.  Must be 0x02 for a Send.

     F - Final Message Flag : 1 bit (boolean flag)

          if set to 1, this is the final Send message of a group
          carrying the data for a single client protocol-submitted Send

     R - Reserved

          Sender SHOULD set to 0, receiver MUST ignore.




Bailey                      Expires July 2002                   [Page 7]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     Credits : 8 bit (unsigned integer)

          Amount by which to increase the Send and Notifying RDMA Write
          credits.

     Send Sequence Number : 32 bit (unsigned integer)

          The sequence number of the client protocol-submitted Send.

     Send Offset : 32 bit (unsigned integer)

          The offset of Data Payload in the client protocol-submitted
          Send.

     Data Payload : 0-2^31-1 octets (opaque data)

          data from the client protocol-submitted Send.

7.  Credit Return Message

     If Send and Notifying Write credits can not be returned in a client
     protocol data transfer message, possibly because no client protocol
     data transfer is in progress, credits can be returned with a Credit
     Return Message.

     A Credit Return Message is an undecorated message:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Tp=0x03|   R   |    Credits    |               R               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


     Tp - Message Type : 4 bit (unsigned integer)

          Undecorated message type.  Must be 0x03 for a Credit Return
          Message.

     R - Reserved

          Sender SHOULD set to 0, receiver MUST ignore.

     Credits : 8 bit (unsigned integer)

          Amount by which to increase the Send and Notifying RDMA Write
          credits.




Bailey                      Expires July 2002                   [Page 8]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


8.  Errors In iWarp

     [TODO]

9.  Transport Characteristics In iWarp

     The effect of transport characteristics on operation ordering in
     iWarp is discussed below.

     In addition operation ordering, transport characteristics also
     interact with iWarp in other ways:

     o    RDMA Write, RDMA Read and Sends larger than a single transport
          message don't work with unordered or unreliable transports.

     o    Flow control doesn't work with unreliable transports.

     o    Flow control doesn't work with multisource transports.

     o    RDMA Read flow control doesn't work with multidestination
          transports.

     o    [TODO] Others?

10.  Operation Ordering In iWarp

     The ordering among:

     o    set()s,

     o    get()s,

     o    Sends,

     o    RDMA Write reception indications, and

     o    RDMA Read completion indications

     and their relationship to corresponding operations on the sender is
     defined in iWarp according to underlying transport characteristics:

     o    reliable or unreliable, and

     o    ordered or unordered.

     TODO: Now complicated stuff, especially about get() ordering.





Bailey                      Expires July 2002                   [Page 9]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


10.1.  Ordering On Reliable, Ordered Transports

     On a reliable, ordered transport, iWarp:

     o    [TODO]

10.2.  Ordering On Reliable, Unordered Transports

     On a reliable, unordered transport, iWarp:

     o    [TODO]

10.3.  Ordering On Unreliable, Ordered Transports

     On an unreliable, ordered transport, DDPP:

     o    [TODO]

10.4.  Ordering On Unreliable, Unordered Transports

     On an unreliable, unordered transport, in general, no additional,
     transport-dependent rules apply to iWarp. [TODO?]

11.  Transport Topology In iWarp

     Transports support some combination of:

     o    single source, or multisource, and

     o    single destination, or multidestination (multicast or
          anycast).

     When running iWarp on a multisource transport, flow control MUST
     NOT be enabled.

     When running iWarp on a multidestination transport, RDMA Read flow
     control MUST NOT be enabled.


12.  Negotiating iWarp

     Negotiating the use of iWarp is the sole responsibility of the
     client protocol.  iWarp is a duplex protocol, and must be enabled
     reciprocally in both directions by a pair of participants.  Some
     client protocols (e.g. RDMA) MAY chose to require iWarp a priori,
     while others MAY define an in- or out-of-band negotiation process
     to dynamically enable iWarp.  Whatever the case, a client protocol
     using iWarp MUST establish:



Bailey                      Expires July 2002                  [Page 10]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     o    Use of Send and Notifying RDMA Write flow control,

     o    Initial Send and Notifying RDMA Write credits (if enabled),

     o    Use of RDMA Read Request flow control,

     o    Initial RDMA Read Request credits (if enabled).

13.  Security Considerations

     [TODO]

14.  IANA Considerations

     [TODO]

15.  References

     [DDPP]
          Bailey, S., "The Direct Data Placement Protocol (DDPP) Core",
          February 2002.  http://www.cs.uchicago.edu/~steph/draft-
          bailey-roi-ddpp-core-00.txt


     [DRARCH]
          Bailey, S., "The Architecture of Direct Data Placement (DDP)
          And Remote Direct Memory Access (RDMA) On Internet Protocols",
          February 2002.  http://www.cs.uchicago.edu/~steph/draft-
          bailey-roi-ddp-rdma-arch-00.txt

Author's Address


     Stephen Bailey
     Sandburst Corporation
     600 Federal Street
     Andover, MA  01810
     USA

     Email: steph@sandburst.com


Full Copyright Statement

     Copyright (C) The Internet Society (2002). All Rights Reserved.

     This document and translations of it may be copied and furnished to
     others, and derivative works that comment on or otherwise explain



Bailey                      Expires July 2002                  [Page 11]

Internet-Draft            RDMA Protocol (iWarp)         12 February 2002


     it or assist in its implementation may be prepared, copied,
     published and distributed, in whole or in part, without restriction
     of any kind, provided that the above copyright notice and this
     paragraph are included on all such copies and derivative works.
     However, this document itself may not be modified in any way, such
     as by removing the copyright notice or references to the Internet
     Society or other Internet organizations, except as needed for the
     purpose of developing Internet standards in which case the
     procedures for copyrights defined in the Internet Standards process
     must be followed, or as required to translate it into languages
     other than English.

     The limited permissions granted above are perpetual and will not be
     revoked by the Internet Society or its successors or assigns.

     This document and the information contained herein is provided on
     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.






























Bailey                      Expires July 2002                  [Page 12]