S. Bailey (Sandburst) Internet-draft Expires: July 2002 The Remote Direct Memory Access Protocol (iWarp) draft-bailey-roi-rdma-00 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2002). All Rights Reserved. Abstract This document defines a Remote Direct Memory Protocol (iWarp) to run on the Direct Data Placement Protocol (DDPP) [DDPP]. This initial draft is an incomplete sketch of iWarp to be used only as the basis of discussion of protocol and architectural issues with DDPP and RDMA. Table Of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 2. Flow Control in iWarp . . . . . . . . . . . . . . . . . . 2 3. Use of DDPP Message Identifiers In iWarp . . . . . . . . 3 Bailey Expires July 2002 [Page 1] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 4. RDMA Write In iWarp . . . . . . . . . . . . . . . . . . . 4 5. RDMA Read In iWarp . . . . . . . . . . . . . . . . . . . 4 6. Send In iWarp . . . . . . . . . . . . . . . . . . . . . . 6 7. Credit Return Message . . . . . . . . . . . . . . . . . . 8 8. Errors In iWarp . . . . . . . . . . . . . . . . . . . . . 9 9. Transport Characteristics In iWarp . . . . . . . . . . . 9 10. Operation Ordering In iWarp . . . . . . . . . . . . . . . 9 10.1. Ordering On Reliable, Ordered Transports . . . . . . . . 10 10.2. Ordering On Reliable, Unordered Transports . . . . . . . 10 10.3. Ordering On Unreliable, Ordered Transports . . . . . . . 10 10.4. Ordering On Unreliable, Unordered Transports . . . . . . 10 11. Transport Topology In iWarp . . . . . . . . . . . . . . . 10 12. Negotiating iWarp . . . . . . . . . . . . . . . . . . . . 10 13. Security Considerations . . . . . . . . . . . . . . . . . 11 14. IANA Considerations . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . 11 Full Copyright Statement . . . . . . . . . . . . . . . . 11 1. Introduction This document defines a Remote Direct Memory Protocol (iWarp) to run on the Direct Data Placement Protocol (DDPP) [DDPP]. This initial draft is an incomplete sketch of iWarp to be used only as the basis of discussion of protocol and architectural issues with DDPP and RDMA. iWarp follows the architecture and terminology of `The Architecture of Direct Data Placement (DDP) And Remote Direct Memory Access (RDMA) On Internet Protocols' (DRARCH) [DRARCH]. A thorough understanding of DRARCH is necessary to understand this document. iWarp defines three data transfer operations: o RDMA Write o RDMA Read o Send (an undecorated message) 2. Flow Control in iWarp While it is straightforward for client protocols to implement flow control over iWarp protocol resources, iWarp defines its own flow control because many client protocols prefer not to handle this Bailey Expires July 2002 [Page 2] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 detail. iWarp flow control is credit-based, with two distinct pools of credits: o Send and Notifying RDMA Write credits, o RDMA Read Request credits. iWarp MAY submit one complete client protocol Send or Notifying RDMA Write (an RDMA Write which requests a reception indication) to DDPP for each Send and Notifying RDMA Write credit. Initial Send and Notifying RDMA Write credits are established when iWarp is enabled and may be returned in any iWarp message. Client protocols MAY chose to use Send and Notifying Write flow control or not. iWarp MAY submit one RDMA Read Request to DDPP for each RDMA Read Request credit. Initial RDMA Read Request credits are established when iWarp is enabled and one credit is returned by each completed RDMA Read. Client protocols MAY chose to use RDMA Read Request flow control or not. 3. Use of DDPP Message Identifiers In iWarp iWarp uses the first 11 bits of DDPP's Message Identifier for its own purposes. The remaining (at least 4, probably 20) bits remain for use by client protocols. iWarp's 11 Message Identifier bits are: 0 1 2 3 4 5 6 7 8 9 10 +-+-+-+-+-+-+-+-+-+-+-+ |R|F|N| Credits | +-+-+-+-+-+-+-+-+-+-+-+ R - Read Reply Flag : 1 bit (boolean flag) if set to 1, the message is RDMA Read data, otherwise, it is RDMA Write data. F - Final RDMA Data Flag : 1 bit (boolean flag) if set to 1, the message is the last in a group for a single RDMA Write or RDMA Read Response. F - Notifying RDMA Write Flag : 1 bit (boolean flag) Bailey Expires July 2002 [Page 3] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 if set to 1, the message is the last in a group for a single, notifying RDMA Write. Credits : 8 bits (unsigned integer) Amount by which to increase the Send and Notifying RDMA Write credits. MUST be 0 if Send and Notifying RDMA Write flow control is disabled. 4. RDMA Write In iWarp An iWarp RDMA Write operation is a group of one more DDP-decorated messages with the Message Identifier field set as defined above. A DDP-decorated message that is part of an RDMA Write MUST have Notify set when: o it is the final DDP-decorated message in an RDMA Write which is requesting a completion indication, or o the Credits value is not zero. 5. RDMA Read In iWarp An iWarp RDMA Read operation is: o an RDMA Read Request containing source and destination buffer addresses and RDMA Read size, o an RDMA Read Response of one or more DDP-decorated messages targeting the destination buffer address with the data from the source buffer address. An RDMA Read Request is an undecorated message: Bailey Expires July 2002 [Page 4] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Tp=0x01| R | Credits | R | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Source Offset + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Read Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Destination Offset + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Tp - Message Type : 4 bit (unsigned integer) Undecorated message type. Must be 0x01 for an RDMA Read Request. R - Reserved Sender SHOULD set to 0, receiver MUST ignore. Credits : 8 bit (unsigned integer) Amount by which to increase the Send and Notifying RDMA Write credits. Source STag : 32 bit (unsigned integer) The steering tag identifying the source buffer from which to retrieve the RDMA Read data. Source Offset : 64 bits (unsigned integer) The offset in the source buffer from which to retrieve the RDMA Read data. Read Size : 32 bits (unsigned integer) Bailey Expires July 2002 [Page 5] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 The number of octets of data to be read from the source address. Destination STag : 32 bit (unsigned integer) The steering tag identifying the destination buffer in which to place the RDMA Read data. Destination Offset: 64 bits (unsigned integer) The offset in the destination buffer at which to place the RDMA Read data. An RDMA Read Response is a group of one or more DDP-decorated messages with the Message Identifier field set as defined above. A DDP-decorated message that is part of an RDMA Read Response MUST have Notify set when: o it is the final DDP-decorated message in an RDMA Read Response, or o the Credits value is not zero. The client protocol portion of the Message Identifier field of the DDP-decorated messages in an RDMA Read Response may be chosen by the client protocol. This allows the client protocol to distinguish among RDMA Read Responses for multiple outstanding RDMA Read Requests. Allowing the client protocol to select a portion of the Message Identifier permits a different interface from DRARCH's synchronous rdma_read(). However, DRARCH's rdma_read() can be implemented in iWarp by having each outstanding call to rdma_read() automatically select a different client protocol portion of the Message Identifier. An RDMA Read Response MUST transfer exactly Read Size octets, or result in an error. 6. Send In iWarp An iWarp Send is an undecorated message of up to 2^31-1 octets. To permit efficient implementation, each Send is identified by a Send Sequence Number. The Send Sequence Number is not visible to client protocols. The first Send after iWarp is enabled MUST have Send Sequence Number 0. Each subsequent Send MUST have a Send Sequence number of 1 + the Send Sequence Number of the previous Send. Bailey Expires July 2002 [Page 6] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 The data from a single client protocol-submitted Send is sent as a group of one or more Send messages where: o Each Send message of the group MUST have the same Send Sequence Number. o Each Send message of the group MUST have Send Offset equal to the offset in the client protocol-submitted Send of its first octet of data. o Send messages other than the last of the group MUST NOT have the Final Message Flag set. o The last Send message of the group MUST have the Final Message Flag set. An individual Send message is an undecorated message: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Tp=0x02|F| R | Credits | R | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Send Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Send Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Data Payload ~ ~ ~ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+ Tp - Message Type : 4 bit (unsigned integer) Undecorated message type. Must be 0x02 for a Send. F - Final Message Flag : 1 bit (boolean flag) if set to 1, this is the final Send message of a group carrying the data for a single client protocol-submitted Send R - Reserved Sender SHOULD set to 0, receiver MUST ignore. Bailey Expires July 2002 [Page 7] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 Credits : 8 bit (unsigned integer) Amount by which to increase the Send and Notifying RDMA Write credits. Send Sequence Number : 32 bit (unsigned integer) The sequence number of the client protocol-submitted Send. Send Offset : 32 bit (unsigned integer) The offset of Data Payload in the client protocol-submitted Send. Data Payload : 0-2^31-1 octets (opaque data) data from the client protocol-submitted Send. 7. Credit Return Message If Send and Notifying Write credits can not be returned in a client protocol data transfer message, possibly because no client protocol data transfer is in progress, credits can be returned with a Credit Return Message. A Credit Return Message is an undecorated message: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Tp=0x03| R | Credits | R | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Tp - Message Type : 4 bit (unsigned integer) Undecorated message type. Must be 0x03 for a Credit Return Message. R - Reserved Sender SHOULD set to 0, receiver MUST ignore. Credits : 8 bit (unsigned integer) Amount by which to increase the Send and Notifying RDMA Write credits. Bailey Expires July 2002 [Page 8] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 8. Errors In iWarp [TODO] 9. Transport Characteristics In iWarp The effect of transport characteristics on operation ordering in iWarp is discussed below. In addition operation ordering, transport characteristics also interact with iWarp in other ways: o RDMA Write, RDMA Read and Sends larger than a single transport message don't work with unordered or unreliable transports. o Flow control doesn't work with unreliable transports. o Flow control doesn't work with multisource transports. o RDMA Read flow control doesn't work with multidestination transports. o [TODO] Others? 10. Operation Ordering In iWarp The ordering among: o set()s, o get()s, o Sends, o RDMA Write reception indications, and o RDMA Read completion indications and their relationship to corresponding operations on the sender is defined in iWarp according to underlying transport characteristics: o reliable or unreliable, and o ordered or unordered. TODO: Now complicated stuff, especially about get() ordering. Bailey Expires July 2002 [Page 9] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 10.1. Ordering On Reliable, Ordered Transports On a reliable, ordered transport, iWarp: o [TODO] 10.2. Ordering On Reliable, Unordered Transports On a reliable, unordered transport, iWarp: o [TODO] 10.3. Ordering On Unreliable, Ordered Transports On an unreliable, ordered transport, DDPP: o [TODO] 10.4. Ordering On Unreliable, Unordered Transports On an unreliable, unordered transport, in general, no additional, transport-dependent rules apply to iWarp. [TODO?] 11. Transport Topology In iWarp Transports support some combination of: o single source, or multisource, and o single destination, or multidestination (multicast or anycast). When running iWarp on a multisource transport, flow control MUST NOT be enabled. When running iWarp on a multidestination transport, RDMA Read flow control MUST NOT be enabled. 12. Negotiating iWarp Negotiating the use of iWarp is the sole responsibility of the client protocol. iWarp is a duplex protocol, and must be enabled reciprocally in both directions by a pair of participants. Some client protocols (e.g. RDMA) MAY chose to require iWarp a priori, while others MAY define an in- or out-of-band negotiation process to dynamically enable iWarp. Whatever the case, a client protocol using iWarp MUST establish: Bailey Expires July 2002 [Page 10] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 o Use of Send and Notifying RDMA Write flow control, o Initial Send and Notifying RDMA Write credits (if enabled), o Use of RDMA Read Request flow control, o Initial RDMA Read Request credits (if enabled). 13. Security Considerations [TODO] 14. IANA Considerations [TODO] 15. References [DDPP] Bailey, S., "The Direct Data Placement Protocol (DDPP) Core", February 2002. http://www.cs.uchicago.edu/~steph/draft- bailey-roi-ddpp-core-00.txt [DRARCH] Bailey, S., "The Architecture of Direct Data Placement (DDP) And Remote Direct Memory Access (RDMA) On Internet Protocols", February 2002. http://www.cs.uchicago.edu/~steph/draft- bailey-roi-ddp-rdma-arch-00.txt Author's Address Stephen Bailey Sandburst Corporation 600 Federal Street Andover, MA 01810 USA Email: steph@sandburst.com Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain Bailey Expires July 2002 [Page 11] Internet-Draft RDMA Protocol (iWarp) 12 February 2002 it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Bailey Expires July 2002 [Page 12]