INTERNET-DRAFT J. Pinkerton Microsoft Corporation Steph Bailey Genroco C. Sapuntzakis Cisco Systems U. Elzur Broadcom J. Williams Giganet draft-jpink-warp-summary-00.txt January 2001 Expires August 2001 WARP Architectural Requirements Summary Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other docu¡ ments at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) Internet Society (2001). All Rights Reserved. Pinkerton [Page 1] Internet-Draft WARP Arch. Summary January 2001 Abstract This document defines the requirements and some architectural implementation issues for an RDMA system called WARP. Acknowledgements The authors would like to acknowledge the many people who have con¡ tributed to discussions of running RDMA over IP, including Allyn Romanow, Jeff Chase, David Cheriton, and Julian Satran. Pinkerton [Page 2] Internet-Draft WARP Arch. Summary January 2001 Introduction This document defines the requirements and some architectural implementation issues for a new protocol called WARP. Warp is designed to provide bus oriented semantics across the network. Bus oriented semantics are defined as: Remote DMA semantics - the destination address is present in the header. Send semantics - the destination address is not present in the header. This data transfer model provides sequential in-order delivery of messages with partial completions of receive messages (this is the same as the SCTP completion model). One of WARPs primary requirements is that it solve the steering of receive data into receive buffers such that each segment is self- describing. Self-describing means that each receive DMA can place the data in the correct location of the ULP buffer without any state from prior packets. This means that network data can be received by an application without any intermediate copying of the data. By requiring each WARP PDU to be self describing, the receive buffer requirements move from an SCTP/TCP window per connection to simply a WARP PDU. If a WARP PDU is simply a single frame, this has the effect of reducing the receive NIC memory requirements from many megabytes to a few kilobytes. To ensure self-describing WARP PDUs, WARP shall: - work with out of order packets (including dropped packets). - for RDMA transfers WARP will use a buffer-id allocated by the receiver with an offset to allow packets to be self-describing. The offset can be either zero based or have a fixed offset. - for Send transfers WARP will use a Send Sequence Number and offset. - separate data steering of ULP header from ULP data done by chunkifying the header/data into two messages. WARP shall provide a mapping for both SCTP and TCP. WARP will be layered on top of TCP. WARP will be an extension to SCTP rather than layered on top. The SCTP mapping will allow multiple simulta¡ neous in-order streams on a single connection. The TCP mapping will allow a single in-order stream on a single connection. In both Pinkerton [Page 3] Internet-Draft WARP Arch. Summary January 2001 cases a single stream can mix RDMA and Send semantics. Warp will support an explicit sender managed ULP receive completion model (a.k.a. notification) to allow a transmit ULP to "gather" multiple ULP messages into one receive completion. WARP will allow transmission of a ULP PDU that is larger than a TCP segment (i.e. it will fragment and reassemble the ULP PDU) while maintaining the self-describing behavior defined above. WARP is integrated with SCTP, so it will simply utilize the existing SCTP fragmentation algorithm with an extension to provide enough infor¡ mation for the segment to be self-describing. WARP will negotiate at connection setup time the following options: - framing mechanism. - if in-order delivery of data is required or just in-order completions. WARP will define the variables needed to remotely access an RDMA memory region, but will not define a packet format to exchange these variables. The variables will be returned to the ULP upon memory region initialization. It is expected that the ULP will exchange these variables with the remote end to enable RDMA accesses. The RDMA memory region variables are: - buffer id - size of memory region - base (zero based RDMA offset or non-zero based RDMA offset) Self describing segments require that the WARP header be aligned to a lower layer frame. WARP will be defined to work over either a marker based framing algorithm or a header alignment algorithm. Note that header alignment is purely a performance issue. WARP will work correctly without header alignment. WARP shall specifically not recreate the following SCTP/TCP func¡ tionality: - Congestion management - WARP assumes invalid segments will not be DMA'ed into host memory - reliable delivery - WARP assume in-order completions (but not DMAs) Pinkerton [Page 4] Internet-Draft WARP Arch. Summary January 2001 WARP has the following header variables in both the SCTP and TCP mappings: Type code (8 bits) flags Completion bit (C) - send a completion event to the ULP Last bit (L) - the Last bit shall be set in the last WARP PDU of a ULP PDU. Transmit Error bit (E)- the transmitter encountered an error while sending this ULP PDU. The Completion bit shall only be set if the Last bit is also set. Sends messages require a 32 bit Send Sequence Number (SSN) and 32 bit Send Offset in each Send WARP fragment. RDMAs shall have a 64 bit offset. The offset can either be zero-based or non-zero based. Buffer ID (RID) The TCP mapping of WARP shall include a 32 bit Adler checksum or stronger. The WARP protocol has a few limitations. RDMA transfers which are out-of-order to the same destination address are not guaranteed to occur in order. If in-order writes to the same destination memory location are required on fabric that can provide out-of-order delivery, the ULP must provide some synchronization method to flush the prior writes. WARP shall not solve RDMA Read semantics, Atom¡ ics, or flow control of Sends. Pinkerton [Page 5] Internet-Draft WARP Arch. Summary January 2001 Appendix A: Sketch of Packet Format The following is a proposed packet header for WARP on TCP and SCTP. It is provided here as a sketch only. The WARP specification will override this outline. A.1 TCP Packet Format A.1.1. RDMA Packet Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Type | Rsvd |E|C|L| Chunk length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | RID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Buffer Offset | + + 12 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data payload | // // | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 32-bit Adler Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Type = packet type C = Create a Completion event for the current ULP PDU L = Last message of a ULP PDU. E = Transmit Error bit. Rsvd = Reserved field. Transmit as zeros, not checked on receive. RID = Receive Buffer ID Chunk Length = length of the current WARP chunk (fragment) A.1.2. Send Packet Format Pinkerton [Page 6] Internet-Draft WARP Arch. Summary January 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Type | Rsvd |E|C|L| Chunk length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | Send Sequence Number (SSN) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Buffer Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data payload | // // | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 32-bit Adler Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Send Sequence Number - the sequence number of the current Send operation. Note that Sends may be split into multiple WARP fragments. The offset is used to describe fragments within a single Send Sequence Number. A.2 SCTP Chunk formats Format The SCTP chunks are identical to the TCP chunks except include an SCTP transmit sequence number (TSN) to leverage the reliable chunk delivery mechanisms in SCTP. A.2.1. SCTP SEND chunk Pinkerton [Page 7] Internet-Draft WARP Arch. Summary January 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Type = TBD | |E|C|L| Chunk length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | TSN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Send sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Buffer Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A.2.2. SCTP RDMA chunk 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Type = TBD | |E|C|L| Chunk length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | TSN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | RID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Buffer Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Note that the Adler checksum is already in the SCTP header. Appendix B: Mapping of iSCSI to WARP Pinkerton [Page 8] Internet-Draft WARP Arch. Summary January 2001 B.1. Introduction Mapping iSCSI onto the WARP data transfer mechanism consists pri¡ marily of defining connection setup procedure to enable WARP accel¡ eration of iSCSI and chosing a transfer mode, either Send, or RDMA, for each PDU defined by iSCSI. B.2 iSCSI on WARP Connection Setup The use of WARP in an iSCSI connection will be negotiated by attempting to initiate a WARP connection indicating iSCSI as the upper layer protocol. If this WARP connection succeeds, the iSCSI connection setup behavior then proceeds as described in the `iSCSI Login' section of iSCSI, mapping iSCSI PDUs to WARP packets as defined below. If the WARP connection fails, the connection ini¡ tiator MAY attempt to fall back to forming a traditional iSCSI con¡ nection without WARP support, if desired. The use of WARP for an iSCSI connection MAY be enabled on a per- connection basis for multiple connections within an iSCSI session. B.3 WARP Mapping of iSCSI Control And Immediate Write Data PDUs All iSCSI PDUs MUST be sent using WARP Send packets, with the exception of: - Initiator opcode 0x05, SCSI Data (for WRITE operation), when data is in response to an iSCSI Ready To Transfer (R2T) PDU. PDUs with initiator opcode 0x05 which are not sent in response to an R2T PDU (immediate data) shall be sent using WARP Send packets, - Initiator opcode 0x45, SCSI Data (for READ operation). B.4 WARP Mapping of iSCSI SCSI Data PDUs iSCSI PDUs with initiator opcode 0x05, SCSI Data (for WRITE opera¡ tion), when data is in response to an iSCSI R2T PDU, and with tar¡ get opcode 0x45, MUST be sent using a sequence of one or more WARP RDMA Write packets containing the data payload of the iSCSI SCSI Data PDU, followed by a WARP Send packet containing the iSCSI SCSI Data PDU header. The buffer offset for the initial RDMA Write packet associated with Pinkerton [Page 9] Internet-Draft WARP Arch. Summary January 2001 an iSCSI SCSI Data PDU MUST be set to the Buffer Offset in the iSCSI SCSI Data PDU header. RDMA Write packets associated with an iSCSI SCSI Data PDU MAY be of any size allowed by WARP. However, for maximum performance, the sender of the iSCSI SCSI Data PDU SHOULD attempt to send RDMA Write packets of the maximum allowable size at all times. Successive RDMA Write packets associated with iSCSI SCSI Data PDUs must fill the destination buffer in contigu¡ ous, increasing order. In other words, the Buffer Offset of each RDMA Write packet associated with an iSCSI SCSI Data PDU MUST equal the Buffer Offset of the previous RDMA Write packet plus the pay¡ load size of the previous RDMA Write packet. The sender MUST set the completion bit to 0 in RDMA Write packets associated with an iSCSI SCSI Data PDU. Notification of data delivery is implied by notification of the Send containing the iSCSI SCSI Data PDU header which follows the RDMA Write transfer of the actual data payload. B.4.1 WARP RID For iSCSI SCSI Data (for WRITE operation) PDUs The iSCSI target MUST supply the WARP RID to be used for the WARP RDMA Write packets associated with an iSCSI SCSI Data (for WRITE operation) PDU in the "transfer tag" field. B.4.2 WARP RID For iSCSI SCSI Data (for READ operation) PDUs The iSCSI initiator MUST supply the WARP RID to be used for the WARP RDMA Write packets associated with an iSCSI SCSI Data (for READ operation) PDU as the initiator task tag on the iSCSI Command PDU that initiates the READ operation. Authors Jim Pinkerton 1 Microsoft Way Bldg. 40 Redmond, WA 98052-6399 jpink@microsoft.com Stephen Bailey steph@cs.uchicago.edu Pinkerton [Page 10] Internet-Draft WARP Arch. Summary January 2001 Constantine Sapuntzakis csapuntz@cisco.com Jim Williams jimw@giganet.com Uri Elzur uelzur@broadcom.com draft-jpink-warp-summary-00.txt Expires August 2001 Pinkerton [Page 11]