INTERNET-DRAFT J. Williams draft-williams-iwarp-ift-01.txt Emulex Corporation Expires: August 2003 February 2003 iWARP Framing for TCP 1 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are work- ing documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also dis- tribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet- Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 2 Abstract A framing protocol is defined for DDP over TCP that is fully compliant with applicable TCP RFCs and fully interoperable with existing TCP implementations. The protocol offers to two things, definition of the DDP record boundaries within the TCP stream, and added data integrity by means of added CRC protection. In addition, an adaption mechanism is defined that confirms use of RDDP, and negotiates parameters including use of CRCs, markers, and enabling of speculative placement. 3. Acknowledgements The detailed and pains taking effort of the RDMA Consortium members is acknowledged as well as others who have contributed to defining an RDMA over IP protocol. J. Williams Expires: August 2003 [Page 1] INTERNET-DRAFT iWARP Framing for TCP February 2003 4. Adaption Indication It is expected that RDDP mode may be used immediately at connection setup time, or alternatively may be initiated at some time after the connection has been set up. The latter case would be used when an existing ULP negotiates an upgrade to RDDP mode. The adaption indication is sent in band, but MUST not be the initial offer to use RDDP. The ULP MUST first agree to use RDDP, or else the port number MUST be one that is agreed to be used for RDDP. The sending of the adaption indication is a confirmation of this agreement to use RDDP, and also a means to negotiate the particular parameters of the RDDP connection. 4.1 Adaption Indication Format The adoption indication is exactly 64 bytes in length and has the following format. +--------------+---------------+---------------+---------------+ | Magic number ( 0x52444450 ) | +--------------+---------------+---------------+---------------+ | version | BC version | +--------------+---------------+---------------+---------------+ | Flags | +--------------+---------------+---------------+---------------+ | Sender's VTag | marker interval | +--------------+---------------+---------------+---------------+ | | | reserved (44 bytes) | | | | | +--------------+---------------+---------------+---------------+ | CRC | +--------------+---------------+---------------+---------------+ 4.2 Field definitions Version Defines protocol version of the adaption indication. This document describes version 1. The field MUST have value of one. BcVersion Backward compatible version. For this version of the protocol, this field MUST be set to 1. In general it will be set to the minimum version with which it is backward compatible. On receipt, any value greater than the current version MUST be rejected as incompatible. The received version need not be rejected if greater than the current version as long as the BC version is not. J. Williams Expires: August 2003 [Page 2] INTERNET-DRAFT iWARP Framing for TCP February 2003 The following flag bits are defined starting with the MSB of the flag field and proceeding left to right. The remainder of the flag field is reserved. PH Adoption is two phase negotiation, with phases numbered 0, and 1. Phase 0 is sent from node A to node B. On receipt of phase 0, node B sends phase 1 back to node A. This bit indicates phase 0 or 1. R Indicates sender wants to use RDMA mode. If both ends set this bit, then connection is in RDMA mode. If both ends clear this bit, then the connection is in DDP mode. If ends disagree, then the connection MUST be closed in error. HC Indicates that the sender will be including a header CRC. If the senders choice to use or not use a header CRC is unacceptable to the receiver, the receiver SHOULD abort the connection. PC Indicates that the sender will be including a payload CRC. If the senders choice to use or not use a payload CRC is unacceptable to the receiver, the receiver SHOULD abort the connection. IH Indicates that the sender will ignore and not check any header CRC which the receiver of the adaption indication sends. If this is unacceptable to the receiver of the adaption indication, the connection should be closed. IP Indicates that the sender will ignore and not check any payload CRC which the receiver of the adaption indication sends. If this is unacceptable to the receiver of the adaption indication, the connection should be closed in error. SP Indicates the sender is giving the receiver permission to do speculative placement. If this bit is set, the receiver MAY do speculative placement. If not set, the receiver MUST NOT do speculative placement. See the chapter below for full description of speculative placement. SM Indicates sender of adaption indication is willing to send periodic markers. Markers will be sent only if the SM bit is set and other end of the connection sets the RM bit in its adaption indication. J. Williams Expires: August 2003 [Page 3] INTERNET-DRAFT iWARP Framing for TCP February 2003 RM Indicates the sender of the adaption indication wishes to receive periodic markers. Markers will be received if and only if the RM bit is set and the other end sets the SM bit in its adaption indication. Sender's VTag The sender's VTag is a 16 bit value that SHOULD be selected at random using a good random number generator. It is included in all outgoing RDDP PDUs sent by the receiver of this adaption indication. The main purpose is to provide padding to align the RDDP header, however it also provides some added checking as to the validity of the IFT header, and some added confidence for doing speculative placement. Marker Interval Used to negotiate use of periodic markers and determine the interval at which they will be sent. Exact method of negotiation specified in the chapter on periodic markers. CRC CRC computed on adaption indication is REQUIRED regardless of whether header or payload CRCs are negotiated in either direction. reserved All reserved bits MUST be set to zero by the sender and ignored by the receiver. J. Williams Expires: August 2003 [Page 4] INTERNET-DRAFT iWARP Framing for TCP February 2003 5. Protocol definition 5.1 Frame format +--------------+---------------+---------------+---------------+ | header_size | payload_size | +--------------+---------------+---------------+---------------+ | receiver's VTag | | +--------------+---------------+ | | DDP Header | | | | | +--------------+---------------+---------------+---------------+ | Header CRC (if negotiated) | +--------------+---------------+---------------+---------------+ | | | | | | | DDP Payload | | | | | | | | | | | +--------------+---------------+---------------+---------------+ | Payload CRC (if negotiated) | +--------------+---------------+---------------+---------------+ The header_size and payload_size are 16 bit fields containing the number of bytes in the DDP header and payload respectively. The header CRC covers both the IFT header (header_size and payload_size fields) and the DDP header. The payload CRC covers only the DDP payload. The CRC uses the CRC-32c algorithm as defined in [iSCSI]. Note that the RDMA header (if present) and headers associated with higher level protocols are considered as part of the DDP payload, and not covered by the IFT header CRC. J. Williams Expires: August 2003 [Page 5] INTERNET-DRAFT iWARP Framing for TCP February 2003 The above format shows the DDP header and DDP payload size as being a multiple of four bytes. This is not necessary, however, and no padding is added to align the CRC. An example of an unaligned frame is shown below. +--------------+---------------+---------------+---------------+ | header_size | payload_size | +--------------+---------------+---------------+---------------+ | receiver's VTag | | +--------------+---------------+ | | | | DDP Header | | +---------------+---------------| | | | +--------------+---------------+---------------+---------------+ | Header CRC (if negotiated) | | +--------------+---------------+ | | | | | | | | DDP Payload | | | | | | | + +---------------+---------------+---------------+ | | Payload CRC (if negotiated) | +--------------+---------------+---------------+---------------+ | | +--------------+ 6. Ordering semantics The IFT layer receives all data in order from the TCP layer and delivers all data in order to the DDP layer. Note that this does not preclude a merged layer implementation from placing data out of order, but any such implementation MUST be functionally equivalent to a layered implementation in which TCP delivers all data in order. 7. Motivation The IFT header contains the size of the DDP Header and DDP payload in bytes. Assuming in order processing of received TCP data, this is fully sufficient to define the DDP PDU boundaries. The header and payload CRCs are 32 bits each, and provide additional protection against data corruption. In the event of a CRC error on received data, the IFT layer will notify the next layer that the data contains an error. That next layer will define the action to be taken. Typical action is to close the connection with a fatal error. J. Williams Expires: August 2003 [Page 6] INTERNET-DRAFT iWARP Framing for TCP February 2003 7.1 Motivation for separating header and payload CRCs. There are three important reasons for this separation. First, because of TCP segmentation, the entire payload may not be received together with the header. The next protocol layer (DDP) may wish to place the portion of the payload that has been received, but this can't be safely done until the header, which indicated where the data should be placed, has been verified. The second important reason is that it makes hardware implementations significantly more efficient in that the payload CRC can be calculated as the payload data is streamed from NIC memory to host memory. This streaming can't take place until the header (and therefore the destination host address) has been verified correct. There are only three ways the payload CRC can be verified, on the way into the NIC buffer, on the way out of the NIC buffer towards host memory, or by making a separate access to the NIC buffer memory just for the purpose of CRC verification. The third option causes significant inefficiencies in terms of required memory bandwidth. Verifying the CRC while the data is on the way into the NIC buffer is great if it can be done, but is generally not practical if the CRC is part of a layer above the transport (TCP in this case) layer. This is because the transport processing must be done first, and the time between receiving the packet on the link and writing it to buffer memory is too brief to complete the transport processing. Therefore the proposal requires checking only the header CRC with a separate access to buffer memory, and allows the payload CRC to be verified as the data is streamed from NIC buffer to host buffer. The third reason is that some applications may require a header CRC but not require a payload CRC. This may be for a number of reasons including the presense of added end to end checks at the ULP level, of simply the ability of the ULP to tolerate data errors (but not placement errors). J. Williams Expires: August 2003 [Page 7] INTERNET-DRAFT iWARP Framing for TCP February 2003 7.2 Motivation for not requiring padding to align CRCs. Experience building hardware for [iSCSI] has shown that the hardware required to compute unaligned CRCs is trivial requiring only a couple byte shifters. The hardware required to insert the padding is an order of magnitude more complex, and affects control timing in ways that require complex verification. Since the only claimed benefit of padding insertion was to simplify hardware design, and since the result was exactly the opposite, this proposal does not include padding. 8. Speculative Placement Speculative placement is done by the receiver of a RDDP PDU. On receiving an out of order TCP segment, the receiver guesses at the location of the RDDP PDU within the TCP segment. This guess is confirmed by doing a number of checks, and if the checks pass, directly placing the payload data. If the confirmation fails, speculative placement MUST NOT be done. If the PDU identified appears to be in error, the error MUST NOT be reported speculatively. In the case of either the failed confirmation or PDU error, nothing may be done with the PDU until it can be processed in order. Proir to delivering the payload data to the ULP, the alignment is verified by receiving all preceding PDUs and using the IFT length fields to absolutely verify that the alignment was correct. If this verification fails, the PDU MUST be placed again. If an implementation has discarded the original after placing it (which it typically will do), then it MUST withhold TCP acknowledgement of this segment and force the remote end to retransmit it. 8.1 Confirmation checks Before doing speculative placement, the IFT and DDP headers should be checked. Specific field checked include the IFT header length, IFT payload length, IFT VTag, IFT header CRC, DDP STag, DDP QN, DDP MSN, DDP TO, DDP DV. The subset of these which exists SHOULD all be checked, and any field containing an invalid value disqualifies the PDU for speculative placement. J. Williams Expires: August 2003 [Page 8] INTERNET-DRAFT iWARP Framing for TCP February 2003 8.2 Overwrite checks Before speculative placement is done, the RDDP implementation MUST insure that no previously placed data is overwritten. This is necessary to insure that if the speculative placement is being done in error, that the error is recoverable. This implies that the RDDP implementation must track what portions of a buffer have been written at any time, or if the RDDP implementation loses track, then do no further speculative placements in that buffer. If an in-order placement would overwrite a previously done speculative placement, then that in-order placement should be done and the preceding speculative placement regarded as invalid, and needs to be re-done in-order. 8.3 Unwritten Portions of Buffers Because speculative placement may write erroneously to unwritten portions of buffers, applications that allow speculative placement MUST assume that when a buffer is delivered to it by RDDP, any unwritten portion of a buffer contains unpredictable data. Applications MUST NOT assume that unwritten portions of buffers are unmodified. 9. Periodic Markers Periodic markers may be inserted in the data stream as a means of locating the beginning of RDDP PDUs when TCP segments are received out of order. Details of markers are TBD. 10. IFT and SCTP IFT is a framing protocol for TCP only and does not address SCTP. It is expected that IFT will be a temporary transitional solution in the event the SCTP ultimately achieves wide spread use. It would become a long term solution only if SCTP fails to achieve wide spread use and acceptance. J. Williams Expires: August 2003 [Page 9] INTERNET-DRAFT iWARP Framing for TCP February 2003 11. Security Considerations It is expected that IFT introduces no new security considerations. It has all the strengths and weaknesses normally associated with TCP, and creates no new weaknesses. All application related security issues are the responsibility of higher layer protocols. 12. References [MPA] P. Culley et al., draft-culley-iwarp-mpa-00.txt, September 16, 2001 [iSCSI] Satran, Julian, draft-ietf-iscsi-15.txt, July 30, 2002 [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981. [DDP] H. Shah et al., "Direct Data Placement over Reliable Transports", RDMA Consortium Draft Specification draft-shah- rdmap-ddp-00.txt, September 2002 [RDMA] R. Recio et al., "RDMA Protocol Specification", RDMA Consortium Draft Specification draft-recio-rdmap-rdma-00.txt, September 2002 [SCTP] R. Stewart et al., "Stream Control Transmission Protocol", RFC 2960, October 2000. 13. Author's Addresses Jim Williams Emulex Corporation 580 Main Street Bolton, MA 01740 USA Phone: +1 978 779 7224 Email: jim.williams@emulex.com J. Williams Expires: August 2003 [Page 10]