INTERNET-DRAFT J. Williams draft-williams-iwarp-ift-00.txt Emulex Corporation Expires: April 2003 October 2002 iWARP Framing for TCP 1 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are work- ing documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also dis- tribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet- Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 2 Abstract A framing protocol is defined for DDP over TCP that is fully compliant with applicable TCP RFCs and fully interoperable with existing TCP implementations. This protocol is being offered as an alternative to to the proposal offered in [MPA]. The protocol offers to two things, definition of the DDP record boundaries within the TCP stream, and added data integrity by means of added CRC protection. 3 Acknowledgements The detailed and pains taking effort of the RDMA Consortium members is acknowledged as well as others who have contributed to defining an RDMA over IP protocol. J. Williams Expires: April 2003 [Page 1] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 3 Protocol definition 3.1 Frame format +--------------+---------------+---------------+---------------+ | header_size | payload_size | +--------------+---------------+---------------+---------------+ | | | DDP Header | | | | | +--------------+---------------+---------------+---------------+ | Header CRC | +--------------+---------------+---------------+---------------+ | | | | | | | DDP Payload | | | | | | | | | | | +--------------+---------------+---------------+---------------+ | Payload CRC | +--------------+---------------+---------------+---------------+ The header_size and payload_size are 16 bit fields containing the number of bytes in the DDP header and payload respectively. The header CRC covers both the IFT header (header_size and payload_size fields) and the DDP header. The payload CRC covers only the DDP payload. The CRC uses the CRC-32c algorithm as defined in [iSCSI]. Note that the RDMA header (if present) and headers associated with higher level protocols are considered as part of the DDP payload, and not covered by the IFT header CRC. J. Williams Expires: April 2003 [Page 2] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 The above format shows the DDP header and DDP payload size as being a multiple of four bytes. This is not necessary, however, and no padding is added to align the CRC. An example of an unaligned frame is shown below. +--------------+---------------+---------------+---------------+ | header_size | payload_size | +--------------+---------------+---------------+---------------+ | | | DDP Header | | +---------------+---------------| | | | +--------------+---------------+---------------+---------------+ | Header CRC | | +--------------+---------------+ | | | | | | | | DDP Payload | | | | | | | | | | | + +---------------+---------------+---------------+ | | Payload CRC | +--------------+---------------+---------------+---------------+ | | +--------------+ 4. Ordering semantics The IFT layer receives all data in order from the TCP layer and delivers all data in order to the DDP layer. 5. Motivation The IFT header contains the size of the DDP Header and DDP payload in bytes. Assuming in order processing of received TCP data, this is fully sufficient to define the DDP PDU boundaries. The header and payload CRCs are 32 bits each, and provide additional protection against data corruption. In the event of a CRC error on received data, the IFT layer will notify the next layer that the data contains an error. That next layer will define the action to be taken. Typical action may be to close the connection with a fatal error. J. Williams Expires: April 2003 [Page 3] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 5.1 Motivation for separating header and payload CRCs. There are two critically important reasons for this separation. First, because of TCP segmentation, the entire payload may not be received together with the header. The next protocol layer (DDP) may wish to place the portion of the payload that has been received, but this can't be safely done until the header, which indicated where the data should be placed, has been verified. The second important reason is that it makes hardware implementations significantly more efficient in that the payload CRC can be calculated as the payload data is streamed from NIC memory to host memory. This streaming can't take place until the header (and therefore the destination host address) has been verified correct. There are only three ways the payload CRC can be verified, on the way into the NIC buffer, on the way out of the NIC buffer towards host memory, or by making a separate access to the NIC buffer memory just for the purpose of CRC verification. The third option causes significant inefficiencies in terms of required memory bandwidth. Verifying the CRC while the data is on the way into the NIC buffer is great if it can be done, but is generally not practical if the CRC is part of a layer above the transport (TCP in this case) layer. This is because the transport processing must be done first, and the time between receiving the packet on the link and writing it to buffer memory is too brief to complete the transport processing. Therefore the proposal requires checking only the header CRC with a separate access to buffer memory, and allows the payload CRC to be verified as the data is streamed from NIC buffer to host buffer. 5.2 Motivation for not requiring padding to align CRCs. Experience building hardware for [iSCSI] has shown that the hardware required to compute unaligned CRCs is trivial requiring only a couple byte shifters. The hardware required to insert the padding is an order of magnitude more complex, and affects control timing in ways that require complex verification. Since the only claimed benefit of padding insertion was to simplify hardware design, and since the result was exactly the opposite, this proposal does not include padding. J. Williams Expires: April 2003 [Page 4] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 5.3 Motivation for not including markers and alignment with TCP segments The argument for including markers and alignment as proposed in [MPA] is to allow optimized hardware to directly place data received out of order, and avoid the cost of buffering out of order received data on the NIC. This proposal takes the position that reasonable and robust hardware implementations will require significant buffer memory on the NIC anyway, and protocol complexity designed to eliminate the need for such buffer memory will not be successful. Given the presense of sufficient buffer memory, there is little value is placing data out of order, and the addition of significant protocol complexity (and the associated implementation inefficiency) are not justified. Proponents of [MPA] have suggested that eliminating the need for significant amounts of buffer memory is critical to reducing NIC hardware costs. Experience, however, has shown this to be a questionable assumption. It has proven to be the case that the primary cost driver for accelerated NICs is the peak packet processing rate that needs to be supported, not the buffer memory requirement. Supporting high peak packet processing rates adds cost in required NIC processing power, silicon area and power to support that processing power, and bandwidth to control memory. Unfortunately, building a NIC with small payload buffering memory has exactly the opposite of the desired effect. The peak supported packet processing rate must go up significantly because there is little elasticity between the link with its potential burst rate of small packets, and the packet processing engine. With large payload buffering memory, a much more modest peak packet processing rate can be supported, without the need to drop large quantities of incoming packets during bursts. A further problem with bufferless NICs will begin to occur at 10Gb link rates. This occurs when the host bus can't keep up with the link. Although raw bandwidth is manageable, the big problem comes with a burst of small packets and the limited transaction rate across actual host bus implementations. If the host bus can't keep up with the link, then the NIC must either have large amounts of buffer memory, or else large numbers of packets will be dropped. J. Williams Expires: April 2003 [Page 5] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 If bufferless NIC designs start to proliferate, the result will be good behavior is some environments where burst rates are favorable, and terrible behavior in other environments where bursts are particularly heavy and and large numbers of packets are dropped. Overall, this is an undesirable result as net performance will be inconsistent and unpredictable. 6. Effect of Markers on RDMA/DDP adoption Markers can be dealt with fairly efficiently on custom hardware designed to insert markers on transmit, and remove markers on receive. However it is likely that initial implementations of RDMA/DDP will be based reprogrammed existing hardware not supporting MPA style markers or on general purpose IO processors. In this case the performance impact of markers is both large and negative. General purpose DMA hardware used to transfer data to and from the host is not able to insert and delete markers on the fly, so the only option is to fragment host DMA transfers into the small pieces between markers. With high performance host busses such as PCI-X, breaking transfers into small pieces will significantly degrade performance of the host bus. This in turn will significantly hinder initial adoption and deployment of RDMA/DDP. For software implementations of the protocol, the insertion and deletion of markers is a very heavyweight operation requiring either the use of a long scatter gather list with lots of segments, or an additional copy of all the payload data. Although this is arguable a temporary problem until custom RDMA hardware is available, one should not underestimate the importance of the transition phase to ultimate success. 7. IFT and SCTP IFT is a framing protocol for TCP only and does not address SCTP. It is expected that IFT will be a temporary transitional solution in the event the SCTP ultimately achieves wide spread use. It would become a long term solution only if SCTP fails to achieve wide spread use and acceptance. 8. Security Considerations It is expected that IFT introduces no new security considerations. It has all the strengths and weaknesses normally associated with TCP, and creates no new weaknesses. All application related security issues are the responsibility of higher layer protocols. J. Williams Expires: April 2003 [Page 6] INTERNET-DRAFT iWARP Framing for TCP 9 October 2002 9. References [MPA] P. Culley et al., draft-culley-iwarp-mpa-00.txt, September 16, 2001 [iSCSI] Satran, Julian, draft-ietf-iscsi-15.txt, July 30, 2002 [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981. [DDP] H. Shah et al., "Direct Data Placement over Reliable Transports", RDMA Consortium Draft Specification draft-shah- rdmap-ddp-00.txt, September 2002 [RDMA] R. Recio et al., "RDMA Protocol Specification", RDMA Consortium Draft Specification draft-recio-rdmap-rdma-00.txt, September 2002 [SCTP] R. Stewart et al., "Stream Control Transmission Protocol", RFC 2960, October 2000. 10 Author's Addresses Jim Williams Emulex Corporation 580 Main Street Bolton, MA 01740 USA Phone: +1 978 779 7224 Email: jim.williams@emulex.com J. Williams Expires: April 2003 [Page 7]