INTERNET-DRAFT                               J. Williams 
   draft-williams-iwarp-ift-00.txt              Emulex Corporation
   Expires:  April 2003
                                              
                                                October 2002 
                                      

             iWARP Framing for TCP

1  Status of this Memo 

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.  Internet-Drafts are work-
     ing documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups. Note that other groups may also dis-
     tribute working documents as Internet-Drafts.  Internet-Drafts are
     draft documents valid for a maximum of six months and may be
     updated, replaced, or obsoleted by other documents at any time. It
     is inappropriate to use Internet-Drafts as reference material or to
     cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-
     Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


2    Abstract 

   A framing protocol is defined for DDP over TCP that is fully
   compliant with applicable TCP RFCs and fully interoperable with
   existing TCP implementations.  This protocol is being offered as
   an alternative to to the proposal offered in [MPA].

   The protocol offers to two things, definition of the DDP record
   boundaries within the TCP stream, and added data integrity 
   by means of added CRC protection.


3   Acknowledgements

   The detailed and pains taking effort of the RDMA Consortium members
   is acknowledged as well as others who have contributed to defining
   an RDMA over IP protocol. 


   J. Williams              Expires: April 2003                [Page 1] 

   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 
    

3   Protocol definition


3.1 Frame format

     +--------------+---------------+---------------+---------------+
     |        header_size           |         payload_size          |
     +--------------+---------------+---------------+---------------+
     |                                                              |
     |                        DDP Header                            |
     |                                                              |
     |                                                              |
     +--------------+---------------+---------------+---------------+
     |                         Header CRC                           |
     +--------------+---------------+---------------+---------------+
     |                                                              |
     |                                                              |
     |                                                              |
     |                        DDP Payload                           |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     +--------------+---------------+---------------+---------------+
     |                        Payload CRC                           |
     +--------------+---------------+---------------+---------------+

   The header_size and payload_size are 16 bit fields containing the
   number of bytes in the DDP header and payload respectively.

   The header CRC covers both the IFT header (header_size and
   payload_size fields) and the DDP header.  The payload CRC covers
   only the DDP payload.  The CRC uses the CRC-32c algorithm as 
   defined in [iSCSI].  Note that the RDMA header (if present) and
   headers associated with higher level protocols are considered
   as part of the DDP payload, and not covered by the IFT header CRC.


   J. Williams              Expires: April 2003                [Page 2] 
    
   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 


   The above format shows the DDP header and DDP payload size as
   being a multiple of four bytes.  This is not necessary, however,
   and no padding is added to align the CRC.  An example of 
   an unaligned frame is shown below.

     +--------------+---------------+---------------+---------------+
     |        header_size           |         payload_size          |
     +--------------+---------------+---------------+---------------+
     |                                                              |
     |                        DDP Header                            |
     |                              +---------------+---------------|
     |                              |                               |
     +--------------+---------------+---------------+---------------+
     |         Header CRC           |                               |
     +--------------+---------------+                               | 
     |                                                              |
     |                                                              |
     |                                                              |
     |                        DDP Payload                           |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     +              +---------------+---------------+---------------+
     |              |         Payload CRC                           |
     +--------------+---------------+---------------+---------------+
     |              |
     +--------------+

4.   Ordering semantics
  
   The IFT layer receives all data in order from the TCP layer and
   delivers all data in order to the DDP layer.


5.   Motivation 

   The IFT header contains the size of the DDP Header and DDP payload
   in bytes.  Assuming in order processing of received TCP data, 
   this is fully sufficient to define the DDP PDU boundaries.

   The header and payload CRCs are 32 bits each, and provide
   additional protection against data corruption.  In the event
   of a CRC error on received data, the IFT layer will notify
   the next layer that the data contains an error.  That next
   layer will define the action to be taken.  Typical action
   may be to close the connection with a fatal error.


   J. Williams              Expires: April 2003                [Page 3] 
    
   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 


5.1   Motivation for separating header and payload CRCs.

   There are two critically important reasons for this separation.
   First, because of TCP segmentation, the entire payload may
   not be received together with the header.  The next protocol
   layer (DDP) may wish to place the portion of the payload that has
   been received, but this can't be safely done until the header,
   which indicated where the data should be placed, has been
   verified.

   The second important reason is that it makes hardware
   implementations significantly more efficient in that the
   payload CRC can be calculated as the payload data is streamed
   from NIC memory to host memory.  This streaming can't take
   place until the header (and therefore the destination host address)
   has been verified correct.

   There are only three ways the payload CRC can be verified,
   on the way into the NIC buffer, on the way out of the NIC buffer
   towards host memory, or by making a separate access
   to the NIC buffer memory just for the purpose of CRC
   verification.  The third option causes significant 
   inefficiencies in terms of required memory bandwidth.
   Verifying the CRC while the data is on the way into the
   NIC buffer is great if it can be done, but is generally
   not practical if the CRC is part of a layer above the
   transport (TCP in this case) layer.  This is because the transport
   processing must be done first, and the time between receiving
   the packet on the link and writing it to buffer memory is too brief
   to complete the transport processing.

   Therefore the proposal requires checking only the header CRC with
   a separate access to buffer memory, and allows the payload CRC to
   be verified as the data is streamed from NIC buffer to host buffer.

5.2   Motivation for not requiring padding to align CRCs.

   Experience building hardware for [iSCSI] has shown that the
   hardware required to compute unaligned CRCs is trivial requiring
   only a couple byte shifters.  The hardware required to insert
   the padding is an order of magnitude more complex, and affects
   control timing in ways that require complex verification.

   Since the only claimed benefit of padding insertion was to 
   simplify hardware design, and since the result was exactly
   the opposite, this proposal does not include padding.


   J. Williams              Expires: April 2003                [Page 4] 
    
   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 


5.3   Motivation for not including markers and alignment
      with TCP segments

   The argument for including markers and alignment as proposed
   in [MPA] is to allow optimized hardware to directly place
   data received out of order, and avoid the cost of buffering
   out of order received data on the NIC.

   This proposal takes the position that reasonable and robust
   hardware implementations will require significant buffer
   memory on the NIC anyway, and protocol complexity designed
   to eliminate the need for such buffer memory will not be
   successful.  Given the presense of sufficient buffer memory,
   there is little value is placing data out of order, and 
   the addition of significant protocol complexity (and the
   associated implementation inefficiency) are not justified.

   Proponents of [MPA] have suggested that eliminating the need for 
   significant amounts of buffer memory is critical to reducing NIC
   hardware costs.  Experience, however, has shown this to be a
   questionable assumption.  It has proven to be the case that
   the primary cost driver for accelerated NICs is the peak packet
   processing rate that needs to be supported, not the buffer memory
   requirement.  Supporting high peak packet processing rates adds
   cost in required NIC processing power, silicon area and power to
   support that processing power, and bandwidth to control memory.

   Unfortunately, building a NIC with small payload buffering memory
   has exactly the opposite of the desired effect. The 
   peak supported packet processing rate must go up significantly
   because there is little elasticity between the link with its
   potential burst rate of small packets, and the packet processing
   engine.

   With large payload buffering memory, a much more modest peak
   packet processing rate can be supported, without the need to
   drop large quantities of incoming packets during bursts.

   A further problem with bufferless NICs will begin to occur at
   10Gb link rates.  This occurs when the host bus can't keep up
   with the link.  Although raw bandwidth is manageable, the big
   problem comes with a burst of small packets and the limited
   transaction rate across actual host bus implementations.  If
   the host bus can't keep up with the link, then the NIC must
   either have large amounts of buffer memory, or else large numbers
   of packets will be dropped.


   J. Williams              Expires: April 2003                [Page 5] 
    
   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 


   If bufferless NIC designs start to proliferate, the result will
   be good behavior is some environments where burst rates are
   favorable, and terrible behavior in other environments where
   bursts are particularly heavy and and large numbers of packets
   are dropped.  Overall, this is an undesirable result as net
   performance will be inconsistent and unpredictable.

6.   Effect of Markers on RDMA/DDP adoption

   Markers can be dealt with fairly efficiently on custom hardware
   designed to insert markers on transmit, and remove markers
   on receive.  However it is likely that initial implementations of
   RDMA/DDP will be based reprogrammed existing hardware not supporting
   MPA style markers or on general purpose IO processors.
   In this case the performance
   impact of markers is both large and negative.  General purpose 
   DMA hardware used to transfer data to and from the host is not able
   to insert and delete markers on the fly, so the only option is
   to fragment host DMA transfers into the small pieces between 
   markers.  With high performance host busses such as PCI-X, breaking
   transfers into small pieces will significantly degrade performance
   of the host bus.  This in turn will significantly hinder initial
   adoption and deployment of RDMA/DDP.

   For software implementations of the protocol, the insertion and
   deletion of markers is a very heavyweight operation requiring
   either the use of a long scatter gather list with lots of 
   segments, or an additional copy of all the payload data.

   Although this is arguable a temporary problem until custom RDMA
   hardware is available, one should not underestimate the importance
   of the transition phase to ultimate success.

7.   IFT and SCTP

   IFT is a framing protocol for TCP only and does not address SCTP.
   It is expected that IFT will be a temporary transitional solution
   in the event the SCTP ultimately achieves wide spread use.
   It would become a long term solution only if SCTP fails to achieve
   wide spread use and acceptance.

8.  Security Considerations 

   It is expected that IFT introduces no new security considerations.
   It has all the strengths and weaknesses normally associated with
   TCP, and creates no new weaknesses.  All application related
   security issues are the responsibility of higher layer protocols.


   J. Williams              Expires: April 2003                [Page 6] 
    
   INTERNET-DRAFT       iWARP Framing for TCP          9 October 2002 

9.  References 

   [MPA] P. Culley et al., draft-culley-iwarp-mpa-00.txt,
         September 16, 2001

   [iSCSI] Satran, Julian, draft-ietf-iscsi-15.txt, July 30, 2002

   [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet  
       Program Protocol Specification", RFC 793, September 1981.  

   [DDP] H. Shah et al., "Direct Data Placement over Reliable 
       Transports", RDMA Consortium Draft Specification draft-shah-
       rdmap-ddp-00.txt, September 2002 

   [RDMA] R. Recio et al., "RDMA Protocol Specification", RDMA 
       Consortium Draft Specification draft-recio-rdmap-rdma-00.txt, 
       September 2002 

   [SCTP] R. Stewart et al., "Stream Control Transmission Protocol", 
       RFC 2960, October 2000. 


10 Author's Addresses 

   Jim Williams 
       Emulex Corporation 
       580 Main Street 
       Bolton, MA 01740 USA 
       Phone: +1 978 779 7224 
       Email: jim.williams@emulex.com 


   J. Williams              Expires: April 2003                [Page 7]