Internet DRAFT - draft-williams-iwarp-ift

draft-williams-iwarp-ift



   INTERNET-DRAFT                               J. Williams 
   draft-williams-iwarp-ift-01.txt              Emulex Corporation
   Expires:  August 2003
                                              
                                                February 2003 
                                      

             iWARP Framing for TCP

1  Status of this Memo 

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.  Internet-Drafts are work-
     ing documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups. Note that other groups may also dis-
     tribute working documents as Internet-Drafts.  Internet-Drafts are
     draft documents valid for a maximum of six months and may be
     updated, replaced, or obsoleted by other documents at any time. It
     is inappropriate to use Internet-Drafts as reference material or to
     cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-
     Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


2    Abstract 

   A framing protocol is defined for DDP over TCP that is fully
   compliant with applicable TCP RFCs and fully interoperable with
   existing TCP implementations.

   The protocol offers to two things, definition of the DDP record
   boundaries within the TCP stream, and added data integrity 
   by means of added CRC protection.  In addition, an adaption 
   mechanism is defined that confirms use of RDDP, and negotiates
   parameters including use of CRCs, markers, and enabling of
   speculative placement. 


3.  Acknowledgements

   The detailed and pains taking effort of the RDMA Consortium members
   is acknowledged as well as others who have contributed to defining
   an RDMA over IP protocol. 









   J. Williams              Expires: August 2003               [Page 1] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 

4.  Adaption Indication

   It is expected that RDDP mode may be used immediately at connection
   setup time, or alternatively may be initiated at some time after
   the connection has been set up.  The latter case would be used
   when an existing ULP negotiates an upgrade to RDDP mode.

   The adaption indication is sent in band, but MUST not be the initial
   offer to use RDDP.  The ULP MUST first agree to use RDDP, or else
   the port number MUST be one that is agreed to be used for RDDP.
   The sending of the adaption indication is a confirmation of this
   agreement to use RDDP, and also a means to negotiate the particular
   parameters of the RDDP connection.

4.1 Adaption Indication Format

   The adoption indication is exactly 64 bytes in length and has
   the following format.

     +--------------+---------------+---------------+---------------+
     |                     Magic number ( 0x52444450 )              |
     +--------------+---------------+---------------+---------------+
     |          version             |         BC version            |
     +--------------+---------------+---------------+---------------+
     |                            Flags                             |
     +--------------+---------------+---------------+---------------+
     |       Sender's VTag          |         marker interval       |
     +--------------+---------------+---------------+---------------+
     |                                                              |
     |                          reserved  (44 bytes)                |
     |                                                              |
     |                                                              |
     +--------------+---------------+---------------+---------------+
     |                             CRC                              |
     +--------------+---------------+---------------+---------------+

4.2 Field definitions

   Version     Defines protocol version of the adaption indication.
               This document describes version 1.  The field
               MUST have value of one.

   BcVersion   Backward compatible version.  For this version of the
               protocol, this field MUST be set to 1.  In general it
               will be set to the minimum version with which it is
               backward compatible.  On receipt, any value greater
               than the current version MUST be rejected as
               incompatible.  The received version need not be rejected
               if greater than the current version as long as the
               BC version is not.



   J. Williams              Expires: August 2003               [Page 2] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 

   The following flag bits are defined starting with the MSB of
   the flag field and proceeding left to right.  The remainder of the
   flag field is reserved. 

   PH          Adoption is two phase negotiation, with phases
               numbered 0, and 1.  Phase 0 is sent from node A
               to node B.  On receipt of phase 0, node B sends phase
               1 back to node A.  This bit indicates phase 0 or 1.

   R           Indicates sender wants to use RDMA mode.  If both
               ends set this bit, then connection is in RDMA mode.
               If both ends clear this bit, then the connection is
               in DDP mode.  If ends disagree, then the connection
               MUST be closed in error.

   HC          Indicates that the sender will be including a header
               CRC.  If the senders choice to use or not use a header
               CRC is unacceptable to the receiver, the receiver
               SHOULD abort the connection.

   PC          Indicates that the sender will be including a payload
               CRC.  If the senders choice to use or not use a payload
               CRC is unacceptable to the receiver, the receiver
               SHOULD abort the connection.

   IH          Indicates that the sender will ignore and not check
               any header CRC which the receiver of the adaption
               indication sends.  If this is unacceptable to the
               receiver of the adaption indication, the connection
               should be closed.

   IP          Indicates that the sender will ignore and not check
               any payload CRC which the receiver of the adaption
               indication sends.  If this is unacceptable to the
               receiver of the adaption indication, the connection
               should be closed in error.

   SP          Indicates the sender is giving the receiver permission
               to do speculative placement.  If this bit is set, the
               receiver MAY do speculative placement.  If not set,
               the receiver MUST NOT do speculative placement.
               See the chapter below for full description of
               speculative placement.

   SM          Indicates sender of adaption indication is willing to
               send periodic markers.  Markers will be sent only if
               the SM bit is set and other end of the connection sets
               the RM bit in its adaption indication.  





   J. Williams              Expires: August 2003               [Page 3] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 

   RM          Indicates the sender of the adaption indication wishes
               to receive periodic markers.  Markers will be received
               if and only if the RM bit is set and the other end
               sets the SM bit in its adaption indication.

   Sender's VTag
               The sender's VTag is a 16 bit value that SHOULD be
               selected at random using a good random number
               generator.  It is included in all outgoing RDDP
               PDUs sent by the receiver of this adaption indication.
               The main purpose is to provide padding to align
               the RDDP header, however it also provides some added
               checking as to the validity of the IFT header, and
               some added confidence for doing speculative placement.

   Marker Interval
               Used to negotiate use of periodic markers and determine
               the interval at which they will be sent.  Exact method
               of negotiation specified in the chapter on periodic
               markers.

   CRC         CRC computed on adaption indication is REQUIRED
               regardless of whether header or payload CRCs are
               negotiated in either direction.

   reserved    All reserved bits MUST be set to zero by the sender
               and ignored by the receiver.


























   J. Williams              Expires: August 2003               [Page 4] 
    

   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 


5.  Protocol definition


5.1 Frame format

     +--------------+---------------+---------------+---------------+
     |        header_size           |         payload_size          |
     +--------------+---------------+---------------+---------------+
     |       receiver's VTag        |                               |
     +--------------+---------------+                               |
     |                        DDP Header                            |
     |                                                              |
     |                                                              |
     +--------------+---------------+---------------+---------------+
     |                         Header CRC (if negotiated)           |
     +--------------+---------------+---------------+---------------+
     |                                                              |
     |                                                              |
     |                                                              |
     |                        DDP Payload                           |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     |                                                              |
     +--------------+---------------+---------------+---------------+
     |                        Payload CRC (if negotiated)           |
     +--------------+---------------+---------------+---------------+

   The header_size and payload_size are 16 bit fields containing the
   number of bytes in the DDP header and payload respectively.

   The header CRC covers both the IFT header (header_size and
   payload_size fields) and the DDP header.  The payload CRC covers
   only the DDP payload.  The CRC uses the CRC-32c algorithm as 
   defined in [iSCSI].  Note that the RDMA header (if present) and
   headers associated with higher level protocols are considered
   as part of the DDP payload, and not covered by the IFT header CRC.














   J. Williams              Expires: August 2003               [Page 5] 
    

   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 


   The above format shows the DDP header and DDP payload size as
   being a multiple of four bytes.  This is not necessary, however,
   and no padding is added to align the CRC.  An example of 
   an unaligned frame is shown below.

     +--------------+---------------+---------------+---------------+
     |        header_size           |         payload_size          |
     +--------------+---------------+---------------+---------------+
     |       receiver's VTag        |                               |
     +--------------+---------------+                               |
     |                                                              |
     |                        DDP Header                            |
     |                              +---------------+---------------|
     |                              |                               |
     +--------------+---------------+---------------+---------------+
     |  Header CRC (if negotiated)  |                               |
     +--------------+---------------+                               | 
     |                                                              |
     |                                                              |
     |                                                              |
     |                        DDP Payload                           |
     |                                                              |
     |                                                              |
     |                                                              |
     +              +---------------+---------------+---------------+
     |              |         Payload CRC (if negotiated)           |
     +--------------+---------------+---------------+---------------+
     |              |
     +--------------+

6.   Ordering semantics
  
   The IFT layer receives all data in order from the TCP layer and
   delivers all data in order to the DDP layer.  Note that this
   does not preclude a merged layer implementation from placing
   data out of order, but any such implementation MUST be functionally
   equivalent to a layered implementation in which TCP delivers
   all data in order.

7.   Motivation 

   The IFT header contains the size of the DDP Header and DDP payload
   in bytes.  Assuming in order processing of received TCP data, 
   this is fully sufficient to define the DDP PDU boundaries.

   The header and payload CRCs are 32 bits each, and provide
   additional protection against data corruption.  In the event
   of a CRC error on received data, the IFT layer will notify
   the next layer that the data contains an error.  That next
   layer will define the action to be taken.  Typical action
   is to close the connection with a fatal error.

   J. Williams              Expires: August 2003               [Page 6] 
    

   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 


7.1   Motivation for separating header and payload CRCs.

   There are three important reasons for this separation.
   First, because of TCP segmentation, the entire payload may
   not be received together with the header.  The next protocol
   layer (DDP) may wish to place the portion of the payload that has
   been received, but this can't be safely done until the header,
   which indicated where the data should be placed, has been
   verified.

   The second important reason is that it makes hardware
   implementations significantly more efficient in that the
   payload CRC can be calculated as the payload data is streamed
   from NIC memory to host memory.  This streaming can't take
   place until the header (and therefore the destination host address)
   has been verified correct.

   There are only three ways the payload CRC can be verified,
   on the way into the NIC buffer, on the way out of the NIC buffer
   towards host memory, or by making a separate access
   to the NIC buffer memory just for the purpose of CRC
   verification.  The third option causes significant 
   inefficiencies in terms of required memory bandwidth.
   Verifying the CRC while the data is on the way into the
   NIC buffer is great if it can be done, but is generally
   not practical if the CRC is part of a layer above the
   transport (TCP in this case) layer.  This is because the transport
   processing must be done first, and the time between receiving
   the packet on the link and writing it to buffer memory is too brief
   to complete the transport processing.

   Therefore the proposal requires checking only the header CRC with
   a separate access to buffer memory, and allows the payload CRC to
   be verified as the data is streamed from NIC buffer to host buffer.

   The third reason is that some applications may require a header
   CRC but not require a payload CRC.  This may be for a number
   of reasons including the presense of added end to end checks
   at the ULP level, of simply the ability of the ULP to tolerate
   data errors (but not placement errors).












   J. Williams              Expires: August 2003               [Page 7] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 

7.2   Motivation for not requiring padding to align CRCs.

   Experience building hardware for [iSCSI] has shown that the
   hardware required to compute unaligned CRCs is trivial requiring
   only a couple byte shifters.  The hardware required to insert
   the padding is an order of magnitude more complex, and affects
   control timing in ways that require complex verification.

   Since the only claimed benefit of padding insertion was to 
   simplify hardware design, and since the result was exactly
   the opposite, this proposal does not include padding.

8.   Speculative Placement

   Speculative placement is done by the receiver of a RDDP PDU.
   On receiving an out of order TCP segment, the receiver
   guesses at the location of the RDDP PDU within the
   TCP segment.  This guess is confirmed by doing a number
   of checks, and if the checks pass, directly placing
   the payload data.  If the confirmation fails,
   speculative placement MUST NOT be done.

   If the PDU identified appears to be in error, the error
   MUST NOT be reported speculatively.  In the case of either the
   failed confirmation or PDU error, nothing may be done with the
   PDU until it can be processed in order.

   Proir to delivering the payload data to the ULP, the 
   alignment is verified by receiving all preceding PDUs and
   using the IFT length fields to absolutely verify that
   the alignment was correct.  If this verification fails,
   the PDU MUST be placed again.   If an implementation has
   discarded the original after placing it (which it typically
   will do), then it MUST withhold TCP acknowledgement of this
   segment and force the remote end to retransmit it.

   8.1   Confirmation checks

   Before doing speculative placement, the IFT and DDP headers
   should be checked.  Specific field checked include the
   IFT header length, IFT payload length, IFT VTag, IFT header
   CRC, DDP STag, DDP QN, DDP MSN, DDP TO, DDP DV.  The
   subset of these which exists SHOULD all be checked, and
   any field containing an invalid value disqualifies the PDU for
   speculative placement.








   J. Williams              Expires: August 2003               [Page 8] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 


   8.2  Overwrite checks

   Before speculative placement is done, the RDDP implementation
   MUST insure that no previously placed data is overwritten.  This
   is necessary to insure that if the speculative placement is
   being done in error, that the error is recoverable.

   This implies that the RDDP implementation must track what
   portions of a buffer have been written at any time, or
   if the RDDP implementation loses track, then do no further
   speculative placements in that buffer.

   If an in-order placement would overwrite a previously done
   speculative placement, then that in-order placement should be
   done and the preceding speculative placement regarded as
   invalid, and needs to be re-done in-order.

   8.3 Unwritten Portions of Buffers

   Because speculative placement may write erroneously to unwritten
   portions of buffers, applications that allow speculative
   placement MUST assume that when a buffer is delivered to it
   by RDDP, any unwritten portion of a buffer contains unpredictable
   data.  Applications MUST NOT assume that unwritten portions
   of buffers are unmodified.

9.  Periodic Markers

   Periodic markers may be inserted in the data stream as a means
   of locating the beginning of RDDP PDUs when TCP segments are
   received out of order.

   Details of markers are TBD.



10.   IFT and SCTP

   IFT is a framing protocol for TCP only and does not address SCTP.
   It is expected that IFT will be a temporary transitional solution
   in the event the SCTP ultimately achieves wide spread use.
   It would become a long term solution only if SCTP fails to achieve
   wide spread use and acceptance.









   J. Williams              Expires: August 2003               [Page 9] 


   INTERNET-DRAFT       iWARP Framing for TCP           February 2003 


11.  Security Considerations 

   It is expected that IFT introduces no new security considerations.
   It has all the strengths and weaknesses normally associated with
   TCP, and creates no new weaknesses.  All application related
   security issues are the responsibility of higher layer protocols.



12.  References 

   [MPA] P. Culley et al., draft-culley-iwarp-mpa-00.txt,
         September 16, 2001

   [iSCSI] Satran, Julian, draft-ietf-iscsi-15.txt, July 30, 2002

   [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet  
       Program Protocol Specification", RFC 793, September 1981.  

   [DDP] H. Shah et al., "Direct Data Placement over Reliable 
       Transports", RDMA Consortium Draft Specification draft-shah-
       rdmap-ddp-00.txt, September 2002 

   [RDMA] R. Recio et al., "RDMA Protocol Specification", RDMA 
       Consortium Draft Specification draft-recio-rdmap-rdma-00.txt, 
       September 2002 

   [SCTP] R. Stewart et al., "Stream Control Transmission Protocol", 
       RFC 2960, October 2000. 



13. Author's Addresses 

   Jim Williams 
       Emulex Corporation 
       580 Main Street 
       Bolton, MA 01740 USA 
       Phone: +1 978 779 7224 
       Email: jim.williams@emulex.com 












   J. Williams              Expires: August 2003              [Page 10]