Internet-Draft S. DiCecco J. Williams Expires May 2001 Giganet, Inc. Bill Terrell TROIKA Networks, Inc. John Scott Network Appliance, Inc. C. Sapuntzakis Cisco Systems November 17, 2000 VI / TCP (Internet VI) Status of this memo This document is an Internet-Draft and is offered in full accordance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Draft documents are valid for a maximum of six months and may be updated, replaced, or rendered obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to site them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/lid-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.NHtml The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this memo are to be interpreted as described in RFC2119. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 1] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Table of Contents 1 Abstract 3 2 Overview 3 2.1 VI Architectural Components 3 2.2 VI/TCP 4 2.2.1 Extensions to VI 4 2.2.2 VI/TCP Overview 4 2.2.2.1 Basic VI Components 5 2.2.2.2 Introduction to VI/TCP 5 2.2.2.2.1 VI/TCP Addressing 5 2.2.2.2.2 VI/TCP Connection Management 6 2.2.2.2.3 VI/TCP Protocol Messaging 6 2.2.2.2.4 TCP/IP Options and VI/TCP 7 2.2.2.2.5 VI/TCP Retransmissions 7 2.2.2.2.6 Note on Outstanding RDMA Reads 8 3 The VI/TCP Protocol 8 3.1 VI/TCP Segment Format 8 3.2 VI/TCP Segment Header 9 3.3 VI/TCP Connection Establishment (CE) Header 13 3.4 VI/TCP RDMA Header 16 3.5 VI Trailer 17 3.6 CRC option 18 3.7 Urgent Marker 19 4 VI/TCP Connection Establishment 20 4.1 Basic Connection Establishment Timeline 20 4.2 Connection Establishment - Active 21 4.3 Connection Establishment - Passive 22 5 Security Considerations 23 6 Intellectual Property 23 7 References 23 8 Author's Addresses 24 DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 2] Internet-Draft VI / TCP (Internet VI) November 17, 2000 1. Abstract The Virtual Interface (VI) architecture [VIAR] describes a high performance design for interfacing distributed applications to accelerated protocol processing. VI seeks to improve the performance of such applications by reducing the latency and overheads associated with standard communications protocol stack processing. VI greatly reduces the processing overhead associated with traditional network architectures by providing applications a protected, directly accessible interface to network hardware - a Virtual Interface. This memo describes extensions to the VI Architecture designed to facilitate operation over TCP/IP. These extensions take the form of enhancements to the VI Provider Library API defined in the VI Architecture Developer's Guide [VIDG], and a "VI Protocol" which supports VI functionality during operation over TCP/IP. The extensions to the VI Architecture which support operation over TCP/IP are intended to be fully compliant with the VI Architecture [VIAR] and its associated Developer's Guide [VIDG]. 2. Overview This section contains a brief overview of VI components and a functional overview of VI operation over TCP/IP 2.1. VI Architectural Components VI is comprised of four architectural components - Virtual Interfaces, Completion Queues, VI Providers, and VI Consumers. Virtual Interfaces (VIs) are the mechanisms that allow VI Consumers direct access to the data transfer services of VI Providers. VI Consumers post data transfer requests, in the form of Descriptors, directly to the VI Provider. Descriptors are structures that contain the information necessary for the VI Provider to process the data transfer (e.g.,data location). Descriptors are posted to Work Queues (send and receive) associated with the VI. Facilities are provided to signal VI Descriptor postings to the network adapter. Processing of posted Descriptors is asynchronous and descriptors are marked when processing completes. VI Consumers remove completed descriptors from Work Queues for reuse in subsequent requests. Completion Queues provide a facility whereby VI Consumers can create a single point of notification for processing completed Descriptors. Once a Work Queue is associated with a Completion Queue, handling of all completions are handled via that Completions Queue. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 3] Internet-Draft VI / TCP (Internet VI) November 17, 2000 The VI Provider consists of a physical network interface (NIC) and driver functionality. The VI NIC implements the Virtual Interfaces and Completion Queues, and directly performs data transfers. VI NIC drivers provide the control and resource management functions to maintain the VI between consumers and VI NICs. VI Consumers are typically applications programs and their supporting operating system functions. VI Consumers represent the users of a Virtual Interface. Access to the Virtual Interface is through a library referred to as the VI Provider Library [VIDG]. The VI Provider Library provides an application programming interface for hardware connection, endpoint creation and destruction, connection management, memory handling, data transfer, queue management, informational queries, name services, and error handling. 2.2. VI/TCP This section introduces the fundamentals of VI operation over TCP. 2.2.1. Extensions to VI The proposed protocol supports the VI Architecture as currently defined. In addition, the protocol supports certain enhancements to VI. Extensions to the API defined in [VIDG] would be required to exploit such enhancements. Proposed enhancements are as follows: - Descriptor Flow Control: Transmit descriptors may be posted in advance of the corresponding receive descriptors. The VI Provider will supply flow control. - Attribute Negotiation: VI Architecture requires that incoming connection establishment attempts be rejected unless the calling and called VI Attributes match (e.g., Maximum Transfer Unit Size). The protocol permits downward negotiation of MTU sizes. The smaller of the two VI MTU sizes proposed by the two ends at connection setup is used for the connection. 2.2.2. VI/TCP Overview This Section provides an overview of how the components of a Virtual Interface are created, managed, and destroyed, and also introduces the data transfer models. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 4] Internet-Draft VI / TCP (Internet VI) November 17, 2000 2.2.2.1. Basic VI Components Operations on basic VI architectural components remain largely unchanged with VI/TCP. VI functionality is invoked by a VI Consumer through the API defined in [VIDG]. Access to a VI NIC is achieved by opening a handle to the driver representing the NIC. This handle is used in subsequent operations. All memory used in data transfer is "registered" with the VI Provider. Memory handles are used to identify the region and to qualify virtual memory addresses. VIs are created by the VI Provider upon request by the VI Consumer. Connections are not established by creation of a VI and no data transfer can occur until the VI is connected to another. VI Work Queues may be associated with Completion Queues to provide a single handling point for completed VI Descriptors. VI provides a connection-oriented data transfer service. Newly created VIs are not pre-associated with other VIs; a VI must be explicitly connected to another to enter its data transfer phase. VI provides two types of data transfers - traditional Send/Receive, and Remote Direct Memory Access (RDMA). 2.2.2.2. Introduction to VI/TCP This Section serves as an introduction to VI operation over TCP/IP. 2.2.2.2.1. VI/TCP Addressing The VI Architecture defines a generic "VI Network Address" format consisting of an "address" portion and a "discriminator" portion. When operating VI/TCP, the address portion contains an IP address and the discriminator is per the VI Architecture [VIAR]. One transport layer port is reserved for passive connection establishment. All incoming VI connections are through this port and VI applications distinguish themselves by the VI Network Address discriminator. For active connection establishment, multiple transport layer ports are used. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 5] Internet-Draft VI / TCP (Internet VI) November 17, 2000 2.2.2.2.2. VI/TCP Connection Management With VI/TCP, a VI connection is implemented over an underlying TCP connection. The VI/TCP connection establishment process requires an underlying TCP connection over which VI/TCP protocol may be exchanged. VI connections have a one-to-one correspondence with TCP connections. This is referred to as the VI/TCP connection. When a VI connection is closed, the underlying TCP connection must be closed. Similarly, when a TCP connection is closed, the associated VI connection must be closed. VI Provider's handling VipConnectRequest primitives [VIDG], first request TCP to establish its connection and then perform VI/TCP protocol messaging over this underlying connection. VI Providers must have accepted an underlying TCP connection before the associated VI connection is accepted. VI/TCP Provider's MUST check that address, handles, and attributes are valid for the underlying connection. From the perspective of a VI/TCP Provider, TCP connection setup is an atomic operation that either succeeds or fails. If the operation succeeds, VI connection establishment is initiated; otherwise, the VI connection is rejected. 2.2.2.2.3. VI/TCP Protocol Messaging VI/TCP functionality is invoked by a VI Consumer through the API defined in [VIDG]. The VI Provider supplies this functionality. The VI Provider, through use of the VI Protocol, supports this VI/TCP functionality. The VI Protocol defines "messages" to implement VI these functions (e.g., connections establishment). Typically, there is one message per Transmit Descriptor. Each message has a type (e.g., RDMA Write). VI messages are divided into "segments". These segments are sent, in order, over the associated TCP connection. It is recommended, but not required, that there be exactly zero or one VI segment for each TCP segment and that VI segments not be fragmented to span multiple TCP segments. All segments for one VI message will be transmitted before the next message is started. An exception is provided in that RDMA Read Response segments may be interleaved with segments of any message type other than another RDMA Read Response. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 6] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Example of a valid sequence of VI segments contained in the TCP stream: type message number data offset ---------------------------------------------------------- RdmaWrite 0x123 0x0 Send 0x124 0x0 RdmaReadResponse 0x777660 0x0 Send 0x124 0x500 RdmaReadResponse 0x777660 0x333 Send 0x124 0xA00 RdmaReadResponse 0x777660 0x666 RdmaReadRequest 0x125 0x0 RdmaReadResponse 0x777660 0x999 Send 0x126 0x0 RdmaReadResponse 0x777661 0x0 2.2.2.2.4. TCP/IP Options and VI/TCP It is strongly recommended that TCP connections supporting VI/TCP implement the timestamp option for PAWS (protection against wrapped sequence numbers) as defined in RFC1323, TCP Extensions for High Performance [PAWS]. 2.2.2.2.5. VI/TCP Retransmissions VI/TCP will retransmit dropped segments, as required. All retransmission is handled at the TCP layer. It is recommended that retransmitted segments contain the same data as the original dropped segment. In certain circumstances, this will not be possible without undue burden on an implementation. The following exceptions are noted: - Retransmission is required, but data access results in an access violation and retransmission cannot occur. - Retransmission is required, but cannot occur because the VI connection has been closed. - A posted application buffer has changed. This is not allowed per VI architecture and therefore constitutes an error. If a VI NIC is unable to retransmit original data, it may pad (substituting zero or arbitrary data for the original but maintaining the correct size) and should set the "Transmit Error" bit in the "Type" field of the "VI Segment Header". With the exception of these error cases, the retransmitted data MUST always be the same as the original data including all VI layer headers and trailers. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 7] Internet-Draft VI / TCP (Internet VI) November 17, 2000 2.2.2.2.6. Note on Outstanding RDMA Reads For each RDMA Read Request received, memory allocated for the request must be held until the response is acknowledged. The number of outstanding RDMA Reads must be limited to control resource exhaustion. Discarding excessive RDMA Reads pending completions of outstanding requests does not seem viable in the absence of a deadlock avoidance mechanism. The VI/TCP Protocol provided negotiation of the number of outstanding RDMA Reads during connection establishment. This number represents a per VI limit and the negotiated value remains for the lifetime of the VI/TCP connection. 3. The VI/TCP Protocol This section provides the VI/TCP protocol data unit formats. All multibyte formats are to be represented in network byte order (i.e., big-endian). Each VI PDU contains a VI Segment Header. Optionally, an RDMA Header or CE (connection establishment) Header may be present. The VI Segment Header provides sufficient features to support non- RDMA send/receives. The RDMA Header must be included for RDMA transfers. The CE Header must be included for connection establishment. The TCP layer provides a reliable data stream connection and the VI segments are placed in this stream. 3.1 VI Segment Format +---------------+---------------+---------------+---------------+ | | | VI Segment Header | | | +---------------+---------------+---------------+---------------+ | | | RDMA Header | | (Included in RdmaRead and RdmaWrite | | segments only.) | | | +---------------+---------------+---------------+---------------+ | | | CE Header | | (Included in ConnectRequest and | | and ConnectAccept segments only.) | | | +---------------+---------------+---------------+---------------+ | | | VI Payload Data | | (Included in Send, RdmaWrite and | | RdmaReadResponse segments only.) | | | +---------------+---------------+---------------+---------------+ | | | VI Trailer | | | +---------------+---------------+---------------+---------------+ DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 8] Internet-Draft VI / TCP (Internet VI) November 17, 2000 3.2. VI Segment Header The VI Segment Header is defined as follows. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Version | Type/Flags | Segment Length | +---------------+---------------+---------------+---------------+ | Data Offset | +---------------+---------------+---------------+---------------+ | Immediate Data | +---------------+---------------+---------------+---------------+ | Message Number | +---------------+---------------+---------------+---------------+ | Message ACK | +---------------+---------------+---------------+---------------+ | Rx Descriptors Posted | Remote Error Code | +---------------+---------------+---------------+---------------+ Version This is an 8-bit field indicating the VI/TCP version. This document describes version one, and this field should contain the value 0x1. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 9] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Type/Flags This is an 8 bit field which contains a 5-bit field indicating the packet type and three one bit flags. The defined values of the type field are as follows: - 0 : Send - 1 : RdmaWrite - 2 : RdmaReadRequest - 3 : RdmaReadResponse - 4 : NOP - 5 : ConnectRequest - 6 : ConnectAccept - 7 : ConnectReject - 8 : ConnectNoMatch The flag bits are defined as follows (bit 7 is MSB, bit 0 is LSB): - BIT 7 : End of Message Indicates the current segment is the last of a message - BIT 6 : Immediate Data Valid As defined by the VI Architecture [VIAR]. This bit MUST be correctly set in each segment of a message. If "Immediate Data Valid" does not apply to a particular message type, it MUST be set to zero by the sender and ignored by the receiver. - BIT 5 : Transmit Error Indicates either transmit length or protection error. If the "Transmit Error" bit is set in any segment of a VI message, the receiver MUST regard the entire message as in error, and notify the receiving application accordingly. 7 6 5 4 3 2 1 0 +-----+-----+-----+-----+-----+-----+-----+-----+ | Eom | IdV | TrE | Type | +-----+-----+-----+-----+-----+-----+-----+-----+ Segment Length Segment Length is a 16-bit field containing the length of the VI segment including the VI Segment Header and VI Segment trailer. This length can be added to the byte location (within the TCP stream) of the first byte of this segment to get the first byte position of the next segment. The total length exceeds the value contained in the Segment Length by the length of the trailer. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 10] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Data Offset For the initial segment of any message, this 32-bit field will contain zero. For subsequent segments, it will contain the number of bytes already transferred for this message in prior segments. Only segment payload is included in this count; headers are specifically not included. A Send, RdmaWrite, or RdmaReadResponse message may be divided into multiple VI segments as they carry VI consumer data which may be up to 4GB is size. All other VI messages MUST consist of a single VI segment. Immediate Data May hold 32-bits of optional user data as described by the VI Architecture [VIAR]. Each segment of a message must contain the correct value for the immediate data. If "Immediate Data Valid" is set to zero for a message, the sender MAY place the immediate data from the send (or RDMA write) descriptor in this field. Otherwise the sender MUST place zero in this field. If the message type does not support immediate data, the sender must place zero in this field. A receiving end point must ignore the contents of the Immediate Data field if the message type does not support immediate data. If the message type does suport immediate data and the "Immediate Data Valid" bit is not set, then the receiver MAY deliver the contents of the Immediate Data field to the user. If the "Immediate Data Valid" bit is set, then the Immediate Data must be delivered to the user. Message Number Messages are sequentially numberd by the VI/TCP Provider. The initial Message Number may be varied by an implementation. For RDMA Read Responses, Message Number carries the message number of the corresponding RDMA Read Request. Any two segments with the same type are part of the same message if and only if their message numbers are equal. Rx Descriptors Posted Indicates the number of receive Descriptors, modulo 2^16, that have been posted during the lifetime of the VI/TCP connection. If Descriptor flow control is in effect, the VI/TCP provider must delay any transmission which would consume receive Descriptors until receive descriptors complete and become available. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 11] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Notes on NOP NOPs are used to send a Message ACK or update RxDescriptorsPosted when a VI has no data to transmit. If a VI has data to transmit, the Message Ack and RxDescriptorsPosted number is included in the transferred segments. However, if a transmitter is idle, a NOP is utilized to permit conveyance of this information. An implementation need not send a NOP to notify the remote end of each Rx descriptor as it is posted, however sufficient notification SHOULD be done so as not to unnecessarily impede the flow of data. NOPs do not constitute a VI message and therefore do not occupy space in the message numbering sequence. The message number field of a NOP must contain the message number of the last (non RDMA read response) message sent. Message ACK Message ACK is valid only for VI/TCP connections at the Reliable Reception Level [VIAR]. Message ACK is used in conjunction with Remote Error Code to provide information relating to memory protection or VI Descriptor errors and also to provide facilities for implementation specific error handling. If the VI Error subfield of the Remote Error Code (Remote Error Codes, next section) indicates "No Error", then Message ACK contains the Message Number from the last VI Message received without error. If VI Error is OTHER THAN "No Error", then Message ACK contains the Message Number of the message segment in error. Message ACK should not be indicated for messages until both message data has been written to host memory and the associated completion information has been written (if applicable). Message ACKs may be included in any VI Message Segment including that of a NOP message. When a VI/TCP connection is supporting Reliable Reception level, the Message ACK field must be valid and will be used determine when transmit Descriptors will be completed. RDMA Reads are completed upon receipt of a valid response. Message ACK are indicated for messages received in error. In this case, the VI Error Type field of Remote Error Code is set to reflect the appropriate VI error. Remote Error Code is defined in the following paragraph. Message ACK is invalid on subsequent messages. When a VI/TCP connection is supporting level Reliable Delivery or Unreliable Delivery, the contents of Message ACK are undefined and must be ignored by a receiver. The sender may, for simplicity, choose to send ACKs in a manner identical to Reliable Reception. Otherwise the sender should set the value to zero. Remote Error Code Remote Error Code is comprised of two subfields - the VI Error Type, and the IS Error Code. Both VI Error Type and IS Error Code apply to VI message identified by the Message ACK field. These are defined in the following paragraphs. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 12] Internet-Draft VI / TCP (Internet VI) November 17, 2000 IS Error Code IS Error Code is an Implementation Specific error code, its semantics are implementation dependent and considered outside the scope of this document. If the IS Error Code is set, Message ACK must be set to indicate the VI Message on which the error occurred. Note that VI errors (see next paragraph) and local errors need not be mutually exclusive and this field may be used to provide supplemental status information. VI Error Type VI Error Type contains bits indicating specific error condition. - Bit 0 : RDMA Memory Protection Error - Bit 1 : VI Descriptor Error - Bit 2 : Unrecoverable Transport Error When a VI/TCP connection is supporting Reliable Reception level, the VI Error Type field must be valid and is used to update the Status of the VI Descriptor's Control Segment. When a VI/TCP connection is supporting Reliable Delivery level or Unreliable Delivery, VI Error Type is undefined and must be ignored by the receiver. The sender, for simplicity, may choose to set the VI Error Type as done for Reliable Reception, and should set it to zero otherwise. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | IS error code | reserved |UTE|VDE|MPE| +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 13] Internet-Draft VI / TCP (Internet VI) November 17, 2000 3.3 VI/TCP Connection Establishment (CE) Header The VI/TCP CE Header is defined as follows. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Calling Attributes | Calling Discriminator Length | +---------------+---------------+---------------+---------------+ | MTU Size | +---------------+---------------+---------------+---------------+ | | +- -+ | | +- Calling Discriminator (64 bytes) -+ | | +- -+ | | +---------------+---------------+---------------+---------------+ | Calling RDMA Read Window | Called Discriminator Length | +---------------+---------------+---------------+---------------+ | | +- -+ | | +- Called Discriminator (64 bytes) -+ | | +- -+ | | +---------------+---------------+---------------+---------------+ | | +- Options -+ | | +---------------+---------------+---------------+---------------+ DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 14] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Calling Attributes The Calling Attributes field contains the following flag bits. Bit 15 is MSB, bit 0 is LSB. - Bit 0 : Unreliable - Bit 1 : Reliable Delivery - Bit 2 : Reliable Reception - Bit 3 : RDMA Write Enable - Bit 4 : RDMA Read Enable - Bit 5 : Descriptor Flow Control Enabled - Bit 6 : Peer-to-peer Connection Establishment - Bits 7-15 : reserved 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | Reserved |P2P|DFC|RWE|RWE| RR| RD| UR| +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ The reliability level of the two ends of a connection must match, and the peer-to-peer bit must match. However the RDMA Write Enable, RDMA Read Enable and Descriptor Flow Control Enabled bits do not need to match and may be specified independently by each end of a connection. If one end sets the "Descriptor Flow Control Enabled" bit, then that end expects to support flow control by not processing sends until the remote receive descriptor is available. The end receiving the "Descriptor Flow Control Enabled" bit must use the "Rx Descriptors Posted" field in the VI header to notify the other end when and if sends (or RDMA writes with immediate data) may be done. If the "Descriptor Flow Control Enabled" bit is not set in a received CE header, the end receiving MAY, but is not required to, notify the other end of receive descriptors posted via the "Rx Descriptors Posted" field of the VI header. Calling/Called Discriminator and Discriminator Lengths: These fields are as defined by the VI Architecture [VIAR] Calling/Called Discriminator: These fields contain the discriminators as defined by the VI Architecture. Although the actual length of the discriminator is determined by the associated length field, a 64 byte field is used to hold the discriminators thereby setting a maximum length of 64 bytes that may be used. MTU Size MTU Size is "proposed" in Connect Request PDUs and is considered an "agreed" value in a Connect Accept. The agreed value must be the lesser of the called/calling VI/TCP Provider's MTU capability. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 15] Internet-Draft VI / TCP (Internet VI) November 17, 2000 Options: The option field contains zero or more options. Each options has the following format Byte Byte Byte Byte Byte ......... Byte | 0 | 1 | 2 | 3 | 4 | | N-1 | +------+------+------+------+------+------------+--------+ | Type | Length | Data | +------+------+------+------+------+------------+--------+ The options defined as of this revision are as follows: End of Option List Type = 0 (no length or data included) CRC option: Type = 1 Length = 4 (no data) Urgent Marker Option: Type = 2 Length = 4 (no data) A receiver MUST ignore any unsupported or unknown options. The option list is terminated by the end of the containing VI segment or by the "End of option list". If the CRC option is specified, the segment must have a valid CRC and therefore the "End Of Option List" must be explicitly included so the CRC is not interpreted as an option. 3.4 VI/TCP RDMA Header The VI/TCP RDMA Header is defined as follows. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | | +- RDMA Address -+ | | +---------------+---------------+---------------+---------------+ | Registered Memory Handle | +---------------+---------------+---------------+---------------+ | RDMA Length | +---------------+---------------+---------------+---------------+ DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 16] Internet-Draft VI / TCP (Internet VI) November 17, 2000 RDMA Address The RDMA Address field contains the 64-bit data address of the first data segment from the VI Descriptor Registered Memory Handle The Registered Memory Handle field contains the Memory Handle returned when the region of memory containing the data segment was registered with the VI Provider. This is the same memory handle required by the VI Descriptor. RDMA Length The RDMA Length field contains the length field from the VI Descriptor that indicates the total number of bytes to be transferred across all segments of a message. 3.5 VI Trailer The format of the VI trailer is as follows. The CRC immediately follows the last byte of the VI segment payload, or header if there is no payload. It is not necessarily word aligned. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | CRC | +---------------+---------------+---------------+---------------+ The VI trailer is included in CE segments (segments of type ConnectRequest and ConnectAccept) if and only if the CRC option is included in that segment. A trailer is MAY be included in a ConnectReject or ConnectNoMatch VI segment at the option of the sender. All other segments MUST include a trailer CRC if and only if the ConnectRequest and ConnectAccept message which established the connection both included the CRC option. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 17] Internet-Draft VI / TCP (Internet VI) November 17, 2000 3.6 CRC Option VI/TCP allows for an optional CRC to be included in each segment. In order for this option to be enabled, both ends of a connection must include the CRC option in the option field of the ConnectRequest and ConnectAccept messages. A ConnectRequest segment will contain a CRC if and only if it contains the CRC option in the options section. A ConnectAccept MAY contain the CRC option only if the associated ConnectRequest contained the CRC option, and MUST contain a computed CRC if an only if it contains the CRC option. ConnectReject and ConnectNoMatch segments MAY contain a CRC. Since these types of segments contain no payload, a receiver can determine by means of the segment length if there is a CRC included. All other segment types MUST contain a CRC if and only if the CRC option was specified in both the ConnectRequest and ConnectAccept segments which established the connection. Otherwise they MUST NOT contain a CRC. The CRC-32 is calculated across the entire VI segment (but does not cover other segments of the same message, or lower level protocol headers such as TCP). The algorithm used to calculate the CRC is exactly that used for the ethernet CRC except that a different generator polynomial is used. The generator polynomial for the VI/TCP CRC is x^32 + x^31 + x^30 + x^28 + x^27 + x^25 + x^24 + x^22 + x^21 + x^20 + x^16 + x^10 + x^9 + x^6 + 1. This polynomial is the standard ethernet polynomial with a left-right reversal. (Or mathematically, substitute y = x^-1 and multiply by y^32). In hex format with the x^32 term removed, this is 0xDB710641. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 18] Internet-Draft VI / TCP (Internet VI) November 17, 2000 The CRC computation is described mathematically as follows. a) Start with the VI segment with zero inserted in the CRC field. This is the entire VI segment including all VI headers and trailers. b) Complement the first 32 bits of the segment. c) The n bits of the segment are then considered to be the coefficients of a polynomial M(x) of degree n-1. d) M(x) is divided by G(x), the generator polynomial defined above, producing a remainder R(x) of degree less than or equal to 31. e) The bit sequence of R(x) is complemented and the result is the CRC and placed in the CRC field of the VI segment. If a VI segment is received with an incorrect value in the CRC field one of the following two actions MUST be taken. 1. Drop the segment and do not send a TCP ack covering the bad data. The TCP layer will then attempt to retransmit. This can only be done if the implementation merges the VI and TCP layers. 2. Deliver the data to the application with status indicating transport error. In this case the connection must be closed immediately if the mode is Reliable Delivery or Reliable Reception. The connection may continue if in Unreliable mode. In Reliable Reception mode, a message ack MUST be sent to the remote end indicating an "Unrecoverable Transport Error". 3.7 Urgent Marker Option Either end may specify the Urgent Marker Option. The end receiving the Urgent Marker Option MAY desginate the first byte of any VI segment as TCP urgent data. As specified by RFC 1122, the urgent pointer in the TCP header must point the the urgent byte (the first byte of the VI segment) and not the byte following the urgent byte as some implementations mistakenly do. (If an implementation can't guarantee this, it MUST never designate any urgent data.) A VI/TCP implementation MUST never designate any data other than the first byte of a VI segment as urgent. Unless and until the Urgent Marker Option was received from the remote end of the VI/TCP connection, no TCP data may be designated as urgent. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 19] Internet-Draft VI / TCP (Internet VI) November 17, 2000 4. VI/TCP Connection Establishment This section contains the state machines governing VI/TCP connection establishment. Both active and passive (e.g., listens) scenarios are presented. For peer-to-peer connection establishment, the same connection establishment mechanism is used with one end using active connection establishment and one end using passive. The end of the connection with the "higher" address does an active establishment and sends the ConnectRequest message. Comparison of address is done by treating each end's IP address as an unsigned binary number (32 bits or 128 bits for IPv4 and IPv6 respectively) and doing a normal numerical comparison. Receiving a Connect No Match VI message during peer connection establishment results in repeated attempts for a period specified by the VI Consumer's connection timeout value. 4.1. Basic Connection Establishment Timeline VIPL API [VIDG] | VI/TCP Protocol | VIPL API ------------------------------------------------------------------ | | | | VipConnectWait VipConnectRequest | | <----------------- -----------------> | | | setup TCP connection | | | | Connect Request | | -------------------> | VipConnectWait(ret) | | -----------------> | | | | VipConnectAccept | Connect Accept | <---------------- VipConnectReq (ret) | <------------------- | <----------------- | or Connect Reject | | or Connect No Match | DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 20] Internet-Draft VI / TCP (Internet VI) November 17, 2000 4.2. Connection Establishment - Active The state machine governing active VI/TCP connection establishment is as follows: +----------------+ (Legend: event - action) | Disconnected | <--------------------------------<+ +----------------+ ^ | VipConnectRequest | | - Setup TCP connection | | | \|/ | +----------------+ TCP setup fail ^ +------>| Connecting +>--------------------------------->+ | +----------------+ ^ | TCP Closes | TCP connection established | | - | - ConnectRequest | | Reestablish | | | \|/ ConnectReject or | | +----------------+ ConnectNoMatch or Timeout ^ +------<| Pending Accept |>--------------------------------->+ +----------------+ - close TCP connect. ^ | | | ConnectAccept | \|/ | +----------------+ Vip or TCP disconnect ^ | Connected |>--------------------------------->+ +----------------+ - close TCP connection DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 21] Internet-Draft VI / TCP (Internet VI) November 17, 2000 4.3. "Connection Establishment - Passive" The state machine governing passive VI/TCP connection establishment is as follows: +----------------+ +--------------+ | Listening on | + Disconnected + | VI/TCP | +--------------+ | Well Known Port| ^ +----------------+ | | | | incoming TCP connection - Accept TCP connection | | (Legend: event - action) | \|/ | +----------------+ Timeout - close TCP connection ^ | Incoming |---------------------------------------------->+ +----------------+ TCP connection closes ^ | | | incoming Connect Request | \|/ | +----------------+ No Matching Discriminator ^ | Matching +---------------------------------------------->+ +----------------+ - Send Connect NoMatch; close TCP connection ^ | | | Discriminator match | \|/ | +----------------+ ConnectReject - VipConnectReject, close TCP ^ | Pending Accept |---------------------------------------------->+ +----------------+ ^ | | | VipConnectAccept - ConnectAccept | \|/ +----------------+ Vip or TCP disconnect - close TCP connection ^ | Connected |---------------------------------------------->+ +----------------+ DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 22] Internet-Draft VI / TCP (Internet VI) November 17, 2000 5. Security Considerations No special security considerations exist at this time. 6. Intellectual Property The existence of following US patents is acknowledged: 5,991,818 and 6,094,712. The authors offer no opinion regarding these patents. 7. References [VIAR] "Virtual Interface Architecture Specification", Compaq Computer Corp., Intel Corporation, Microsoft Corporation, 1997. [VIDG] "Intel Virtual Interface (VI) Architecture Developer's Guide", Intel Corporation, September 1998. [PAWS] Jacobsen, Braden, Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 23] Internet-Draft VI / TCP (Internet VI) November 17, 2000 8. Author's Addresses Stephen DiCecco James Williams Giganet, Inc. Concord Office Center 2352 Main Street Concord, Massachusetts 01742 978.461.0402 (tel) 978.461.0430 (fax) www.giganet.com Email: sdicecco@giganet.com jimw@giganet.com Bill Terrell TROIKA Networks, Inc. 2829 Townsgate Road, Suite 200 Westlake Village, CA 91361 805.370.2612 (tel) 805.371.1344 (fax) www.TroikaNetworks.com Email: terrell@TroikaNetworks.com John A. Scott 627 Davis Drive, Suite 200 Morrisville, NC 27560 919.993.5626 (tel) 919.993.5604 (fax) www.netapp.com Email: jscott@netapp.com Costa Sapuntzakis Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134, USA Phone: +1 408 525 5497 www.cisco.com Email: csapuntz@cisco.com DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 24]