INTERNET DRAFT V. Kashyap IBM Expiration Date: June 15, 2002 December 15, 2001 IP over InfiniBand(IPoIB): Advanced Capabilities Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract InfiniBand Architecture(IBA) defines a high speed, channel based interconnect between systems and devices. The InfiniBand architecture provides multiple modes of transport services with differing characteristics. IPoIB may be implemented over any or multiple such modes to take advantage of the underlying mode's features. IPoIB by default is implemented over the UD mode of IBA. This document describes IPoIB over IBA's non-UD transport modes. Kashyap [Page 1] INTERNET-DRAFT Advanced IPoIB December 15, 2001 Table of Contents 1.0 IPoIB advanced capability links 2.0 Capability flags and interoperability 3.0 Advanced capability IPoIB implementation 3.1 IPoIB-QPN 3.2 IPoIB-RC/UC/RD 3.3 IP encapsulation 4.0 Service ID 4.1 Protocol specific subfield 4.2 Service ID examples 5.0 Security Considerations 6.0 References 7.0 Author's address 1.0 IPoIB advanced capability links IBA provides a broad set of capabilities to choose from when implementing IP over InfiniBand architecture. The datagram modes are limited by the link MTU. In contrast, the connected modes can offer significant benefits by offering a large MTU. Reliability is also enhanced if the underlying feature of automatic path migration of connected modes is utilised. An implementation MAY choose to provide IP over non-UD transport modes in addition to the mandatory IP over UD function. The InfiniBand architecture specification requires Unreliable Datagram mode to be supported by all the IB nodes. The host channel adapters(HCAs) are additionally required to support Reliable connected and Unreliable connected modes but not target channel adapters(TCAs). Support for the two Raw Datagram modes is entirely optional. For the sake of simplicity, ease of implementation and integration with existing stacks, it is desirable that the fabric support multicasting. This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes. Given these conditions IPoIB over IBA's UD mode is mandatory. IPoIB over other modes of IB transport is optional. It is a MUST that IPoIB implementations with any combination of capabilities interoperate seamlessly. An interface MAY be associated with multiple QPNs. This provides a mode of implementation wherein a single IP address is associated with different QPNs. Such an association may be Kashyap [Page 2] INTERNET-DRAFT Advanced IPoIB December 15, 2001 used to demultiplex the incoming packets in the HCA hardware, based on the QPN, avoiding or reducing the upper-layer port based lookup. This amounts to there being multiple MAC addresses associated with an endpoint. Any process for providing resolution and support of multiple QPNs per IP address must provide for interoperability with the default version of a single QPN per IPoIB interface. This document presents such a process. An IP implementation will therefore always support IPoIB over unreliable datagram mode (IPoIB-UD). It may choose to support IPoIB over the other modes. These modes will be referred to as IPoIB-RC(reliable connected), IPoIB-UC(unreliable connected), IPoIB-RD(reliable datagram). An implementation may also map different upper layer services to different QPNs while supporting IPoIB-UD. Such an implementation will be referred to as IPoIB-QPN. These capabilities are indicated to the other end by way of capability flags included in the address resolution packets[IPoIB_ENCAP]. Note that address resolution is always attempted over IPoIB-UD since that is the only mode that allows multicasting and is guaranteed to be supported by all IPoIB stacks. 2.0 Capability flags and interoperability [ Note to WG: The exact format of the link-layer address will depend on the IP encapsulation draft.] The link-layer address is defined as follows: +-------+-------+--------+-------+ | LID | Flags | QPN | +-------+-------+--------+-------+ | QPN | GID | +-------+-------+--------+-------+ | GID | +-------+-------+--------+-------+ | GID | +-------+-------+--------+-------+ | GID | +-------+-------+--------+-------+ | GID | +-------+-------+ The flags are defined as follows: Kashyap [Page 3] INTERNET-DRAFT Advanced IPoIB December 15, 2001 +-----+-----+-----+-----+---+---+---+---+ | QPN | RC | UC | RD | 0 | 0 | 0 | 0 | +-----+-----+-----+-----+---+---+---+---+ QPN: Indicates IPoIB-QPN capability RC : Indicates IPoIB-RC capability UC : Indicates IPoIB-UC capability RD : Indicates IPoIB-RD capability All implementations of IPoIB MUST be able to interoperate. This is achieved by requiring an IPoIB stack to set only those flags that correspond to the capabilities it supports. A stack must ignore capability flags it doesn't support. Thus if a capability is not supported the rule is 'transmit as zero, ignore on receive'. Therefore the address resolution process between the peers informs them of each other's support of IP over specific modes of transport. The default case implementation i.e. IPoIB-UD only, ignores all the flags received and does not send any flags. Its peers therefore always communicate with it using IPoIB-UD. 3.0 Advanced capability IPoIB implementation The address resolution (ARP/ND) is always implemented over the IPoIB-UD mode. As noted earlier the capability flags are exchanged between the peers as part of the address resolution. It is an implementation decision to accept the capabilities offered by the peer. The method by which these decisions are made or conveyed to the IPoIB stack are beyond the scope of this document. Once the IP address is resolved the sender requests for the advanced capability QPN using the relevant InfiniBand connection or service ID resolution request. Once the relevant QPN is received the communication can continue over the specified QPN. It is an implemenation choice on how these connections are reflected to the user and how or whether these connections are reflected through a separate IPoIB interface. 3.1 IPoIB-QPN The IPoIB-QPN method of communication is implemented over the Kashyap [Page 4] INTERNET-DRAFT Advanced IPoIB December 15, 2001 UD mode of IB. In IPoIB-QPN the two peers may associate separate QPNs with different services. The IB specification associates a 'serviceID' with the relevant service. This service ID is used in the IB specific Service ID resolution(SIDR) protocol. See the IB specification for more details[IB_ARCH]. See section 4.0 for the ServiceIDs IPoIB will use. The IB service ID relevant to IPoIB is derived from the protocol and the port or other information relevant to the protocol or the service. Thus the IPoIB-QPN interface resolves the relevant service ID and then uses the QPN so determined for all IP communication to the service. The QPN, which is returned in the address resolution protocol, may or may not be used for a particular service. The implementation must internally manage the multiple QPNs that may refer to the same IP address(but different upper layer protocol's ports). InfiniBand's SIDR protocol includes an application specific opaque block of data called the 'private data' field. This field MUST include the following information: +--------+--------+--------+--------+--------+--------+--------+--------+ | QPN received in ARP/ND | Pad | QPN send in ARP/ND | Pad | +--------+--------+--------+--------+--------+--------+--------+--------+ The QPNs exchanged are the values that were received/sent during IP address resolution. These values will allow the recipient to disambiguate the IP interface for which the resolution is needed. The pad bytes must be set to 0. [ A note to WG. One could use the private data field to exchange large blocks/default QPN values but I don't think the gain is much.] 3.2 IPoIB-RC/UC/RD Similarly, as a result of the address resolution the two peers let the other know of their support of one or more of IPoIB-RC/UC/RD. The corresponding connection at the IB layer MAY then be setup using the communication establishment messages defined by IB. The method by which a host decides it wants to use the IPoIB-RC/UC/RD services is beyond the scope of this specification. Kashyap [Page 5] INTERNET-DRAFT Advanced IPoIB December 15, 2001 The serviceID used in these messages is derived from the protocol and the port to be connected to. The serviceID is derived from these values as defined in section 4.0. Every communication establishment message includes an application specific 'private data' field. The private data field in IPoIB-RC/UC/RD is defined as follows: +--------+--------+--------+--------+--------+--------+--------+--------+ | QPN received in ARP/ND | Pad | QPN sent in ARP/ND | Pad | +--------+--------+--------+--------+--------+--------+--------+--------+ | Maximum IP packet size(IP MTU) | +--------+--------+--------+--------+ The IB specification allows a message size of up to 2^31 bytes. However the IPoIB implementations may set lower limits. This limit is an implementation choice based on the maximum packet the CA can accept and other factors. On receiving the IB connection request the peer may accept or reject the request based on IB rules and other implementation choices. If the connection is accepted the reply message(as per IB protocol) will include the MTU that the receiver prefers. It may be less, same or more than the MTU present in the request. The two peers however MUST use the smaller value as the MTU of the newly formed IPoIB-RC/UC/RD link. The remote peer may reject the connection due to InfiniBand related errors. If the connection is being rejected because the MTU value is required to be of a certain value then InfiniBand's reject (REJ) message is sent. The REJ message must include the MTU that would be acceptable and use the error code 3 (No resources available). In contrast if the requester doesn't find the resultant MTU suitable it can disconnect the IB connection. 3.3 IP encapsulation The IP encapsulation will be done as defined in the IPoIB encapsulation standard[IPoIB_ENCAP]. Packets destined for any upper layer endpoint MAY be sent over any of the IPoIB links (UD/RC/RD/UC/QPN). Any IP packet received over IPoIB-UD link MUST be processed correctly by the receiving IPoIB even if there is an advanced capability link associated with that particular packet. IP multicast cannot be done over the advanced mode since it is Kashyap [Page 6] INTERNET-DRAFT Advanced IPoIB December 15, 2001 inherently not supported. All Address Resolution Protocol(ARP) packets MUST only be exchanged over IPoIB-UD mode. 4.0 Service ID The InfiniBand specification defines a block of service IDs for IETF use. The InfiniBand specification has left the definition and management of this block to the IETF. The 64-bit block is: +--------+--------+--------+--------+--------+--------+--------+--------+ |00000001|<-------------------IETF use--------------------------------->| +--------+--------+--------+--------+--------+--------+--------+--------+ The ServiceIDs used by IPoIB will use the following format: +--------+--------+--------+--------+--------+--------+--------+--------+ |00000001|Reserved|Reserved|Protocol|Reserved|Reserved|Protocol subfield| +--------+--------+--------+--------+--------+--------+--------+--------+ The Reserved fields MUST be transmitted as zeroes. They are ignored on reception. The Protocol field takes the values that may be included in the 'protocol' field of IPv4. 4.1 Protocol specific subfield The 16-bit field carries any information that uniquely identifies the protocol specific endpoint. For example this field will hold the TCP port numbers for TCP. This document specifies the protocol subfield for TCP/SCTP/UDP only. If any other value is found useful it will be described in a separate document. Rule 1: The 'protocol subfield' may be set to zero. This corresponds to the wild-card or 'all' case unless the relevant protocol description says otherwise. Rule 2: For the protocols of TCP/SCTP/UDP the protocol subfield corresponds to the port number associated with the service. 4.2 Service ID examples 1) Service ID for TCP port 40 The Protocol value for TCP is 6. Port 40 corresponds to 0x28. The service ID will therefore be: Kashyap [Page 7] INTERNET-DRAFT Advanced IPoIB December 15, 2001 1:0:0:6:0:0:0:28 2) Service ID for all UDP communication 1:0:0:11:0:0:0:0 5.0 Security Considerations A node may be returned a false set of capability flags by an imposter. This may cause unnecessary attempts and some delay/disruption in IPoIB communication. An imposter can do more damage by replying with its own or otherwise wrong LID/GID. It is the responsibility of the higher layers or applications to implement suitable counter-measures if this is a problem. 6.0 References [IB_ARCH] InfiniBand Architecture Specification [IPoIB_ARCH] IPoIB Architecture [IPoIB_ENCAP] IP encapsulation over InfiniBand 7.0 Author's Address Vivek Kashyap 15450, SW Koll Parkway Beaverton, OR 97006 Phone: +1 503 578 3422 Email: vivk@us.ibm.com Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or Kashyap [Page 8] INTERNET-DRAFT Advanced IPoIB December 15, 2001 as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap [Page 9] -- Vivek Kashyap Linux Technology Center, IBM kashyapv@us.ibm.com vivk@us.ibm.com 503 578 3422 (o)