INTERNET DRAFT V. Kashyap IBM Expiration Date: September 26, 2001 March 26, 2001 IP over InfiniBand (IPoIB) Overview, Issues and Requirements Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract InfiniBand architecture (IBA) specifies a high speed, channel based, switched fabric architecture to deliver scalable performance in data centers. This memo, in a bid to facilitate discussion and achieve consensus on key issues, provides an overview of InfiniBand architecture and the requirements for a possible working group on IP over InfiniBand. Kashyap [Page 1] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 Table of Contents 1.0 Introduction to InfiniBand 1.1 InfiniBand Architecture Specification 1.2 InfiniBand Overview 2.0 IPoIB Requirements 3.0 IPoIB Issues 3.1 InfiniBand as datalink 3.2 Which transport to use ? 3.3 IP Encapsulation 3.4 IP Multicast 3.5 Path MTU 3.6 Address Resolution 3.6.1 Port Identifier 3.6.2 Auxiliary Data 3.7 Impact on host implementations 4.0 Security Considerations 5.0 References 6.0 Author's Address 7.0 Full Copyright Notice 1.0 Introduction to InfiniBand To meet the growing requirements of scalability, reliability, availability and performance of servers in data centers, a balanced system architecture with equally good performance in the memory, processor, and input/output (I/O) subsystems is required. With this is view an independent industry body called the InfiniBand Trade Association was formed to develop a new common I/O specification to deliver a channel based, switched fabric technology that the entire industry can adopt. 1.1 InfiniBand Architecture Specification The InfiniBand Trade Association specification, version 1.0, is available for download from http://www.infinibandta.org. 1.2 Technical overview For a more complete overview the reader is referred to chapter 3 of the InfiniBand specification. Kashyap [Page 2] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems. An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from single ASIC IBA attached devices such as a LAN adapter to a large memory rich RAID subsystem. IBA network is subdivided into subnets interconnected by IB routers. These are IB routers and IB subnets and not IP routers or IP subnets. Each IB node or switch may attach to a single or multiple switches or directly with each other. Each node interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or mode ports that connect to the fabric. Each CA appears as a node to the fabric. The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (queue pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address. IBA supports both connection oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. In raw datagram mode the QPN is not used. A port may be identified by a local ID (LID) and optionally a Global ID (GID). The GID is 128 bits long and is formed by the concatenation of a 64 bit subnet prefix and a 64 bit EUI-64 compliant portion (GUID). The LID is a 16 bit value that is assigned when the port becomes active. Note that the GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified then the GID may change. The subnet manager may attempt to keep the LID values constant across shutdowns but that is not a requirement. The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager Kashyap [Page 3] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 component that controls the fabric. It assigns the LIDs and GIDs, it programs the switches so that they route packets between destinations. The subnet manager and a related component, the subnet administrator (SA) are the central repository of all information that is required to setup and bring up the fabric. IB routers are components that route packets between IB subnets based on the GIDs. Thus within and IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it. A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes (VL) where each VL has its own flow control. There may be upto 16 VLs. VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports however must support VL15 which is reserved exclusively for subnet management datagrams and hence doesn't concern the IPoIB discussions. The actual VL that a port uses is configured by the SM and is based on the Service Level (SL) specified in every packet. There are 16 possible SLs. In addition to the features described above viz. Queue Pairs (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also defines the following: P_Keys or partition keys: Every packet, but for the raw datagrams, carries the partition key (P_key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The same is the case with the receiving CA. Q_Keys: These are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services don't require this value. At communication establishment the endpoints exchange the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Kashyap [Page 4] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 Mutlicast support: A switch may support multicasting ie. replication of packets across multiple output ports. This is an optional feature at the switches. A multicast group is identified by a GID. The GID format is as defined in RFC 2373 on IPv6 addressing. Thus from an IPv6 over IB's point of view the data link multicast address looks like the network address. An IB node must explicitly join a multicast group by a request to the SM to receive packets. A node may send packets to any multicast group. In both cases the multicast LID to be used in the packets is received from the SM. There are 6 transport types specified by the IB architecture. These are : 1. Unreliable Datagram (unacknowledged - connectionless) The UD service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node. The switches and hence each link can support only a certain MTU. The MTU ranges are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot be larger than the smallest link MTU between the two peers. 2. Reliable Datagram (acknowledged - multiplexed) The RD service is multiplexed over connections between nodes called End to end contexts (EEC) which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain). 3. Reliable Connected (acknowledged - connection oriented) The RC service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 bytes in length. The CA implementation takes care of segmentation and assembly. 4. Unreliable Connected (unacknowledged - connection oriented) The UC service associates one local QP with one and only one remote QP. There is no acknowledgment and hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise. 5. Raw Ethertype (unacknowledged - connectionless) The Ethertype raw datagram packet contains a generic Kashyap [Page 5] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined in RFC1700 for ethertype. 6. Raw IPv6 ( unacknowledged - connectionless) Using IPv6 raw datagram service, the IBA CA can support standard prtocol layers atop IPv6 (such as TCP/UDP). Thus native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP. The first 4 are referred to as IB transports. The latter two are classified as Raw datagrams. There is no indication of the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size. 2.0 IP over IB requirements 1. IP packet encapsulation in InfiniBand packets a) IPv4 Encapsulation i) Unicast ii) Multicast b) IPv6 Encapsulation i) Unicast ii) Multicast 2. Address Resolution i) IPv4 Address Resolution (IBARP) ii) IPv6 Neighbour Discovery 3. 'Home' for IB features IB is a feature rich fabric that facilitates multiple optimisations in the endhosts. It is desirable that IP over IB specification is able to leverage these features without affecting the standard IP paradigm. i) Link type (which transport to use ?) ii) Partitioning iii) QP Multiplexing iv) Q_Keys/EE Context 3.0 IPoIB Issues This section discusses some of the key issues that have to be Kashyap [Page 6] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 considered in IPoIB specifications. 3.1 InfiniBand as datalink For the IP layer IBA is the datalink. The IB interface, the logical abstraction in an endhost to which the IP address will be associated with, may be considered as having the following characteristics: 1. Port identifier : LID/GID/GUID 2. Auxiliary data: QPN, P_Key, Q_Key/EEC, IB Path MTU, SL 3. Characteristics: Multicast Transport Service Link P_Key QPN EE context IB Level Path MTU Q_Key Multicast Limit ------------------------------------------------------------------------------- Raw Ethertype| Yes | Yes | No | No | No | Y ------------------------------------------------------------------------------- Raw IPv6 | Yes | Yes | No | No | No | Y ------------------------------------------------------------------------------- Unreliable | Yes | Yes | Yes | Yes | Q_Key | Y Datagram | | | | | | ------------------------------------------------------------------------------- Reliable | Yes | 2^31 | Yes | Yes | EEC | N Datagram | | bytes | | | | ------------------------------------------------------------------------------- Unreliable | Yes | 2^31 | Yes | Yes | No | N Connected | | bytes | | | | ------------------------------------------------------------------------------- Reliable | Yes | 2^31 | Yes | Yes | No | N Connected | | bytes | | | | ------------------------------------------------------------------------------- Table 1 As the table (Table 1) shows, there is different and large amount of auxiliary data needed for each of the transport services offered on InfiniBand fabrics. 3.2 Which transport to use ? IB provides multiple transport methods, as described above, each with varying degree of complexity. It might be desirable to implement IPoIB over more than one or may be all of the transports. Each of the transports provides its own Kashyap [Page 7] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 advantages. Raw Datagrams : Raw Datagrams, able to work directly with the IP packets either by emulating ethernet datagrams or by handling IPv6 packets directly, are an easy match. However, the Raw datagram modes do not allow multiple QPs per port which could be used for load balancing and faster endpoint lookups in end hosts. There is only one QP per port per Raw datagram type. Additionally Raw Datagrams lack the support for partition keys. These are deemed highly desirable in a data center SAN to provide isolation. The Raw datagrams include only a 16 bit CRC and not a 32 bit CRC. On the positive side, other than being able to directly handle the IPv4/IPv6 packets, Raw datagram use would allow co-location of IB and IP routers since the format layer 3 addresses (IB layer 3 and IP layer 3) is the same. IB transports : All the IB transports provide partitioning, multiple QPs at a port, 32-bit CRC protection of data. All IB transports require a connection setup using IB packets. In the case of Unreliable Datagrams the setup comprises of determining the Q_Keys and QPs to be used. Unreliable Datagram : Unreliable Datagram provides, as do all IB transports, for multiple QPs and partitioning. However, the presence of these features makes the task of address resolution that much more complicated. Note that mutlicast and non-multicast QPs may be different if so desired. This mode provides the desired IB features (QPs, P_Keys) along with multicasting support. Unreliable Connected : Unreliable Connected does not support multicast packets. Since it is a connected service, a QP is tied to only one QP at the other end. The message and hence the interface MTU between the connections may be as large as 2^31 bytes. The fragmentation and assembly is handled by the channel adapters. The service provides remote DMA feature which could be leveraged, with corresponding sockets extensions, to provide zero-copy data transfers between user buffers. Reliable Datagram : This service provides reliable communication in a one to many Kashyap [Page 8] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 paradigm. A requestor may send sequential messages to different reponders at different QPs. There can only be one in-flight message between any to endpoints in Reliable Datagram mode. Reliable Connected : This service provides the same features as unreliable connected except for the additional feature of reliability. It provides for remote DMA to and from remote peers user space. A suitable extension to sockets API could be used to leverage this feature. 3.2 IP encapsulation A way must be specified to indicate that the encapsulated packet is an IPv4/IPv6 packet. This could be accomplished by associating a QP number with the IP protocol traffic. The QP association might be static (well known) or determined dynamically. An alternative is to specify a field, analogous to 'ether type', to be added to every packet. 3.3 Multicasting Multicasting is an optional feature in InfiniBand fabrics. Lack of multicast capable switches however doesn't mean that the endnodes cannot form a multicast tree from sender to the receivers. Issue: Describe IPoIB only on IBA fabrics that support multicast. Issue: Define IPv4/IPv6 mapping to IB multicast groups It is desirable and straightforward to derive IB group GIDs from the IPv4/IPv6 addresses. The alternative is to enquire the group IDs from the SM. 3.3.1 IB multicast GID The IB mulitcast GID is of the form FFxy:: exactly as the case for IPv6 multicast addresses. 'x' indicates whether the GID is transient or global and y has the same values as the scope nibble of an IPv6 multicast address. The scope of an InfiniBand GID is defined with respect to the IB subnet.Thus the global multicast GID (FF0E::, for example) signifies a GID that spans multiple contiguous IB subnets. A local GID is limited to the IB subnet. Issue: Multiple IP subnets in an IB subnet require Kashyap [Page 9] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 multicast/broadcast isolation. 3.4 IP subnet to IB subnet relationship As noted earlier IB fabrics may be joined together by IB routers to create a multi-subnet fabric. This allows for better isolation and controlled growth. The IB GIDs route on 64 bit prefixes. Issue: Does IPoIB specify multiple IP subnets in IB subnets ? Issue: Does IPoIB include the case of an IP subnet crossing multiple IB subnets? 3.5 Path MTU In the case of some of the IB transports the segmentation and reassembly of packets is done by the firmware in the CAs. However, in the case of unreliable datagrams and raw datagrams the link MTU is one of the follwoing: 256, 512,1024,2048 or 4096 bytes. For interoperability, a common MTU would be preferable. Issues: Does IPoIB specify per-IB path MTU determination ? Does IPoIB specify a subnet wide MTU ? Does IPoIB specify a mandatory minimum MTU for the subnet? How is the MTU determined by the IP nodes ? 3.6 Address Resolution Address resolution determines the target link address given the target IP address. An IB link address can be taken to be a combination of : 1) Port Identifier 2) Auxiliary Data 3.6.1 Port identifier IBA defines multiple port identifiers as noted above viz. LID, GID and the GUID. Of these LID and GID are useable as destination addresses. Of the two LID is always needed and GID is needed for cross IB subnet communication. Additionally, a GID is persistent but a LID could change after every shutdown and reboot of an end host. This value of LID must always be ascertained before any communication though. Kashyap [Page 10] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 Issue: What is the port identifier to be used ? LID or LID + GUID or LID + GID index 0 GID index 0, is the default GID that the subnet mananger defines for every port. A port could have multiple GIDs. 3.6.2 Auxiliary data Every IB packet requires the SL (service level). Every endpoint must also know the IB link path MTU to be able to send the right size of packets across. Depending on the transport mode other data is required such as QP number, Q_Key, P_Key, EE context etc. Issue : How is this data obtained ? Is it better to obtain all the information in one go via address resolution protocol or is it preferable to ask the SM for the details ? Issue: Load balancing (internal for lookups or otherwise) could tie a protocol/port binding to a particular QP. How is this information transmitted to the peers ? 3.7 Impact on host implementations It is a requirement that the IPoIB modifications must be of a nature that does not require changes in IP and higher layer protocols. Nor should it mandate requirements on IP stacks to implement special user level programs. It is an aim that the IPoIB changes be amenable to modularisation and incorporation into existing implementations at the same level as other media types. 4.0 Security Considerations There are no such issues in the context of this draft. 5.0 References InfiniBand Architecture Specification, Volume 1.0 Kashyap [Page 11] INTERNET-DRAFT IPoIB overview, issues ,requirements March 26, 2001 6.0 Author's Address Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Work: 503 578 3422 Email: vivk@us.ibm.com 7.0 Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap [Page 12]