INTERNET-DRAFT H.K. Jerry Chu Sun Microsystems Expires: January, 2002 July 2001 Transmission of IPv6 Packets over InfiniBand Networks Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document outlines all the necessary ingredients for running an IPv6 network on top of an InfiniBand (IB) network. It first describes how IP link segments can be constructed out of IB subnets and partitions, then specifies the method of forming IPv6 statelessly autoconfigured addresses, link-local addresses, and the link-layer address format used in the Source/Target Link-layer Address option. It then describes how an IPv6 packet can be encapsulated in an IB packet, both unicast and multicast. 1. Introduction InfiniBand (IB) defines four layers of network services, from the physical layer through the transport layer. IB unreliable datagram transport best matches the IP datagram paradigm, and is a natural choice for carrying IP datagrams through an IB network. Jerry Chu [Page 1] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 An IB unreliable datagram packet contains the following headers: o Local Route Header (LRH) - provides routing information for IB switches to relay packets within an IB subnet. o Global Route Header (GRH) - provides routing information for IB routers to relay packets between IB subnets inside an IB fabric. o Base Transport Header (BTH) - provides various information, including the partition key (P_Key), destination queue pair number (QPN) for the IB transport services. o Datagram Extended Header (DETH) - provides additional IB specific information unique to datagram services. From the perspective of IP over IB encapsulation, all of IB's link, network, and transport layers are collapsed together and act as a single link layer to the IP stack. IPv6 architecture is designed to be easily adapted over a wide range of communication links [DISC]. Once the following link-specific entities are defined, the IPv6 protocol suite should run without any changes. The four entities that need to be defined are o link MTU o interface identifier o link layer address o IPv6 multicast to IB multicast mapping The rest of this document specifies the four entities for supporting IPv6 over IB. For more information on InfiniBand, readers are encouraged to review the specifications published by the InfiniBand Trade Association at www.infinibandta.org 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. What Constitutes an IP Link in InfiniBand? Jerry Chu [Page 2] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 A link segment on top of which an IP subnet can be configured is defined in [IPV6] as a communication facility or medium over which nodes can communicate at the "link" layer. For most types of communication media, the boundary between different data links is obvious. An IB fabric is itself a network providing full connectivity for all the nodes inside the IB cloud. In theory the boundary of an IP link can be drawn arbitrarily without regard to the boundary of IB subnets. In reality, in order to simplify the mapping of IP networks to IB networks, and to better take advantage of IB link layer facilities such as link multicast, some restriction should be imposed. The rest of this document confines an IP link to within a single IB subnet. Partitioning in IB offers an isolation mechanism among systems sharing an IB fabric, much like VLANs in an Ethernet network. Each port of an endnode contains a partition key (P_Key) table of all the valid P_Keys the port is allowed to use. The P_Key table is set up by the Subnet Manager (SM) of the local IB subnet. Each queue pair (QP) is programmed with a P_Key from the local P_Key table. This P_Key is carried in the BTH of all the outgoing packets from the QP, and is used to compare against the P_Key in the BTH of all the incoming packets to the QP. Reception of an invalid P_Key causes the packet to be discarded. IB switches and routers may optionally enforce partition checking too. Since all the ports attached to the same IP link must share a common P_Key in order to communicate successfully with one another, IB partitioning becomes a logical mechanism for constructing IP links. Within each IB subnet, system administrators can build "virtual" links through the SM by allocating and assigning a unique P_Key to an arbitrary set of nodes forming an IB partition. This is similar to the construction of VLANs in Ethernet. Each virtual link in an IB fabric is identified by a unique P_Key and a subnet number. Besides P_Key, the link MTU and Q_Key are two other per-link attributes, and will be discussed in the following sections. 4. Maximum Transmission Unit Once an IP link boundary is drawn, the next step is to determine a link MTU for it. IB defines five permissible maximum payload sizes. They are 256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 1280 bytes or greater. This leaves only 2048 and 4096 bytes as two acceptable choices. Channel adaptors (CAs) supporting a maximum payload size Jerry Chu [Page 3] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 less than 2048 bytes can still expose an acceptable link MTU size to IPv6 through an adaptation layer that transparently fragments messages into smaller packets, and reassembles them on the receiving end. It is up to the implementation to decide which link MTU size to use, either 2048 or 4096 bytes. A larger link MTU can potentially offer better throughput performance. The caveat is that once the size of the link MTU for a given link is chosen, nodes with a smaller MTU won't be able to join the link without requiring all other nodes attached to the same link to reconfigure their MTU size. IPv6 nodes attached to the same link SHOULD use the link MTU recorded in the IB all-node multicast group described in section 10 later. They MUST accept a smaller MTU if one is advertised through the link MTU option of a router advertisement [DISC]. 5. IPv6 Link Q_Keys Before running IPv6 on a link , a controlled Q_Key in the following format must be generated. This per-link Q_Key will be used when creating an IB multicast group, as described in section 10 later. All QPs configured to use on a given IPv6 link must be assigned the same per-link Q_Key. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|0|0|1|1|0|1|1|0|0|0|1|1|0|1|1| 16-bit random number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The upper 16-bit of an IPv6 Q_Key contains a signature 0x9B1B to help identifying an IB multicast group created specifically for IPv6 uses. Controlled Q_Keys are used to prevent IPv6 QPs from receiving bogus IPv6 packets fabricated by non-privileged software. The generation of the 16-bit random number is implementation dependent. 6. Interface Identifier IPv6 stateless autoconfiguration [ACONF] requires an Interface Identifier of 64 bits that is (at least) unique per link. IB requires each port on a CA or router to be assigned an EUI-64 [EUI64] Globally Unique Identifier (GUID) by the manufacturer. This port GUID is a natural candidate for the Interface Identifier. The Interface Identifier is then formed from the EUI-64 by complementing the "Universal/Local" (U/L) bit, which is the next-to- Jerry Chu [Page 4] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 lowest order bit of the first octet of the EUI-64. [AARCH] gives the rationale behind inverting the "u" bit when forming the Interface Identifier. 7. Link-Local Addresses The IPv6 link-local address [AARCH] for an IB CA port is formed by appending the Interface Identifier, as defined above, to the prefix FE80::/64. 10 bits 54 bits 64 bits +----------+-----------------------+----------------------------+ |1111111010| (zeros) | Interface Identifier | +----------+-----------------------+----------------------------+ 8. Address Mapping -- Unicast IPv6 defines a generic procedure [DISC] for mapping IPv6 unicast addresses into link-layer addresses. The procedure works for any link type as long as a link-layer address format is defined. The format of the link-layer address for IPv6 over IB is shown in the following diagram. All fields are in the network byte ordering. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved8 | QPN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | GUID[63-32] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | GUID[31-0] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ QPN 24-bit Queue Pair Number of the QP for receiving IPv6 traffic GUID An EUI-64 compliant, Globally Unique Identifier for the CA port The QPN and port GUID constitute the minimal set of information that is required to uniquely identify a communication endpoint in an IB fabric. Note that to actually send an IB unreliable datagram out, many other pieces of information are needed. But they can all be derived from the port GUID through queries to the SM. Therefore they are not included in the link-layer address format. The resolution of these information is implementation dependent and does not affect Jerry Chu [Page 5] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 interoperability. Therefore it is out of the scope of this document. The Source/Target Link-layer Address option used in [DISC] has the following format when the link layer is IB. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved8 | QPN[23-16] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | QPN[15-0] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | GUID[63-48] | +- -+ | GUID[47-32] | +- -+ | GUID[31-16] | +- -+ | GUID[15-0] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Option fields: Type 1 for Source Link-layer address. 2 for Target Link-layer address. Length 2 (in units of 8 octets). 9. Address Mapping - Multicast IB defines two layers of multicast addressing. Its network layer uses multicast GIDs (MGIDs), which closely resemble IPv6 multicast addresses [AARCH], both in syntax and semantics. The IB link-layer defines multicast LIDs (MLIDs), which are used by IB switches to program their multicast forwarding tables. LIDs are 16-bit identifiers serving as IB's own link-layer addresses. The range between 0xC000 to 0xFFFE are reserved for MLIDs (approximately 16k). An IB switch may support much fewer MLIDs in its forwarding table though. Every IB multicast packet is required to carry both LRH and GRH. Therefore a valid MGID and a valid MLID are both needed when sending an IB multicast packet. Jerry Chu [Page 6] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 Before a MGID can be used in each IB subnet, either as a destination address of a multicast packet, or representing a multicast group that a local IB node can join, an IB multicast group must be created on the IB subnet first. This is done through an explicit call to SM. Besides the MGID, the caller must supply a per-IPv6-link Q_Key, as described in section 5 above, the link MTU, and the link P_Key as input parameters. In return, SM will allocate a MLID to be used for the multicast group in the local IB subnet. Note that the allocation of MLIDs for MGIDs should be done efficiently in order to minimize MLID collision due to the limited number of usable MLIDs in an IB subnet. The allocation algorithm is implementation dependent, and is therefore outside the scope of this document. When a node wishes to join an IPv6 multicast group on a local link, the corresponding IPv6 multicast address is used as a MGID to look up a local IB multicast group from the SM with a matching MGID, Q_Key, and P_Key of the local link. If no matching group is found, one must be created with the above parameters. Once the right IB multicast group is identified on the local link, the node can then call the SM to join the group. The join call enables SM to program local IB switches and routers with the new multicast information. Specifically it causes the switch to add the LID of the caller to its forwarding table entry corresponding to the MLID allocated for the group. It also causes the router to attach itself to the IB multicast tree corresponding to the MGID. In order to send a packet destined for an IPv6 multicast address, a node must first check if an IB multicast group matching the equivalent MGID, the Q_Key and P_Key of the outbound link exists or not. If one already exists, the MLID from it is used as the DLID for the packet. Otherwise, it means no member exists on the local link. The packet should be forwarded to the all-router multicast address. Note that the local node MUST be notified when an IB multicast group corresponding to the MGID ever comes into existence later. This signifies that an interested party just showed up on the local link and therefore must be copied. 10. The All-Node Multicast Group The steps following a join call to an IPv6 multicast group by a client can be best illustrated by a special multicast group - the group corresponding to the link-local, all-node multicast address (FF02::1). This special group must be joined by all IPv6 nodes coming up on a link. Jerry Chu [Page 7] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 The first node coming up on a link must create a local IB multicast group with the per-link Q_Key, P_Key, and a MGID equal FF02::1. The node must also supply a link MTU. If both 2048 and 4096 bytes are valid choices, the node must decide on one to use. The decision is implementation dependent and is outside the scope of this document. When an IPv6 node is coming up on a link later, it must first look for the special, link-local, all-node IB multicast group with a P_Key matching the one for the local link. Then it MUST perform a sanity check on the Q_Key associated with the multicast group to verify the signiture contained in the upper 16 bits. This is to make sure the IB multicast group is indeed there for IPv6 uses. Finally it must check if the link MTU size associated with the multicast group is supported by the local port. If not, it can not join the local link as an IPv6 node. Otherwise, it may join the group and retrieve the MLID to use for local subnet broadcast. 11. Encapsulation of IP datagrams in IB Unreliable Datagrams The figure below illustrates the format of an IB Unreliable Datagram: +-----+-----+-----+------+----------------+-----+------+ | LRH | GRH | BTH | DETH | Payload |ICRC | VCRC | | 8B | 40B | 12B | 8B | 0-2048/4096B | 4B | 2B | +-----+-----+-----+------+----------------+-----+------+ IPv6 datagrams SHALL be encapsulated in IB unreliable datagrams in the payload portion directly. The QPs created for receiving IPv6 packets are for IPv6 uses only. They MUST NOT be used for receiving packets from other protocols. 12. References [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing Architecture", RFC 2373, July 1998. [ACONF] Thomson, S. and T. Narten, "IPv6 Stateless Address Autoconfiguration", RFC 2462, December 1998. [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor Discovery for IP Version 6 (IPv6)", RFC 2461, December 1998. [EUI64] "Guidelines For 64-bit Global Identifier (EUI-64)", http://standards.ieee.org/db/oui/tutorials/EUI64.html [IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. Jerry Chu [Page 8] draft-hkchu-ipoib-ipv6oib-00.txt July 2001 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 13. Author's Address H.K. Jerry Chu 901 San Antonio Road, UMPK17-201 Palo Alto, CA 94303-4900 USA Phone: +1 650 786-5146 EMail: jerry.chu@eng.sun.com 14. Full Copyright Statement Copyright (C) The Internet Society (2001>. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Jerry Chu [Page 9]