INTERNET DRAFT Vivek Kashyap IBM Expiration Date: August 6, 2002 H.K. Jerry Chu Sun Microsystems February 6, 2002 IP encapsulation and address resolution over InfiniBand networks Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This document specifies the frame format for transmission of IP and ARP packets over InfiniBand networks. Unless explicitly specified, the term 'IP' refers to both IPv4 and IPv6. The term 'ARP' refers to all the ARP protocols/op-codes such as ARP/RARP. This document also describes the method of forming Kashyap, Chu [Page 1] INTERNET-DRAFT IP over InfiniBand February 6, 2002 IPv6 link-local addresses, and the content of the source/target link layer address option used in Neighbour solicitation and advertisement, router advertisement, router redirect and router solicitation on IPv6 over InfiniBand. Table of Contents 1.0 Introduction 2.0 InfiniBand Datalink 2.1 IP Support on IPoIB Link 3.0 Maximum Transmission Unit 4.0 Frame Format 5.0 IPv6 Stateless Autoconfiguration 5.1 IPv6 Link Local Address 6.0 Address Mapping - Unicast 6.1 Link-Information 6.1.1 Link Layer Address/Hardware Address 6.1.2 Auxiliary Link Information 6.2 Address Resolution in IPv4 Subnets 6.3 Link-Layer Address in IPv6 7.0 IANA Considerations 8.0 Security Considerations 9.0 Acknowledgements 10.0 References 11.0 Authors' Addresses 1.0 Introduction The InfiniBand specification[IB_ARCH] can be found at www.infinibandta.org. The document [IPoIB_ARCH] provides a short overview of InfiniBand architecture along with considerations for specifying IP over InfiniBand networks. The document [IPoIB_MCAST] defines the configuration of IPoIB links and the support of IP multicast over InfiniBand networks. The InfiniBand architecture(IBA) defines multiple modes of transport over which IP may be implemented. The unreliable datagram(UD) transport method best matches the needs of IP and the need for universality in general as described in[IPoIB_ARCH]. This document specifies IPoIB over IB's unreliable datagram(UD) mode. A separate document will describe the implementation of IP subnets over IB's other transport mechanisms. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL Kashyap, Chu [Page 2] INTERNET-DRAFT IP over InfiniBand February 6, 2002 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2.0 InfiniBand Datalink The document [IPoIB_MCAST] defines the IPoIB link, its setup, and IP multicast over InfiniBand in detail. The following discussion gives a short overview. An IB subnet is formed by a network of IB nodes interconnected either directly or via IB switches. IB subnets may be connected using IB routers to form a fabric made of multiple IB subnets. Multiple IP subnets may be overlaid over this IB cloud. The boundary of this IP subnet is arbitrary and not associated with a physical demarcation. The IPoIB nodes that are members of this subnet are interconnected by an abstract 'link'. The link is defined by its members and common characteristics such as the P_Key, link MTU and Q_Key that are defined per 'link'. IPv4 defines a limited-broadcast address over the link. All IPv4 hosts that are members of the IPv4 subnet are members of this address. IPv6 defines a multicast address referred to as the all-IP hosts address. IPoIB associates a multicast GID with these addresses[IPoIB_MCAST]. This multicast GID will henceforth be referred to as the broadcast-GID. The broadcast-GID is required to be setup for an IPoIB subnet to be formed. Every IPoIB interface MUST join the broadcast-GID. This operation returns the MTU and the Q_Key associated with the IPoIB link. Thus the IPoIB subnet(and the link) is formed by the IPoIB nodes joining the broadcast GID. The P_Key is a configuration parameter that must be known before the broadcast-GID can be formed[IPoIB_MCAST]. 2.1 IP Support on IPoIB Link The unreliable datagram(UD) mode of communication is supported by all IB elements be they IB routers, HCAs or TCAs. In addition to being the only universal transmission method it supports multicasting, partitioning and a 32-bit CRC[IB_ARCH]. Though multicasting support is optional in IB fabrics, IPoIB architecture requires the participating components to support it [IPoIB_MCAST]. Kashyap, Chu [Page 3] INTERNET-DRAFT IP over InfiniBand February 6, 2002 All IPoIB implementations MUST support IP over the unreliable datagram(UD) transport mode of IBA. [ Note to WG: There is an ongoing discussion in the WG with respect to packet encapsulation. A consensus call by the chair on the 'ethertype' discussion is awaited. The final draft of this document will reflect the consensus. The decision on 'ethertype' may effect the following two sections: 3.0 Maximum Transmission Unit 4.0 Frame Format ] 3.0 Maximum Transmission Unit The IB architecture supports multiple MTU values: 256, 512, 1024, 2048, 4096 bytes. An implementation determines the IPoIB link MTU from the MTU listed in the MCGroupRecord of the broadcast GID[IPoIB_MCAST]. The IPoIB link does not have a default MTU. It is RECOMMENDED that the IP MTU be set equal to that of the IPoIB link MTU. In IPv6 subnets the IP MTU derived from the IPoIB link MTU may be reduced by a Router Advertisement[RFC2461] containing an MTU option which specifies a smaller MTU, or by manual configuration of each node. If a Router Advertisement received on an IPoIB interface has an MTU option specifying an MTU larger than the link MTU or larger than a manually configured value, that MTU option may be logged to system management but must be otherwise ignored. Similarly, the IPv4 MTU may also be reduced from the link MTU value by manual configuration of each node. For purposes of this document, information received from DHCP is considered "manually configured". Ethernet LANs, which are very common, support an MTU of 1500 bytes. The IPv6 specification further requires a minimum MTU of 1280 bytes. Therefore it is very appropriate to set the IP MTU to these values depending on the networking needs. It must however be ensured that the IPoIB link MTU is at least 2048 bytes. IBA MTUs of smaller values are not optimal for Kashyap, Chu [Page 4] INTERNET-DRAFT IP over InfiniBand February 6, 2002 internetworking to other IP subnets. 4.0 Frame Format The IP and ARP datagrams are directly encapsulated in IB's Unreliable Datagrams payload. |<------ IB Frame headers -------->|Payload|<-IB trailers -->| +-------+------+---------+---------+-------+---------+-------+ |Local | |Base |Datagram | |Invariant|Variant| |Routing| GRH* |Transport|Extended |IPv4/v6| CRC | CRC | |Header |Header|Header |Transport| /ARP | | | | | | |Header | | | | +-------+------+---------+---------+-------+---------+-------+ Figure 1 The InfiniBand specification requires the use of Global Routing Header(GRH)[IPoIB_ARCH] when multicasting or when an InfiniBand packet traverses from one IB subnet to another through an IB router. Its use is optional when used for unicast transmission between nodes within an IB subnet. The IPoIB implementation MUST be able to handle packets received with or without the use of GRH. The IP/ARP datagrams SHALL be encapsulated in IB unreliable datagrams in the payload. The QPs advertised for IP communication MUST NOT be used for other protocols. 5.0 IPv6 Stateless Autoconfiguration IB architecture associates an EUI-64 identifier termed the GUID (Globally Unique Identifier) [IPoIB_ARCH, IB_ARCH] with each port. The LID (16 bits) is unique within an IB subnet only. The interface identifier may be chosen from: 1) The EUI-64 compliant Globally unique identifier(GUID) assigned by the manufacturer. 2) If the IPoIB subnet is fully contained within an IB subnet any of the unique 16-bit LIDs of the port associated with the IPoIB interface. The LID values of a port may change after a reboot/power-cycle of the IB node. Therefore, if a persistent value is desired, it would be prudent to not use the LID to form the interface identifier. Kashyap, Chu [Page 5] INTERNET-DRAFT IP over InfiniBand February 6, 2002 On the other hand, the LID provides an identifier that can be used to create a more anonymous IPv6 address since the LID is not globally unique and is subject to change over time. It is RECOMMENDED that the link-local address be constructed from the port's EUI-64 identifier as per the rules specified in [RFC2373]. The interface identifier may also be chosen as per the guidelines specified in [RFC3041]. 5.1 IPv6 Link Local Address The IPv6 link local address for an IPoIB interface is formed in accordance with the guidelines in [RFC2373]. The link local address is of the format: 10 bits 54 bits 64 bits +----------+-----------------------+----------------------------+ |1111111010| (zeros) | Interface Identifier | +----------+-----------------------+----------------------------+ Figure 2 6.0 Address Mapping - Unicast Address resolution in IPv4 subnets is accomplished through Address Resolution protocol (ARP)[RFC826]. It is accomplished in IPv6 subnets using the Neighbor discovery protocol[RFC2461]. 6.1 Link Information An InfiniBand packet over the UD mode includes multiple headers such as the LRH(local route header), GRH(global route header), BTH(base transport header), DETH(datagram extended header) as depicted in Figure 1 and specified in the InfiniBand architecture[IB_ARCH]. All these headers comprise the link-layer in an IPoIB link. The parameters needed in these IBA headers constitute the link-layer information that needs to be determined before an IP packet may be transmitted across the IPoIB link. Kashyap, Chu [Page 6] INTERNET-DRAFT IP over InfiniBand February 6, 2002 The parameters that need to be determined are: a) LID (local identifier) The LID is always needed. A packet always includes the LRH that is targeted at the remote node's LID, or an IB router's LID to get to the remote node in another IB subnet. b) GID (global identifier) The GID is not needed when exchanging information within an IB subnet though it may be included in any packet. It is an absolute necessity when transmitting across the IB subnet since the IB routers use the GID to correctly forward the packets. The source and destination GIDs are fields included in the GRH. The GID, if formed using the GUID, can be used to unambiguously identify an endpoint. c) QPN (queue pair number) Every unicast UD communication is always directed to a particular queue pair(QP) at the peer. d) Q_Key A Q_Key is associated with each unreliable datagram QPN. The received packets must contain a Q_Key that matches the QP's Q_Key to be accepted. e) P_Key A successful communication between two IB nodes using UD mode can occur only if the two nodes have compatible P_Keys. This is referred to as being in the same partition[IB_ARCH]. P_Keys are checked at the receiving channel adapter and may be optionally checked at intermediate switches/IB routers. If the P_Key in the packet does not match the expected P_Key the packet is dropped. f) SL (service level) Every IBA packet contains an SL value. A path in IBA Kashyap, Chu [Page 7] INTERNET-DRAFT IP over InfiniBand February 6, 2002 is defined by the three-tuple (source LID, destination LID, SL). The SL in turns is mapped to a virtual lane(VL) at every xCA, switch that sends/forwards the packet [IPoIB_ARCH]. Multiple SLs may be used between two endpoints to provide for load-balancing, SLs may be used for providing a QoS infrastructure, or may be used to avoid deadlocks in the IBA fabric. Another auxiliary piece of information, not included in the IBA headers, is : g) Path rate The InfiniBand architecture defines multiple link speeds. A higher speed transmitter can swamp switches/xCAs. To avoid such congestion every source transmitting at greater than 1x speeds is required to determine the 'path rate' before the data may be transmitted [IB_ARCH]. 6.1.1 Link Layer Address/Hardware Address Though the list of information required for a successful transmittal of an IPoIB packet is large not all the information need be determined during the IP address resolution process. The IPoIB link-layer address used in the source/target link-layer address option in IPv6 and the 'hardware address' in IPv4/ARP has the same format. The format is as described below: +--------+--------+--------+--------+ |Reserved| QPN[23-0] | +--------+--------+--------+--------+ | GID[127-96] | + + | GID[95-64] | + + | GID[63-32] | + + | GID[31-0] | +--------+--------+--------+--------+ Figure 3 Kashyap, Chu [Page 8] INTERNET-DRAFT IP over InfiniBand February 6, 2002 a) Reserved Flags These 8 bits are reserved for future use. These bits MUST be set to zero on send and ignored on receive unless specified differently in a future document. b) Queue Pair Number (QPN) Every unicast communication in IB architecture is directed to a specific queue pair(QP)[IB_ARCH]. This QP number is included in the link description. All IP communication to the relevant IPoIB interface MUST be directed to this QPN. In the case of IPv4 subnets the address resolution protocol(ARP) reply packets are also directed to the same QPN. The choice of the QPN for IP/ARP communication is up to the implementation. c) Global Identifier (GID) This is one of the Global Identifiers(GIDs)[IB_ARCH] of the port associated with the IPoIB interface. IB associates multiple GIDs with a port. It is RECOMMENDED that the GID formed by the combination of the IB subnet prefix and the port's GUID be included in the link-layer/hardware address. 6.1.2 Auxiliary Link Information The rest of the parameters are determined as follows: a) Local Identifier(LID) The method of determining the peer's LID is not defined in this document. It is up to the implementation to use any of the IBA approved methods to determine the destination LID. One such method is to use the GID determined during the address resolution, to retrieve the associated LID from the IB routing infrastructure or the SA. It is the responsibility of the administrator to ensure that the IB subnet(s) have unicast connectivity between the IPoIB nodes. The GID exchanged between two endpoints in a multicast message(ARP/ND) does not guarantee the existence of a unicast path between the Kashyap, Chu [Page 9] INTERNET-DRAFT IP over InfiniBand February 6, 2002 two. This has to be ensured by the fabric administrator. There may be multiple LIDs, and hence paths, between the endpoints. The criteria for selection of the LIDs are beyond the scope of this document. b) Q_Key The Q_Key received on joining the broadcast-GID MUST be used for all IPoIB communication over the particular IPoIB link. c) P_Key The network administrator is required to setup an IPoIB link by setting up an IB partition and assigning it a unique P_Key[IPoIB_MCAST]. Thus the P_Key to be used in the IP subnet is not discovered but is a configuration parameter. d) Service Level(SL) The method of determining the SL is not defined in this document. The SL is determined by any of the IBA approved methods. e) Path rate The implementation must leverage IB methods to determine the path rate as required. 6.2 Address Resolution in IPv4 Subnets The ARP packet header is as defined in [RFC826]. The hardware type is set to 32(decimal) as specified by Internet Assigned Numbers Authority(IANA). The rest of the fields are used as per RFC826. 16 bits: hardware type 16 bits: protocol 8 bits: length of hardware address 8 bits: length of protocol address 16 bits: ARP operation The remaining fields in the packet hold the sender/target Kashyap, Chu [Page 10] INTERNET-DRAFT IP over InfiniBand February 6, 2002 hardware and protocol addresses. [ sender hardware address ] [ sender protocol address ] [ target hardware address ] [ target protocol address ] The hardware address included in the ARP packet will be as specified in section 6.1.1 and depicted in Figure 3. The length of the hardware address used in ARP packet header therefore is 20. 6.3 Link-Layer Address in IPv6 The Source/Target Link-layer address option is used in Router Solicit, Router advertisements, Redirect, Neighbour Solicitation and Neighbour Advertisement messages when such messages are transmitted on InfiniBand networks. The source/target address option is specified as follows: Type: Source Link-layer address 1 Target Link-layer address 2 Length: 3 Link-layer address: The link-layer address is as specified in section 6.1.1 and depicted in Figure 3. 7.0 IANA Considerations To support ARP over InfiniBand a value for the Address Resolution Parameter 'Number Hardware Type (hrd)' is required. IANA has assigned the number '32' to indicate InfiniBand[IANA_ARP]. 8.0 Security Considerations This document specifies IP transmission over a multicast network. Any network of this kind is vulnerable to a sender claiming another's identity and forge traffic or eavesdrop. It is the responsibility of the higher layers or applications to Kashyap, Chu [Page 11] INTERNET-DRAFT IP over InfiniBand February 6, 2002 implement suitable counter-measures if this is a problem. 9.0 Acknowledgements The authors would like to thank Bruce Beukema, David Brean, Dan Cassiday, Yaron Haviv, Thomas Narten, Erik Nordmark, Greg Pfister, Jim Pinkerton, Renato Recio, Kevin Reilly, Madhu Talluri and Satya Sharma for their suggestions and many clarifications on the IBA specification. 10.0 References [IB_ARCH] InfiniBand Architecture Specification, Volume 1.0a www.infinibandta.org [IPoIB_ARCH] draft-ietf-ipoib-architecture-01.txt [IPoIB_MCAST] draft-ietf-ipoib-link-multicast-00.txt [RFC2373] IP Version 6 Addressing Architecture [RFC2375] IPv6 Multicast Address Assignments [RFC826] An Ethernet Address Resolution Protocol [RFC1700] Assigned Numbers. [RFC2434] Guidelines for Writing an IANA Considerations Section in RFCs [RFC2461] Neighbor Discovery for IP version 6 (IPv6) [RFC3041] Extensions to IPv6 Address Autoconfiguration [IANA] Internet assigned numbers authority, www.iana.org [IANA_ARP] www.iana.org/assignments/arp-parameters 11.0 Authors' Address Vivek Kashyap 15450, SW Koll Parkway Beaverton, OR 97006 USA Phone: +1 503 578 3422 Email: vivk@us.ibm.com Kashyap, Chu [Page 12] INTERNET-DRAFT IP over InfiniBand February 6, 2002 H.K. Jerry Chu 901 San Antonio Road, UMPK17-201 Palo Alto, CA 94303-4900 USA Phone: +1 650 786-5146 Email: jerry.chu@sun.com Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap, Chu [Page 13] -- Vivek Kashyap Linux Technology Center, IBM kashyapv@us.ibm.com vivk@us.ibm.com 503 578 3422 (o)