INTERNET DRAFT V.Kashyap IBM Expiration Date: January 19, 2002 July 19, 2001 IPv4 and ARP over InfiniBand networks Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This document presents a way of encapsulating IPv4 and Address Resolution Protocol(ARP) packets over InfiniBand. It also describes a mechanism for IPv4 address resolution on InfiniBand fabrics. Table of Contents 1.0 Introduction 2.0 InfiniBand data link Kashyap [Page 1] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 2.1 UD packet format 2.1.1 Local Routing header 2.1.2 Global Routing header 2.1.3 Base Transport Header 2.1.4 Datagram Extended Transport Header 2.1.5 IPv4 over UD requirements 3.0 IPv4 Address resolution 3.1 ARP and IPv4 subnet span over IB subnets 3.2 InfiniBand ARP 3.2.1 InfiniBand ARP header 3.2.2 Hardware address format 3.2.2.1 LID 3.2.2.2 Capability flag 3.2.2.3 QPN and Q_Key 3.2.2.4 GID 3.3 InfiniBand ARP process 4.0 IPv4 encapsulation in UD packets 4.1 Protocol demultiplexing 5.0 MTU 6.0 Service Level 7.0 P_Key 8.0 Additional Features 9.0 IANA Considerations 10.0 Security Considerations 11.0 References 12.0 Author's Address 13.0 APPENDIX A 1.0 Introduction The InfiniBand specification[1] can be found at www.infinibandta.org. The document 'IP over InfiniBand: Overview, issues and requirements' [2] provides a short overview of InfiniBand architecture and issues with respect to specifying IP over InfiniBand. This document restricts itself with IPv4 and ARP over InfiniBand. A subsequent document will define IPv6 encapsulation and address resolution over InfiniBand. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Kashyap [Page 2] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 2.0 InfiniBand Data link InfiniBand(IB) provides multiple methods of packet exchange between two endpoints. These are : Reliable Connected (RC) Reliable Datagram (RD) Unreliable Connected (UC) Unreliable Datagram (UD) Raw Datagram : Raw IPv6 (R6) : Raw Ethertype (RE) IPv4 and ARP can be specified over any, multiple or all of these methods. A case can be made for support on any of the methods depending on the desired parameters. The IB specification requires Unreliable Datagram mode to be supported by all the IB nodes. The host channel adapters (HCAs) are additionally required to support Reliable connected and Unreliable connected modes but not target channel adapters (TCAs). Additionally, for the sake of simplicity and ease of implementation and integration with existing stacks, it is desirable that the fabric support multicasting. This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes. Given the above conditions this document specifies a method to encapsulate IPv4 and ARP over UD mode of InfiniBand. It is a MUST for an IPv4 over InfiniBand implementation to support IPv4 and ARP over Unreliable Datagram mode of InfiniBand. The Address Resolution Protocol (ARP) MUST NOT be supported over any mode other than Unreliabe Datagram. An implementation MAY additionally support IPv4 over any of the other modes. 2.1 UD packet format The UD packet may be transmitted in two ways: 1) Local (within an IB subnet) packets +--------+---------+---------+-------+---------+---------+ |Local |Base |Datagram |Packet |Invariant| Variant | |Routing |Transport|Extended |Payload| CRC | CRC | |Header |Header |Transport| | | | | | |Header | | | | +--------+---------+---------+-------+---------+---------+ Kashyap [Page 3] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 2) Global (between IB subnets) packets +--------+-------+---------+---------+-------+---------+---------+ |Local |Global |Base |Datagram |Packet |Invariant| Variant | |Routing |Routing|Transport|Extended |Payload| CRC | CRC | |Header |Header |Header |Transport| | | | | | | |Header | | | | +--------+-------+---------+---------+-------+---------+---------+ 2.1.1 Local Routing header This header is always used. When communicating across IB subnets the destination lid is the LID of an IB router port. Otherwise it is the LID of the port to which the packet is addressed. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Virtual|Link |Service|Rsr|LNH| Destination Local ID | | Lane |Version| Level |vd | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Reserved | Packet Length | Source Local ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Of the header elements the sending node's IPv4 stack must know the Service Level, Destination LID and the source LID. In addition packet length cannot specify a payload of more than the path MTU between the source and the destination ports. The other values are either well known standard values or are determined from other known values. For example, the VL is determined from the SL. Kashyap [Page 4] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 2.1.2 Global Routing header This header is used when the packet must traverse IB subnet boundaries. It is also used for all multicast packets. The GRH looks like the IPv6 header. The GID looks like an IPv6 address. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| Traffic Class | Flow Label | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | Next Header | Hop Limit | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source GID | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination GID | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Note that from the point of view of the IPv4 layer the GID is another form of link-layer address albeit incomplete since the LID is always needed for any communication. The version is always set to 6. The Traffic Class, Flow label etc. are likely to be determined in response to a policy or default values may be used. The next header field is always the BTH (Base transport header). Among the GRH fields only the destination GID needs to be determined. Kashyap [Page 5] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 2.1.3 Base Transport Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OpCode |S|M|PC | Tver | Partition Key (P_Key) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Destination Queue Pair(QP) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |A| Reserved | Packet Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Of these the P_Key and the destination QP must be determined as part of the IPv4 address resolution process. The rest of the fields are either not used by UD mode or are filled in the the channel adapter based on local conditions/values. The P_Key index in the P_Key table is attached to the QP used for transmission of packets. In case the P_Key table on the port is more than one entry deep the software needs to decide the P_Key to use. Note: The P_Key table can be written to only by the SM [1]. When multicasting the destination QP is always set to 0xFFFFFF. 2.1.4 Datagram Extended Transport Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Queue Key (Q_Key) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Source Queue Pair | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header includes the sender's queue pair number and the Q_Key used in the communication. 2.1.5 IPv4 over UD requirements Based on the above headers it is clear that the IPv4 implementation must know the following information before it can send a packet to a peer: 1. LID 2. GID A GID is required only when crossing an IB subnet. A Kashyap [Page 6] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 GID is also required, albeit the multicast group's GID, when sending multicast packets. 3. Service Level 4. Path MTU between the communicating port 5. Partition Key 6. Queue Pair Number 7. Q_Key 3.0 IPv4 Address resolution Address resolution in its most basic form requires a mapping from the IPv4 address to the link layer address. This is generally the port identifier. However, a packet in IB requires additional auxiliary information as noted above. Of the information noted in the previous section the peer knows its own LID, GID, Partition Key, Queue Pair and the Q_Key. It can therefore return these values in the ARP reply. The service level and path MTU can be determined after knowing both the endpoint's port identifiers by querying the SA. Thus to get all the information that can comprise a link address in InfiniBand UD fabrics the subnet manager/subnet administrator needs to be consulted. Such a setup however introduces unwanted complexity and possibly delay. Additionally it may impact the scalability of the IPv4 subnets in IB subnets. The solution proposed in this draft does away with consulting the SM/SA for the missing information. This is achieved by utilizing the subnet wide parameters that are configured for the IB multicast GID corresponding to IPv4 broadcast[5]. 3.1 ARP and IPv4 subnet span over IB subnets Spanning IPv4 subnets across IB subnets is not straightforward when implementing multicasting [5]. A link address in the case of IPv4 subnets spanning IB subnets requires two components. These are the GID and the LID. In the case of the target being on the remote IB subnet the sender cannot determine the right LID in a simple way. The sender doesn't know the receiver's location and hence the IB router to use. Therefore it cannot specify the right LID. The target is better placed in that the target will know the Kashyap [Page 7] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 sender's location. It must then, based on IB routing setup (which is not defined by IBTA yet), determine the LID it needs to use. It must therefore send a reply with the first hop IB router's LID and its own GID. Only on receiving the reply would the original sender be in a position to know the LID it must use to communicate with the target. The situation can get even more complex due to IB routing changes since they would effect the LIDs in the cached ARP entries. The target ARP code must, of course, use the received LID if the GID prefix is the same as its own. Only in the case where the prefixes are different must it determine the first hop IB router's LID. The process is complex and unnecessary. The additional complexity introduced by IPv4 subnets spanning IB subnets gains very little if anything. IPv4 subnets therefore MUST be fully contained within an IB subnet. 3.2 InfiniBand ARP This document proposes to utilize the address resolution protocol as defined in RFC826 [6]. The ARP request packet is broadcast to the IPv4 subnet. This packet includes the target IPv4 address and the sender's link layer address. The response is unicast to the sender with the target's link layer address. 3.2.1 InfiniBand ARP header The standard ARP packet header is of the form (as per RFC 826) 16 bits: hardware type 16 bits: protocol 8 bits: length of hardware address 8 bits: length of protocol address 16 bits: ARP operation The hardware type will take the value corresponding to InfiniBand. A request to IANA will be made for this allocation. The rest of the fields will be used consistent with RFC 826. The remaining fields in the packets hold the sender/target Kashyap [Page 8] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 hardware and protocol addresses. [ sender hardware address ] [ sender protocol address ] [ target hardware address ] [ target protocol address ] 3.2.2 Hardware address format 16 bits : LID 8 bit : Capability flags (UC|RC|RE|R6|QPN) 24 bits : QPN 32 bits : Q_Key 64 bits : GUID Note that this is the packet on the wire. It does not imply the data structure used by the end hosts in its ARP cache. 3.2.2.1 LID This is the LID associated with the port to which the IPv4 address is attached by way of the logical interface. 3.2.2.2 Capability flag Only the first 5 bits are defined. The rest are for future use. The first 4 bits denote the InfiniBand modes over which IPv4 is supported. UC - unreliable connected RC - reliable connected RE - raw ethertype R6 - raw IPv6 The support of IPv4 over UD is mandatory and therefore it need not be indicated in these bits. The rest are all optional. The implementation details of the other formats are beyond the scope of this document. The flags provide a way for the IPv4 over IB implementations to indicate the possibilities among themselves. The use of these capabilities is then a choice between the communicating endpoints. QPN flag: QPN flag indicates that the endpoint supports applications that are tied to specific QPs. Since there may be a large number of QPs available at the endpoints (QP number is 24 Kashyap [Page 9] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 bits) an endpoint can choose to map various services (protocol and port pairs) to specific QPNs. This flag indicates the use of such demultiplexing. The flag will be set by hosts that want to advertise such a use. The endpoints that don't support QPN demultiplexing don't use this flag. The receiver is free to ignore this flag and continue to use the default QPN (described below) and not determine the service related QPN. By the same token, a host that implements QPN based demultiplexing MUST accept packets that are received on the default QPN even if it is demultiplexing the corresponding service by use of QPNs. The method of service resolution to the corresponding QPN is not defined in this document. The ARP packet MUST include only one QPN, the default QPN and none other. 3.2.2.3 QPN and Q_Key This is the default QPN to be used to send IPv4 packets. The sender lists the QPN it expects the packets to be sent to and the target replies with its QPN. The Q_Key is the corresponding Q_Key the endpoints intend to use. 3.2.2.4 GUID The GUID (Globally Unique Identifier) is the EUI-64 identifier associated with the CA port [1]. GUID identifies an interface uniquely. In contrast the LID may change after a reboot. Thus the GUID value might be used when setting up static ARP entries. Note that to actually send a packet the LID must always be determined. Thus static ARP entries really don't exist in IB subnets. The GID formed from the GUID is always available to the receiver. The IB specification requires the 40 bytes of the GRH to be placed in the first 40 bytes of the receiver's buffer. Thus it is possible for the receiver to always get the GUID of the sender if a GRH is included. The GRH has to be used when sending an IB multicast packet [1]. Therefore there is really no need to include the GUID in the ARP packet. However the GUID is included to allow for proxy ARP. Kashyap [Page 10] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 3.3 InfiniBand ARP process The source broadcasts the ARP_REQUEST packet to the IPv4 subnet. The IPv4 broadcast to IB multicast group mapping is defined in kashyap-draft-ipoib-ipv4-multicast-01.txt [5]. The broadcast packet itself is a UD packet and hence requires the parameters listed in section 2.1.5. As defined in [5], the IPv4 subnet is setup with the IPv4 broadcast address mapped to an IB multicast group. This address is registered with the IB subnets SM/SA. Along with this are registered the characteristics such as the: LID P_Key Q_Key Service Level MTU Traffic Class Hop Limit Flow ID Note that since these are applicable to the IPv4 broadcast address, the fabric administrator must ensure that these parameters are honoured across the IPv4 subnet. If this were not done the broadcast cannot be sent to all the IPv4 hosts. The interfaces that do not honour these parameters will not be able to join the IPv4 broadcast address. Thus the service level must be supported across the multicast group. The MTU must be common across the subnet etc. All these values are returned to the node joining the group. Therefore the act of joining the IPv4 broadcast address resolves many of the parameters needed to send/receive packets in the IPv4 subnet. Every IPv4 interface MUST join the IB multicast group corresponding to IPv4 subnet broadcast address. This first step is a necessary step towards address resolution and general IPv4 and ARP support on InfiniBand subnets. These values act as the best case as well as the default case values. It is RECOMMENDED that the implementations use the parameters returned by the joining of the IPv4 broadcast group in all communication. Implementations are free to utilize IB specific messages and methods to determine alternate values if Kashyap [Page 11] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 they so desire. Thus the InfiniBand ARP broadcast packet utilizes the parameters received as a result of joining the IB broadcast group. The rest of the process is therefore identical to standard implementation of ARP over Ethernet as described in RFC826. All multicast packets in IB use the QP number 0xFFFFFF thus it doesn't have to be determined for sending the ARP broadcast. During the ARP response/request the QPN and Q_Key being used by the two endpoints are exchanged along with the LID and the GID. The rest of the values, as stated above, are determined at the time the broadcast group was joined. 4.0 IPv4 and ARP encapsulation in UD packets +-------+------+---------+---------+-------+---------+-------+ |Local | |Base |Datagram | |Invariant|Variant| |Routing| GRH |Transport|Extended |Payload| CRC | CRC | |Header |Header|Header |Transport| | | | | | | |Header | | | | +-------+------+---------+---------+-------+---------+-------+ The InfiniBand specification requires the use of GRH when multicasting. A unicast packet need not include the GRH. It is RECOMMENED that a GRH be included in the interest of symmetry. It is RECOMMENDED that for unicast packets the destination GID be formed from the GUID in the ARP cache and the local IB subnet prefix. The GID at GID index 0 MUST be used as the source GID. Note that the additional 40 bytes of the GRH do not reduce the MTU value since the IB multicast group's MTU already accounts for it. The MTU applies to the payload carried in the IB frame and doesn't include the headers (GRH/LRH etc.) optional or otherwise. 4.1 Protocol demultiplexing The IB frame does not indicate the payload type. The queue pairs (QP) at the endpoints are tied to specific 'users' who know the data they are likely to receive. The IB specification assumes such a setup. In this scenario if a common QP is used to receive multiple Kashyap [Page 12] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 protocols, intentionally or otherwise, there are two options: a) Introduce a protocol identifier header into the payload. This mode would translate to introducing a 4 byte field in the packet before the IPv4 header. This field would carry the assigned value to indicate IPv4. This option, however has a flaw. Since it is being protected from a random protocol or data reception the random data has an equal chance of looking like the new header as with the IPv4 header. b) Determine the packet based on IPv6 header requirements The packet received can be checked for IPv4 signatures such as the value 6 in the first nibble. The likely conflict is with IPv6/ARP packets on the unicast connections. These can be distinguished by looking at the first nibble. Any intermediate IB switches/IB routers might as well look at the first nibble (and fields) directly rather than a header in the payload. They wouldn't know that the packet carries a special header without peeking anyway. Based on the above no additional header is defined in the paylaod. An implementation may disambiguate its packets based on the signature of the protocols it supports. [ Opinions of the WG members are solicited on this. The author's preference is for not specifying an additional header. A payload identifier in the IB headers would have been nice.] 5.0 MTU The MTU associated with the IB multicast group corresponding to the limited broadcast address MUST be set as the MTU of the subnet. Every interface on being setup joins the broadcast group and gets the MTU value. This is the MTU value that is reflected to the IP layer as the interface MTU. 6.0 Service Level The IB specification requires the use of a service level (SL) in every packet. This value may be derived based on any QoS or Kashyap [Page 13] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 other parameters. The only requirement is that it must be a valid SL between the two endpoints. The valid SLs between two ports are recorded in the SM/SA [1]. To determine the list the two endpoints must be known. This implies that after the ARP process is complete the SM/SA need to be consulted. Even then the host must use some external parameters to determine the correct SL to use between the endpoints. It is RECOMMENED that the SL returned on joining the IB broadcast group is chosen as the default SL for all communication in the IP subnet. 7.0 P_Key The partition key is a necessity in all IB communication. This document RECOMMENDS using the P_Key associated with the IPv4 broadcast group (IB multicast GID FF12::255.255.255.255) for all IPv4 subnet related communication. 8.0 Additional Features This document has presented a simple, efficient, interoperable method of address resolution and ARP and IPv4 encapsulation in InfiniBand packets. The basic desire of the author is to present a method that easily enmeshes with existing implementations. Another strong desire is to ensure interoperability between implementations by requiring easily setup default values without inhibiting those implementations that need some additional features. There may be situations where implementations may desire more 'optimal' performance or features. Such implementations are dependent on the fabric administrator ensuring that the SM/SA and the fabric components are correctly setup for the desired features. These cases could be: 1. use of other methods than UD for IPv4 packets Example: One could utilize Unreliable connected mode to get a higher MTU for TCP connectivity. The fragmentation and reassembling of packets is delegated to the hardware. The ARP proposal in this document allows for an Kashyap [Page 14] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 indication of such a capability. It is up to the implementations to then utilize InfiniBand specific ways (use of Connection manager etc.) to setup the necessary communication. The IPv4 encapsulation can stay the same except that the relevant IB headers will be used. The only requirement is that the Address Resolution protocol MUST be implemented over UD mode of InfiniBand. This allows for a common mode of determining the link address and link capabilities across the IPv4 subnet. 2. Use of alternate SL Some implementations might want to determine alternate SL values from the SM/SA. This is a valid option but is unrelated to the IPv4 implementation. This document recommends that the SL utilized in the IPv4 subnet broadcast address i.e. the corresponding IB multicast group, which by definition is valid for the whole of IPv4 subnet, will be used by default. Alternate choices depend on the implementation consulting the SM/SA and the fabric administrator ensuring that such choices are valid and available. Such a choice could also depend on the quality of service policy mappings from IPv4 to InfiniBand implemented on the host. An implementation MUST provide configuration prameters to define the method of determining the SL. The default value MUST imply that the SL associated with IPv4 broadcast group (IB multicast group FF12::255.255.255.255) is the best case SL to be used in for all communication. 3. Use of alternate TClass, FlowLabel The logic governing these parameters is the same as in the previous case for SL. Additionally these may be determined by a policy that is intertwined with IPv4 routing or IB routing or both. The discussion of such issues is not relevant to this document. However, it is RECOMMENDED that broadcast group provide the default values. 4. Use of QPN demultiplexing Kashyap [Page 15] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 The QPN demultiplexing has been described in some detail in section 3.2.2.2. Thus the implementations that desire the more 'optimal' behaviour can do so in an interoperable way. The method of determining the service bindings to QPs is beyond the scope of this document. 9.0 IANA Considerations To support ARP over InfiniBand the Address Resolution Parameter 'Number Hardware Type (hrd)' is required. A request to IANA will be made in this regard. 10.0 Security Considerations This document specifies IPv4 packet transmission over a broadcast network. Any network of this kind is vulnerable to a sender claiming another's identity and forge traffic or eavesdrop. It is the responsibility of the higher layers or applications to implement suitable counter-measures if this is a problem. 11.0 References: [1] InfiniBand Architecture Specification, Volume 1, Release 1.0 [2] draft-kashyap-ipoib_requirements-00.txt. V. Kashyap [3] RFC2373: IPv4 Version 6 Addressing Architecture. R. Hinden,S. Deering. [4] RFC2375: IPv6 Multicast Address Assignments. R. Hinden, S. Deering. [5] draft-kashyap-ipoib-ipv4-multicast-01.txt V. Kashyap [6] RFC826:An Ethernet Address Resolution Protocol. David C. Plummer [7] RFC1700: Assigned Numbers. J. Reynolds, J. Postel [8] RFC2434: Guidelines for Writing an IANA Considerations Section in RFCs T. Narten, H. Alvestrand 12.0 Author's Address Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Work: 503 578 3422 Email: vivk@us.ibm.com 13.0 APPENDIX A: Introduction to InfiniBand Kashyap [Page 16] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 For a more complete overview the reader is referred to chapter 3 of the InfiniBand specification. InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems. An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from single ASIC IBA attached devices such as a LAN adapter to a large memory rich RAID subsystem. IBA network is subdivided into subnets interconnected by IB routers. These are IB routers and IB subnets and not IP routers or IP subnets. Each IB node or switch may attach to a single or multiple switches or directly with each other. Each node interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or mode ports that connect to the fabric. Each CA appears as a node to the fabric. The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (queue pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address. IBA supports both connection oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. In raw datagram mode the QPN is not used. A port may be identified by a local ID (LID) and optionally a Global ID (GID). The GID is 128 bits long and is formed by the concatenation of a 64 bit subnet prefix and a 64 bit EUI-64 compliant portion (GUID). The LID is a 16 bit value that is assigned when the port becomes active. Note that the GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified then the GID may change. The subnet manager may attempt to keep the LID values constant across shutdowns but that is not a requirement. Kashyap [Page 17] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager component that controls the fabric. It assigns the LIDs and GIDs, it programs the switches so that they route packets between destinations. The subnet manager and a related component, the subnet administrator (SA) are the central repository of all information that is required to setup and bring up the fabric. IB routers are components that route packets between IB subnets based on the GIDs. Thus within and IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it. A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes (VL) where each VL has its own flow control. There may be upto 16 VLs. VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports however must support VL15 which is reserved exclusively for subnet management datagrams and hence doesn't concern the IPoIB discussions. The actual VL that a port uses is configured by the SM and is based on the Service Level (SL) specified in every packet. There are 16 possible SLs. In addition to the features described above viz. Queue Pairs (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also defines the following: P_Keys or partition keys: Every packet, but for the raw datagrams, carries the partition key (P_key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The same is the case with the receiving CA. Q_Keys: These are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services don't require this value. At communication establishment the Kashyap [Page 18] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 endpoints exchange the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Mutlicast support: A switch may support multicasting ie. replication of packets across multiple output ports. This is an optional feature at the switches. A multicast group is identified by a GID. The GID format is as defined in RFC 2373 on IPv6 addressing. Thus from an IPv6 over IB's point of view the data link multicast address looks like the network address. An IB node must explicitly join a multicast group by a request to the SM to receive packets. A node may send packets to any multicast group. In both cases the multicast LID to be used in the packets is received from the SM. There are 6 transport types specified by the IB architecture. These are : 1. Unreliable Datagram (unacknowledged - connectionless) The UD service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node. The switches and hence each link can support only a certain MTU. The MTU ranges are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot be larger than the smallest link MTU between the two peers. 2. Reliable Datagram (acknowledged - multiplexed) The RD service is multiplexed over connections between nodes called End to end contexts (EEC) which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain). 3. Reliable Connected (acknowledged - connection oriented) The RC service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 bytes in length. The CA implementation takes care of segmentation and assembly. 4. Unreliable Connected (unacknowledged - connection oriented) The UC service associates one local QP with one and only one remote QP. There is no acknowledgment and hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise. Kashyap [Page 19] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 5. Raw Ethertype (unacknowledged - connectionless) The Ethertype raw datagram packet contains a generic transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined in RFC1700 for ethertype. 6. Raw IPv6 ( unacknowledged - connectionless) Using IPv6 raw datagram service, the IBA CA can support standard prtocol layers atop IPv6 (such as TCP/UDP). Thus native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP. The first 4 are referred to as IB transports. The latter two are classified as Raw datagrams. There is no indication of the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size. Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A Kashyap [Page 20] INTERNET-DRAFT IPv4 and ARP over InfiniBand July 19, 2001 PARTICULAR PURPOSE. Kashyap [Page 21] -- Vivek Kashyap IBM viv@sequent.com vivk@us.ibm.com 503 578 3422 (o)