INTERNET DRAFT V.Kashyap IBM Expiration Date: May 14, 2002 November 14, 2001 IP over InfiniBand(IPoIB) Architecture Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract InfiniBand is a high speed, channel based interconnect between systems and devices. This memo presents an overview of the InfiniBand architecture. It further describes the requirements for transmission of IP over InfiniBand. Kashyap [Page 1] INTERNET-DRAFT IPoIB architecture November 14, 2001 Table of Contents 1.0 Introduction to InfiniBand 1.1 InfiniBand Architecture Specification 1.2 Overview of InfiniBand Architecture 1.2.1 InfiniBand Addresses 1.2.1.1 Unicast GIDs 1.2.1.2 Multicast GIDs 1.2.2 InfiniBand Multicast Groups 2.0 Management of InfiniBand subnet 3.0 IP over IB requirements 3.1 InfiniBand as datalink 3.2 Multicast support 3.2.1 Mapping IP multicast to IB multicast 3.2.2 Transient bit in IB MGIDs 3.3 IP subnet across IB subnets ? 3.4 Multicast address to LID mapping 3.5 IP encapsulation 4.0 IP subnets in InfiniBand fabrics 4.1 IPoIB VLANs 4.2 Multicast in IPoIB subnets 4.2.1 Sending IP multicast datagrams 4.2.2 Receiving multicast packets 4.2.2.1 IB_join of MGIDs by a listener 4.2.3 Leaving/Deleting a multicast group 5.0 QoS and related issues 6.0 Security Considerations 7.0 References 8.0 Author's address 1.0 Introduction to InfiniBand The InfiniBand Trade Association(IBTA) was formed to develop an I/O specification to deliver a channel based, switched fabric technology. The InfiniBand standard is aimed at meeting the requirements of scalability, reliability, availability and performance of servers in data centers. 1.1 InfiniBand Architecture Specification The InfiniBand Trade Association specification, version 1.0, is available for download from http://www.infinibandta.org. 1.2 Overview of InfiniBand Architecture For a more complete overview the reader is referred to chapter 3 of the InfiniBand specification. Kashyap [Page 2] INTERNET-DRAFT IPoIB architecture November 14, 2001 InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems. An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from single ASIC IBA attached devices such as a LAN adapter to a large memory rich RAID subsystem. IBA network is subdivided into subnets interconnected by IB routers. These are IB routers and IB subnets and not IP routers or IP subnets. Each IB node or switch may attach to a single or multiple switches or directly with each other. Each node interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or more ports that connect to the fabric. Each CA appears as a node to the fabric. The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (queue pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address. IBA supports both connection oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. In raw datagram mode the QPN is not used. A port may be identified by a local ID (LID) and optionally a Global ID (GID). The GID is 128 bits long and is formed by the concatenation of a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant portion (GUID). The LID is a 16 bit value that is assigned when the port becomes active. Note that the GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified then the GID may change. The subnet manager may attempt to keep the LID values constant across reboots but that is not a requirement. The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager Kashyap [Page 3] INTERNET-DRAFT IPoIB architecture November 14, 2001 component that controls the fabric. It assigns the LIDs and GIDs. The subnet manager also programs the switches so that they route packets between destinations. The subnet manager and a related component, the subnet administrator (SA) are the central repository of all information that is required to setup and bring up the fabric. IB routers are components that route packets between IB subnets based on the GIDs. Thus within an IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it. A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes (VL) where each VL has its own flow control. There may be upto 16 VLs. VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports however must support VL15 which is reserved exclusively for subnet management datagrams and hence doesn't concern the IPoIB discussions. The actual VL that a port uses is configured by the SM and is based on the Service Level (SL) specified in every packet. There are 16 possible SLs. In addition to the features described above viz. Queue Pairs (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also defines the following: P_Keys or partition keys: Every packet, but for the raw datagrams, carries the partition key (P_key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The CA ports always check for the P_Keys. A CA port may belong to multiple partitions. Q_Keys: These are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services don't require this value. At communication establishment the endpoints exchange the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Kashyap [Page 4] INTERNET-DRAFT IPoIB architecture November 14, 2001 Mutlicast support: A switch may support multicasting ie. replication of packets across multiple output ports. This is an optional feature. A multicast group is identified by a GID. The GID format is as defined in [RFC 2373] on IPv6 addressing. Thus from an IPv6 over IB's point of view the data link multicast address looks like the network address. An IB node must explicitly join a multicast group by a request to the SM to receive packets. A node may send packets to any multicast group. In both cases the multicast LID to be used in the packets is received from the SM. There are 6 transport types specified by the IB architecture. These are : 1. Unreliable Datagram (unacknowledged - connectionless) The UD service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node. The switches and hence each link can support only a certain MTU. The MTU ranges are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot be larger than the smallest link MTU between the two peers. 2. Reliable Datagram (acknowledged - multiplexed) The RD service is multiplexed over connections between nodes called End to end contexts (EEC) which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain). 3. Reliable Connected (acknowledged - connection oriented) The RC service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 bytes in length. The CA implementation takes care of segmentation and assembly. 4. Unreliable Connected (unacknowledged - connection oriented) The UC service associates one local QP with one and only one remote QP. There is no acknowledgment and hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise. 5. Raw Ethertype (unacknowledged - connectionless) The Ethertype raw datagram packet contains a generic Kashyap [Page 5] INTERNET-DRAFT IPoIB architecture November 14, 2001 transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined in RFC1700 for ethertype. 6. Raw IPv6 ( unacknowledged - connectionless) Using IPv6 raw datagram service, the IBA CA can support standard prtocol layers atop IPv6 (such as TCP/UDP). Thus native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP. The first 4 are referred to as IB transports. The latter two are classified as Raw datagrams. There is no indication of the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size. 1.2.1 InfiniBand Addresses The InfiniBand architecture borrows heavily from the IPv6 architecture in terms of the InfiniBand subnet structure and global identifiers (GIDs). The InfiniBand architecture defines the global identifier associated with a port as follows: GID (Global Identifier): A 128-bit unicast or multicast identifier used to identify a port on a channel adapter, a port on a router, a switch, or a multicast group. A GID is a valid 128-bit IPv6 address (per RFC 2373) with additional properties/restrictions defined within IBA to facilitate efficient discovery, communication, and routing. Note: These rules apply only to IBA operation and do not apply to raw IPv6 operation unless specifically called out. The raw IPv6 operation referred to in the note in the the definition above is the IPv6 mode of InfiniBand's raw datagram service. It does not mean IPv6 itself. The routers and switches referred to in the above definition are the InfiniBand routers and switches. The InfiniBand(IB) specification defines two types of GIDs: unicast and multicast. 1.2.1.1 Unicast GIDs The unicast GIDs are defined, as in IPv6, with three scopes. Kashyap [Page 6] INTERNET-DRAFT IPoIB architecture November 14, 2001 The IB specification states: a. link local: This is defined to be FE80/10. The IB routers will not forward packets with a link local address in source or destination beyond the IB subnet. b. site local: FEC0/10 A unicast GID used within a collection of subnets which is unique within that collection (e.g. a data center or campus) but is not necessarily globally unique. IB routers must not forward any packets with either a site-local Source GID or a site-local Destination GID outside of the site. c. global: A unicast GID with a global prefix, i.e. an IB router may use this GID to route packets throughout an enterprise or internet. 1.2.1.2 Multicast GIDs The mulicast GIDs also parallel the IPv6 multicast addresses. The IB specification defines the multicast GIDs as follows: FFxy:<112 bits> Flag bits: The nibble, denoted by x above, are the 4 flag bits: 000T. The first three bits are reserved and are set to zero. The last bit is defined as follows: T=0: denotes a permanently assigned i.e. well known GID T=1: denotes a transient group Scope bits: The 4 bits, denoted by y in the GID above, are the scope bits. These are defined as : scope value Address value 0 Reserved 1 Unassigned 2 Link-local 3 Unassigned Kashyap [Page 7] INTERNET-DRAFT IPoIB architecture November 14, 2001 4 Unassigned 5 Site-local 6 Unassigned 7 Unassigned 8 Organization-local 9 Unassigned 0xA Unassigned 0xB Unassigned 0xC Unassigned 0xD Unassigned 0xE Global 0xF Reserved Table 1 The IB specification further refers to [RFC_2373] and [RFC_2375] while defining the well known multicast addresses. However, it then states that the well known addresses apply to IB raw IPv6 datagrams only. The IB unreliable datagram (UD) service recognises only one well known multicast address. This is the ALL_CHANNEL_ADAPTERS multicast address defined to be FF02::1. The scope of this address is limited to a single IB subnet. It must be noted though that a multicast group can be associated with only a single MGID. Thus the same MGID cannot be associated with the UD mode and the raw datagram mode. 1.2.2 InfiniBand Multicast Groups IB multicast groups (multicast GIDs) are managed by the subnet manager(SM). The SM explicitly programs the IB switches in the fabric to ensure that the packets are received by all the members of the multicast group. When the group is created a create request is sent to the SM. The subnet manager records the group GIDs and the associated characteristics. The group characteristics are defined by the group path MTU, whether the group will be used for raw datagrams or unreliable datagrams, the service level, the partition key associated with the group, the LID (local identifier) associated with the group etc. These characteristics are defined at the time of the group creation. The LID is associated with the multicast group by the subnet manager(SM) at the time of the multicast group creation. An IB node may request a specific LID be associated with a group. The SM determines the multicast tree based on all the group members and programs the relevant switches. The LID is used by Kashyap [Page 8] INTERNET-DRAFT IPoIB architecture November 14, 2001 the switches to route the packets. Any member IB node wanting to participate in the group must join the group. As part of the join operation the node is returned the group characteristics. At the same time the subnet manager ensures that the requester can indeed participate in the group by verifying that it can support the group MTU, and accessiblity to the rest of the group members. Other group characteristics may need verification too. The SM, for groups that span IB subnet boundaries, must interact with IB routers to determine the presence of this group in other IB subnets. If present the MTU must match across the IB subnets. P_Key is another characteristic that must match across IB subnets since the P_Key inserted into a packet is not modified by the IB switches or IB routers. Thus if the P_Keys didn't match the IB router(s) itself might drop the packets or destinations on other subnets might drop the packets. These characteristics are returned to the IB endnode that joins the multicast group. A join operation may cause the SM to reprogram the fabric so that the new member can participate in the multicat group. 2.0 Management of InfiniBand subnet To aid in the monitoring and configuration of InfiniBand subnet components a set of MIBs MUST be defined. MIBs are needed for the channel adapters, baseboard mamangement to allow management of specified device properties and sample counters. It must be noted that the management objects addressed in the IPoIB documents are for all of the IB subnet components and are not limited to IP (over IB). 3.0 IP over IB requirements As described above, IB provides a broad set of capabilities to choose from when implementing IP over IB. It is a requirement that the IPoIB modifications must be of a nature that does not require changes in IP and higher layer protocols. Nor should it mandate requirements on IP stacks to implement special user level programs. It is an aim that the IPoIB changes be amenable to modularisation and incorporation into existing implementations at the same level as other media types. Kashyap [Page 9] INTERNET-DRAFT IPoIB architecture November 14, 2001 3.1 InfiniBand as link layer InfiniBand(IB) provides multiple methods of packet exchange between two endpoints as was noted above. These are : Reliable Connected (RC) Reliable Datagram (RD) Unreliable Connected (UC) Unreliable Datagram (UD) Raw Datagram : Raw IPv6 (R6) : Raw Ethertype (RE) IPoIB can be implemented over any, multiple or all of these methods. A case can be made for support on any of the methods depending on the desired parameters. Unreliable datagrams are limited by the link MTU. The connected modes, in contrast to this limitation, can offer significant benefit in terms of performance by utilising a larger MTU. Reliability is also enhanced if the underlying feature of automatic path migration of connected modes is utilised. An implementation MAY choose to provide IP over non-UD transport modes in addition to the madatory IP over UD function. The IB specification requires Unreliable Datagram mode to be supported by all the IB nodes. The host channel adapters (HCAs) are additionally required to support Reliable connected and Unreliable connected modes but not target channel adapters (TCAs). Support for the two Raw Datagram modes is entirely optional. For the sake of simplicity and ease of implementation and integration with existing stacks, it is desirable that the fabric support multicasting. This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes. Given these conditions it is a MUST that an IP stack support IP over the UD trasport mode of InfiniBand. The support IP over the other modes of IB transport is optional. InfiniBand communication is addressed to a QP at a port. Therefore the IPoIB interface is identified by the port identifier as well as a QP that is associated with it. The address resolution process for IPoIB MUST also determine the associated QPN along with determining the port identifier. Kashyap [Page 10] INTERNET-DRAFT IPoIB architecture November 14, 2001 An interface MAY be associated with multiple QPNs. This provides a mode of implementation wherein a single IP address is associated with different QPNs. Such an association may be used to demultiplex the incoming packets based on the QPN avoiding or reducing the upper-layer port based lookup. This amounts to there being multiple MAC addresses associated with an endpoint. Any process for providing resolution and support of multiple QPNs per IP address MUST provide for interoperability with the default version of a single QPN per IPoIB interface. 3.2 Multicast support InfiniBand specification makes support of multicasting in the switches optional. It is RECOMMENDED that multicast switches be used in IPoIB subnets. Lack of multicast capable switches however doesn't mean that multicasting cannot be supported. In such IP subnets the multicast service may have to be implemented using a multicast server. The translation from IP addresses to IB MGIDs is independent of the IB fabric's multicast capability. 3.2.1 Mapping IP multicast to IB multicast Well known IP multicast groups are defined for both IPv4 and IPv6 (RFC_1700, RFC_2373). Multicast groups may also be dynamically created at any time. To avoid creating unnecessary duplicates of multicast packets in the fabric, and to avoid unnecessary handling of such packets at the hosts it is desirable to associate each of the IP multicast groups with a different IB multicast GID. A process MUST be defined for mapping the IP multicast addresses to unique IB multicast addresses. Every IPoIB node MUST be capable of making this mapping decision independently. 3.2.2 Transient flag in IB MGIDs The IB specfication describes the flag bits as discussed in section 1.3. The IB specification also defines some well known IB MGIDs. Any mapping that is defined from IP multicast addresses therefore MUST NOT fall into IB's definition of a well-known address. Therefore all IPoIB related multicast GIDs will always set the transient bit. Kashyap [Page 11] INTERNET-DRAFT IPoIB architecture November 14, 2001 3.3 IP subnets across IB subnets ? Some implementations may desire to support multiple clusters of machines in their own IB subnets but otherwise part of a common IP subnet. For such a solution the IB specification needs multiple upgrades: 1) A method for creating IB multicast GIDs that span multiple IB subntes. The partition keys and other parameters need to be consistent across IB subnets. 2) Develop IB routing protocol to determine the IB topology across IB subnets. 3) Define the process and protocols needed between IB nodes and IB routers Until the above conditions are met it is not possible to define IPoIB subnets that span IB subnets. The IPoIB architecture however is capable of providing IP subnets across IB subnets if the underlying IB fabric provides the infrastructure. The scope bits for the IP to IB mapping will be chosen as follows: The local scope bits will always be used in the mapping first. If the IB multicast group so formed cannot be joined at the SM the site/organisation/global scope bits will be used in the order listed. The first multicast group to be joined by a host is always the one corresponding to the all-IP nodes in the subnet. The scope bits for the rest of the mappings will be the scope bits that provided a successful IB mapping for the broadcast/all-IP nodes multicast group. 3.4 Multicast address to LID mapping In a generic LAN setup the IP multicast addresses are mapped to the destination link layer address directly. In the case of InfiniBand this is only partly true. A mapping of multicast IP to IB GIDs can be standardised. But the IPoIB driver on the host must determine the LID that needs to be used when sending to the particular multicast group. Kashyap [Page 12] INTERNET-DRAFT IPoIB architecture November 14, 2001 A mapping from the IP multicast address or the corresponding IB multicast group to a LID is not required because of the following reasons: 1) Sending/receiving IP multicast An IB node cannot be assured of its packets reaching all the multicast members without itself joining the IB multicast group. This is because the relevant switches are programmed by the IB subnet manager only on receiving a join request. Thus the sender/receiver will always have to join the IB multicast groups and keep track of the groups it has already joined. Mapping directly to the LID doesn't help if the the group has not been joined. Thus the implementation is required to keep track of the IB groups joined. It can therefore also record the corresponding LID removing the need to map the IP multicast address to the LID. 2) Reduction of LID conflicts The LIDs in the range 0xC000 to 0xFFFE are designated as the multicat LIDs by IBA. This limits the range to 2^14 -1 entries (16382 entries). This implies that 2^18 or 256K IPv4 multicast groups could map to a single LID. It is better to let the SM decide on a more efficient usage of the multicast LID space. 3) SM and IB architecture should stay unaffected. A mapping of the LIDs can conflict with the SM implementations. The SM is under no restrictions to choose a particular LID for any multicast group. Thus it could end up utilising a LID that maps from an IP multicast address for some other multicast group since not everything on IB subnets is governed by the IPoIB rules. 4) No need to plan for LID conflicts Allowing the SM decide on the LIDs also avoids having to come up with a solution to handle LID conflicts with other multicast groups. Kashyap [Page 13] INTERNET-DRAFT IPoIB architecture November 14, 2001 Thus it is best to avoid such a mapping and leave it to the individual implementations to determine the LID from the SM. There is no extra work involved in this determination since the SM has to be contacted anyway for the IB multicast group join/create operations. IPoIB WILL NOT standardise IP multicast addresses to LID mapping. 4.0 IP subnets in InfiniBand fabrics The IPoIB subnet is overlaid over the IB subnet. The IPoIB subnet is brought up in the following steps: Note: the join/leave operation at the IP level will be referred to as IP_join/IP_leave and the join/leave operations at the IB level will be referred to as IB_join in this document. 1. The all-IP nodes group MUST be created It is a MUST that the administrator setup the IB multicast group corresponding to all-IP nodes/IPv4 broadcast (henceforth called 'broadcast group') when the IP(v4/v6) subnet is setup. The method by which the broadcast group is setup is not defined by IPoIB. 2. All IPoIB interfaces IB_join the broadcast group The administrator chooses the parameters that are valid for the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass and the MTU. All multicast transmissions in the IP subnet must use these values. Therefore any other multicast groups setup in the IPoIB subnet MUST be setup with these attributes. In the future as the IB specification associated more meaning with the various values and defines IB QoS different values for IP multicast traffic may be possible. The IB_join of the broadcast group by the IPoIB nodes builds the IPoIB subnet. The broadcast group defines the span and the members of the IPoIB subnet.The IB_join to the broadcast group has the additional benefit of distributing these values to all the members of the subnet. The IP interface MTU for the IP over Unreliable Datagram interface is the path MTU value returned when the broadcast MGID is joined. This is the largest MTU that can be used across the IPoIB subnet without fragmenting. The IPoIB Kashyap [Page 14] INTERNET-DRAFT IPoIB architecture November 14, 2001 specification for IP over non-UD modes of transmission MUST also define the MTU that can be used with it. 4.1 IPoIB VLANs In an IB subnet, to communicate with one another, the endpoints must have compatible P_Keys. Thus the administrator when setting up an IP subnet over an IB subnet must ensure that all the members have compatible P_Keys. An endpoint may however have multiple P_Keys. The IB architecture specifies that there can be only one MGID associated with a multicast group in the IB subnet. The P_Key can be included in the MGID mappings from the IP multicast addresses. If there is only one IPv4/v6 subnet in the IB subnet the P_Key value used in the mapping may be set to 0. Since the P_Key is unique in the IB subnet the inclusion of the P_Key in the IB MGIDs ensures unique MGID mappings are created. Every unique broadcast group MGID so formed creates a separate abstract IPoIB link and hence an IPoIB VLAN. It is an implementation choice on how the P_Key related to the IPoIB subnet is determined by the IP stack. It could be a configuration parameter initialised by some means by the administrator. An implementation MAY choose to have the interface join all of the possible MGIDs possible by using the P_Key's in the P_KeyTable in the associated port. In the absence of multiple IPoIB VLANs (different partitions) a value of 0 for the P_Key in MGID is a valid value. This does not imply the partition's P_Key is zero but that the value used in the translation to IB MGIDs is 0. In this case the P_Key is returned to the node on a successful IB_Join of the broadcast group. 4.2 Multicast in IPoIB subnets IP multicast on InfiniBand subnets follows the same concepts and rules as on any other media. However, unlike most other media multicast over InfiniBand requires interaction with another entiy, the IB subnet manager. This section describes the outline of the process and also suggests some guidelines. IB architecture specifies the following format for IB Kashyap [Page 15] INTERNET-DRAFT IPoIB architecture November 14, 2001 multicast packets when used over unreliable datagram (UD) mode: +--------+-------+---------+---------+-------+---------+---------+ |Local |Global |Base |Datagram |Packet |Invariant| Variant | |Routing |Routing|Transport|Extended |Payload| CRC | CRC | |Header |Header |Header |Transport| (IP) | | | | | | |Header | | | | +--------+-------+---------+---------+-------+---------+---------+ For details about the various headers please refer to InfiniBand Architecture Specification. The Global routing header (GRH) includes the IB multicast group GID. The Local routing header (LRH) includes the local identifier (LID). The IB switches in the fabric route the packet based on the LID. The GID is made available to the receiving IB user (the IPoIB interface driver for example). The driver can therefore determine the IB group the packet belongs to. IPv4 defines three levels of multicast support. These are : Level 0: No support for IP multicasting Level 1: Support for sending but not receiving multicasts Level 2: Full support for IP multicasting In IPv6 there is no such distinction. Full multicast support is mandatory. Additionally, all IPv4 subnets support broadcast (255.255.255.255) and there is no interface associated with broadcast reception. The standard case of broadcast is covered by the requirement that the multicast MGID must exist for an IPoIB subnet to be formed. Thus level 0 IPv4 multicast support is available by default. 4.2.1 Sending IP multicast datagrams An IP host may send a multicast packet at any time to any multicast address. The join/leave of IB groups will be referred to as IB_Join/IB_leave in this document. The corresponding IP level join/leave will be referred to as IP_join/IP_leave. Kashyap [Page 16] INTERNET-DRAFT IPoIB architecture November 14, 2001 The IP layer conveys the multicast packet to the IPoIB interface driver/module. This module attempts to IB_join the relevant IB multicast group. This is required since otherwise there is no guarantee that the packet will reach its destinations. The IB_join could fail if the IB group has not been created. This could imply that there are no listeners on the subnet and the router doesn't expect to forward packets received on this group. In such a case the module would be justified in dropping the packet. However, this may not be the case. The IB group may not exist because the SM ran out of resources or the SM policy allows only a limited set of multicast groups to be created. Additionally it is not reasonable to expect the router to create IB groups for all the IP multicast addreses that it may be called upon to forward. Therefore, the multicast module of IPoIB interface, when sending a multicast packet MUST do one the following: 1) join the IB mulicast group corresponding to the IP multicast address. This is the RECOMMENDED option for multicast if the sender is itself a member of the group. As noted earlier, a particular IB multicast group may not exist for some reason. In such a case the implementation MUST fall back to one of the following methods. 2) Send the multicast packet out with the IB MGID/MLID associated with the all-systems IP multicast address (224.0.0.1/FF02::1). An implementation implementing 1) described above must fall back to this condition or the condition given below on failure to join the IB group corresponding to the IPv4 multicast address being sent to. 3) In IPv4 subnets if both the above conditions fail then the packet MUST be sent with the IB MGID/MLID corresponding to the IPv4 limited broadcast address (255.255.255.255). Kashyap [Page 17] INTERNET-DRAFT IPoIB architecture November 14, 2001 4.2.2 Receiving multicast packets An IP host sends an IGMP/MLD report to the router(s) when it wants to receive packets on a multicast group. The router could then create the IB group. However to receive the packets the IP host must join the corresponding IB multicast group. Therefore, it is simpler for the IB interface module on the IP host to first create the IB group and then send the IGMP message to the router. The router will then IB_join the specified IB group. It may also be that an IPoIB subnet doesn't have any routers. In such a case the non-existent router cannot be relied on to create the IB groups. The router MAY choose to create IB groups corresponding to the IP groups it expects to forward. Thus the creation of IB groups is done by IP receivers or IP routers only and not by senders thereby keeping things simple. The host must first try to join the group and only on failure attempt to create it. 4.2.2.1 IB join of MGIDs by a listener A multicast listener follows the following steps when it IP_joins the IP multicast group: 1) The IPoIB interface IB_Joins the corresponding IB MGID 2) If step 1) fails The IPoIB interface creates the IB MGID group and IB_Joins it 3) If step 2) fails The IPoIB interface records the IB MGID/MLID it will be using for the IP multicast group. This decision is based on the steps outlined in section 6.2. The IGMP/MLD report is then sent out. The MGID/MLID pair in the report therefore may not correspond to the IP multicast address. 4) It may be that the IB MGID could not be created/joined because of a transient error or resource constraint at the SM. It may also be created at a later point in time. The listener therefore would not be in the IB MGID corresponding to the IP address. Unfortunately there is no Kashyap [Page 18] INTERNET-DRAFT IPoIB architecture November 14, 2001 IB level support to let the listener know of the new IB MGID being created. If the underlying IB level indicated a transient failure the listener periodically retries to join the IB group. Note that multicasting can still continue since the packets can be sent out on the broadcast MGID. A configuration parameter dependent on the underlying IB subnet's requirements MUST be set that determines how often the retries can be done. 4.2.3 Leaving/Deleteing a multicast group An IPv4 sender (level 1 compliance) IB_joins the IB multicast group only because that is the only way to guarantee reception of the packets by all the group recepients. The sender must however IB_leave the group at some time. It is RECOMMENDED that a sender, when not a receiver on the group, start a timer per multicast group sent to. The sender leaves the IB group when the timer goes off. It restarts the timer if another message is sent. It is RECOMMENDED that the duration of the timer be 1200 seconds. This recommendation doesn't apply to the IB broadcast group. It also doesn't apply to the IB group corresponding to the all-hosts multicast group. An IPv4 host MUST always remain a member of the broadcast group. It MAY choose to remain a member of all-hosts group. Thus a sender that chooses to always send to the broadcast group and not to the specific multicast group does not need to implement a timer. An IP multicast receiver MUST IB_leave the corresponding IB multicast group when it IP_leaves the IP multicast group. In the case of IPv4 implementation the receiver may choose to continue to be a sender (level 1 compliance). It MAY choose to not IB_leave the IB group but start a timer as explained above. A router is RECOMMENDED to IB_leave the IB multicast group when there are no members of the IP multicast address in the subnet and it has no explicit knowledge of any need to forward such packets. The router and the IP hosts MUST NOT IB_delete the IB multicast group when they IB_leave the group. It is possible Kashyap [Page 19] INTERNET-DRAFT IPoIB architecture November 14, 2001 for the same IB multicast group be used by a non-IP protocol. The IB specification mentions an IB specific protocol that will delete the IB groups when it determines that there are no IB members of the group. 5.0 QoS and related issues [ WG input is solicited on this issue ] The IB specification suggests the use of service levels for load balacing, QoS and deadlock avoidance within an IB subnet. But the IB specification leaves the usage and mode of determination of the SL for the application to decide. The SL and list of SLs are available in the SA but it is upto the endnode's application to choose the 'right' value. IP is one such IB application and so IPoIB needs to define a set of rules on the choice of the SL. The IPoIB implementations MUST map the QoS request to the right SL based on the IB's QoS policies. This mapping in itself is not an IPoIB issue. However a policy needs to be defined that lets a IPoIB node know the method to adopt to determine the SL. This is especially the case if the same IP subnet spans across multiple IB subnets. The policy must address the issue of whether the SL must be mapped as per IB's QoS parameters (when they are defined), determined only from the SA, or determined in an implementation dependent way etc. It must especially address the IP best-effort case. 6.0 Security Considerations Any multicast/broadcast communication is inherently insecure since anyone can receive the data. The applications must implement appropriate authentication/encryption methods for data security. The IP subnet communication can be disrupted by creating the IB broadcast/multicast groups with incompatible parameters. The implementations must leverage IB specific methods to protect against such situations. 7.0 References [IB_ARCH] InfiniBand Architecture Specification, Volume 1.0 [RFC_2373] IP Version 6 Addressing Architecture Kashyap [Page 20] INTERNET-DRAFT IPoIB architecture November 14, 2001 [RFC_2375] IPv6 Multicast Address Assignments [RFC_1700] Assigned Numbers [RFC_1112] Host extensions for IP multicasting [RFC_2236] Internet Group Management Protocol, Version 2 [RFC_2710] Multicast Listener Discovery 8.0 Author's Address Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Work: 503 578 3422 Email: vivk@us.ibm.com Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap [Page 21] INTERNET-DRAFT IPoIB architecture November 14, 2001 Kashyap [Page 22] -- Vivek Kashyap IBM kashyapv@us.ibm.com vivk@us.ibm.com 503 578 3422 (o)