INTERNET DRAFT                                                       V.Kashyap
<draft-kashyap-ipoib-requirements-01.txt>                                  IBM
Expiration Date: May 14, 2002                                November 14, 2001

                IP over InfiniBand(IPoIB) Architecture


Status of this memo

        This document is an Internet-Draft and is in full conformance
        with all provisions of Section 10 of RFC 2026.

        Internet-Drafts are working documents of the Internet
        Engineering Task Force (IETF), its areas, and its working
        groups. Note that other groups may also distribute working
        documents as Internet- Drafts.

        Internet-Drafts are draft documents valid for a maximum of six
        months and may be updated, replaced, or obsoleted by other
        documents at any time. It is inappropriate to use
        Internet-Drafts as Reference material or to cite them other
        than as ``work in progress''.

        The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt

        The list of Internet-Draft Shadow Directories can be accessed
        at http://www.ietf.org/shadow.html

        This memo provides information for the Internet community.
        This memo does not specify an Internet standard of any kind.
        Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

        InfiniBand is a high speed, channel based interconnect between
        systems and devices.

        This memo presents an overview of the InfiniBand architecture.
        It further describes the requirements for transmission of IP
        over InfiniBand.


Kashyap                                                         [Page 1]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


Table of Contents

        1.0     Introduction to InfiniBand
        1.1     InfiniBand Architecture Specification
        1.2     Overview of InfiniBand Architecture
        1.2.1   InfiniBand Addresses
        1.2.1.1 Unicast GIDs
        1.2.1.2 Multicast GIDs
        1.2.2   InfiniBand Multicast Groups
        2.0     Management of InfiniBand subnet
        3.0     IP over IB requirements
        3.1     InfiniBand as datalink
        3.2     Multicast support
        3.2.1   Mapping IP multicast to IB multicast
        3.2.2   Transient bit in IB MGIDs
        3.3     IP subnet across IB subnets ?
        3.4     Multicast address to LID mapping
        3.5     IP encapsulation
        4.0     IP subnets in InfiniBand fabrics
        4.1     IPoIB VLANs
        4.2     Multicast in IPoIB subnets
        4.2.1   Sending IP multicast datagrams
        4.2.2   Receiving multicast packets
        4.2.2.1 IB_join of MGIDs by a listener
        4.2.3   Leaving/Deleting a multicast group
        5.0     QoS and related issues
        6.0     Security Considerations
        7.0     References
        8.0     Author's address

1.0 Introduction to InfiniBand

        The InfiniBand Trade Association(IBTA) was formed to develop
        an I/O specification to deliver a channel based, switched
        fabric technology. The InfiniBand standard is aimed at meeting
        the requirements of scalability, reliability, availability and
        performance of servers in data centers.

1.1 InfiniBand Architecture Specification

        The InfiniBand Trade Association specification, version 1.0,
        is available for download from http://www.infinibandta.org.

1.2 Overview of InfiniBand Architecture

        For a more complete overview the reader is referred to
        chapter 3 of the InfiniBand specification.


Kashyap                                                         [Page 2]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        InfiniBand Architecture (IBA) defines a System Area Network
        (SAN) for connecting multiple independent processor platforms,
        I/O platforms and I/O devices. The IBA SAN is a communications
        and management infrastructure supporting both I/O and
        inter-processor communications for one or more computer
        systems.

        An IBA SAN consists of processor nodes and I/O units connected
        through an IBA fabric made up of cascaded switches and IB
        routers (connecting IB subnets). I/O units can range in
        complexity from single ASIC IBA attached devices such as a LAN
        adapter to a large memory rich RAID subsystem.

        IBA network is subdivided into subnets interconnected by IB
        routers. These are IB routers and IB subnets and not IP
        routers or IP subnets.

        Each IB node or switch may attach to a single or multiple
        switches or directly with each other. Each node interfaces
        with the link by way of channel adapters (CAs). The
        architecture supports multiple CAs per unit with each CA
        providing one or more ports that connect to the fabric. Each
        CA appears as a node to the fabric.

        The ports are the endpoints to which the data is sent.
        However, each of the ports may include multiple QPs (queue
        pairs) that may be directly addressed from a remote peer. From
        the point of view of data transfer the QP number (QPN) is part
        of the address.

        IBA supports both connection oriented and datagram service
        between the ports. The peers are identified by QPN and the
        port identifier. In raw datagram mode the QPN is not used.

        A port may be identified by a local ID (LID) and optionally a
        Global ID (GID).

        The GID is 128 bits long and is formed by the concatenation of
        a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant
        portion (GUID). The LID is a 16 bit value that is assigned
        when the port becomes active. Note that the GUID is the only
        persistent identifier of a port. However, it cannot be used as
        an address in a packet. If the prefix is modified then the GID
        may change. The subnet manager may attempt to keep the LID
        values constant across reboots but that is not a requirement.

        The assignment of the GID and the LID is done by the subnet
        manager. Every IB subnet has at least one subnet manager


Kashyap                                                         [Page 3]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        component that controls the fabric. It assigns the LIDs and
        GIDs. The subnet manager also programs the switches so that
        they route packets between destinations. The subnet manager
        and a related component, the subnet administrator (SA) are the
        central repository of all information that is required to
        setup and bring up the fabric.

        IB routers are components that route packets between IB
        subnets based on the GIDs. Thus within an IB subnet a packet
        may or may not include a GID but when going across an IB
        subnet the GID must be included. A LID is always needed in a
        packet since the destination within a subnet is determined by
        it.

        A CA and a switch may have multiple ports. Each CA port is
        assigned its own LID or a range of LIDs. The ports of a switch
        are not addressable by LIDs/GIDs or in other words, are
        transparent to other end nodes. Each port has its own set of
        buffers. The buffering is channeled through virtual lanes (VL)
        where each VL has its own flow control. There may be upto 16
        VLs.

        VLs provide a mechanism for creating multiple virtual links
        within a single physical link. All ports however must support
        VL15 which is reserved exclusively for subnet management
        datagrams and hence doesn't concern the IPoIB discussions. The
        actual VL that a port uses is configured by the SM and is
        based on the Service Level (SL) specified in every packet.
        There are 16 possible SLs.

        In addition to the features described above viz. Queue Pairs
        (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also
        defines the following:

        P_Keys or partition keys: Every packet, but for the raw
        datagrams, carries the partition key (P_key). These values are
        used for isolation in the fabric. A switch (this is an
        optional feature) may be programmed by the SM to drop packets
        not having a certain key. The CA ports always check for the
        P_Keys. A CA port may belong to multiple partitions.


        Q_Keys: These are used to enforce access rights for reliable
        and unreliable IB datagram services. Raw datagram services
        don't require this value. At communication establishment the
        endpoints exchange the Q_Keys and must always use the relevant
        Q_Keys when communicating with one another.


Kashyap                                                         [Page 4]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        Mutlicast support: A switch may support multicasting ie.
        replication of packets across multiple output ports. This is
        an optional feature. A multicast group is identified by a GID.
        The GID format is as defined in [RFC 2373] on IPv6 addressing.
        Thus from an IPv6 over IB's point of view the data link
        multicast address looks like the network address. An IB node
        must explicitly join a multicast group by a request to the SM
        to receive packets. A node may send packets to any multicast
        group. In both cases the multicast LID to be used in the
        packets is received from the SM.

        There are 6 transport types specified by the IB architecture.
        These are :

        1. Unreliable Datagram (unacknowledged - connectionless)
                The UD service is connectionless and unacknowledged.
                It allows the QP to communicate with any unreliable
                datagram QP on any node.

                The switches and hence each link can support only a
                certain MTU. The MTU ranges are 256 bytes, 512 bytes,
                1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot
                be larger than the smallest link MTU between the two
                peers.

        2. Reliable Datagram    (acknowledged - multiplexed)
                The RD service is multiplexed over connections between
                nodes called End to end contexts (EEC) which allows
                each RD QP to communicate with any RD QP on any node
                with an established EEC. Multiple QPs can use the same
                EEC and a single QP can use multiple EECs (one for
                each remote node per reliable datagram domain).

        3. Reliable Connected (acknowledged - connection oriented)
                The RC service associates a local QP with one and only
                one remote QP. The message sizes maybe as large as
                2^31 bytes in length. The CA implementation takes care
                of segmentation and assembly.

        4. Unreliable Connected (unacknowledged - connection oriented)
                The UC service associates one local QP with one and
                only one remote QP. There is no acknowledgment and
                hence no resend of lost or corrupted packets. Such
                packets are therefore simply dropped. It is similar to
                RC otherwise.

        5. Raw Ethertype (unacknowledged - connectionless)
                The Ethertype raw datagram packet contains a generic


Kashyap                                                         [Page 5]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


                transport header that is not interpreted by the CA but
                it specifies the protocol type. The values for
                ethertype are the same as defined in RFC1700 for
                ethertype.

        6. Raw IPv6 ( unacknowledged - connectionless)
                Using IPv6 raw datagram service, the IBA CA can
                support standard prtocol layers atop IPv6 (such as
                TCP/UDP). Thus native IPv6 packets can be bridged into
                the IBA SAN and delivered directly to a port and to
                its IPv6 raw datagram QP.

        The first 4 are referred to as IB transports. The latter two
        are classified as Raw datagrams. There is no indication of the
        QP number in the raw datagram packets. The raw datagram
        packets are limited by the link MTU in size.

1.2.1 InfiniBand Addresses

        The InfiniBand architecture borrows heavily from the IPv6
        architecture in terms of the InfiniBand subnet structure and
        global identifiers (GIDs).

        The InfiniBand architecture defines the global identifier
        associated with a port as follows:

                GID (Global Identifier): A 128-bit unicast or
                multicast identifier used to identify a port on a
                channel adapter, a port on a router, a switch, or a
                multicast group. A GID is a valid 128-bit IPv6 address
                (per RFC 2373) with additional properties/restrictions
                defined within IBA to facilitate efficient discovery,
                communication, and routing. Note: These rules apply
                only to IBA operation and do not apply to raw IPv6
                operation unless specifically called out.

        The raw IPv6 operation referred to in the note in the the
        definition above is the IPv6 mode of InfiniBand's raw datagram
        service. It does not mean IPv6 itself. The routers and
        switches referred to in the above definition are the
        InfiniBand routers and switches.

        The InfiniBand(IB) specification defines two types of GIDs:
        unicast and multicast.

1.2.1.1 Unicast GIDs

        The unicast GIDs are defined, as in IPv6, with three scopes.


Kashyap                                                         [Page 6]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        The IB specification states:

                a. link local: This is defined to be FE80/10.
                               The IB routers will not forward packets
                               with a link local address in source or
                               destination beyond the IB subnet.

                b. site local: FEC0/10
                               A unicast GID used within a collection
                               of subnets which is unique within that
                               collection (e.g. a data center or
                               campus) but is not necessarily globally
                               unique. IB routers must not forward any
                               packets with either a site-local Source
                               GID or a site-local Destination GID
                               outside of the site.

                c. global: A unicast GID with a global prefix, i.e. an
                           IB router may use this GID to route packets
                           throughout an enterprise or internet.

1.2.1.2  Multicast GIDs

        The mulicast GIDs also parallel the IPv6 multicast addresses.
        The IB specification defines the multicast GIDs as follows:

                FFxy:<112 bits>

        Flag bits:

        The nibble, denoted by x above, are the 4 flag bits: 000T. The
        first three bits are reserved and are set to zero. The last
        bit is defined as follows:

                T=0: denotes a permanently assigned i.e. well known GID
                T=1: denotes a transient group

        Scope bits:

        The 4 bits, denoted by y in the GID above, are the scope bits.
        These are defined as :

                scope value             Address value

                    0                        Reserved
                    1                        Unassigned
                    2                        Link-local
                    3                        Unassigned


Kashyap                                                         [Page 7]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


                    4                        Unassigned
                    5                        Site-local
                    6                        Unassigned
                    7                        Unassigned
                    8                        Organization-local
                    9                        Unassigned
                    0xA                      Unassigned
                    0xB                      Unassigned
                    0xC                      Unassigned
                    0xD                      Unassigned
                    0xE                      Global
                    0xF                      Reserved

                                Table 1

        The IB specification further refers to [RFC_2373] and
        [RFC_2375] while defining the well known multicast addresses.
        However, it then states that the well known addresses apply to
        IB raw IPv6 datagrams only. The IB unreliable datagram (UD)
        service recognises only one well known multicast address. This
        is the ALL_CHANNEL_ADAPTERS multicast address defined to be
        FF02::1. The scope of this address is limited to a single IB
        subnet. It must be noted though that a multicast group can be
        associated with only a single MGID. Thus the same MGID cannot
        be associated with the UD mode and the raw datagram mode.

1.2.2 InfiniBand Multicast Groups

        IB multicast groups (multicast GIDs) are managed by the subnet
        manager(SM). The SM explicitly programs the IB switches in the
        fabric to ensure that the packets are received by all the
        members of the multicast group.

        When the group is created a create request is sent to the SM.
        The subnet manager records the group GIDs and the associated
        characteristics. The group characteristics are defined by the
        group path MTU, whether the group will be used for raw
        datagrams or unreliable datagrams, the service level, the
        partition key associated with the group, the LID (local
        identifier) associated with the group etc. These
        characteristics are defined at the time of the group
        creation.

        The LID is associated with the multicast group by the subnet
        manager(SM) at the time of the multicast group creation. An IB
        node may request a specific LID be associated with a group.
        The SM determines the multicast tree based on all the group
        members and programs the relevant switches. The LID is used by


Kashyap                                                         [Page 8]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        the switches to route the packets.

        Any member IB node wanting to participate in the group must
        join the group. As part of the join operation the node is
        returned the group characteristics. At the same time the
        subnet manager ensures that the requester can indeed participate
        in the group by verifying that it can support the group MTU,
        and accessiblity to the rest of the group members. Other group
        characteristics may need verification too.

        The SM, for groups that span IB subnet boundaries, must interact
        with IB routers to determine the presence of this group in other
        IB subnets. If present the MTU must match across the IB subnets.

        P_Key is another characteristic that must match across IB subnets
        since the P_Key inserted into a packet is not modified by the
        IB switches or IB routers. Thus if the P_Keys didn't match the
        IB router(s) itself might drop the packets or destinations on
        other subnets might drop the packets.

        These characteristics are returned to the IB endnode that
        joins the multicast group. A join operation may cause the SM
        to reprogram the fabric so that the new member can participate
        in the multicat group.

2.0 Management of InfiniBand subnet

        To aid in the monitoring and configuration of InfiniBand
        subnet components a set of MIBs MUST be defined. MIBs are
        needed for the channel adapters, baseboard mamangement to
        allow management of specified device properties and sample
        counters. It must be noted that the management objects
        addressed in the IPoIB documents are for all of the IB subnet
        components and are not limited to IP (over IB).

3.0 IP over IB requirements

        As described above, IB provides a broad set of capabilities to
        choose from when implementing IP over IB.

        It is a requirement that the IPoIB modifications must be of a
        nature that does not require changes in IP and higher layer
        protocols. Nor should it mandate requirements on IP stacks to
        implement special user level programs. It is an aim that the
        IPoIB changes be amenable to modularisation and incorporation
        into existing implementations at the same level as other media
        types.


Kashyap                                                         [Page 9]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


3.1 InfiniBand as link layer

        InfiniBand(IB) provides multiple methods of packet exchange
        between two endpoints as was noted above. These are :

                Reliable Connected (RC)
                Reliable Datagram  (RD)
                Unreliable Connected (UC)
                Unreliable Datagram (UD)
                Raw Datagram : Raw IPv6 (R6)
                             : Raw Ethertype (RE)

        IPoIB can be implemented over any, multiple or all of these
        methods. A case can be made for support on any of the methods
        depending on the desired parameters.

        Unreliable datagrams are limited by the link MTU. The
        connected modes, in contrast to this limitation, can offer
        significant benefit in terms of performance by utilising a
        larger MTU. Reliability is also enhanced if the underlying
        feature of automatic path migration of connected modes is
        utilised. An implementation MAY choose to provide IP over
        non-UD transport modes in addition to the madatory IP over UD
        function.

        The IB specification requires Unreliable Datagram mode to be
        supported by all the IB nodes. The host channel adapters
        (HCAs) are additionally required to support Reliable connected
        and Unreliable connected modes but not target channel adapters
        (TCAs). Support for the two Raw Datagram modes is entirely
        optional.

        For the sake of simplicity and ease of implementation and
        integration with existing stacks, it is desirable that the
        fabric support multicasting. This is possible only in
        Unreliable datagram (UD) and IB's Raw datagram modes.

        Given these conditions it is a MUST that an IP stack support
        IP over the UD trasport mode of InfiniBand. The support IP
        over the other modes of IB transport is optional.

        InfiniBand communication is addressed to a QP at a port.
        Therefore the IPoIB interface is identified by the port
        identifier as well as a QP that is associated with it. The
        address resolution process for IPoIB MUST also determine the
        associated QPN along with determining the port identifier.


Kashyap                                                        [Page 10]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        An interface MAY be associated with multiple QPNs. This
        provides a mode of implementation wherein a single IP address
        is associated with different QPNs. Such an association may be
        used to demultiplex the incoming packets based on the QPN
        avoiding or reducing the upper-layer port based lookup. This
        amounts to there being multiple MAC addresses associated with
        an endpoint. Any process for providing resolution and support
        of multiple QPNs per IP address MUST provide for
        interoperability with the default version of a single QPN per
        IPoIB interface.

3.2 Multicast support

        InfiniBand specification makes support of multicasting in the
        switches optional. It is RECOMMENDED that multicast switches
        be used in IPoIB subnets. Lack of multicast capable switches
        however doesn't mean that multicasting cannot be supported. In
        such IP subnets the multicast service may have to be
        implemented using a multicast server.

        The translation from IP addresses to IB MGIDs is independent
        of the IB fabric's multicast capability.

3.2.1 Mapping IP multicast to IB multicast

        Well known IP multicast groups are defined for both IPv4 and
        IPv6 (RFC_1700, RFC_2373). Multicast groups may also be
        dynamically created at any time. To avoid creating unnecessary
        duplicates of multicast packets in the fabric, and to avoid
        unnecessary handling of such packets at the hosts it is
        desirable to associate each of the IP multicast groups with a
        different IB multicast GID.

        A process MUST be defined for mapping the IP multicast
        addresses to unique IB multicast addresses. Every IPoIB node
        MUST be capable of making this mapping decision
        independently.

3.2.2 Transient flag in IB MGIDs

        The IB specfication describes the flag bits as discussed in
        section 1.3. The IB specification also defines some well known
        IB MGIDs. Any mapping that is defined from IP multicast
        addresses therefore MUST NOT fall into IB's definition
        of a well-known address.

        Therefore all IPoIB related multicast GIDs will always set the
        transient bit.


Kashyap                                                        [Page 11]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


3.3 IP subnets across IB subnets ?

        Some implementations may desire to support multiple clusters
        of machines in their own IB subnets but otherwise part of a
        common IP subnet. For such a solution the IB specification
        needs multiple upgrades:

        1) A method for creating IB multicast GIDs that span multiple
           IB subntes. The partition keys and other parameters need to
           be consistent across IB subnets.

        2) Develop IB routing protocol to determine the IB topology
           across IB subnets.

        3) Define the process and protocols needed between IB nodes
           and IB routers

        Until the above conditions are met it is not possible to
        define IPoIB subnets that span IB subnets. The IPoIB
        architecture however is capable of providing IP subnets across
        IB subnets if the underlying IB fabric provides the
        infrastructure.

        The scope bits for the IP to IB mapping will be chosen as
        follows:

                The local scope bits will always be used in the
                mapping first. If the IB multicast group so formed
                cannot be joined at the SM the
                site/organisation/global scope bits will be used in
                the order listed.

                The first multicast group to be joined by a host is
                always the one corresponding to the all-IP nodes in
                the subnet. The scope bits for the rest of the
                mappings will be the scope bits that provided a
                successful IB mapping for the broadcast/all-IP nodes
                multicast group.

3.4 Multicast address to LID mapping

        In a generic LAN setup the IP multicast addresses are mapped
        to the destination link layer address directly. In the case of
        InfiniBand this is only partly true. A mapping of multicast IP
        to IB GIDs can be standardised. But the IPoIB driver on the
        host must determine the LID that needs to be used when sending
        to the particular multicast group.


Kashyap                                                        [Page 12]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        A mapping from the IP multicast address or the corresponding
        IB multicast group to a LID is not required because of the
        following reasons:

                1) Sending/receiving IP multicast

                   An IB node cannot be assured of its packets
                   reaching all the multicast members without itself
                   joining the IB multicast group. This is because the
                   relevant switches are programmed by the IB subnet
                   manager only on receiving a join request.

                   Thus the sender/receiver will always have to join
                   the IB multicast groups and keep track of the
                   groups it has already joined. Mapping directly to
                   the LID doesn't help if the the group has not been
                   joined.

                   Thus the implementation is required to keep track
                   of the IB groups joined. It can therefore also
                   record the corresponding LID removing the need to
                   map the IP multicast address to the LID.

                2) Reduction of LID conflicts

                   The LIDs in the range 0xC000 to 0xFFFE are
                   designated as the multicat LIDs by IBA. This limits
                   the range to 2^14 -1 entries (16382 entries). This
                   implies that 2^18 or 256K IPv4 multicast groups
                   could map to a single LID. It is better to let the
                   SM decide on a more efficient usage of the
                   multicast LID space.

                3) SM and IB architecture should stay unaffected.

                   A mapping of the LIDs can conflict with the SM
                   implementations. The SM is under no restrictions to
                   choose a particular LID for any multicast group.
                   Thus it could end up utilising a LID that
                   maps from an IP multicast address for some other
                   multicast group since not everything on IB subnets
                   is governed by the IPoIB rules.

                4) No need to plan for LID conflicts

                   Allowing the SM decide on the LIDs also avoids
                   having to come up with a solution to handle LID
                   conflicts with other multicast groups.


Kashyap                                                        [Page 13]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        Thus it is best to avoid such a mapping and leave it to the
        individual implementations to determine the LID from the SM.
        There is no extra work involved in this determination since
        the SM has to be contacted anyway for the IB multicast group
        join/create operations.

        IPoIB WILL NOT standardise IP multicast addresses to LID
        mapping.

4.0 IP subnets in InfiniBand fabrics

        The IPoIB subnet is overlaid over the IB subnet. The IPoIB
        subnet is brought up in the following steps:

        Note: the join/leave operation at the IP level will be
              referred to as IP_join/IP_leave and the join/leave
              operations at the IB level will be referred to as
              IB_join in this document.

        1. The all-IP nodes group MUST be created

        It is a MUST that the administrator setup the IB multicast
        group corresponding to all-IP nodes/IPv4 broadcast (henceforth
        called 'broadcast group') when the IP(v4/v6) subnet is setup.
        The method by which the broadcast group is setup is not
        defined by IPoIB.

        2. All IPoIB interfaces IB_join the broadcast group

        The administrator chooses the parameters that are valid for
        the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass
        and the MTU. All multicast transmissions in the IP subnet
        must use these values. Therefore any other multicast groups
        setup in the IPoIB subnet MUST be setup with these
        attributes. In the future as the IB specification associated
        more meaning with the various values and defines IB QoS
        different values for IP multicast traffic may be possible.

        The IB_join of the broadcast group by the IPoIB nodes builds
        the IPoIB subnet. The broadcast group defines the span and the
        members of the IPoIB subnet.The IB_join to the broadcast group
        has the additional benefit of distributing these values to all
        the members of the subnet.

        The IP interface MTU for the IP over Unreliable Datagram
        interface is the path MTU value returned when the broadcast
        MGID is joined. This is the largest MTU that can be used
        across the IPoIB subnet without fragmenting. The IPoIB


Kashyap                                                        [Page 14]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        specification for IP over non-UD modes of transmission MUST
        also define the MTU that can be used with it.

4.1 IPoIB VLANs

        In an IB subnet, to communicate with one another, the
        endpoints must have compatible P_Keys. Thus the administrator
        when setting up an IP subnet over an IB subnet must ensure
        that all the members have compatible P_Keys. An endpoint may
        however have multiple P_Keys.

        The IB architecture specifies that there can be only one MGID
        associated with a multicast group in the IB subnet. The P_Key
        can be included in the MGID mappings from the IP multicast
        addresses. If there is only one IPv4/v6 subnet in the IB
        subnet the P_Key value used in the mapping may be set to 0.
        Since the P_Key is unique in the IB subnet the inclusion of
        the P_Key in the IB MGIDs ensures unique MGID mappings are
        created. Every unique broadcast group MGID so formed creates a
        separate abstract IPoIB link and hence an IPoIB VLAN.

        It is an implementation choice on how the P_Key related to the
        IPoIB subnet is determined by the IP stack. It could be a
        configuration parameter initialised by some means by the
        administrator. An implementation MAY choose to have the
        interface join all of the possible MGIDs possible by using the
        P_Key's in the P_KeyTable in the associated port. In the
        absence of multiple IPoIB VLANs (different partitions) a value
        of 0 for the P_Key in MGID is a valid value. This does not
        imply the partition's P_Key is zero but that the value used in
        the translation to IB MGIDs is 0. In this case the P_Key is
        returned to the node on a successful IB_Join of the broadcast
        group.

4.2 Multicast in IPoIB subnets

        IP multicast on InfiniBand subnets follows the same concepts
        and rules as on any other media. However, unlike most other
        media multicast over InfiniBand requires interaction with
        another entiy, the IB subnet manager. This section describes
        the outline of the process and also suggests some guidelines.

        IB architecture specifies the following format for IB


Kashyap                                                        [Page 15]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        multicast packets when used over unreliable datagram (UD)
        mode:

       +--------+-------+---------+---------+-------+---------+---------+
       |Local   |Global |Base     |Datagram |Packet |Invariant| Variant |
       |Routing |Routing|Transport|Extended |Payload| CRC     |  CRC    |
       |Header  |Header |Header   |Transport| (IP)  |         |         |
       |        |       |         |Header   |       |         |         |
       +--------+-------+---------+---------+-------+---------+---------+

       For details about the various headers please refer to
       InfiniBand Architecture Specification.

       The Global routing header (GRH) includes the IB multicast group
       GID. The Local routing header (LRH) includes the local
       identifier (LID). The IB switches in the fabric route the
       packet based on the LID.

       The GID is made available to the receiving IB user (the IPoIB
       interface driver for example). The driver can therefore
       determine the IB group the packet belongs to.

       IPv4 defines three levels of multicast support. These are :

                Level 0: No support for IP multicasting

                Level 1: Support for sending but not receiving multicasts

                Level 2: Full support for IP multicasting

        In IPv6 there is no such distinction. Full multicast support
        is mandatory. Additionally, all IPv4 subnets support broadcast
        (255.255.255.255) and there is no interface associated with
        broadcast reception.

        The standard case of broadcast is covered by the requirement
        that the multicast MGID must exist for an IPoIB subnet to be
        formed. Thus level 0 IPv4 multicast support is available by
        default.

4.2.1 Sending IP multicast datagrams

        An IP host may send a multicast packet at any time to any
        multicast address. The join/leave of IB groups will be
        referred to as IB_Join/IB_leave in this document. The
        corresponding IP level join/leave will be referred to as
        IP_join/IP_leave.


Kashyap                                                        [Page 16]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        The IP layer conveys the multicast packet to the IPoIB
        interface driver/module. This module attempts to IB_join the
        relevant IB multicast group. This is required since otherwise
        there is no guarantee that the packet will reach its
        destinations.

        The IB_join could fail if the IB group has not been created.
        This could imply that there are no listeners on the subnet and
        the router doesn't expect to forward packets received on this
        group. In such a case the module would be justified in
        dropping the packet.

        However, this may not be the case. The IB group may not exist
        because the SM ran out of resources or the SM policy allows
        only a limited set of multicast groups to be created.
        Additionally it is not reasonable to expect the router to
        create IB groups for all the IP multicast addreses that it may
        be called upon to forward.

        Therefore, the multicast module of IPoIB interface, when
        sending a multicast packet MUST do one the following:

                1) join the IB mulicast group corresponding to the IP
                   multicast address. This is the RECOMMENDED option
                   for multicast if the sender is itself a member of
                   the group.

                   As noted earlier, a particular IB multicast group
                   may not exist for some reason. In such a case the
                   implementation MUST fall back to one of the
                   following methods.

                2) Send the multicast packet out with the
                   IB MGID/MLID associated with the all-systems IP
                   multicast address (224.0.0.1/FF02::1).

                   An implementation implementing 1) described above
                   must fall back to this condition or the condition
                   given below on failure to join the IB group
                   corresponding to the IPv4 multicast address being
                   sent to.

                3) In IPv4 subnets if both the above conditions fail
                   then the packet MUST be sent with the IB MGID/MLID
                   corresponding to the IPv4 limited broadcast address
                   (255.255.255.255).


Kashyap                                                        [Page 17]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


4.2.2 Receiving multicast packets

        An IP host sends an IGMP/MLD report to the router(s) when it
        wants to receive packets on a multicast group. The router
        could then create the IB group. However to receive the packets
        the IP host must join the corresponding IB multicast group.
        Therefore, it is simpler for the IB interface module on the IP
        host to first create the IB group and then send the IGMP
        message to the router. The router will then IB_join the
        specified IB group. It may also be that an IPoIB subnet
        doesn't have any routers. In such a case the non-existent
        router cannot be relied on to create the IB groups.

        The router MAY choose to create IB groups corresponding to the
        IP groups it expects to forward.

        Thus the creation of IB groups is done by IP receivers or IP
        routers only and not by senders thereby keeping things simple.
        The host must first try to join the group and only on failure
        attempt to create it.

4.2.2.1 IB join of MGIDs by a listener

        A multicast listener follows the following steps when it
        IP_joins the IP multicast group:

        1) The IPoIB interface IB_Joins the corresponding IB MGID

        2) If step 1) fails

                The IPoIB interface creates the IB MGID group and IB_Joins it

        3) If step 2) fails

                The IPoIB interface records the IB MGID/MLID it will
                be using for the IP multicast group. This decision is
                based on the steps outlined in section 6.2.

        The IGMP/MLD report is then sent out. The MGID/MLID pair in
        the report therefore may not correspond to the IP multicast
        address.

        4) It may be that the IB MGID could not be created/joined
           because of a transient error or resource constraint at the
           SM. It may also be created at a later point in time. The
           listener therefore would not be in the IB MGID
           corresponding to the IP address. Unfortunately there is no


Kashyap                                                        [Page 18]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


           IB level support to let the listener know of the new IB
           MGID being created.

           If the underlying IB level indicated a transient failure
           the listener periodically retries to join the IB group.
           Note that multicasting can still continue since the packets
           can be sent out on the broadcast MGID.

           A configuration parameter dependent on the underlying IB
           subnet's requirements MUST be set that determines how often
           the retries can be done.

4.2.3 Leaving/Deleteing a multicast group

        An IPv4 sender (level 1 compliance) IB_joins the IB multicast
        group only because that is the only way to guarantee reception
        of the packets by all the group recepients. The sender must
        however IB_leave the group at some time. It is RECOMMENDED
        that a sender, when not a receiver on the group, start a timer
        per multicast group sent to. The sender leaves the IB group
        when the timer goes off. It restarts the timer if another
        message is sent. It is RECOMMENDED that the duration of the
        timer be 1200 seconds.

        This recommendation doesn't apply to the IB broadcast group. It
        also doesn't apply to the IB group corresponding to the
        all-hosts multicast group. An IPv4 host MUST always remain a member
        of the broadcast group. It MAY choose to remain a member of
        all-hosts group.

        Thus a sender that chooses to always send to the broadcast
        group and not to the specific multicast group does not need to
        implement a timer.

        An IP multicast receiver MUST IB_leave the corresponding IB
        multicast group when it IP_leaves the IP multicast group. In
        the case of IPv4 implementation the receiver may choose to
        continue to be a sender (level 1 compliance). It MAY choose to
        not IB_leave the IB group but start a timer as explained
        above.

        A router is RECOMMENDED to IB_leave the IB multicast group
        when there are no members of the IP multicast address in the
        subnet and it has no explicit knowledge of any need to forward
        such packets.

        The router and the IP hosts MUST NOT IB_delete the IB
        multicast group when they IB_leave the group. It is possible


Kashyap                                                        [Page 19]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


        for the same IB multicast group be used by a non-IP protocol.
        The IB specification mentions an IB specific protocol that
        will delete the IB groups when it determines that there are no
        IB members of the group.

5.0 QoS and related issues

        [ WG input is solicited on this issue ]

        The IB specification suggests the use of service levels for
        load balacing, QoS and deadlock avoidance within an IB subnet.
        But the IB specification leaves the usage and mode of
        determination of the SL for the application to decide. The SL
        and list of SLs are available in the SA but it is upto the
        endnode's application to choose the 'right' value.

        IP is one such IB application and so IPoIB needs to define a
        set of rules on the choice of the SL. The IPoIB
        implementations MUST map the QoS request to the right SL based
        on the IB's QoS policies. This mapping in itself is not an
        IPoIB issue. However a policy needs to be defined that lets a
        IPoIB node know the method to adopt to determine the SL. This
        is especially the case if the same IP subnet spans across
        multiple IB subnets.

        The policy must address the issue of whether the SL must be
        mapped as per IB's QoS parameters (when they are defined),
        determined only from the SA, or determined in an
        implementation dependent way etc. It must especially address
        the IP best-effort case.


6.0 Security Considerations

        Any multicast/broadcast communication is inherently insecure
        since anyone can receive the data. The applications must
        implement appropriate authentication/encryption methods for
        data security.

        The IP subnet communication can be disrupted by creating the
        IB broadcast/multicast groups with incompatible parameters.
        The implementations must leverage IB specific methods to
        protect against such situations.

7.0 References

[IB_ARCH]       InfiniBand Architecture Specification, Volume 1.0
[RFC_2373]      IP Version 6 Addressing Architecture


Kashyap                                                        [Page 20]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


[RFC_2375]      IPv6 Multicast Address Assignments
[RFC_1700]      Assigned Numbers
[RFC_1112]      Host extensions for IP multicasting
[RFC_2236]      Internet Group Management Protocol, Version 2
[RFC_2710]      Multicast Listener Discovery

8.0 Author's Address

Vivek Kashyap

IBM
15450, SW Koll Parkway
Beaverton, OR 97006

Work: 503 578 3422
Email: vivk@us.ibm.com

Full Copyright Statement

        Copyright (C) The Internet Society (2001). All Rights Reserved.

        This document and translations of it may be copied and
        furnished to others, and derivative works that comment on or
        otherwise explain it or assist in its implementation may be
        prepared, copied, published and distributed, in whole or in
        part, without restriction of any kind, provided that the above
        copyright notice and this paragraph are included on all such
        copies and derivative works. However, this document itself may
        not be modified in any way, such as by removing the copyright
        notice or references to the Internet Society or other Internet
        organizations, except as needed for the purpose of developing
        Internet standards in which case the procedures for copyrights
        defined in the Internet Standards process must be followed, or
        as required to translate it into languages other than
        English.

        The limited permissions granted above are perpetual and will
        not be revoked by the Internet Society or its successors or
        assigns.

        This document and the information contained herein is provided
        on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
        ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
        USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
        ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
        PARTICULAR PURPOSE.


Kashyap                                                        [Page 21]


INTERNET-DRAFT             IPoIB architecture          November 14, 2001


Kashyap                                                        [Page 22]


-- 
Vivek Kashyap
IBM
kashyapv@us.ibm.com 
vivk@us.ibm.com 
503 578 3422 (o)