Internet Engineering Task Force Raj Yavatkar INTERNET-DRAFT Ema Patki draft-yavatkar-sbm-ethernet-00.txt Intel Corporation Don Hoffman Sun Microsystems, Inc June 1996 Expires: December 1, 1996 SBM (Subnet Bandwidth Manager): A Proposal for Admission Control over Ethernet Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). This document is a product of the ISSLL subgroup of the Integrated Services working group of the Internet Engineering Task Force. Comments are solicited and should be addressed to the working group's mailing list at issll@mercury.lcs.mite.edu, and/or the author(s). Abstract This document outlines an architecture for RSVP-based admission control over an IEEE 802.3 ethernet environment. The proposed architecture is designed to work with the current generation of Ethernet infrastructure (NICs, bridges, hubs, and switches) and should be considered as a first step towards discovering solutions for implementation of IntServ capabilities over Ethernet. This draft is intended as a starting point for discussions in ISSLL meeting to be held at the Montreal IETF meeting in June 1996. draft-yavatkar-sbm-ethernet-00.txt [Page 1] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 1. Introduction Under RSVP, an end-to-end reservation of resources involves two steps. First, a sender specifies its traffic characteristics in a PATH message that traverses all the links along the path of a data flow. Receiver(s) are responsible for reservation of resources and request a reservation using a RESERVE message that traverses, in reverse, the path taken by the PATH message. As the RESERVE message traverses the reverse path, each intermediate RSVP node (a host or a router) reserves resources on its downstream interface. As part of the reservation procedure, the RSVP node invokes (and relies on) an admission control procedure that is specific to the underlying medium (e.g., a point-to-point link, Token Ring or Ethernet) of the interface to ensure that adequate resources are available and can be reserved over that link. The RSVP protocol itself does not specify the admission control procedure and separate specifications are needed for each type service and type of link-level medium. The purpose of this document is to propose an architecture for RSVP-based admission control over Ethernet. We make the following assumptions: - No explicit support for integrated services is assumed from Ethernet NICs, switches, hubs, or bridges. However, we identify possible avenues for such support later in the document. - In the absence of any policing or traffic shaping mechanism for limiting outgoing traffic on an end-system, our goal is to provide an administrative control over the maximum amount of RSVP-capable traffic admitted on any segment of a LAN. Thus, we assume that all the RSVP nodes on a LAN will utilize the proposed admission control procedure to reserve bandwidth in advance of sending any RSVP-enabled data flows and will not send/forward such traffic if the reservation request fails. Thus, if all the multimedia traffic on a LAN is sent using RSVP for resource reservation, the proposed architecture would restrict the total multimedia traffic on any LAN segment within the bounds desired by a LAN administrator. - No prior limit is assumed on the amount of best effort traffic on a LAN until additional mechanisms can be incorporated in end-systems to restrict the amount of best effort traffic generated. However, we do assume that best-effort traffic is rate-adaptive and uses a "slow start" type congestion control mechanism to limit the amount of traffic sent. - One of our goals is to propose a design that is stateless and fault tolerant. Therefore, we make use of soft state (state information that can be easily reconstructed in case of failure) and rely on IP multicast for state discovery and propagation. draft-yavatkar-sbm-ethernet-00.txt [Page 2] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 2. Overview 2.1 Terminology Our architecture is based on a logical entity called an SBM (Subnet Bandwidth Manager) responsible for handling admission control requests. We assume that a Designated SBM (DSBM) exists for each LAN segment (called a Managed Segment or MS) and services reservation requests (manages bandwidth) for that segment. An SBM is an application-level entity that uses UDP as its transport protocol and understands RSVP messages. The architecture makes no assumptions about the number of SBMs within a LAN; an SBM may act as a DSBM for one or more segments, or a single DSBM may exist for each LAN segment. 2.1 Basic Algorithm Figure 1 - Example Managed Segment. Host Host _______ === ------- === | | | C | | SBM | | B | | Router| | | - /---- | | |_R2____| === / === | | / | | | / | ==============================================================LAN | | | | === __|_____ | A | | Router | | | | R1 | === |________| Host Figure 1 shows an example topology with hosts and routers interconnected across a LAN. For the purpose of this discussion, we ignore the actual physical topology of the LAN and a single SBM is assumed to be the DSBM for the entire LAN. The basic SBM algorithm works as follows: 1. As part of its initial configuration, DSBM obtains information such as maximum bandwidth that can be reserved on each LAN segment under its control. Configuration is likely to be static with the current Ethernet devices. Future work may allow for dynamic discovery of this information. Section 3 discusses some of these issues in more detail. 2. At the start, an RSVP node (RSVP-capable hosts and routers are referred to as an "RSVP node") discovers and binds to its DSBM using a "SBM Discovery Protocol" (described in section 2.3). 3. As in conventional RSVP processing, Path messages from a sender are sent/forwarded to potential receivers using the destination session address or using the standard RSVP encapsulation. For example, if the sender to a session is outside the LAN and router draft-yavatkar-sbm-ethernet-00.txt [Page 3] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 R1 (see Figure 1) is on the path to the receivers, R1 will forward its PATH message to the session address. 4. When a receiver (say, host A) wishes to make a reservation request for a session, it sends a RSVP RESERVE message to its DSBM using that DSBM's unicast address. The RESERVE message must contain an additional, new RSVP object called LAN_PHOP object. This object specifies the IP address of the PHOP (previous hop) associated with the RESERVE message and has the same structure as the RSVP_HOP object. 5. The DSBM processes the RESERVE message based on the bandwidth available and returns an RSVP_ERROR to the requestor (host A) if the request cannot be granted. In case of a successful reservation, DSBM forwards the RESERVE message towards the PHOP specified in the LAN_PHOP object. The DSBM merges reservation requests for the same session as and when possible using the rules similar to the conventional RSVP processing. 6. The RESERVE message eventually reaches the original PHOP on that MS (as specified in the LAN_PHOP object) if all reservation requests within the MS succeed. 2.2 Changes to conventional RSVP operation The SBM algorithm requires following changes to the RSVP operation on part of RSVP end-nodes: - Outgoing RESERVE messages on an Ethernet interface are unicast to the DSBM. - RESERVE messages sent to a DSBM contain a new, additional object called LAN_PHOP that specifies the IP address of the PHOP for the RESERVE message. - RESERVE message are sent from an RSVP node to the DSBM, and not the PHOP. 2.3 Discovering and Binding to a DSBM DSBM listens to a well-known SBM multicast transport address (SBM_GRP -- a combination of reserved UDP port and an IP multicast address) and an RSVP node locates its DSBM by multicasting a LOCATE_SBM request to SBM_GRP with a restricted multicast scope (multicast TTL=1). After the initial handshake between the DSBM and an RSVP node, the RSVP node is considered bound to its DSBM. 2.4 More than one active SBM on a LAN segment For the sake of redundancy and fault tolerance, it is possible to have more than one active SBM on any LAN segment that can act as a backup DSBM if draft-yavatkar-sbm-ethernet-00.txt [Page 4] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 necessary. In such a case, exactly one SBM acts as the DSBM at any time and is elected using an DSBM election algorithm. Once a DSBM is elected, other SBMs stay around and step in to elect a new DSBM if the previously elected DSBM terminates or crashes for some reason. The election algorithm works as follows. When a SBM comes up on an IP subnet, it announces its willingness to be a DSBM by periodically multicasting a DSBM WILLING message to the SBM_GRP address with restricted multicast scope (TTL=1). If another SBM exists on the same subnet and considers itself a current or a better choice, it replies to the new SBM via a multicast I_AM_DSBM message to the same group. Ties between candidate DSBMs may be broken based on a priority field in the I_AM_DSBM message (priority specified by the network administrator) or simply based on IP address ordering. (e.g. candidate with the lower IP address wins). If the new SBM does not hear a response from a better choice within K periodic announcements, it declares itself to be the SBM by multicasting a I_AM_DSBM message. In order to avoid SBMs sending simultaneous I_AM_DSBM messages, the SBM waits for a random time before declaring itself SBM. The designated SBM of the MS periodically multicasts the I_AM_DSBM message. Other SBMs in the MS listen to the periodic announcements and assume that the SBM has terminated functioning if they do not hear K successive periodic announcements. In that case, all the SBMs initiate the election algorithm to elect a new DSBM. 2.5 Application Behavior This proposal makes no assumptions about any traffic separation or policing mechanisms on the MS. Consequently, there are no network enforced mechanisms to keep non adaptive traffic intended to be part of a reserved flow, but that did not receive a reservation due to admission control, from interfering with traffic running with valid reservations. Instead, applications are expected to be well behaved and follow the following set of rules in such environments: - For both unicast and multicast flows, a sender is not to transmit any traffic on a RSVP flow until at least one RESERVE has been received for that flow. The outgoing flow should be policed at the sender to be less than or equal to the maximum flow reserved. RESERVE TEARS are indications that the sender should no longer send a flow. - For multicast flows, the receiver is to leave the session multicast group if a reservation error (RESV_ERROR) or a Path Tear (PTEAR) is received. The final rule may create some difficulty in environments where source specific multicast pruning is not implemented (the default case in the current MBONE). A reservation error from a path toward any one specific sender would result in the receiver dropping all senders, even those with fully reserved paths. Applications running in such environments should restrict sessions to a single sender if at all possible. draft-yavatkar-sbm-ethernet-00.txt [Page 5] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 3. Handling Complex LAN Physical Topologies The physical topology of a LAN may vary from a single, shared segment to a more complex, multi-hop topology (e.g., an interconnected network of bridges, hubs, and switches). In such a complex topology, the SBM algorithm should handle two cases that may arise: - In a multi-hub (or bridged) LAN topology, unicast data flows between two entities on the subnet only traverse a subset of LAN segments. Similarly, if Ethernet switches/hubs support IP multicast traffic filtering, multicast data flows would also traverse a subset of segments based on the location of IP multicast group members. Thus, the SBM algorithm must reserve bandwidth only on affected segments in such cases. - Instead of a single SBM acting as a centralized DSBM for the entire multi-hop LAN, it may be desirable to deploy many SBMs with each SBM responsible for managing bandwidth on a separate portion of the LAN. To handle such a case of many DSBMs within a LAN, our SBM algorithm must include mechanisms for discovering and communicating with peer DSBMs within a LAN. In the remainder of this document, we address each of these issues in more detail. 3.1 Discovering Physical Topology of a LAN We assume that an SBM can discover the topological information about the physical interconnections among hubs, switches, and bridges using a variety of methods such as using a static topology configuration database or using MIBs and techniques used by network management utilities. Using a static topology database is sufficient when the multi-hop LAN uses a star-based topology with no alternate, redundant interconnections between bridges and hubs. However, when alternate paths exist in a rich topology, bandwidth reservation optimizations require access to information that reveals the LAN segments traversed between two endpoints. Ideally, a standard interface to information such as the spanning tree configuration should be made available by IEEE 802.3 (or associated) working groups. Until such an interface is available, we propose a topology discovery protocol that relies on the "Topology Mapping" section (section 4.0) of the hub MIB WG specification being considered in IETF [3]. draft-yavatkar-sbm-ethernet-00.txt [Page 6] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 3.1.1 Topology Discovery Protocol Figure 2 -- Alternative representation of the LAN in figure 1. Switch Host ____ === | Sw1| | A | |____|-------------| | / === / / _/__ === | Sw2|-------------| C | |____| | | / \ === / \ Host __/__ _\___ | SW3 | | Sw4 | |_____| |_____| / / === _/____ | B |------| Sw5| | | |____| === Host As described above, in a multi-hop LAN, a DSBM should identify the relevant physical segments on a path between a sender and a receiver and perform admission control over one or more such segments. Figure 2 shows an example multi-hop topology. Communication between endpoints (e.g., between hosts A and B, or between B and C) requires reservation of bandwidth across only few of the LAN segments segments that lie on the path between the two endpoints. Given a topology map, the DSBM needs to place the two endpoints on the map and identify the LAN segments on the path between them. An SBM follows the following steps to achieve the goal: 1. The SBM determines the MAC address of the endpoint it wishes to place on the map. 2. The SBM tells all managed hubs in the collision domain to watch for packets with that source MAC address. 3. The SBM sends an echo datagram ("ping") to the endpoint to cause it to transmit. The SBM then uses the hub MIB interface to read the group and port of the targeted MAC address from the managed hubs. The information obtained identifies the location of the endpoint on the map and the LAN segments currently used to reach the endpoint. Similar information can be gathered for the other endpoint. draft-yavatkar-sbm-ethernet-00.txt [Page 7] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 4. Once the endpoints and the traversed hops are placed on the topology map, the SBM can identify the affected segments between the two endpoints assuming a spanning tree topology. 3.2 Handling Multiple, Peer DSBMs in a LAN We assume that each DSBM has information on which LAN segments are under its control and needs to communicate with its peer DSBMs whenever admission control involves LAN segments that are not under its control. Thus, peer-to-peer DSBM communication involves two parts: - Discovering peer DSBMs (and information on which segments they manage). - Coordinating with peer DSBMs when a reservation request involves LAN segments under the control of more than one SBM. 3.2.1 Discovery of Peer SBMs All SBMs maintain a local cache of information on peer SBMs and segments under their control. The cache is periodically updated to flush outdated entries. When an SBM processes a reservation request involving segment(s) that are not under its control, it checks its cache to locate appropriate DSBMs for the segment(s) of interest. If no DSBM is known, it multicasts (with TTL=1) a PEER_SEARCH query to the SBM_GRP specifying the segment(s) of interest. When the appropriate DSBM receives such a request, it multicasts its reply to the group giving its unicast address and the list of segments under its control. The requesting SBM (and others) then include the information in their cache. 3.2.2 Peer-to-Peer Communication When a reservation request involves segments outside the domain of a DSBM, it first performs admission control on segments under its control. If the admission control succeeds, the DSBM forwards the RSVP RESERVE request to the peer DSBM responsible for the next segment(s) on the path towards the LAN_PHOP using the peer's unicast address. The peer DSBM will repeat similar procedure and forward the RESERVE to the next DSBM if necessary. Finally, the DSBM responsible for the final segment for the LAN_PHOP will forward the RESERVE to the LAN_PHOP if admission control succeeds up to and including the final segment. If admission control fails at any intervening SBM, that SBM sends back an RSVP_ERROR message to its previous peer and the error message propagates hop-by-hop (using the information in the reservation state at intermediate SBMs) back to the first DSBM which then communicates the RSVP_ERROR back to the RSVP node that initiated the RESERVE. DSBMs merge reservation requests (and handle killer reservation problems) for the same session on segments under their control according to the rules specified in the RSVP specification. draft-yavatkar-sbm-ethernet-00.txt [Page 8] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 4. ADDITIONAL NOTES It should be noted that our design allows for one DSBM per switch in a LAN. However, our design will benefit from definition of standard interface for accessing routing (spanning tree) information available at switches and other 802.3 medium attachment units. We also hope the draft to be a starting point of discussion between IETF and IEEE 802.1P/Q subcommittee(s) to identify Level 2 mechanisms that will help in implementation of integrated-services capabilities over Ethernets. Acknowledgements The authors wish to thank John Flick of HP for explanation of the "Topology Mapping" section of the hub MIB specification. draft-yavatkar-sbm-ethernet-00.txt [Page 9] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) June, 1996 5. References [1] R Braden, L Zhang, S Berson, S Herzog, J Wroclaswki, "Resource Reservation Protocol", Internet Draft draft-ietf-rsvp-spec12.txt,May 1996. [2] S.Shenker, "Specification of General Characterization Parameters", draft-ietf-intserv-charac-00.txt,Nov 1995 [3] D Romascanu, "Definitions of Managed Objects for IEEE 802.3 Repeater Devices", draft-ietf-hubmib-repeater-dev-02.txt,May 1996 6. Authors' Addresses Raj Yavatkar Intel Corporation MS : JF2-74 2111 N.E. 25th Avenue, Hillsboro, OR 97124 USA phone: +1 503-264-9077 email: yavatkar@ibeam.intel.com Ema Patki Intel Corporation MS: JF2-74 2111 N.E. 25th Avenue, Hillsboro, OR 97124 USA phone: +1 503-264-0440 email: epatki@ibeam.intel.com Don Hoffman Sun Microsystems, Inc. MS: UMPK14-305 2550 Garcia Avenue Mountain View, California 94043-1100 USA phone: +1 503-297-1580 email: don.hoffman@eng.sun.com draft-yavatkar-sbm-ethernet-00.txt [Page 10]