Network Working Group L. Yong Internet Draft W. Hao D. Eastlake Category: Standard Track Huawei A. Qu MetiaTek J. Hudson Brocade U. Chunduri Ericsson Expires: April 2015 October 27, 2014 IGP Multicast Architecture draft-yong-rtgwg-igp-multicast-arch-00 Abstract This document specifies Interior Gateway Protocol (IGP) network architecture to support multicast transport. It describes the architecture components and the algorithms to automatically build a distribution tree for transporting multicast traffic and provides a method of pruning that tree for improved efficiency. Status of this document This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 27, 2015. Yong, et al [Page 1] Internet-Draft IGP Multicast Architecture October 2014 Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction...................................................3 1.1. Motivation................................................3 1.2. Conventions used in this document.........................4 2. IGP Architecture for Multicast Transport RTADDR sub-TLV........4 3. Computation Algorithms in IGP Multicast Domain.................5 3.1. Automatic Tree Root Node Selection........................5 3.2. Distribution Tree Computation.............................5 3.2.1. Parent Selection.....................................6 3.2.2. Parallel Local Link Selection........................6 3.3. Multiple Distribute Trees for a Multicast Group...........7 3.4. Pruning a Distribution Tree for a Group...................7 4. Router Forwarding Procedures...................................8 4.1. Packet Forwarding Along a Pruned Distribution Tree........8 4.2. Local Forwarding at Edge Router...........................8 4.2.1. Overlay Multicast Transport..........................9 4.3. Multi-homing Access Through Active-active MC-LAG.........10 4.4. Reverse Path Forwarding Check (RPFC).....................11 5. Security Considerations.......................................11 6. IANA Considerations...........................................11 7. Acknowledgements..............................................11 8. References....................................................12 8.1. Normative References.....................................12 8.2. Informative References...................................12 Yong, et al. [Page 2] Internet-Draft IGP Multicast Architecture October 2014 1. Introduction This document specifies Interior Gateway Protocol (IGP) network architecture to support multicast transport. It describes the architecture components and the algorithms to automatically build a distribution tree for transporting multicast traffic and provides a method of pruning that tree for improved efficiency. An IGP network is built to transport unicast traffic. Transporting multicast traffic relies on a protocol independent mechanism and a different protocol, i.e. PIM [RFC4601] [RFC5015]. The PIM builds on top of IGP network and maintains its own state. Data Center infrastructure and advanced systems for cloud applications are looking for an IGP network to transport both unicast and multicast packets in a simpler and more efficient way than use of a separate protocol beyond IGP protocol. This draft proposes the architecture and algorithms for an IGP based multicast transport. The architecture and algorithms automatically build a bi-directional distribution tree and pruned bi-directional tree for a multicast group without use of PIM. IGP protocol extension for this architecture is addressed in the [ISEXT]. 1.1. Motivation Network-as-a-service technically can be achieved by decoupling network IP space from service IP space as with a VxLAN [RFC7348] based network overlay. Decoupling network IP space from service IP address space also provides network agility and programmability to applications. If network IP space is decoupled from service IP space, the network itself no longer needs manual configuration; automatically forming an IP network fabric can be done. The resulting "plug and play" can greatly simplify network operation. With the goal of automation in forming a network fabric and support of any type of forwarding behavior the service layer requires, IGP protocol should be to be extended to support 1. Network formation 2. Multi destination distribution tree computation. Using external PIM prohibits the "automatic" nature requirement and results a longer convergence time of multicast transport than Yong, et al. [Page 3] Internet-Draft IGP Multicast Architecture October 2014 unicast transport because the convergence time for PIM is added to the basic IGP unicast route convergence time. 1.2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. 2. IGP Architecture for Multicast Transport RTADDR sub-TLV An IGP multicast domain defined in this document contains edge routers and transit routers. Multicast source(s) and receiver(s) locally attach to edge routers or connect to edge routers through a layer 2 or layer 3 network. When an ingress edge router receives a multicast packet from a local multicast source, it replicates it along a pruned tree in the IGP domain. When an egress edge router receives a multicast packet from the IGP domain, it forwards the packet to all local receivers and replicates the packet along the pruned tree in the domain. When a transit router receives a multicast packet from another router in the domain, it replicates the packet to its neighbor router(s) in the domain along a pruned tree. An IGP multicast domain can be used to carry overlay multicast traffic. Upon receiving a multicast packet from a source, the edge router first encapsulates the packet, adds its IP address as the source address and the corresponding underlay multicast IP address as the destination address on the encapsulated packet, then replicates it along a pruned tree. Egress edge router(s) decapsulate the packet before sending to receiver(s). In an IGP multicast domain, each router has a unique IP address and the router IP address is advertised as a host address by IGP protocol. An IGP domain can be an IGP multicast domain if all routers support the multicast capability described in this document; a subset of an IGP domain can be an IGP multicast domain where only some edge routers and transit routers have IGP multicast capability described in this draft and the draft [ISEXT]. In the case where the IGP multicast domain is subset of an IGP domain, a router in an IGP multicast domain must have at least one adjacency (next hop) to another router that is in the IGP multicast domain, that is, the IGP multicast domain must be connected. Configuring an IP tunnel between two routers in an IGP multicast domain can achieve this. How to configure such tunnel is outside the scope of this document. Yong, et al. [Page 4] Internet-Draft IGP Multicast Architecture October 2014 In an IGP multicast domain, a default distribution tree is established automatically (see Section of 3.1). Operators may configure other distribution trees with different priorities in the domain as well and specify the associated multicast groups carried by these configured trees. By default, all the multicast groups use the default distribution tree. The distribution tree computation algorithm is described in Section 3.2. The tree pruning for a particular multicast group is described in Section 3.3. Section 3.4 describes multiple trees to support one multicast group. Section 4 describes router forwarding procedures. 3. Computation Algorithms in IGP Multicast Domain 3.1. Automatic Tree Root Node Selection By default the tree root is the router with the largest magnitude Router ID, considering the Router ID, i.e. router IPv4 address, to be an unsigned integer. Note that the algorithms in following sections use Router ID for router identifier, i.e. unique IP address assigned to a router in a IGP multicast domain. Operators may configure a default tree root node (based on the topology) that takes precedence over the default tree root auto- calculated. This configured tree root node would advertise its IP address as the default tree root for all multicast groups that are not assigned to a distribution tree in a IGP multicast domain. 3.2. Distribution Tree Computation The Distribution Tree Computation Algorithm uses the existing IGP Link State Database (LSDB). All routers in an IGP multicast domain calculate a path toward to the default tree root node (router, see Section 3.1), based on the LSDB and shortest path algorithm. As a result, a default distribution tree is formed in the domain. If an operator configures other distribution tree roots on other routers, the operator specifies what multicast groups use those trees and the tree root routers will advertise themselves as the tree root for those multicast groups by use of the new RTADDR TLV [ISEXT]. All routers in the domain will track the tree root nodes and calculate the path toward to each configured tree root node by using the shortest path algorithm, which form multiple distribute trees. It is important that all the routers calculate the identical branches in a tree in an IGP multicast domain. Section 3.2.1 and Yong, et al. [Page 5] Internet-Draft IGP Multicast Architecture October 2014 3.3.2 specifies the tiebreaking rules for parent router selection in case of equal-cost path and for the link selection in case of multiple local links. Because link costs can be asymmetric, it is important for all tree construction calculations to use the cost towards the root. 3.2.1. Parent Selection When there are equal costs from a potential child router to more than one possible parent router, all routers need to use the same tiebreakers. It is desirable to allow splitting traffic on as many links as possible in such situations when multiple distribution trees presents. This document uses the following tiebreaker rules: If there are k distribution trees in the domain, when each router computes these trees, the k trees calculated are ordered and numbered from 0 to k-1 in ascending order according by root IP addresses. The tiebreaker rule is: when building the tree number j, remember all possible equal cost parents for router N. After calculating the entire "tree" (actually, directed graph), for each router N, if N has "p" parents, then order the parents in ascending order according to the 7-octet IS-IS System ID considered as an unsigned integer, and number them starting at zero. For tree j, choose N's parent as choice (j-1) mod p. 3.2.2. Parallel Local Link Selection If there are parallel point-to-point links between two routers, say R1 and R2, these parallel links would be visible to R1 and R2, but not to other routers. If this bundle of parallel links is included in a tree, it is important for R1 and R2 to decide which link to use; if the R1-R2 link is the branch for multiple trees, it is desirable to split traffic over as many links as possible. However the local link selection for a tree is irrelevant to other Routers. Therefore, the tiebreaking algorithm need not be visible to any Routers other than R1 and R2. When there are L parallel links between R1 and R2 and they both are on K trees. L links are ordered from 0 to L-1 in ascending order of Circuit ID as associated with the adjacency by the router with the highest System ID, and K trees are ordered from 0 to K-1 in ascending order of root IP addresses. The tiebreaker rule is: for tree k, select the link as choice k mod L. Yong, et al. [Page 6] Internet-Draft IGP Multicast Architecture October 2014 Note that if multiple distribution trees are configured in a domain or on a router, better load balance among parallel links through the tie-breaking algorithm can be achieved. Otherwise, if there is only one tree is configured, then only one link in parallel links can be used for the corresponding distribution tree. However, calculating and maintaining many trees is resource consuming. Operators need to balance between two. Another alternative is to use a lower level link aggregation protocol, such as [802.1AX-2011] on the parallel point-to-point links between R1 and R2. They will then appear to be a single link to the IGP and it will be the link aggregation protocol that spreads traffic across the actual lower level parallel links. 3.3. Multiple Distribute Trees for a Multicast Group It is possible that a multicast group is associated with multiple trees that may have the same or different priority. When a multicast group associates with more than one tree, all routers have to select the same tree for the group. The tiebreaker rules specified in PIM [RFC4601] are used here. They are: o Perform longest match on group-range to get a list of trees. o Select the tree with highest priority. o If only one tree with the highest priority, select the tree for the group-range. o If multiple trees are with the highest priority, use the PIM hash function to choose one. PIM hash function is described in section 4.1.1 in RFC 4601 [RFC4601]. 3.4. Pruning a Distribution Tree for a Group Routers prune the distribution tree for each associated multicast group, i.e. eliminating branches that have no potential downstream receivers. Multi-destination packets SHOULD only be forwarded on branches that are not pruned. The assumption here is that a multicast source is also a multicast receiver but a multicast receiver may not be a multicast source. All routers in the domain receive LSP messages with GRADD-TLV [RFC7176] from the edge routers, which indicate which multicast group that an edge router is the receiver. According that, the routers prune the corresponding distribution tree for each multicast group and maintain a list of adjacency interfaces that are on the Yong, et al. [Page 7] Internet-Draft IGP Multicast Architecture October 2014 pruned tree for a multicast group. Among these interfaces, one interface will be toward the tree-root router (unless the router is the root) and zero or more interfaces will be toward some edge routers. 4. Router Forwarding Procedures 4.1. Packet Forwarding Along a Pruned Distribution Tree Forwarding a multi-destination packet follows the pruned tree for the group that the packet belongs to. It is done as follows. o If the router receives a multi-destination packet with group IP address that does not associated with any configured tree, the packet MUST be considered associated with the default tree. o Else check if the link that the packet arrives on is one of the ports in the pruned distribution tree. If not, the packet MUST be dropped. o Else optionally perform RPF checking (section 4.4). If the check is performed and it fails, the packet SHOULD be dropped. o Else the packet is forwarded onto all the adjacency interfaces in the pruned tree for the group except the interface where the packet receive. 4.2. Local Forwarding at Edge Router Upon receiving a multicast packet, besides forwarding it along the pruned tree, an edge router may also need to forward the packet to the local hosts attached to it. This is referred to as local forwarding in this document. Local forwarding table and multicast forwarding table in IGP domain should be stitched at each edge router. Local forwarding table can be generated using IGMP/PIM protocol running in the network between host and the edge router. The local group database is needed to keep track of the group membership of attached hosts. Each entry in the local group database is a [group, host] pair, which indicates that the attached hosts belonging to the multicast group. When receiving a multicast packet, the edge router forwards the packet to the host that match the [group, host] pair in the local group database. The local group database is built through the operation of the IGMPv3 [RFC3376]. An edge router sends periodic IGMPv3 Host Membership Queries to attached hosts. Hosts then respond with IGMPv3 Yong, et al. [Page 8] Internet-Draft IGP Multicast Architecture October 2014 Host Membership Reports, one for each multicast group to which they belong. Upon receiving a Host Membership Report for a multicast group A, the router updates its local group database by adding/refreshing the entry [group A, host] pair. If at a later time Reports for Group A cease to be heard from the host, the entry is then deleted from the local group database. The edge router further sends the LSP message with GRADDR TLV to inform other routers about the group memberships in the local group database. 4.2.1. Overlay Multicast Transport An IGP multicast domain may be used to carry overlay multicast traffic. [RFC7365] There are two architecture scenarios: 1) IGP multicast domain edge router separates with overlay network edge device [RFC7365]. Before multicast traffic is forwarded, Overlay network should trigger underlay multicast domain to construct multicast tree using IGMP protocol in beforehand. Group address in the protocol is underlay multicast group address. Outer layer traffic encapsulation is performed on the overlay network edge device, IGP multicast domain acts as pure underlay network. 2) IGP multicast domain edge router collapses with overlay network edge device. Before multicast traffic is forwarded, local connecting host should trigger underlay multicast domain to construct multicast tree using IGMP like protocol beforehand. Group address in the protocol is overlay multicast group address, edge router should map the group address into underlay multicast group address. The IGP multicast domain can support both scenarios. To carry overlay multicast traffic, a (designated) edge router (see Section below on Multi-Homing Access) further necessarily maintains the mapping between an overlay multicast group and a underlying multicast group, and performs packet encapsulation/descapsulation upon receiving a packet from a host or the underlay IGP network. Mapping between an overlay multicast group and a underlay multicast group can be manually configured, automatically generated by an algorithm at a (designated) edge router. The same edge router MUST be selected as the Designated Forwarder for the overlay multicast group and underlying multicast group that are associated. If multiple overlay multicast groups attach to same edge router sets, these overlay multicast groups can be mapped to the same underlying multicast group to reduce underlay network multicast forwarding table size on each router. The mapping method is beyond the scope of this document. Yong, et al. [Page 9] Internet-Draft IGP Multicast Architecture October 2014 4.3. Multi-homing Access Through Active-active MC-LAG A multicast group receiver may attach to multiple edge routers through an active-active MC-LAG [802.1AX-2011] to enhance reliability. When a remote edge router ingresses a multicast packet w/ multicast group address from local multicast source, if all egress routers in an MC-LAG forward the packet to the local host (receiver), the host will receive multiple copies of the multicast frame from the remote multicast source. To avoid duplicated packets received from the IGP domain to a local network, a Designated Forwarder (DF) mechanism is required. All the edge routers associated to a same MC-LAG use the same algorithm to select one DF edge router for a multicast group. Each MC-LAG should be assigned with a unique MC-LAG identifier in an IGP multicast domain, which may be manually configured or automatically provisioned. When an edge router in a MC-LAG receives a multicast group receiver joining message using IGMP/PIM like protocols, it announces its self MC-LAG ID and the multicast group correspondence to other routers in its IGP LSP. After network state reaches steady state, all edge routers in a MC-LAG elect the same router as DF for each multicast group. Upon receiving a multicast packet from the domain, only the DF edge router will forward the packet towards the receiver. All non-DF edge routers do not forward the packet towards the receiver. All edge routers, including DF and non-DF, can ingress the traffic to IGP domain as usual. DF and non-DF state has influence only on the egress multicast traffic forwarding process. If a multicast group source host attaches to multiple edge routers through an active-active MC-LAG, loop prevention, i.e. the packet sent by source host loops back to the source host via the edge routers in a MC-LAG, is necessary. The solutions for two scenarios are described below. o When the multicast IGP domain edge routers separate with overlay network edge devices that carry overlay network traffic, these routers don't replace traffic source IP address when they inject the traffic into IGP domain. In this case, edge routers should acquire multicast source IP address in beforehand using a mechanism like IGMPv3 explicit tracking, and then the source IP addresses are synchronized among each edge routers in same MC-LAG. Then same split-horizon mechanism described in the above section can be used. Yong, et al. [Page 10] Internet-Draft IGP Multicast Architecture October 2014 o When the multicast IGP domain edge routers collapse with overlay network edge devices, the edge router connecting to multicast source performs multicast encapsulation when it injects local multicast traffic into the IGP domain, source IP is the edge router's IP. Each edge router tracks the IP address(es) associated with the other edge router(s) with which it has shared MC-LAG. When the edge router receives a packet from an IGP domain, it examines the source IP address and filters out the packet on all local interfaces in the same MC-LAG. With this approach, local bias forwarding is required on the ingress edge router. It performs replication locally to all directly attached receivers no matter DF or non-DF state of the out interface connecting to each receiver. 4.4. Reverse Path Forwarding Check (RPFC) The routing transients resulting from topology changes can cause temporary transient loops in distribution trees. If no precautions are taken, and there are fork points in such loops, it is possible for multiple copies of a packet to be forwarded. If this is a problem for a particular use, a Reverse Path Forwarding Check (RPFC) may be implemented. In this case, the RPFC works by a router determining for each port, based on the source and destination IP address of a multicast packet, whether the port is a port that router expects to receive such a packet. In other words, is there an edge router with reachability to the source IP address such that, starting at that router and using the tree indicated by the destination IP address, the packet would have arrived at the port in question. If so, it is further distributed. If not, it is discarded. An RPFC can be implemented at some routers and not at others. 5. Security Considerations To come in future version 6. IANA Considerations This document does not request any IANA action. 7. Acknowledgements Authors like to thank Mike McBride and Linda Dunbar for their valuable inputs. Yong, et al. [Page 11] Internet-Draft IGP Multicast Architecture October 2014 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC2119, March 1997. [RFC3376] Cain B., etc, "Internet Group Management Protocol, Version 3", rfc4604, October 2002 [RFC4601] Fenner, B., et al, "Protocol Independent multicast - Sparse Mode (PIM-SM): Protocol Specification", rfc4601, August 2006 [RFC5015] Handley, M., et al, "Bidirectional Protocol Independent Multicast (BIDIR-PIM", rfc5015, October 2007 [ISEXT] Yong, L., el al, "IS-IS Extension For Building Distribution Tree", draft-yong-isis-ext-4-distribution-tree, work in progress. [802.1AX-2011] IEEE, "IEEE Standard for Local and metropolitan area networks - Link Aggregation", IEEE802.1AX, 2011 8.2. Informative References [RFC7348] Mahalingam, M., Dutt, D., etc, "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC7348, 2014 [RFC7365] Lasserre, M., "Framework for DC Network Virtualization", RFC7364, 2014. Authors' Addresses Lucy Yong Huawei USA Phone: 918-808-1918 Email: lucy.yong@huawei.com Weiguo Hao Huawei Technologies 101 Software Avenue, Nanjing 210012 Yong, et al. [Page 12] Internet-Draft IGP Multicast Architecture October 2014 China Phone: +86-25-56623144 Email: haoweiguo@huawei.com Donald Eastlake Huawei 155 Beaver Street Milford, MA 01757 USA Phone: +1-508-333-2270 EMail: d3e3e3@gmail.com Andrew Qu MediaTek San Jose, CA 95134 USA Email: laodulaodu@gmail.com Jon Hudson Brocade 130 Holger Way San Jose, CA 95134 USA Phone: +1-408-333-4062 Email: jon.hudson@gmail.com Uma Chunduri Ericsson Inc. 300 Holger Way, San Jose, California 95134 USA Phone: 408 750-5678 Email: uma.chunduri@ericsson Yong, et al. [Page 13]