Internet Engineering Task Force R. Perlman INTERNET DRAFT Sun Microsystems C-Y Lee Nortel A. Ballardie Research Consultant J. Crowcroft UCL August 1998 A Design for Simple, Low-Overhead Multicast Status of This Memo This document is an Internet Draft, Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material, or to cite them other than as a ``working draft'' or ``work in progress.'' Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Abstract This paper describes how a lot of the complexity and overhead of multicast can go away if a slightly different approach is taken. Approaches like this have been proposed in the past, but the design has not been carried through to completion. This paper describes the approach and then compares it with other approaches. 1.0 Introduction The basic idea is that a multicast group is created by generating: - a name, suitable for lookup by humans, - a multicast address (i.e., a class D IP multicast address) Expires February 1999 [Page 1] Internet Draft A Design for Simple Multicast August 1998 - a distinguished node known as the "core" Endnodes look up the group through some sort of directory, e.g., DNS or SDR. The look-up is based on name, but the information retrieved is the multicast address and the core address. Endnodes then join the group by sending a special join message towards the core, creating state in the routers along the path. The result is a tree of shortest paths from the core to each member, where only the routers along the tree need to have any state about the group. 1.1 Very simplified Review of PIM-SM and BGMP Originally there was the idea, presented in CBT, of forming a multicast group by choosing a distinguished node, the "core", and having all members join by sending special join messages towards the core. The routers along the path keep state about which ports are in the group. If a router along the path of the join already has state about that group the join does not proceed further. Instead that router just "grafts" the new limb onto the tree. The result is a tree of shortest paths from the core, with only the routers along the path knowing anything about that group. Later modifications included: - "optimizing" the tree by forming a tree for each sender. The way this is done, each node could independently decide whether the volume of traffic from a particular source was worth joining a per-source tree. The result was that there were two possible trees for traffic from a particular source for group M, the shared tree and the source tree. To prevent loops, the shared tree had to be unidirectional, i.e., to send to the shared tree, you'd have to unicast to the core. - the assumption is that routers have to be able to figure out, solely based on the multicast address, who the core was. This resulted in a protocol whereby "core-capable" routers continuously advertise themselves, all routers keep track of the current set of live core-capable routers, and there is some sort of hash to map a multicast address to one of the set of core-capable routers. This advertisement protocol is confined to within a domain. - Because the core-capable advertisements (mercifully) do not travel through the entire Internet, but are instead confined to a domain, there needed to be a protocol whereby routers could know the mapping of multicast address to core, even in other domains. The proposal (BGMP) was to have each domain acquire (somehow, a current area for research) a set of addresses. These addresses would be advertised through the interdomain routing protocol. Therefore they had to be allocated in blocks so as to attempt not to place too much of a burden on the interdomain routing protocol. Expires February 1999 [Page 2] Internet Draft A Design for Simple Multicast August 1998 We use the term "aggregatable" to mean that the routing protocol can summarize large numbers of addresses in a small amount of space. 2.0 The Design The design we propose is most similar in spirit to the original idea of CBT, because it does not have: - per source trees - the necessity for routers to know the mapping between multicast addresses and cores 2.1 Address Allocation The only constraint on the multicast addresses in our proposal is that they be unique in time and space. There is no need for the addresses to be aggregatable, since they will not be advertised in routing protocols. It is far simpler to allocate addresses, and you can do with a smaller number of them, if there are no further constraints on the addresses than that they be unique in time (and space). The way to allocate addresses is to have a bunch of address allocation servers sprinkled throughout the Internet, each with a block of addresses. Anyone that wants to form a group finds any one of the address allocation servers, asks for an address and gives an amount of time, say 1 day. After 1 day (plus handwave for clock skew, etc.,) that address can be reallocated to someone else. There can be a hierarchy of these servers. At the top, there might be 10 of them, each with 1/10 of the address space. Anyone could ask one of them for a single address, or for a block of addresses. If you want to be a server handing out addresses, you can request a block of addresses, and then clients can request individual addresses from you. If you run out, you can ask for more. Having a hierarchy with lots of servers, rather than a few servers with millions of addresses each, has the following advantages: - no performance bottlenecks - it's likely a server will be nearby - servers don't have to keep track of millions of individual addresses, each one with a different timer as to when it can be re- allocated. Aggregation wastes addresses, since if you don't overestimate, you will run out and have to acquire more blocks, which will cause fragmentation of the space (more intervals to have to advertise). Since in our scheme the addresses will not get advertised in any routing protocol, there is no necessity to have them have any Expires February 1999 [Page 3] Internet Draft A Design for Simple Multicast August 1998 topological significance or any other constraint that would waste the addresses. Therefore there will effectively be more addresses, and less likelihood they will "run out". If we are worried about running out even though in our proposal they do not need to be aggregatable, then we could reserve some to be limited in scope, say within a domain. If you wanted to create a group where you know that all members are in that domain, you'd get one of the "local" addresses. Since the same block of local addresses can be used simultaneously within different domains, this makes a lot more addresses. Routers at the domain boundary should set up filters to make sure that packets with local addresses don't "leak" out. This concept is known as "administratively scoped addresses". 2.2 Creating the Multicast Group There should be a "multicast group creation" utility. To create a group you pick a name that humans will recognize the group by. You input that into the multicast group creation utility, along with information about how long the group will live, and a logical choice for the core address. If no core is specified, the default is to have the node that created the group be the core. The utility finds an address allocation server, gets an address, and then registers the group and core address into the directory. 2.3 Joining a Group To join a group, you browse the directory to find the appropriate name, and then tell a "multicast join" utility that you'd like to join that group. The utility looks up the address and core address. Then it sends a special control packet, known as a "join" message, from your node. This message gets sent in the same direction as a unicast message to the core address would be sent, but it creates state in the routers along the path, to know which ports are in the group. If a router receives a join for multicast address M, and it already has state for M, then it merely adds that port to its set of ports for M, and does not forward the join further. The result is a tree of shortest paths from the core to each member. Each router on the tree has a database of (M, {ports}) that tells it, for group M, which ports are in the tree. 2.4 Transmitting to multicast group M Expires February 1999 [Page 4] Internet Draft A Design for Simple Multicast August 1998 If you are a member of the group, you simply transmit an IP packet with destination address M. A router that receives a packet with multicast address M checks its multicast database. If it knows about M, it checks if the port it received it on is in its database. If not, it drops the packet. If so, it forwards the packet onto all the other ports listed in its database for M. The result is a bi- directional tree. If you are not a member of the group but want to transmit to the group, you unicast to the core. 2.5 Interdomain Groups Everything described above works just as well in the interdomain case. You join a group by sending a join message to the core address. No matter what domain the core is in, it has an IP address and IP unicast routing already knows how to reach it. 2.6 Backbones There might be concern about a backbone ISP that might need to keep state about billions of groups. It is extremely unlikely that so many groups would exist, especially over such a wide geographic area. However, if there really is that concern, the solution is tunnels between boundary routers in the backbone. A tunnel can be thought of as a point-to-point link between the two routers. Traffic is sent across the tunnel by adding an additional IP header, with the destination address in the outer header being the address of the other tunnel endpoint. The simplest strategy is to assume a full mesh of tunnels between every pair of boundary routers in that backbone. So for instance, if there are 100 boundary routers for that backbone, each would maintain 99 tunnels, to each of the 99 other backbone boundary routers. Each of those are treated as a "port". When receiving a join, backbone boundary router R1 checks its routing database to see which backbone boundary router leads to the core address, say R7. It then forwards the join along the tunnel it maintains between itself and R7. In its multicast group state, it makes an entry for the multicast address M, and the set of ports (including tunnel) included in the tree for M. In this way the interior nodes of the backbone do not keep any state about individual groups. It is not necessary to maintain all the tunnels. They can be formed on an as-needed basis. In fact, there is no reason for a tunnel to be Expires February 1999 [Page 5] Internet Draft A Design for Simple Multicast August 1998 "maintained" at all. If R1 determines that multicast M should be forwarded on a tunnel to R7, all it has to do is remember "R7", and when it receives a packet with destination address M, it tunnels it to R7. 2.7 Per-Source Trees Don't bother. The most logical thing to optimize is the overhead to the network to deliver a packet through the tree. The ideal tree is a minimum weight spanning tree, but no proposals have attempted to try to calculate a minimum weight spanning tree. It is most likely too difficult to do so, though a simple algorithm for creating a minimum weight or near-minimum weight tree would certainly be a welcome advance in IP multicast. A per source tree does not use less overhead to deliver the message than the tree rooted at the core. The only thing more optimal about a source tree than the shared core tree is the delay from the time the source transmits to the time everyone receives the message. It is unlikely that an application that is so delay sensitive that it would work with a per-source tree but not with the shared core tree would work at all if there are members located any significant distance away from the source. If there were an application that happens to be in a topology where all members are close enough if per-source trees are used, but the members are not close enough if the shared tree is used, then it is certainly possible to explicitly form multiple trees in that case. But this is an extremely rare scenario, and we should not require the network to keep n times as much state for all groups, with a mind- bogglingly complex protocol for switching from shared to per-source trees, for the convenience of the very rare case. 2.8 Detecting Failure of the tree This can be done with a simple scheme of keep-alive timers sent when there has been no traffic for some amount of time. That way failures can be detected and corrected when they occur. If a router stops receiving keep-alives from the port away from the core, it removes that port from the tree. If it stops receiving keep-alives from the port towards the core, it rejoins the tree. If a member stops receiving keep-alives, it rejoins the tree. 2.9 Detecting failures of the core Ideally, the node selected as the core should be robust or a redundant server may be installed as the hardware backup core. Expires February 1999 [Page 6] Internet Draft A Design for Simple Multicast August 1998 To deal with core failures there is no one best way because it depends on engineering tradeoffs: - speed of recovery after a core failure - state in routers - bandwidth use To give an example of tradeoffs, consider an extreme example of an application for which every packet must be delivered without delay. Such an application cannot tolerate waiting for unicast routing to notice a link failure and reroute, or react to lack of receipt of keep-alive messages. In this case, the ideal solution is to consciously create multiple groups and transmit every message on all of the trees. This uses a lot of bandwidth since all data is sent on multiple trees, but there is no other way to accommodate an application that must have that degree of availability. For applications that can tolerate some amount of time after failure to rebuild a new tree, there can be a backup group created, and nodes would join the backup group if the core for the original group is unreachable. In the case where it is reasonable to have state in the routers, the backup group could be formed proactively, but not used for data unless the first group fails. The routers would have to maintain state about two groups, and perhaps incur a modest bandwidth overhead for keepalives to maintain the health of the backup tree, but there would be no need to send all data on both trees as it would in the first example application. Decisions about whether to create multiple groups proactively can be made on a per-application basis. 3.0 FAQs and answers 3.1 What if the core is an endnode? Endnodes can't forward packets. A tree is a tree. It doesn't matter who the core is. Once the tree is formed, the traffic pattern is the same no matter which node is the core. If an endnode has only a single link to the network, it will never forward traffic. If it receives traffic on that link, it does not have any "other" links to forward the traffic to, so the single- link core just receives or generates multicast like any other endnode would do. 3.2 How should the core be chosen for an optimal tree? Expires February 1999 [Page 7] Internet Draft A Design for Simple Multicast August 1998 If the core is a member of the group, the tree will be reasonably optimal, and certainly from the viewpoint of how expensive it will be for the network to deliver the packet, the core tree is as likely to be good as any per-source tree. 3.3 IP is already deployed in the endnodes. How can we change all those kernels? There is no need to change the kernels. This can be deployed as an application layer process which sends the special IP packet which is the join message. There is no need to modify IGMP. As a matter of fact, IGMP is not needed for this proposal. IGMP is still useful, as is, for dense mode multicast, and this proposal does not require removing IGMP or modifying IGMP. So there are no difficult kernel modifications to the IP stack as a result of this proposal. 3.4 Won't BGMP allow policy control, somehow? There is the belief that since BGP allows routing policies, that BGMP will somehow allow policies, though what exactly it is supposed to be doing and how it would do it is not exactly well thought out at this point. We claim that each border router can be individually configured with policies which it can individually enforce, and accomplish anything that might have been accomplished with BGMP. This can be done with the same sorts of packet filters that are already done in firewalls. 4.0 Security/Policy considerations This section discusses various problems we might be trying to solve, and proposed solutions. 4.1 Hiding data from unauthorized receivers A motivation, perhaps, is to limit delivery to those receivers that have "paid their bill". One method of doing this is to try to ensure that the routing infrastructure does not deliver data to unauthorized receivers. This is not the right way to do this. The right way to hide data from the unauthorized receivers is to encrypt the data, making sure that only authorized receivers are given the key. Then there is no reason to have the routing infrastructure attempt to prevent unauthorized nodes from receiving the transmitted data, since it will do them no good (it will be encrypted). Expires February 1999 [Page 8] Internet Draft A Design for Simple Multicast August 1998 There has recently been work on key distribution for multicast, and there are schemes for efficiently changing the shared group key periodically, or when a new member joins (if it's not allowed to see previous data), or when a member leaves (if it is not allowed to see data after it leaves). These schemes are interesting, but rather than describing them here, for the purpose of this paper we can assume that distributing a shared secret key to all authorized receivers is a solved problem. The basic idea is that there is a group moderator that a member has to check in with, first authenticating and then proving somehow that he is authorized to join the group. Then the group moderator gives that new member the key. Making key changes very efficient is accomplished, in the recent work on multicast key distribution, by having a hierarchy of keys, and giving each member log(n) keys, so that changing the key only involves log(n) work rather than having to individually contact each member and give them a new group key. 4.2 Preventing unauthorized transmitters from cluttering the group with data The basic solution is to have authorized transmitters "sign" the packet (need not be a public key signature), and have various filter points reject unauthorized packets. There are three possible scenarios: a) all authorized group members are trusted to transmit, or at least not to transmit when they should not. In this case, the single shared group secret key will suffice. Each packet can include a MIC (message integrity code), for instance, a keyed MD with the group secret. Anyone that knows the group secret can verify the MIC and discard the packet. If routers don't check the MIC, then receivers will receive the spam, but will be able to recognize it as bogus and discard it. If some selected routers are given the shared key (say the firewall at the entrance to your domain), then it can discard bogus packets before they clutter the bandwidth of your domain. b) there are "receiver-only" members that are not trusted to refrain from transmitting, but they need to be able to recognize bogus packets injected by unauthorized transmitters In this case, we can't base the MIC on a secret key known to the receivers. Instead we'd use public key technology. All the members, Expires February 1999 [Page 9] Internet Draft A Design for Simple Multicast August 1998 and any routers that will be doing filtering, are given the "authorized transmitter public key". All authorized transmitters are given the authorized transmitter private key. They digitally sign packets, and receivers and selected routers can recognize and discard bogus packets. c) there are authorized transmitters, but you don't want them to impersonate each other In this case, we can't have them all use the same private key. Instead, we'd still have a single group public key that would need to be distributed to all members and selected routers, but each authorized transmitter would use an individual private key to sign packets. There would be a group moderator that would know the group private key. It would authorize a node to be a transmitter by signing a certificate authorizing that transmitter's key to transmit packets to this group. In all of these cases, packets can be filtered, either at the receivers, or earlier. If routers do not have the bandwidth to check every packet for digital signatures, then they can "spot check", and only start paying a lot of attention to a particular multicast group's messages if it starts seeing bogus packets. Also, if there are multiple routers they can share the load, and there are several variants of this: - each router checks a random percentage of packets. In this way, most bogus packets will have been discarded before they reach the receivers - the routers along the path coordinate as to which packets each will handle. For instance, there can be a simple hash function, and each of the k routers on the path takes the ones that hash to one of the values mod k. - at the entrance to the domain, a bit in the packet is flipped indicating the packet has not been verified. Then any router in the domain that has spare cycles can check packets that have the bit set, and either discard the packet (if it's bogus) or clear the bit so other routers (and receivers) need not waste cycles testing the packet. (assuming we're trying to protect ourselves from people outside our domain, so we trust all the nodes within the domain). 5.0 Overhead Compared to other schemes Address allocation is much simpler. There is no need for addresses to be aggregatable. There is no reason to have BGMP to pass multicast Expires February 1999 [Page 10] Internet Draft A Design for Simple Multicast August 1998 address blocks around in the interdomain protocol. The only routing is based on the core address, which is a unicast IP address, which IP already needs to be able to route towards. Passing around multicast address ranges through BGMP uses bandwidth, CPU, and memory in the routers, not to mention pages of specifications, implementation effort, and more things to read and understand. With our proposal none of that is necessary. There is no reason, as in PIM-SM, to have core-capable routers advertise themselves, and to have some sort of hash scheme so that every router can determine, based on the multicast address, which of the core-capable routers would be the core for that multicast address. It is expensive to have all core-capable routers advertising all the time. It is very complex to have an algorithm whereby a router from that set is chosen in an identical manner by all routers, and to hope that the same core will be chosen at all times by all routers even if the set of core capable routers is different when one makes a decision vs. another router making the decision. In PIM-SM, the core is basically a router chosen at random, and there is no reason to believe it will be conveniently located near the other members of the group. So therefore the shared tree is likely to be very suboptimal, leading to the desire for per-source trees. In this scheme, since the core will be a member of the group or a node consciously chosen by whoever was creating the group, the shared tree is likely to be as good a tree as any of the per-source trees, certainly according to the metric of total cost to the network of delivering the data. Using a single tree per group saves the network k times as much state, assuming on the average, k transmitters per group. Another possible scheme is Dave Cheriton's Express model, where every group is a combination of multicast address and source address. The multicast address, in other words, is unique to the source. Then address allocation becomes even simpler than in our scheme. However, the problem with this model is that for something like a conference call every member needs to know all the possible transmitters so they can join a separate group for each possible transmitter. If a new member joins the group, somehow all the other members have to be alerted in order to join the new tree. Also, this model requires the network to keep k times as much state, i.e., a separate tree for every possible transmitter. 6.0 Summary Although this proposal is similar to CBT and PIM-SM, it differs because: Expires February 1999 [Page 11] Internet Draft A Design for Simple Multicast August 1998 - routers do not need to figure out which core goes with which multicast address - the core is a member of the group, or explicitly chosen for that group, so the shared tree will be fairly good - use a single shared tree per group, though extra groups can be formed in the rare cases where a shared tree is not good enough (i.e., don't bother with dynamically formed per-source trees) - the shared tree can be bi-directional (and therefore more efficient) - no need for a protocol for core-capable routers to advertise - no need for an algorithm that will hash a multicast address to the choice among the set of core-capable routers - no need to pass multicast address ranges in the interdomain routing protocol 7. Acknowledgments We would like to thank Brad Cain and Ross Callon for their comments. Tony Ballardie would like to thank British Telecom Plc, and 3Com Corporation for assisting with funding his work. References [STATIC_MCAST] M. Ohta, J. Crowcroft Static Multicast, Internet-Draft, March 1998 [DNS_RP] DNS Based RP Placement scheme Dino Farinacci's presentation in the MBONED WG, 40th IETF Meeting [Cheriton] IDMR Mailing List discussion [IGMP] Cain, Deering, Thyagarajan Internet Group Management Protocol, Version 3, Internet-Draft, November 1998 [CBT] Ballardie, Cain, Zhang, Core Based Tree Multicast Routing, Internet-Draft, March 1998 [PIMSM] Estrin, Farinacci, Helmy, Thaler, Deering, Handley, Jacobson, Liu, Sharma, and Wei. Protocol independent multicast-sparse mode (PIM- SM) Specification, RFC-2117, June 1997 [BGMP] Thaler, Estrin, Meyers Expires February 1999 [Page 12] Internet Draft A Design for Simple Multicast August 1998 Border Gateway Multicast Protocol Specification, Internet-Draft, March 1998 [MASC] Estrin, Handley, Kumar, Thaler Multicast Address Set Clain Protocol Internet-Draft, November 1997 Authors' Addresses Radia Perlman Sun Microsystems Laboratories 2 Elizabeth Drive Chelmsford, MA 01824 Radia.Perlman@sun.com Cheng-Yin Lee Nortel (Northern Telecom), Ltd. PO Box 3511, Station C Ottawa, ON K1Y 4H7, Canada leecy@nortel.ca Tony Ballardie Research Consultant aballardie@acm.org Department of Computer Science University College London Gower Street London, WC1E 6BT UK J.Crowcroft@cs.ucl.ac.uk Expires February 1999 [Page 13]