Internet Engineering Task Force                     R. Perlman
INTERNET DRAFT                                      Sun Microsystems
                                                    C-Y Lee
                                                    Nortel
                                                    A. Ballardie
                                                    Research Consultant
                                                    J. Crowcroft
                                                    UCL
                                                    August 1998


              A Design for Simple, Low-Overhead Multicast
                <draft-perlman-simple-multicast-00.txt>


Status of This Memo

   This document is an Internet Draft, Internet Drafts are working
   documents of the Internet Engineering Task Force (IETF), its Areas,
   and its Working Groups.  Note that other groups may also distribute
   working documents as Internet Drafts.

   Internet Drafts are draft documents valid for a maximum of six
   months.  Internet Drafts may be updated, replaced, or obsoleted by
   other documents at any time.  It is not appropriate to use Internet
   Drafts as reference material, or to cite them other than as a
   ``working draft'' or ``work in progress.''

   Please check the I-D abstract listing contained in each Internet
   Draft directory to learn the current status of this or any other
   Internet Draft.


Abstract

   This paper describes how a lot of the complexity and overhead of
   multicast can go away if a slightly different approach is taken.
   Approaches like this have been proposed in the past, but the design
   has not been carried through to completion. This paper describes the
   approach and then compares it with other approaches.

1.0 Introduction

   The basic idea is that a multicast group is created by generating:

     - a name, suitable for lookup by humans,

     - a multicast address (i.e., a class D IP multicast address)


Expires February 1999                                           [Page 1]


Internet Draft       A Design for Simple Multicast           August 1998


     - a distinguished node known as the "core"

   Endnodes look up the group through some sort of directory, e.g., DNS
   or SDR. The look-up is based on name, but the information retrieved
   is the multicast address and the core address.

   Endnodes then join the group by sending a special join message
   towards the core, creating state in the routers along the path.  The
   result is a tree of shortest paths from the core to each member,
   where only the routers along the tree need to have any state about
   the group.

1.1 Very simplified Review of PIM-SM and BGMP

   Originally there was the idea, presented in CBT, of forming a
   multicast group by choosing a distinguished node, the "core", and
   having all members join by sending special join messages towards the
   core. The routers along the path keep state about which ports are in
   the group. If a router along the path of the join already has state
   about that group the join does not proceed further. Instead that
   router just "grafts" the new limb onto the tree. The result is a tree
   of shortest paths from the core, with only the routers along the path
   knowing anything about that group.

   Later modifications included:
     - "optimizing" the tree by forming a tree for each sender. The way
   this is done, each node could independently decide whether the volume
   of traffic from a particular source was worth joining a per-source
   tree. The result was that there were two possible trees for traffic
   from a particular source for group M, the shared tree and the source
   tree. To prevent loops, the shared tree had to be unidirectional,
   i.e., to send to the shared tree, you'd have to unicast to the core.
     -  the assumption is that routers have to be able to figure out,
   solely based on the multicast address, who the core was. This
   resulted in a protocol whereby "core-capable" routers continuously
   advertise themselves, all routers keep track of the current set of
   live core-capable routers, and there is some sort of hash to map a
   multicast address to one of the set of core-capable routers. This
   advertisement protocol is confined to within a domain.
     -  Because the core-capable advertisements (mercifully) do not
   travel through the entire Internet, but are instead confined to a
   domain, there needed to be a protocol whereby routers could know the
   mapping of multicast address to core, even in other domains. The
   proposal (BGMP) was to have each domain acquire (somehow, a current
   area for research) a set of addresses. These addresses would be
   advertised through the interdomain routing protocol. Therefore they
   had to be allocated in blocks so as to attempt not to place too much
   of a burden on the interdomain routing protocol.


Expires February 1999                                           [Page 2]


Internet Draft       A Design for Simple Multicast           August 1998


   We use the term "aggregatable" to mean that the routing protocol can
   summarize large numbers of addresses in a small amount of space.

2.0 The Design

   The design we propose is most similar in spirit to the original idea
   of CBT, because it does not have:
     - per source trees
     - the necessity for routers to know the mapping between multicast
   addresses and cores

2.1 Address Allocation

   The only constraint on the multicast addresses in our proposal is
   that they be unique in time and space. There is no need for the
   addresses to be aggregatable, since they will not be advertised in
   routing protocols. It is far simpler to allocate addresses, and you
   can do with a smaller number of them, if there are no further
   constraints on the addresses than that they be unique in time (and
   space).

   The way to allocate addresses is to have a bunch of address
   allocation servers sprinkled throughout the Internet, each with a
   block of addresses. Anyone that wants to form a group finds any one
   of the address allocation servers, asks for an address and gives an
   amount of time, say 1 day. After 1 day (plus handwave for clock skew,
   etc.,) that address can be reallocated to someone else.

   There can be a hierarchy of these servers. At the top, there might be
   10 of them, each with 1/10 of the address space. Anyone could ask one
   of them for a single address, or for a block of addresses.  If you
   want to be a server handing out addresses, you can request a block of
   addresses, and then clients can request individual addresses from
   you. If you run out, you can ask for more.

   Having a hierarchy with lots of servers, rather than a few servers
   with millions of addresses each, has the following advantages:
     - no performance bottlenecks
     - it's likely a server will be nearby
     -  servers don't have to keep track of millions of individual
   addresses, each one with a different timer as to when it can be re-
   allocated.

   Aggregation wastes addresses, since if you don't overestimate, you
   will run out and have to acquire more blocks, which will cause
   fragmentation of the space (more intervals to have to advertise).
   Since in our scheme the addresses will not get advertised in any
   routing protocol, there is no necessity to have them have any


Expires February 1999                                           [Page 3]


Internet Draft       A Design for Simple Multicast           August 1998


   topological significance or any other constraint that would waste the
   addresses. Therefore there will effectively be more addresses, and
   less likelihood they will "run out".

   If we are worried about running out even though in our proposal they
   do not need to be aggregatable, then we could reserve some to be
   limited in scope, say within a domain.  If you wanted to create a
   group where you know that all members are in that domain, you'd get
   one of the "local" addresses. Since the same block of local addresses
   can be used simultaneously within different domains, this makes a lot
   more addresses. Routers at the domain boundary should set up filters
   to make sure that packets with local addresses don't "leak" out. This
   concept is known as "administratively scoped addresses".


2.2 Creating the Multicast Group

   There should be a "multicast group creation" utility.

   To create a group you pick a name that humans will recognize the
   group by.  You input that into the multicast group creation utility,
   along with information about how long the group will live, and a
   logical choice for the core address. If no core is specified, the
   default is to have the node that created the group be the core. The
   utility finds an address allocation server, gets an address, and then
   registers the group and core address into the directory.

2.3 Joining a Group

   To join a group, you browse the directory to find the appropriate
   name, and then tell a "multicast join" utility that you'd like to
   join that group.  The utility looks up the address and core address.

   Then it sends a special control packet, known as a "join" message,
   from your node. This message gets sent in the same direction as a
   unicast message to the core address would be sent, but it creates
   state in the routers along the path, to know which ports are in the
   group. If a router receives a join for multicast address M, and it
   already has state for M, then it merely adds that port to its set of
   ports for M, and does not forward the join further.

   The result is a tree of shortest paths from the core to each member.
   Each router on the tree has a database of (M, {ports}) that tells it,
   for group M, which ports are in the tree.


2.4 Transmitting to multicast group M


Expires February 1999                                           [Page 4]


Internet Draft       A Design for Simple Multicast           August 1998


   If you are a member of the group, you simply transmit an IP packet
   with destination address M. A router that receives a packet with
   multicast address M checks its multicast database. If it knows about
   M, it checks if the port it received it on is in its database. If
   not, it drops the packet. If so, it forwards the packet onto all the
   other ports listed in its database for M. The result is a bi-
   directional tree.

   If you are not a member of the group but want to transmit to the
   group, you unicast to the core.

2.5 Interdomain Groups

   Everything described above works just as well in the interdomain
   case.  You join a group by sending a join message to the core
   address. No matter what domain the core is in, it has an IP address
   and IP unicast routing already knows how to reach it.

2.6 Backbones

   There might be concern about a backbone ISP that might need to keep
   state about billions of groups. It is extremely unlikely that so many
   groups would exist, especially over such a wide geographic area.
   However, if there really is that concern, the solution is tunnels
   between boundary routers in the backbone. A tunnel can be thought of
   as a point-to-point link between the two routers. Traffic is sent
   across the tunnel by adding an additional IP header, with the
   destination address in the outer header being the address of the
   other tunnel endpoint.

   The simplest strategy is to assume a full mesh of tunnels between
   every pair of boundary routers in that backbone.  So for instance, if
   there are 100 boundary routers for that backbone, each would maintain
   99 tunnels, to each of the 99 other backbone boundary routers. Each
   of those are treated as a "port".

   When receiving a join, backbone boundary router R1 checks its routing
   database to see which backbone boundary router leads to the core
   address, say R7. It then forwards the join along the tunnel it
   maintains between itself and R7. In its multicast group state, it
   makes an entry for the multicast address M, and the set of ports
   (including tunnel) included in the tree for M.

   In this way the interior nodes of the backbone do not keep any state
   about individual groups.

   It is not necessary to maintain all the tunnels. They can be formed
   on an as-needed basis. In fact, there is no reason for a tunnel to be


Expires February 1999                                           [Page 5]


Internet Draft       A Design for Simple Multicast           August 1998


   "maintained" at all. If R1 determines that multicast M should be
   forwarded on a tunnel to R7, all it has to do is remember "R7", and
   when it receives a packet with destination address M, it tunnels it
   to R7.

2.7 Per-Source Trees

   Don't bother. The most logical thing to optimize is the overhead to
   the network to deliver a packet through the tree. The ideal tree is a
   minimum weight spanning tree, but no proposals have attempted to try
   to calculate a minimum weight spanning tree. It is most likely too
   difficult to do so, though a simple algorithm for creating a minimum
   weight or near-minimum weight tree would certainly be a welcome
   advance in IP multicast.

   A per source tree does not use less overhead to deliver the message
   than the tree rooted at the core. The only thing more optimal about a
   source tree than the shared core tree is the delay from the time the
   source transmits to the time everyone receives the message. It is
   unlikely that an application that is so delay sensitive that it would
   work with a per-source tree but not with the shared core tree would
   work at all if there are members located any significant distance
   away from the source.

   If there were an application that happens to be in a topology where
   all members are close enough if per-source trees are used, but the
   members are not close enough if the shared tree is used, then it is
   certainly possible to explicitly form multiple trees in that case.
   But this is an extremely rare scenario, and we should not require the
   network to keep n times as much state for all groups, with a mind-
   bogglingly complex protocol for switching from shared to per-source
   trees, for the convenience of the very rare case.

2.8 Detecting Failure of the tree

   This can be done with a simple scheme of keep-alive timers sent when
   there has been no traffic for some amount of time. That way failures
   can be detected and corrected when they occur.

   If a router stops receiving keep-alives from the port away from the
   core, it removes that port from the tree. If it stops receiving
   keep-alives from the port towards the core, it rejoins the tree.  If
   a member stops receiving keep-alives, it rejoins the tree.

2.9 Detecting failures of the core

   Ideally, the node selected as the core should be robust or a
   redundant server may be installed as the hardware backup core.


Expires February 1999                                           [Page 6]


Internet Draft       A Design for Simple Multicast           August 1998


   To deal with core failures there is no one best way because it
   depends on engineering tradeoffs:

    - speed of recovery after a core failure

    - state in routers

    - bandwidth use

   To give an example of tradeoffs, consider an extreme example of an
   application for which every packet must be delivered without delay.
   Such an application cannot tolerate waiting for unicast routing to
   notice a link failure and reroute, or react to lack of receipt of
   keep-alive messages. In this case, the ideal solution is to
   consciously create multiple groups and transmit every message on all
   of the trees. This uses a lot of bandwidth since all data is sent on
   multiple trees, but there is no other way to accommodate an
   application that must have that degree of availability.

   For applications that can tolerate some amount of time after failure
   to rebuild a new tree, there can be a backup group created, and nodes
   would join the backup group if the core for the original group is
   unreachable.

   In the case where it is reasonable to have state in the routers, the
   backup group could be formed proactively, but not used for data
   unless the first group fails. The routers would have to maintain
   state about two groups, and perhaps incur a modest bandwidth overhead
   for keepalives to maintain the health of the backup tree, but there
   would be no need to send all data on both trees as it would in the
   first example application.

   Decisions about whether to create multiple groups proactively can be
   made on a per-application basis.

3.0 FAQs and answers

3.1 What if the core is an endnode? Endnodes can't forward packets.

   A tree is a tree. It doesn't matter who the core is. Once the tree is
   formed, the traffic pattern is the same no matter which node is the
   core. If an endnode has only a single link to the network, it will
   never forward traffic. If it receives traffic on that link, it does
   not have any "other" links to forward the traffic to, so the single-
   link core just receives or generates multicast like any other endnode
   would do.

3.2 How should the core be chosen for an optimal tree?


Expires February 1999                                           [Page 7]


Internet Draft       A Design for Simple Multicast           August 1998


   If the core is a member of the group, the tree will be reasonably
   optimal, and certainly from the viewpoint of how expensive it will be
   for the network to deliver the packet, the core tree is as likely to
   be good as any per-source tree.

3.3 IP is already deployed in the endnodes. How can we change all those
kernels?

   There is no need to change the kernels. This can be deployed as an
   application layer process which sends the special IP packet which is
   the join message. There is no need to modify IGMP. As a matter of
   fact, IGMP is not needed for this proposal.

   IGMP is still useful, as is, for dense mode multicast, and this
   proposal does not require removing IGMP or modifying IGMP. So there
   are no difficult kernel modifications to the IP stack as a result of
   this proposal.

3.4 Won't BGMP allow policy control, somehow?

   There is the belief that since BGP allows routing policies, that BGMP
   will somehow allow policies, though what exactly it is supposed to be
   doing and how it would do it is not exactly well thought out at this
   point.

   We claim that each border router can be individually configured with
   policies which it can individually enforce, and accomplish anything
   that might have been accomplished with BGMP. This can be done with
   the same sorts of packet filters that are already done in firewalls.

4.0 Security/Policy considerations

   This section discusses various problems we might be trying to solve,
   and proposed solutions.

4.1 Hiding data from unauthorized receivers

   A motivation, perhaps, is to limit delivery to those receivers that
   have "paid their bill".

   One method of doing this is to try to ensure that the routing
   infrastructure does not deliver data to unauthorized receivers. This
   is not the right way to do this. The right way to hide data from the
   unauthorized receivers is to encrypt the data, making sure that only
   authorized receivers are given the key. Then there is no reason to
   have the routing infrastructure attempt to prevent unauthorized nodes
   from receiving the transmitted data, since it will do them no good
   (it will be encrypted).


Expires February 1999                                           [Page 8]


Internet Draft       A Design for Simple Multicast           August 1998


   There has recently been work on key distribution for multicast, and
   there are schemes for efficiently changing the shared group key
   periodically, or when a new member joins (if it's not allowed to see
   previous data), or when a member leaves (if it is not allowed to see
   data after it leaves). These schemes are interesting, but rather than
   describing them here, for the purpose of this paper we can assume
   that distributing a shared secret key to all authorized receivers is
   a solved problem.

   The basic idea is that there is a group moderator that a member has
   to check in with, first authenticating and then proving somehow that
   he is authorized to join the group. Then the group moderator gives
   that new member the key.

   Making key changes very efficient is accomplished, in the recent work
   on multicast key distribution, by having a hierarchy of keys, and
   giving each member log(n) keys, so that changing the key only
   involves log(n) work rather than having to individually contact each
   member and give them a new group key.

4.2 Preventing unauthorized transmitters from cluttering the group with
data

   The basic solution is to have authorized transmitters "sign" the
   packet (need not be a public key signature), and have various filter
   points reject unauthorized packets.

   There are three possible scenarios:

   a) all authorized group members are trusted to transmit, or at least
   not to transmit when they should not.

   In this case, the single shared group secret key will suffice. Each
   packet can include a MIC (message integrity code), for instance, a
   keyed MD with the group secret. Anyone that knows the group secret
   can verify the MIC and discard the packet. If routers don't check the
   MIC, then receivers will receive the spam, but will be able to
   recognize it as bogus and discard it.  If some selected routers are
   given the shared key (say the firewall at the entrance to your
   domain), then it can discard bogus packets before they clutter the
   bandwidth of your domain.

   b) there are "receiver-only" members that are not trusted to refrain
   from transmitting, but they need to be able to recognize bogus
   packets injected by unauthorized transmitters

   In this case, we can't base the MIC on a secret key known to the
   receivers.  Instead we'd use public key technology. All the members,


Expires February 1999                                           [Page 9]


Internet Draft       A Design for Simple Multicast           August 1998


   and any routers that will be doing filtering, are given the
   "authorized transmitter public key". All authorized transmitters are
   given the authorized transmitter private key. They digitally sign
   packets, and receivers and selected routers can recognize and discard
   bogus packets.

   c) there are authorized transmitters, but you don't want them to
   impersonate each other

   In this case, we can't have them all use the same private key.
   Instead, we'd still have a single group public key that would need to
   be distributed to all members and selected routers, but each
   authorized transmitter would use an individual private key to sign
   packets. There would be a group moderator that would know the group
   private key. It would authorize a node to be a transmitter by signing
   a certificate authorizing that transmitter's key to transmit packets
   to this group.


   In all of these cases, packets can be filtered, either at the
   receivers, or earlier. If routers do not have the bandwidth to check
   every packet for digital signatures, then they can "spot check", and
   only start paying a lot of attention to a particular multicast
   group's messages if it starts seeing bogus packets. Also, if there
   are multiple routers they can share the load, and there are several
   variants of this:
     -  each router checks a random percentage of packets. In this way,
   most bogus packets will have been discarded before they reach the
   receivers
     - the routers along the path coordinate as to which packets each
   will handle. For instance, there can be a simple hash function, and
   each of the k routers on the path takes the ones that hash to one of
   the values mod k.
     - at the entrance to the domain, a bit in the packet is flipped
   indicating the packet has not been verified. Then any router in the
   domain that has spare cycles can check packets that have the bit set,
   and either discard the packet (if it's bogus) or clear the bit so
   other routers (and receivers) need not waste cycles testing the
   packet. (assuming we're trying to protect ourselves from people
   outside our domain, so we trust all the nodes within the domain).


5.0 Overhead Compared to other schemes


   Address allocation is much simpler. There is no need for addresses to
   be aggregatable. There is no reason to have BGMP to pass multicast


Expires February 1999                                          [Page 10]


Internet Draft       A Design for Simple Multicast           August 1998


   address blocks around in the interdomain protocol. The only routing
   is based on the core address, which is a unicast IP address, which IP
   already needs to be able to route towards. Passing around multicast
   address ranges through BGMP uses bandwidth, CPU, and memory in the
   routers, not to mention pages of specifications, implementation
   effort, and more things to read and understand. With our proposal
   none of that is necessary.

   There is no reason, as in PIM-SM, to have core-capable routers
   advertise themselves, and to have some sort of hash scheme so that
   every router can determine, based on the multicast address, which of
   the core-capable routers would be the core for that multicast
   address. It is expensive to have all core-capable routers advertising
   all the time. It is very complex to have an algorithm whereby a
   router from that set is chosen in an identical manner by all routers,
   and to hope that the same core will be chosen at all times by all
   routers even if the set of core capable routers is different when one
   makes a decision vs. another router making the decision.

   In PIM-SM, the core is basically a router chosen at random, and there
   is no reason to believe it will be conveniently located near the
   other members of the group. So therefore the shared tree is likely to
   be very suboptimal, leading to the desire for per-source trees. In
   this scheme, since the core will be a member of the group or a node
   consciously chosen by whoever was creating the group, the shared tree
   is likely to be as good a tree as any of the per-source trees,
   certainly according to the metric of total cost to the network of
   delivering the data.

   Using a single tree per group saves the network k times as much
   state, assuming on the average, k transmitters per group.

   Another possible scheme is Dave Cheriton's Express model, where every
   group is a combination of multicast address and source address. The
   multicast address, in other words, is unique to the source. Then
   address allocation becomes even simpler than in our scheme. However,
   the problem with this model is that for something like a conference
   call every member needs to know all the possible transmitters so they
   can join a separate group for each possible transmitter. If a new
   member joins the group, somehow all the other members have to be
   alerted in order to join the new tree.  Also, this model requires the
   network to keep k times as much state, i.e., a separate tree for
   every possible transmitter.

6.0 Summary

   Although this proposal is similar to CBT and PIM-SM, it differs
   because:


Expires February 1999                                          [Page 11]


Internet Draft       A Design for Simple Multicast           August 1998


     - routers do not need to figure out which core goes with which
   multicast address
     - the core is a member of the group, or explicitly chosen for that
   group, so the shared tree will be fairly good
     - use a single shared tree per group, though extra groups can be
   formed in the rare cases where a shared tree is not good enough
   (i.e., don't bother with dynamically formed per-source trees)
     - the shared tree can be bi-directional (and therefore more
   efficient)
     - no need for a protocol for core-capable routers to advertise
     - no need for an algorithm that will hash a multicast address to
   the choice among the set of core-capable routers
     - no need to pass multicast address ranges in the interdomain
   routing protocol

   7. Acknowledgments
   We would like to thank Brad Cain and Ross Callon for their comments.
   Tony Ballardie would like to thank British Telecom Plc, and 3Com
   Corporation for assisting with funding his work.


   References


      [STATIC_MCAST] M. Ohta, J. Crowcroft
                   Static Multicast, Internet-Draft, March 1998

      [DNS_RP]     DNS Based RP Placement scheme
                   Dino Farinacci's presentation in the MBONED WG,
                   40th IETF Meeting

      [Cheriton]   IDMR Mailing List discussion

      [IGMP]       Cain, Deering, Thyagarajan
                   Internet Group Management Protocol, Version 3,
                   Internet-Draft, November 1998

      [CBT]        Ballardie, Cain, Zhang, Core Based Tree Multicast
                   Routing, Internet-Draft, March 1998

      [PIMSM]      Estrin, Farinacci, Helmy, Thaler, Deering, Handley,
                   Jacobson, Liu, Sharma, and Wei.
                   Protocol independent multicast-sparse mode (PIM- SM)
                   Specification,  RFC-2117, June 1997

      [BGMP]       Thaler, Estrin, Meyers


Expires February 1999                                          [Page 12]


Internet Draft       A Design for Simple Multicast           August 1998


                   Border Gateway Multicast Protocol Specification,
                   Internet-Draft, March 1998

      [MASC]       Estrin, Handley, Kumar, Thaler
                   Multicast Address Set Clain Protocol
                   Internet-Draft, November 1997


   Authors' Addresses

   Radia Perlman
   Sun Microsystems Laboratories
   2 Elizabeth Drive
   Chelmsford, MA 01824
   Radia.Perlman@sun.com

   Cheng-Yin Lee
   Nortel (Northern Telecom), Ltd.
   PO Box 3511, Station C
   Ottawa, ON K1Y 4H7, Canada
   leecy@nortel.ca

   Tony Ballardie
   Research Consultant
   aballardie@acm.org

   Department of Computer Science
   University College London
   Gower Street
   London, WC1E 6BT
   UK
   J.Crowcroft@cs.ucl.ac.uk


Expires February 1999                                          [Page 13]