Internet DRAFT - draft-farinacci-msdp


Network Working Group                                     Dino Farinacci
INTERNET DRAFT                                             Yakov Rekhter
                                                           cisco Systems
                                                          Peter Lothberg
                                                             Hank Kilmer
                                                             Jeremy Hall
                                                           June 25, 1998

               Multicast Source Discovery Protocol (MSDP)

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   To learn the current status of any Internet-Draft, please check the
   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
   Directories on (Africa), (Europe), (Pacific Rim), (US East Coast), or (US West Coast).


   This proposal describes a mechanism to connect multiple PIM-SM
   domains together. Each PIM-SM domain uses it's own independent RP(s)
   and do not have to depend on RPs in other domains.

   This proposal is being submitted as a method for the initial phase of
   Inter-Domain Multicast deployment in the Internet and may be upward
   compatible with the IDMR protocols being proposed for subsequent

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 1]
RFC DRAFT                                                      June 1998

1.0 Introduction

   This proposal describes a mechanism to connect multiple PIM-SM
   domains together. Each PIM-SM domain uses it's own independent RP(s)
   and do not have to depend on RPs in other domains.

   Some advantages of this proposal:

      o PIM-SM domains can rely on their own RPs only.
      o Domains with only receivers get data without globally advertising
        group membership.
      o Global source state is not required.

2.0 Overview

   An RP in a PIM-SM domain will have a MSDP peering relationship with
   an RP in another domain. The peering relationship will be made up of
   a TCP connection in which only control information is primarily
   exchanged. Each domain will have a connection to this virtual

   The purpose of this topology is to have domains discover multicast
   sources from other domains. If the multicast sources are of interest
   to a domain which has receivers, the normal source-tree building
   mechanism in PIM-SM will be used to deliver multicast data over an
   inter-domain distribution tree.

   We envision this virtual topology will essentially be congruent to
   the existing BGP topology used in the unicast-based Internet today.
   That is the TCP connections between RPs can be realized by the
   underlying BGP routing system.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 2]
RFC DRAFT                                                      June 1998

3.0 Procedure

   A source in a PIM-SM domain originates traffic to a multicast group.
   The PIM DR which is directly connected to the source sends the data
   encapsulated in a PIM Register message to the RP in the domain.

   The RP will construct a "Source-Active" (SA) message and send it to
   it's MSDP peers. The SA message contains the following fields:

      o Source address of the data source.
      o Group address the data source sends to.
      o IP address of the RP.

   Each MSDP peer receives and forwards the message away from the RP
   address in a "peer-RPF flooding" fashion.  The notion of peer-RPF
   flooding is with respect to forwarding SA messages. The BGP routing
   table is examined to determine which peer is the next hop towards the
   originating RP of the SA message.  Such a peer is called an "RPF

   If the MSDP peer receives the SA from a non-RPF peer towards the
   originating RP, it will drop the message. Otherwise, it forwards the
   message to all it's MSDP peers.

   The flooding can be further constrained to children of the peer by
   interrogating BGP reachability information. That is, if a peer
   advertises a route (back to you) and you are the next to last AS in
   the AS-path, the peer is using you as the next-hop. In this case, you
   *should* forward an SA message (which was originated from the RP
   address covered by that route) to the peer. This is known in other
   circles as Split-Horizon with Poison Reverse.

   When each MSDP peer (which are also RPs for their own domain) receive
   an SA message, they determine if they have any group members
   interested in the group the SA message describes. If the (*,G) entry
   exists with an non-empty outgoing interface list, the domain is
   interested in the group, and the RP triggers an (S,G) join towards
   the data source. This sets up a branch of the source-tree to this
   domain. Subsequent data packets arrive at the RP which are forwarded
   down the shared-tree inside the domain. If leaf routers choose to
   join the source-tree they have the option to do so according to
   existing PIM-SM conventions.

   This procedure has been affectionately named flood-and-join because
   if any RP is not interested in the group, they can ignore the SA
   message. Otherwise, they join a distribution tree.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 3]
RFC DRAFT                                                      June 1998

4.0 Controlling State

   RPs which receive SA messages are not required to keep MSDP (S,G)
   state. However, if they do, newly formed MSDP peers can get MSDP
   (S,G) state sooner and therefore reduce join latency for new joiners.

   RPs which originate SA messages do it periodically as long as there
   is data being sent by the source. RPs will not send more than 1 SA
   message for a given (S,G) within a 1 minute interval. Originating
   periodic SA messages are important so new receivers who join after a
   source has been active can get data quickly via the receiver's own RP
   when it is not caching SA state.

   Intermediate RPs do not send periodic SA messages on behalf of
   sources in other domains. They only do for their own sources.

   As the number of (source,group) pairs increases in the Internet, an
   RP may want to filter what sources it describes in SA messages. Also,
   filtering may be used as a matter of policy which at the same time
   can reduce state. Only the RP colocated in the same domain as the
   source can restrict SA messages. Other RPs should not filter or the
   flood-and-join model becomes broken.

   If an MSDP peer decides to cache SA state, it may accept SA-Requests
   from other MSDP peers. When a MSDP peer receives an SA-Request for a
   group range, it will respond to the peer with a set of SA entries, in
   a SA-Response message, for all active sources sending to the group
   range requested in the SA-Request message. The peer that sends the
   request will not flood the responding SA-Response message to other

5.0 SA Encapsulated Data Packets

   For bursty sources, the SA message may contain multicast data from
   the source. Interested RPs can decapsulate the SA message and forward
   the original data packet down the shared-tree inside of a domain. We
   recommend this not be the default setting.

6.0 Auto-configuration versus Manual-configuration of MSDP Peers

   MSDP peers can be configured manually or can be learned
   automatically. The two automatic mechanisms can be achieved by:

      o PIM Query/Hello messages
      o BGP capability parameter negotiation

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 4]
RFC DRAFT                                                      June 1998

   In either case, each side of the peering relationship will indicate
   it's desire to participate in the MSDP protocol. If so, the TCP peer
   relationship is set up.

7.0 Other Scenarios

   MSDP is not limited to deployment across different routing domains.
   It can be used within a routing domain when it is desired to deploy
   multiple RPs for different group ranges. As long as all RPs have a
   interconnected MSDP topology, each can learn about active sources as
   well as RPs in other domains.

   MSDP can be used in domains that operate a dense-mode multicast
   routing protocol. However, in some cases SA messages with
   encapsulated source data may be required.

8.0 Packet Formats

   MSDP messages will be encapsulated in a TCP connection using well-
   known port 639. The one side of the MSDP peering relationship will
   listen on the well-known port and the other side will do an active
   connect on the well-known port. The side with the higher IP address
   will do the listen. This connection establishment algorithm avoids
   call collision. Therefore, there is no need for a call collision

   MSDP messages will be encoded in TLV format:

                         1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    |    Type       |           Length              |  Value ....   |

    Type (8 bits)
        Describes the format of the Value field.

    Length (16 bits)
        Length of Type, Length, and Value fields in octets. Minimum length
        required is 3 octets.

    Value (variable length)
        Format is based on the Type value. See below. The length of the
        value field is Length field minus 3.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 5]
RFC DRAFT                                                      June 1998

Documented Types:

IPv4 Source-Active TLV

                         1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    |       1       |           x + y               |  Entry Count  |
    |                           RP Address                          |
    |            Reserved           |  Gprefix Len  |  Sprefix Len  | \
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  \
    |                      Group Address Prefix                     |   ) z
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  /
    |                      Source Address Prefix                    | /

        IPv4 Source-Active TLV is type 1.

    Length x
        Is the length of the control information in the message. x is 8
        octets (for the first two 32-bit quantities) plus 12 times Entry
        Count octets.

    Length y
        If 0, then there is no data encapsulated. Otherwise an IPv4 packet
        follows and y is the length of the total length field of the IPv4
        header encapsulated. If there are multiple SA TLVs in a message,
        and data is also included, y must be 0 in all SA TLVs except the
        last one. And the last SA TLV must reflect the source and destination
        addresses in the IP header of the encapsulated data.

    Entry Count
        Is the count of z entries (note above) which follow the RP address
        field. This is so multiple (S,G)s from the same domain can be
        encoded efficiently for the same RP address.

    RP Address
        The address of the RP in the domain the source has become active in.

    Gprefix Len and Sprefix Len
         The route prefix length associated with the group address prefix
         and source address prefix, respectively.

    Group Address Prefix
        The group address the active source has sent data to.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 6]
RFC DRAFT                                                      June 1998

    Source Address Prefix
         The route prefix associated with the active source.

    Multiple SA TLVs can appear in the same message and can be batched for
    efficiency at the expense of data latency. This would typically occur
    on intermediate forwarding of SA messages.

IPv4 Source-Active Request TLV

    Used to request SA-state from a caching MSDP peer. If an RP in a domain
    receives a PIM Join message for a group, creates (*,G) state and wants to
    know all active sources for group G, and it has been configured to peer
    with an SA-state caching peer, it may send an SA-Request message
    for the group.

                         1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    |       2       |             8                 |  Gprefix Len  |
    |                      Group Address Prefix                     |

        IPv4 Source-Active Request TLV is type 2.

    Gprefix Len
         The route prefix length associated with the group address prefix.

    Group Address Prefix
        The group address prefix the MSDP peer is requesting.

IPv4 Source-Active Response TLV

    Sent in response to a Source-Active Request message. The Source-Active
    Response message has the same format as a Source-Active message but
    does not allow encapsulation of multicast data.

                         1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    |       3       |             x                 |     ....      |

        IPv4 Source-Active Response TLV is type 3.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 7]
RFC DRAFT                                                      June 1998

    Length x
        Is the length of the control information in the message. x is 8
        octets (for the first two 32-bit quantities) plus 12 times Entry
        Count octets.

9.0 Acknowledgements

   The authors would like to thank David Meyer, John Meylor, Liming Wei,
   Manoj Leelanivas, Mark Turner, and John Zwiebel for their design
   feedback and comments.

10.0 Author's Address:

   Dino Farinacci
   Cisco Systems, Inc.
   170 Tasman Drive
   San Jose, CA, 95134

   Yakov Rehkter
   Cisco Systems, Inc.
   170 Tasman Drive
   San Jose, CA, 95134

   Peter Lothberg
   12502 Sunrise Valley Drive
   Reston VA, 20196

   Hank Kilmer
   Digex Inc.
   One DIGEX Plaza
   Beltsville, Maryland 20705

   Jeremy Hall
   UUnet Technologies
   3060 Williams Drive
   Fairfax, VA 22031

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 8]
RFC DRAFT                                                      June 1998

11.0 References

[1] Estrin D., Farinacci, D., Helmy, A., Thaler, D., Deering, S.,
    Handley M., Jacobson, V., Liu C., Sharma, P., Wei, L., "Protocol
    Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification",
    draft-ietf-idmr-pim-sm-specv2-00.txt, September 9, 1997.

[2] Thaler, D., Estrin, D., Meyer, D., "Border Gateway Multicast Protocol
    (BGMP): Protocol Specification", draft-ietf-idmr-gum-01.txt, October 30,

[3] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 (BGP-4)", RFC 1771,
    March 1995.

[4] Bates, T., Chandra, R., Katz, D., and Y. Rekhter., "Multiprotocol
    Extensions for BGP-4", RFC 2283, February 1998.

[5] Deering, S., "Multicast Routing in a Datagram Internetwork", PhD thesis,
    Electric Engineering Dept., Stanford University, December 1991.

[6] Pusateri, T., "Distance Vector Multicast Routing Protocol",
    draft-ietf-idmr-dvmrp-v3-05.txt, October 1997.

Farinacci, Rekhter, Lothberg, Kilmer, Hall                      [Page 9]