Independent Submission                                      J. Weinstein
Internet-Draft                                                 J. Keller
Intended status: Experimental                           BBN Technologies
Expires: December 6, 2013                                   June 4, 2013


                  PIM-DM Optimizations for Large LANs
                draft-weinstein-bbn-pim-dm-large-lan-00

Abstract

   Optimizations to the PIM-DM protocol [RFC3973] are proposed that
   dramatically reduce the PIM-DM routing overhead on local area
   networks containing large numbers of PIM routers.  These
   optimizations reduce the likelihood of two or more routers emitting
   redundant JOIN or PRUNE messages onto the same LAN.  While these
   optimizations are of general applicability to large LANs, they were
   motivated in particular by performance requirements for supporting
   PIM-DM across LANs emulated by IP tunneling over large wireless
   MANETs.

   PIM-DM routers supporting these optimizations will interoperate with
   standard PIM-DM as defined in RFC 3973, even on the same LAN.
   However, to obtain maximum benefit from these optimizations, all
   PIM-DM routers on the LAN SHOULD support them.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.  This document may not be modified,
   and derivative works of it may not be created, and it may not be
   published except as an Internet-Draft.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on December 6, 2013.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the


Weinstein & Keller      Expires December 6, 2013                [Page 1]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Conventions used in this document  . . . . . . . . . . . . . .  5
   3.  Optimized Redundant JOIN suppression . . . . . . . . . . . . .  5
     3.1.  Motivation . . . . . . . . . . . . . . . . . . . . . . . .  5
     3.2.  Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] . . . .  7
     3.3.  Other modifications  . . . . . . . . . . . . . . . . . . .  8
   4.  Redundant PRUNE suppression  . . . . . . . . . . . . . . . . .  8
     4.1.  Motivation . . . . . . . . . . . . . . . . . . . . . . . .  8
     4.2.  Modifications to the state-machine-specific timers
           [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . . . . 10
     4.3.  Modifications to the tabular description of the state
           machine [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . 11
     4.4.  Modifications to Transitions from the Forwarding (F)
           State [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . 12
     4.5.  Modifications to Transitions from the Pruned (P) State
           [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . . . . 13
     4.6.  Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] . . . . 14
     4.7.  Other modifications  . . . . . . . . . . . . . . . . . . . 15
   5.  Compatibility with RFC 3973  . . . . . . . . . . . . . . . . . 15
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 16
   8.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 16
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 17
     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 17
     9.2.  Informative References . . . . . . . . . . . . . . . . . . 17
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18


Weinstein & Keller      Expires December 6, 2013                [Page 2]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


1.  Introduction

   PIM-DM [RFC3973] is an IP multicast routing protocol optimized for
   situations where group members (receivers) are fairly densely
   distributed throughout the multicast distribution domain and there
   are not too many sources per group.  It is based upon the reversed-
   shortest-path broadcast paradigm, in which multicast traffic from
   each source is distributed along a tree rooted at that source and
   consisting of the shortest paths from each potential destination in
   the multicast distribution domain back to that source.  It optimizes
   this reversed-shortest-path broadcast tree by pruning unneeded
   branches that do not lead to any actual group members.  Pruning of
   unneeded branches is accomplished through use of JOIN/PRUNE messages,
   sent by downstream routers towards the source in a direction opposite
   to that of data flow.

   When large numbers of PIM-DM routers are attached to a single LAN,
   PIM-DM as described in RFC 3973 becomes vulnerable to storms of
   redundant JOIN/PRUNE messages on the LAN.  These storms can arise
   when more than one of the routers on the LAN lack downstream members
   for a particular (S,G) pair, and each sends a separate PRUNE message
   to the upstream router towards the source S. Each of these PRUNE
   messages may, in turn, trigger an overriding JOIN from one or more
   routers that do have downstream members for the (S,G) pair.  Since IP
   multicast forwarding decisions are made on a per-interface basis, not
   a per-downstream-router basis, all except one of these PRUNE messages
   is unnecessary.  Likewise, all except one of the overriding JOIN
   messages, if any, are redundant.

   These excess JOIN/PRUNE messages are particularly problematic because
   they tend to be emitted in a synchronous burst by all downstream
   routers on the LAN when the first user data packet or STATE REFRESH
   message arrives, potentially causing a temporary overload on the LAN.
   Furthermore, these JOIN/PRUNE storms will repeat periodically if
   STATE REFRESH messages are lost or otherwise unavailable.  Since a
   separate set of JOIN/PRUNE messages is used for each (source, group)
   pair, the load from these redundant JOIN/PRUNE messages scales
   approximately as the product [number_of_groups] * [number_of_sources
   ] * [number_of_routers_per_LAN] * [rate_of_prune_cycles].  As such,
   this load becomes the key constraint limiting the number of sources
   and groups that can be supported within a single PIM-DM multicast
   domain.

   The PIM-DM specification [RFC3973] includes a mechanism for avoiding
   redundant overriding JOINs, but it is sub-optimal when the number of
   PIM routers on a single LAN is large or propagation delays on the LAN
   are non-negligible.  This mechanism is built into the upstream state
   machine [RFC3973 Sec. 4.4.1] through the Override Timer.  When a


Weinstein & Keller      Expires December 6, 2013                [Page 3]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   router in the Forwarding state or AckPending state overhears an
   upstream PRUNE that must be overridden, it sets this timer to a
   random value between 0 and Override_Interval.  It then sends an
   overriding JOIN to the upstream router if and only if this timer
   expires before it overhears another overriding JOIN from some other
   router on the same interface to the same upstream router for the same
   (S,G) pair.  We shall refer to this mechanism as redundant JOIN
   suppression.

   As will be explained in Section 3 of this document, the effectiveness
   of redundant JOIN suppression is extremely sensitive to the algorithm
   used to determine the setting for the Override Timer.  RFC 3973
   specifies that this timer is to be set to a random value between 0
   and Override_Interval.  It does not specify the probability
   distribution to use, but most implementations probably use a linear
   distribution.  If the number of routers on the LAN is sufficiently
   large compared to the inverse propagation delay, large numbers of
   redundant messages may still be sent.  The likelihood of this
   occurring can be greatly reduced by employing a deterministic
   algorithm for setting the Override_Timer that approximates an
   exponential probability distribution.  Section 3 of this document
   specifies this algorithm.

   PIM-DM as specified in RFC 3973 does not, however, include any
   mechanism for avoiding redundant PRUNEs.  Section 4 of this document
   describes a proposed mechanism, which is very similar to the strategy
   used for avoiding redundant JOINs.  Instead of sending a PRUNE
   message upstream as soon as a router receives an incoming data packet
   from a LAN, the router may instead defer sending this PRUNE for a
   short period of time.  The router then sends this PRUNE if, and only
   if, it does not overhear a PRUNE message from another router on the
   same LAN to the same upstream router for the same (S,G) pair before
   this timer expires.  By using a deterministic algorithm for computing
   this delay, the delay before the first such router emits a PRUNE can
   be made effectively zero for the only case where the delay length
   impacts overall protocol performance -- those situation where there
   are no group members downstream of any router on the LAN.  We shall
   refer to this proposed mechanism as redundant PRUNE suppression.

   Although the mechanisms described in this document should be
   beneficial in any situation where there are a large number of PIM-DM
   routers on a single LAN, these optimizations were motivated by the
   need to support PIM-DM across LANs emulated by multicast-capable IP
   tunneling over wireless mobile ad-hoc networks (MANETs)
   [Weinstein_In_Progress_a].  LANs emulated in this way are motivated
   by the desire to hide most mobility-related wireless network topology
   changes from IP by handling such changes through a custom MANET
   routing protocol at a lower layer.  As the effectiveness of mobility


Weinstein & Keller      Expires December 6, 2013                [Page 4]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   hiding is dependent upon including all mobile routers in the theater
   of operation within a single emulated LAN, the number of PIM-DM
   routers on such a LAN can become quite large.  At the same time, the
   capacity of such networks can be very limited, so that minimizing the
   overhead from PIM-DM can be crucially important.  Finally, the
   propagation delays for traffic across such a network may be
   significantly larger than those typical of high-capacity wired LANs,
   reducing the efficacy of the redundant JOIN suppression mechanisms
   specified by RFC 3973.  A separate document will deal with
   adaptations to PIM-DM necessitated by the multi-layer routing
   structure itself [Weinstein_In_Progress_b].


2.  Conventions used in this document

   This document describes optimizations to PIM-DM with reference to its
   current specification in RFC 3973.  The terminology of that reference
   is adopted in its entirety.  Concepts and terms not defined herein
   should be understood as having the meaning specified in RFC 3973.
   Algorithms and protocol elements not discussed herein should be
   understood to be identical to those specified in [RFC3973].

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC2119.

   In this document, these words will appear with that interpretation
   only when in ALL CAPS.  Lower case uses of these words are not to be
   interpreted as carrying RFC2119 significance.


3.  Optimized Redundant JOIN suppression

3.1.  Motivation

   If a PIM-DM router with a non-null outgoing interface list (e.g.,
   possible downstream members) for some (S,G) pair overhears a PRUNE on
   its incoming interface from some other router for the same (S,G)
   pair, then it must cancel that PRUNE by sending an overriding (S,G)
   JOIN to the upstream router U towards S. If there is more than one
   PIM-DM router with a non-null outgoing interface list for (S,G) with
   its incoming interface on the same network, more than one such router
   may send an overriding (S,G) JOIN to the upstream router U. All
   except the first of these is unnecessary, as the receipt of even one
   overriding JOIN is sufficient to prevent the upstream router from
   pruning its outgoing interface onto the LAN.

   The PIM-DM specification includes a mechanism for avoiding redundant


Weinstein & Keller      Expires December 6, 2013                [Page 5]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   overriding JOINs, but it is sub-optimal when the number of PIM
   routers on a single LAN is large and propagation delays on the LAN
   are non-trivial.  This mechanism is built into the upstream state
   machine [RFC3973 Sec. 4.4.1] through the Override Timer.  When a
   router in the Forwarding state or AckPending state overhears an
   upstream PRUNE that must be overridden, it sets this timer to a
   random value between 0 and Override_Interval.  It then sends an
   overriding JOIN to the upstream router if and only if this timer
   expires before it overhears another overriding JOIN from some other
   router on the same interface to the same upstream router for the same
   (S,G) pair.  RFC 3973 does not specify the probability distribution
   to use when setting the Override Timer for redundant JOIN
   suppression, although most implementations probably use a linear
   distribution.

   If the number of routers on the LAN is sufficiently large compared to
   the inverse propagation delay, redundant messages may still be sent.
   The likelihood of this occurring can be greatly reduced by employing
   a deterministic algorithm for setting the Override_Timer that
   approximates an exponential distribution.

   It is not difficult to estimate the expected number of redundant JOIN
   messages for any given algorithm from the propagation delay, the
   number of routers N on the LAN, and the number of routers on the LAN
   with downstream group members.  The worst case occurs when only one
   router lacks downstream members for some source and group, and hence
   sends a PRUNE message upstream.  In this case, N-2 other routers will
   overhear this PRUNE and schedule possible origination of an
   overriding JOIN message.  As PIM-DM is optimized for use in
   situations where group members are densely distributed throughout the
   multicast distribution domain, this worst-case scenario and/or near-
   worst-case scenarios will be common.

   Using a linear probability distribution for the Override Timer as
   implied by RFC 3973, the expected delay before the first overriding
   JOIN is sent by some router will be given by Override_Interval/(N-2).
   However, it will typically take Propagation_Delay seconds for other
   routers to overhear this overriding JOIN.  In the meantime, (N-3) *
   Propagation_Delay / Override_Interval other routers can be expected
   to emit overriding JOINs of their own.  If the Propagation Delay is
   non-trivial, the number of overriding JOINs sent will increase
   linearly with the number N of routers on the LAN, and can be quite
   large if the number of routers N on the LAN is also large.

   The number of redundant JOINs can be dramatically reduced by
   employing a deterministic algorithm for picking the delay, which
   mimics an exponential probability distribution.  We begin by
   considering an arbitrary index function J that assigns to each router


Weinstein & Keller      Expires December 6, 2013                [Page 6]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   on the LAN (except the upstream router towards the source S) a unique
   integer in the closed interval [0,N-2], where N is the total number
   of PIM routers on the LAN (including the current router).  Such a
   function can be constructed, for example, by having each router count
   the number of other routers (other than the upstream router) with
   higher-addressed interfaces on the LAN, and using that count as the
   value of J. Then, when the Override Timer is to be set, consider
   setting it to the value
      Override_timer = Override_interval*ln(1+J)/ln(N-1)
   With this choice, the number of other routers that will also send a
   JOIN during the interval of length propagation_delay before they
   overhear this JOIN will be given by exp((ln(N)/
   Override_Interval)*propagation_delay)) - 1 = N ** (propagation_delay/
   override_interval).  For propagation_delay < override_interval, as it
   must be for redundant JOIN suppression to work, this grows less
   rapidly than N and hence results in a much smaller number of
   redundant JOIN messages when N is large.  This formula also
   guarantees that the Override Timer will never be set to a value
   greater than Override_interval, thus ensuring that an overriding JOIN
   is received by the upstream router before the upstream router prunes
   its outgoing interface.

3.2.  Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8]

   In order to implement the optimizations to redundant JOIN suppression
   describe above, the subsection of RFC 3973 Sec. 4.8 entitled "Timer
   Name: Upstream Override Timer (OT(S,G))" is modified to read as
   follows:


      Timer Name: Upstream Override Timer (OT(S,G))

+------------+----------------+----------------------------------------+
| Value Name | Value          | Explanation                            |
+------------+----------------+----------------------------------------|
| t_override | OI(I)*         | Randomized delay to prevent response   |
|            |  ln(1+J)/      | implosion when sending a join message  |
|            |  ln(N-1)       | to override someone else's prune       |
+------------+----------------+----------------------------------------+

      t_override is a value between 0 and the incoming interface's
      Override_Interval (OI(I)) computed from the formula t_override =
      OI(I)*ln(1+J)/ln(N-1), where N is the total number of PIM routers
      on incoming interface I (including this router), and J is a count
      of the number of PIM routers on incoming interface I (other than
      the upstream router towards source S) with addresses greater than
      the address of this router.  If all routers on a LAN are using the
      LAN Prune Delay option, the Override_Interval (OI(I)) MUST be set


Weinstein & Keller      Expires December 6, 2013                [Page 7]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


      to the largest value on the LAN.  Otherwise, the Override_Interval
      (OI(I)) MUST be set to 2.5 seconds.
      Note that, if N <= 2, no router will ever need to send a join
      message to override someone else's prune, and so neither an
      attempt to take the logarithm of a negative number nor an attempt
      to divide by zero can ever occur in this computation.

3.3.  Other modifications

   No other sections of RFC 3973 need to be modified to support
   optimized redundant JOIN suppression.


4.  Redundant PRUNE suppression

4.1.  Motivation

   When a PIM-DM router with a null outgoing interface list for a source
   S and group G receives a data packet for that source and group from
   the upstream interface towards that source, and its Prune Limit timer
   is not running, then it sends a PRUNE for that source and group to
   the upstream router towards that source [RFC3973 Sec. 4.4.1].
   Similarly, if a PIM-DM router receives a STATE REFRESH message for a
   source S and group G from the upstream router towards that source,
   the Prune Indicator bit in that STATE REFRESH message is clear, the
   upstream interface is in the Pruned state, and its Prune Limit timer
   is not running, then the router sends a PRUNE for that source and
   group to the upstream router towards that source [RFC3973 Sec.
   4.4.1].

   Unfortunately, this behavior can lead to multiple redundant PRUNE
   messages being sent if there is more than one router on a LAN with a
   null outgoing interface list for the (S,G) pair.  Each of these
   routers will send its own PRUNE to the upstream router towards S. All
   except the first of these, is unnecessary.  Even worse, each of these
   PRUNEs can trigger a storm of overriding JOIN messages.

   These unnecessary PRUNEs can be avoided by employing a strategy
   similar to that which is employed for avoiding redundant JOINs.
   Instead of sending a PRUNE message upstream as soon as it receives an
   incoming data packet from a LAN as specified in RFC 3973, the router
   may instead defer sending this PRUNE for a short period of time.  The
   router then sends this PRUNE if, and only if, it does not overhear a
   PRUNE message from another router on the same LAN to the same
   upstream router for the same (S,G) pair before this timer expires.

   However, in the case where no downstream router has any downstream
   group members, it is absolutely essential that the time delay before


Weinstein & Keller      Expires December 6, 2013                [Page 8]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   emission of the first PRUNE by some downstream router be kept to an
   absolute minimum.  Until that PRUNE is received by the upstream
   router, unnecessary user traffic will continue to be admitted to the
   LAN and, possibly, upstream networks as well, consuming unnecessary
   resources.  Forutunately, this time delay is relevant only if no
   downstream router has any downstream members.  Otherwise the PRUNE
   will be overridden anyway by a JOIN from one of the downstream
   routers with downstream members, and data will continue to flow.

   By using a deterministic algorithm for computing this delay, the
   delay before some router sends the first PRUNE can, as required, be
   kept to zero for the situation where there are no group members
   downstream of any router on the LAN.  This cannot be guaranteed by
   any probabilistic algorithm.

   Since a deterministic algorithm is required, we consider again the
   algorithm described in sec. 3.1. above, restated as:

   Prune_deferral_timer = Prune_deferral_interval*ln(1+J)/ln(N-1)

   As before, J is an index function that assigns to each router on the
   LAN (except the upstream router towards the source S) a unique
   integer in the interval [0,N-2], where N is the total number of PIM
   routers on the LAN.  Such a function can be constructed, for example,
   by having each router count the number of other routers (other than
   the upstream router) with higher-addressed interfaces on the LAN, and
   using that count as the value of J.

   Due to its deterministic character and the definition of the index
   function J, this algorithm guarantees that there will always be some
   router on the LAN for which J=0, so that Prune_deferral_timer is also
   zero.  Consequently, when every router (other than the upstream
   router) lacks downstream group members, this algorithm guarantees
   that some router will as required emit a PRUNE with zero delay.

   Since the time delay before some router emits a first PRUNE is
   irrelevant if one or more downstream routers have downstream group
   members, the value of Prune_deferral_interval can safely be made
   fairly large, thus minimizing the risk that redundant PRUNEs will be
   sent.  In fact, if there were no concern for the possibility that
   PRUNEs might be lost, there would be no need for any router other
   than the highest-addressed router to ever send a PRUNE.  If it had
   failed to send a PRUNE, then it could be concluded that it had
   downstream group members, and hence would itself generate an
   overriding JOIN if some other router then sent a PRUNE.  Using an
   algorithm that provides for other routers to send PRUNEs adds a
   measure of redundancy, that helps protect against the possibility
   that a PRUNE from the highest-addressed router might simply be lost.


Weinstein & Keller      Expires December 6, 2013                [Page 9]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   The order in which a router sees an incoming (S,G) data packet, a
   PRUNE emitted by some other router in response to that packet, and an
   overriding JOIN from a third router, may vary depending upon
   propagation delays in the network.  In particular, one must also
   address the situation wherein a router sees an overriding JOIN from
   another before it sees either the PRUNE that triggered that
   overriding JOIN or the data packet that triggered the PRUNE.  This
   can be handled through judicious use of the Prune Limit Timer.

   Hence, to avoid unnecessary PRUNEs without increasing the time delay
   before pruning can occur, the operation of the Upstream Prune, Join,
   and Graft State Machine [RFC3973 sec. 4.4.1] SHOULD be modified as
   described in this section.

4.2.  Modifications to the state-machine-specific timers [RFC3973 Sec.
      4.4.1]

   In the description of the state-machine-specific timers [RFC3973 sec.
   4.4.1], the following timer is added:
      Prune Deferral Timer (PDT(S,G))
         This timer is used to support redundant PRUNE suppression on a
         LAN.  If a data packet arrives on the upstream interface
         towards S (RPF_Interface(S)) while the outgoing interface list
         olist(S,G) is NULL and the prune limit timer PLT(S,G) is not
         running, and redundant PRUNE suppression is enabled on that
         interface, then this timer is set to a value t_deferral between
         0 and Prune Deferral Interval.  If redundant PRUNE suppression
         is not enabled on that interface, then this timer is set to
         zero (0) and expires immediately.  This timer is also set if a
         STATE REFRESH message is received with its Prune Indicator is
         clear, and the upstream interface is in the Pruned state.  If
         an (S,G) PRUNE is seen on the upstream interface towards S
         (RPF_Interface(S)) and is directed towards RPF'(S), then this
         timer is cancelled.  If this timer expires, a PRUNE is sent to
         the upstream router towards S (RPF'(s)).
         The computation` of t_deferral is described in Sec. 4.8.


Weinstein & Keller      Expires December 6, 2013               [Page 10]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


4.3.  Modifications to the tabular description of the state machine
      [RFC3973 Sec. 4.4.1]

   In the tabular description of the state machine [RFC3973 sec. 4.4.1],
   the following entries are added:

+-------------------------------+--------------------------------------+
|                               |            Previous State            |
|                               +------------+------------+------------+
|            Event              | Forwarding |   Pruned   | AckPending |
+-------------------------------+------------+------------+------------+
| PDT(S,G) expires              | N/A        | ->P Send   | N/A        |
|                               |            | Prune(S,G) |            |
+-------------------------------+------------+------------+------------+


Weinstein & Keller      Expires December 6, 2013               [Page 11]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   The following entries, identified by event and previous state, are
   modified:

+-------------------------------+--------------------------------------+
|                               |            Previous State            |
|                               +------------+------------+------------+
|            Event              | Forwarding |   Pruned   | AckPending |
+-------------------------------+------------+------------+------------+
| Data packet arrives on        | ->P Set    | ->P Set    | N/A        |
| RPF_Interface(S) AND          | PDT(S,G)   | PDT(S,G)   |            |
| olist(S,G) == NULL AND        |Set PLT(S,G)|Set PLT(S,G)|            |
| PLT(S,G) not running          |            |            |            |
+-------------------------------+------------+------------+------------+
| State Refresh(S,G) received   | ->F        | ->P Set    |->F Cancel  |
| from RPF`(S) AND              |            | PDT(S,G)   |  GRT(S,G)  |
| Prune Indicator == 0 AND      |            |Set PLT(S,G)|            |
| PLT(S,G) not running          |            |            |            |
+-------------------------------+------------+------------+------------+
| See Join(S,G) to RPF'(S)      | ->F Cancel | ->P Cancel |->AP Cancel |
|                               |    OT(S,G) | PDT(S,G)   |    OT(S,G) |
|                               |Set PLT(S,G)|Set PLT(S,G)|Set PLT(S,G)|
+-------------------------------+------------+------------+------------+
| See Prune(S,G)                | ->F Set    | ->P Cancel |->AP Set    |
|                               |    OT(S,G) | PDT(S,G)   |    OT(S,G) |
|                               |Set PLT(S,G)|Set PLT(S,G)|Set PLT(S,G)|
+-------------------------------+------------+------------+------------+
| olist(S,G)->non-NULL          | N/A        | ->AP Send  | N/A        |
|                               |            | Graft(S,G) |            |
|                               |            |Set GRT(S,G)|            |
|                               |            |Cancel      |            |
|                               |            |    PDT(S,G)|            |
+-------------------------------+------------+------------+------------+
| RPF'(S) Changes AND           | ->AP Send  | ->AP Send  |->AP Send   |
| olist(S,G) != NULL            | Graft(S,G) | Graft(S,G) | Graft(S,G) |
|                               |Set GRT(S,G)|Set GRT(S,G)|Set GRT(S,G)|
|                               |            |Cancel      |            |
|                               |            |    PDT(S,G)|            |
+-------------------------------+------------+------------+------------+

4.4.  Modifications to Transitions from the Forwarding (F) State
      [RFC3973 Sec. 4.4.1]

   In the description of Transitions from the Forwarding (F) State [RFC
   3973 Sec. 4.4.1.1], the following sections are modified as follows:
      Data Packet arrives on RPF_Interface(S) AND olist(S,G) == NULL AND
      S NOT directly connected


Weinstein & Keller      Expires December 6, 2013               [Page 12]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


         The Upstream(S,G) state machine MUST transition to the Pruned
         (P) state, set the Prune Deferall Timer PDT(S,G) to t_deferall
         seconds, and set PLT(S,G) to t_limit seconds.
      See Join(S,G) to RPF'(S)
         This event is only relevant if RPF_interface(S) is a shared
         medium.  This router sees another router on RPF_interface(S)
         send a Join(S,G) to RPF'(S,G).  If the OT(S,G) is running, then
         it means that the router had scheduled a Join to override a
         previously received Prune.  Another router has responded more
         quickly with a Join, so the local router SHOULD cancel its
         OT(S,G), if it is running.  The Upstream(S,G) state machine
         remains in the Forwarding (F) state.
         In addition, there is no sense in sending a PRUNE to RPF'(S) in
         the near-term, as that PRUNE would only be overridden by
         another JOIN.  Hence, the upstream (S,G) state machine SHOULD
         set the Prune Limit Timer PLT(S,G) to t_limit.
      See Prune(S,G) AND S NOT directly connected
         This event is only relevant if RPF_interface(S) is a shared
         medium.  This router sees another router on RPF_interface(S)
         send a Prune(S,G).  As this router is in Forwarding state, it
         must override the Prune after a short random interval.  If
         OT(S,G) is not running, the router MUST set OT(S,G) to
         t_override seconds.  The Upstream(S,G) state machine remains in
         Forwarding (F) state.  The router MAY reset its PLT(S,G) to the
         value in the Holdtime field of the received message if it is
         greater than the current value of the PLT(S,G).

4.5.  Modifications to Transitions from the Pruned (P) State [RFC3973
      Sec. 4.4.1]

   In the description of Transitions from the Pruned (P) State [RFC 3973
   Sec. 4.4.1.1], the following sections are added:
      Prune Deferral Timer PDT(S,G) expires
         Send a PRUNE(S,G) to RPF'(S)
      See Join(S,G) to RPF'(S)
         A Join(S,G) is seen on RPF_interface(S) to RPF'(S).  The
         Upstream(S,G) state machine stays in the Pruned (P) state.  If
         the Prune Deferral Timer PDT(S,G) is set, it SHOULD be
         cancelled.
         In addition, there is no sense in sending another PRUNE to
         RPF'(S) in the near-term, as that PRUNE would only be
         overridden by another JOIN.  Hence, the upstream (S,G) state
         machine SHOULD set the Prune Limit Timer PLT(S,G) to t_limit.

   In the description of Transitions from the Pruned (P) State [RFC 3973
   Sec. 4.4.1.1], the following sections are modified as follows:


Weinstein & Keller      Expires December 6, 2013               [Page 13]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


      Data arrives on RPF_interface(S) AND PLT(S,G) not running AND S
      NOT directly connected
         Either another router on the LAN desires traffic from S
         addressed to G or a previous Prune was lost.  To prevent
         generating a Prune(S,G) in response to every data packet, the
         Prune Limit Timer (PLT(S,G)) is used.  To prevent multiple
         routers from generating Prune(S,G) messages, the Prune Deferral
         Timer is used (PDT(S,G)).  Once the PLT(S,G) expires, some
         router needs to send another prune in response to a data packet
         not received directly from the source.  The Prune Deferral
         Timer PDT(S,G) MUST be set to t_deferral, and the PLT(S,G) MUST
         be set to t_limit.
      State Refresh(S,G) Received from RPF'(S)
         The Upstream(S,G) state machine remains in a Pruned state.  If
         the State Refresh has its Prune Indicator bit set to zero and
         PLT(S,G) is not running, The Prune Deferral Timer PDT(S,G) MUST
         be set to t_deferral, and the PLT(S,G) MUST be set to t_limit.
         If the State Refresh has its Prune Indicator bit set to one,
         the router MUST reset PLT(S,G) to t_limit.
      See Prune(S,G) to RPF'(S)
         A Prune(S,G) is seen on RPF_interface(S) to RPF'(S).  The
         Upstream(S,G) state machine stays in the Pruned (P) state.  If
         the Prune Deferral Timer PDT(S,G) is set, it SHOULD be
         cancelled.  The router MAY reset its PLT(S,G) to the value in
         the Holdtime field of the received message if it is greater
         than the current value of the PLT(S,G).
      olist(S,G)->non-NULL AND S NOT directly connected
         The set of interfaces defined by the olist(S,G) macro becomes
         non-empty, indicating that traffic from S addressed to group G
         must be forwarded.  The Upstream(S,G) state machine MUST cancel
         PLT(S,G), transition to the AckPending (AP) state and unicast a
         Graft message to RPF'(S).  The Graft Retry Timer (GRT(S,G))
         MUST be set to Graft_Retry_Period.  If the Prune Deferral timer
         PDT(S,G) is set, it MUST be cancelled.
      RPF'(S) Changes AND olist(S,G) == non-NULL AND S NOT directly
      connected
         Unicast routing or Assert state causes RPF'(S) to change,
         including changes to RPF_Interface(S).  The Upstream(S,G) state
         machine MUST cancel PLT(S,G), cancel the Prune Deferral Timer
         PDT(S,G), transition to the AckPending (AP)state, send a Graft
         unicast to the new RPF'(S), and set the GraftRetry Timer
         (GRT(S,G)) to Graft_Retry_Period.

4.6.  Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8]

   A new timer must be added to the list of PIM-DM Timers at the head of
   RFC 3973 Sec. 4.8 [RFC3973]:


Weinstein & Keller      Expires December 6, 2013               [Page 14]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


      Per (S,G) Pair:
         ...
         (S,G) Prune Deferral Timer : PDT(S,G)

   A new subsection must be added to RFC 3973 Sec. 4.8 [RFC3973]:


      Timer Name: Prune Deferral Timer (PDT(S,G))

+------------+----------------+----------------------------------------+
| Value Name | Value          | Explanation                            |
+------------+----------------+----------------------------------------|
| t_deferral | PDI(I)*        | Randomized delay to prevent response   |
|            |  ln(1+J)/      | implosion when sending a prune in      |
|            |  ln(N-1)       | response to incoming user traffic or   |
|            |                | an incoming state refresh message      |
+------------+----------------+----------------------------------------+

      t_deferall is a value between 0 and the incoming interface's
      Prune_Deferall_Interval (PDT(I)) computed from the formula
      t_deferall = PDT(I)*ln(1+J)/ln(N-1), where N is the total number
      of PIM routers on incoming interface I (including this router),
      and J is a count of the number of PIM routers on incoming
      interface I (other than the upstream router towards source S) with
      addresses greater than the address of this router.  The
      Prune_Deferall_Interval SHOULD be set by configuration and SHOULD
      have a value significantly greater than the expected propagation
      delay (PD(I)) but significantly less than the Prune Limit Timer.
      All routers on the same LAN SHOULD be assigned the same value of
      Prune_Deferall_Interval.  Recommended default: 10 seconds.
      Note that, if N <= 1, there will be no router to send a prune on
      the incoming interface I, and so an attempt to take the logarithm
      of a negative number will never occur.  If N = 2, then the only
      possible value of J will be J = 0, so the formula given above will
      result in a divide of zero by zero (ln(1) / ln(1) ).  This should
      be resolved by letting t_deferral = 0 in this case.

4.7.  Other modifications

   No other modifications to RFC 3973 are required.


5.  Compatibility with RFC 3973

   PIM-DM routers may implement the optimizations described herein on
   certain interfaces, while implementing standard PIM-DM as described
   in RFC 3973 on others.  Hence, PIM-DM routers supporting the
   optimization described herein SHOULD provide a per-interface


Weinstein & Keller      Expires December 6, 2013               [Page 15]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   configuration mechanism for selectively enabling or disabling these
   optimizations.

   PIM-DM routers implementing the optimizations described herein will
   interoperate with those implementing standard PIM-DM, as described in
   RFC 3973, even if intermixed on the same LAN.  However, to obtain the
   full benefit of these optimizations, all PIM routers on the LAN
   should implement these optimizations.  Mixed LANs may exhibit
   degraded performance, especially if several PIM-DM routers
   implementing only standard PIM are assigned higher-numbered addresses
   on the LAN than any of those implementing these optimizations.


6.  Security Considerations

   The optimizations described in this document should not have any
   impact upon the security of the PIM-DM protocol or the lack thereof.


7.  IANA Considerations

   There are no IANA considerations.


8.  Conclusions

   This document has described several optimizations to the PIM-DM
   protocol for LANs with large numbers of attached PIM-DM routers.
   These optimizations more fully exploit the potential for suppressing
   redundant JOIN/PRUNE messages inherent in the use of link-local
   multicast to transport these messages.  These optimizations provide
   maximum benefit when the number of PIM routers attached to a single
   LAN is large, and the propagation delay on the LAN is non-trivial but
   much shorter than the LAN Prune Delay.

   The optimized JOIN/PRUNE suppression described here will in most
   cases scale much better to large LANs than possible alternatives in
   which JOIN/PRUNE messages are unicast to the upstream router instead
   (as proposed by RFC6559 for PIM-SM).  The use of unicast obviates all
   possibility of redundant JOIN/PRUNE suppression, resulting in an
   implosion at the upstream router of N-1 unicast JOIN and PRUNE
   messages.  By contrast, the offered load from multicast JOIN/PRUNE
   messages with the optimizations described here should scale roughly
   as N ** (propagation_delay/override_interval), with propagation_delay
   << override_interval.

   On a classical Ethernet the resource consumption from unicast and
   multicast messages is the same.  Consequently, the optimized JOIN/


Weinstein & Keller      Expires December 6, 2013               [Page 16]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


   PRUNE suppression described here will almost always be preferable to
   use of unicast JOIN/PRUNE messages without redundant message
   suppression.

   For LANs emulated by multicast-capable IP tunneling across wireless
   mobile ad-hoc networks (MANETs) the resource consumption depends upon
   the lower-layer mechanisms used to implement multicast, the network
   topology, the MAC protocol, the routing protocol, and the
   opportunities for spatial reuse [BBN-TM-2031].  Nevertheless, as long
   as link-local multicast is implemented at the lower layer via
   reversed-shortest-path-forwarding or some other mechanism that is
   more efficient than packet replication at the source, the optimized
   JOIN/PRUNE suppression will still usually be preferable to use of
   unicast JOIN/PRUNE messages without redundant message suppression.
   However, exceptions can occur if propagation delay and/or packet loss
   on the emulated LAN are too high to permit effective suppression of
   redundant JOIN/PRUNE messages.

   The authors have examined the behavior of the modified protocol in
   LANs with up to 100 attached PIM routers, with propagation delays on
   the order of a few tens of milliseconds.  We have found that, with
   the optimizations proposed herein, it is unusual for any redundant
   JOINs or PRUNEs to be emitted.  In most cases, only one PRUNE and, if
   needed, one overriding JOIN is emitted onto the network in each prune
   cycle.


9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3973]  Adams, A., Nicholas, J., and W. Siadak, "Protocol
              Independent Multicast - Dense Mode (PIM-DM): Protocol
              Specification (Revised)", RFC 3973, January 2005.

9.2.  Informative References

   [BBN-TM-2031]
              Ramanathan, R., "An Approximate Model for Analyzing Real-
              World Wireless Network Scalability", BBM Technical
              Memo 2031, March 2012, <http://www.ir.bbn.com/~ramanath/
              pdf/symptotics-techreport.pdf>.

   [RFC6559]  Farinacci, D., Wijnands, IJ., Venaas, S., and M.
              Napierala, "A Reliable Transport Mechanism for PIM",


Weinstein & Keller      Expires December 6, 2013               [Page 17]

Internet-Draft     PIM-DM Optimizations for Large LANs         June 2013


              RFC 6559, March 2012.

   [Weinstein_In_Progress_a]
              Weinstein, J., "Mobility Hiding through IP Tunneling
              Across Wireless MANETs", Work in Progress a.

   [Weinstein_In_Progress_b]
              Weinstein, J., "PIM-DM over Multi-Layer Networks such as
              MANETs and VPLS emulated LANs", Work in Progress b.


Authors' Addresses

   Joseph Weinstein
   Raytheon BBN Technologies
   10 Moulton Street
   Cambridge, MA  02138
   USA

   Email: weinstein@bbn.com


   Joseph Keller
   Raytheon BBN Technologies
   10 Moulton Street
   Cambridge, MA  02138
   USA

   Email: jkeller@bbn.com


Weinstein & Keller      Expires December 6, 2013               [Page 18]