Independent Submission J. Weinstein Internet-Draft J. Keller Intended status: Experimental BBN Technologies Expires: December 6, 2013 June 4, 2013 PIM-DM Optimizations for Large LANs draft-weinstein-bbn-pim-dm-large-lan-00 Abstract Optimizations to the PIM-DM protocol [RFC3973] are proposed that dramatically reduce the PIM-DM routing overhead on local area networks containing large numbers of PIM routers. These optimizations reduce the likelihood of two or more routers emitting redundant JOIN or PRUNE messages onto the same LAN. While these optimizations are of general applicability to large LANs, they were motivated in particular by performance requirements for supporting PIM-DM across LANs emulated by IP tunneling over large wireless MANETs. PIM-DM routers supporting these optimizations will interoperate with standard PIM-DM as defined in RFC 3973, even on the same LAN. However, to obtain maximum benefit from these optimizations, all PIM-DM routers on the LAN SHOULD support them. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, and it may not be published except as an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 6, 2013. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the Weinstein & Keller Expires December 6, 2013 [Page 1] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions used in this document . . . . . . . . . . . . . . 5 3. Optimized Redundant JOIN suppression . . . . . . . . . . . . . 5 3.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2. Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] . . . . 7 3.3. Other modifications . . . . . . . . . . . . . . . . . . . 8 4. Redundant PRUNE suppression . . . . . . . . . . . . . . . . . 8 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2. Modifications to the state-machine-specific timers [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . . . . 10 4.3. Modifications to the tabular description of the state machine [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . 11 4.4. Modifications to Transitions from the Forwarding (F) State [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . 12 4.5. Modifications to Transitions from the Pruned (P) State [RFC3973 Sec. 4.4.1] . . . . . . . . . . . . . . . . . . . 13 4.6. Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] . . . . 14 4.7. Other modifications . . . . . . . . . . . . . . . . . . . 15 5. Compatibility with RFC 3973 . . . . . . . . . . . . . . . . . 15 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 16 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 9.1. Normative References . . . . . . . . . . . . . . . . . . . 17 9.2. Informative References . . . . . . . . . . . . . . . . . . 17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 Weinstein & Keller Expires December 6, 2013 [Page 2] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 1. Introduction PIM-DM [RFC3973] is an IP multicast routing protocol optimized for situations where group members (receivers) are fairly densely distributed throughout the multicast distribution domain and there are not too many sources per group. It is based upon the reversed- shortest-path broadcast paradigm, in which multicast traffic from each source is distributed along a tree rooted at that source and consisting of the shortest paths from each potential destination in the multicast distribution domain back to that source. It optimizes this reversed-shortest-path broadcast tree by pruning unneeded branches that do not lead to any actual group members. Pruning of unneeded branches is accomplished through use of JOIN/PRUNE messages, sent by downstream routers towards the source in a direction opposite to that of data flow. When large numbers of PIM-DM routers are attached to a single LAN, PIM-DM as described in RFC 3973 becomes vulnerable to storms of redundant JOIN/PRUNE messages on the LAN. These storms can arise when more than one of the routers on the LAN lack downstream members for a particular (S,G) pair, and each sends a separate PRUNE message to the upstream router towards the source S. Each of these PRUNE messages may, in turn, trigger an overriding JOIN from one or more routers that do have downstream members for the (S,G) pair. Since IP multicast forwarding decisions are made on a per-interface basis, not a per-downstream-router basis, all except one of these PRUNE messages is unnecessary. Likewise, all except one of the overriding JOIN messages, if any, are redundant. These excess JOIN/PRUNE messages are particularly problematic because they tend to be emitted in a synchronous burst by all downstream routers on the LAN when the first user data packet or STATE REFRESH message arrives, potentially causing a temporary overload on the LAN. Furthermore, these JOIN/PRUNE storms will repeat periodically if STATE REFRESH messages are lost or otherwise unavailable. Since a separate set of JOIN/PRUNE messages is used for each (source, group) pair, the load from these redundant JOIN/PRUNE messages scales approximately as the product [number_of_groups] * [number_of_sources ] * [number_of_routers_per_LAN] * [rate_of_prune_cycles]. As such, this load becomes the key constraint limiting the number of sources and groups that can be supported within a single PIM-DM multicast domain. The PIM-DM specification [RFC3973] includes a mechanism for avoiding redundant overriding JOINs, but it is sub-optimal when the number of PIM routers on a single LAN is large or propagation delays on the LAN are non-negligible. This mechanism is built into the upstream state machine [RFC3973 Sec. 4.4.1] through the Override Timer. When a Weinstein & Keller Expires December 6, 2013 [Page 3] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 router in the Forwarding state or AckPending state overhears an upstream PRUNE that must be overridden, it sets this timer to a random value between 0 and Override_Interval. It then sends an overriding JOIN to the upstream router if and only if this timer expires before it overhears another overriding JOIN from some other router on the same interface to the same upstream router for the same (S,G) pair. We shall refer to this mechanism as redundant JOIN suppression. As will be explained in Section 3 of this document, the effectiveness of redundant JOIN suppression is extremely sensitive to the algorithm used to determine the setting for the Override Timer. RFC 3973 specifies that this timer is to be set to a random value between 0 and Override_Interval. It does not specify the probability distribution to use, but most implementations probably use a linear distribution. If the number of routers on the LAN is sufficiently large compared to the inverse propagation delay, large numbers of redundant messages may still be sent. The likelihood of this occurring can be greatly reduced by employing a deterministic algorithm for setting the Override_Timer that approximates an exponential probability distribution. Section 3 of this document specifies this algorithm. PIM-DM as specified in RFC 3973 does not, however, include any mechanism for avoiding redundant PRUNEs. Section 4 of this document describes a proposed mechanism, which is very similar to the strategy used for avoiding redundant JOINs. Instead of sending a PRUNE message upstream as soon as a router receives an incoming data packet from a LAN, the router may instead defer sending this PRUNE for a short period of time. The router then sends this PRUNE if, and only if, it does not overhear a PRUNE message from another router on the same LAN to the same upstream router for the same (S,G) pair before this timer expires. By using a deterministic algorithm for computing this delay, the delay before the first such router emits a PRUNE can be made effectively zero for the only case where the delay length impacts overall protocol performance -- those situation where there are no group members downstream of any router on the LAN. We shall refer to this proposed mechanism as redundant PRUNE suppression. Although the mechanisms described in this document should be beneficial in any situation where there are a large number of PIM-DM routers on a single LAN, these optimizations were motivated by the need to support PIM-DM across LANs emulated by multicast-capable IP tunneling over wireless mobile ad-hoc networks (MANETs) [Weinstein_In_Progress_a]. LANs emulated in this way are motivated by the desire to hide most mobility-related wireless network topology changes from IP by handling such changes through a custom MANET routing protocol at a lower layer. As the effectiveness of mobility Weinstein & Keller Expires December 6, 2013 [Page 4] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 hiding is dependent upon including all mobile routers in the theater of operation within a single emulated LAN, the number of PIM-DM routers on such a LAN can become quite large. At the same time, the capacity of such networks can be very limited, so that minimizing the overhead from PIM-DM can be crucially important. Finally, the propagation delays for traffic across such a network may be significantly larger than those typical of high-capacity wired LANs, reducing the efficacy of the redundant JOIN suppression mechanisms specified by RFC 3973. A separate document will deal with adaptations to PIM-DM necessitated by the multi-layer routing structure itself [Weinstein_In_Progress_b]. 2. Conventions used in this document This document describes optimizations to PIM-DM with reference to its current specification in RFC 3973. The terminology of that reference is adopted in its entirety. Concepts and terms not defined herein should be understood as having the meaning specified in RFC 3973. Algorithms and protocol elements not discussed herein should be understood to be identical to those specified in [RFC3973]. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying RFC2119 significance. 3. Optimized Redundant JOIN suppression 3.1. Motivation If a PIM-DM router with a non-null outgoing interface list (e.g., possible downstream members) for some (S,G) pair overhears a PRUNE on its incoming interface from some other router for the same (S,G) pair, then it must cancel that PRUNE by sending an overriding (S,G) JOIN to the upstream router U towards S. If there is more than one PIM-DM router with a non-null outgoing interface list for (S,G) with its incoming interface on the same network, more than one such router may send an overriding (S,G) JOIN to the upstream router U. All except the first of these is unnecessary, as the receipt of even one overriding JOIN is sufficient to prevent the upstream router from pruning its outgoing interface onto the LAN. The PIM-DM specification includes a mechanism for avoiding redundant Weinstein & Keller Expires December 6, 2013 [Page 5] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 overriding JOINs, but it is sub-optimal when the number of PIM routers on a single LAN is large and propagation delays on the LAN are non-trivial. This mechanism is built into the upstream state machine [RFC3973 Sec. 4.4.1] through the Override Timer. When a router in the Forwarding state or AckPending state overhears an upstream PRUNE that must be overridden, it sets this timer to a random value between 0 and Override_Interval. It then sends an overriding JOIN to the upstream router if and only if this timer expires before it overhears another overriding JOIN from some other router on the same interface to the same upstream router for the same (S,G) pair. RFC 3973 does not specify the probability distribution to use when setting the Override Timer for redundant JOIN suppression, although most implementations probably use a linear distribution. If the number of routers on the LAN is sufficiently large compared to the inverse propagation delay, redundant messages may still be sent. The likelihood of this occurring can be greatly reduced by employing a deterministic algorithm for setting the Override_Timer that approximates an exponential distribution. It is not difficult to estimate the expected number of redundant JOIN messages for any given algorithm from the propagation delay, the number of routers N on the LAN, and the number of routers on the LAN with downstream group members. The worst case occurs when only one router lacks downstream members for some source and group, and hence sends a PRUNE message upstream. In this case, N-2 other routers will overhear this PRUNE and schedule possible origination of an overriding JOIN message. As PIM-DM is optimized for use in situations where group members are densely distributed throughout the multicast distribution domain, this worst-case scenario and/or near- worst-case scenarios will be common. Using a linear probability distribution for the Override Timer as implied by RFC 3973, the expected delay before the first overriding JOIN is sent by some router will be given by Override_Interval/(N-2). However, it will typically take Propagation_Delay seconds for other routers to overhear this overriding JOIN. In the meantime, (N-3) * Propagation_Delay / Override_Interval other routers can be expected to emit overriding JOINs of their own. If the Propagation Delay is non-trivial, the number of overriding JOINs sent will increase linearly with the number N of routers on the LAN, and can be quite large if the number of routers N on the LAN is also large. The number of redundant JOINs can be dramatically reduced by employing a deterministic algorithm for picking the delay, which mimics an exponential probability distribution. We begin by considering an arbitrary index function J that assigns to each router Weinstein & Keller Expires December 6, 2013 [Page 6] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 on the LAN (except the upstream router towards the source S) a unique integer in the closed interval [0,N-2], where N is the total number of PIM routers on the LAN (including the current router). Such a function can be constructed, for example, by having each router count the number of other routers (other than the upstream router) with higher-addressed interfaces on the LAN, and using that count as the value of J. Then, when the Override Timer is to be set, consider setting it to the value Override_timer = Override_interval*ln(1+J)/ln(N-1) With this choice, the number of other routers that will also send a JOIN during the interval of length propagation_delay before they overhear this JOIN will be given by exp((ln(N)/ Override_Interval)*propagation_delay)) - 1 = N ** (propagation_delay/ override_interval). For propagation_delay < override_interval, as it must be for redundant JOIN suppression to work, this grows less rapidly than N and hence results in a much smaller number of redundant JOIN messages when N is large. This formula also guarantees that the Override Timer will never be set to a value greater than Override_interval, thus ensuring that an overriding JOIN is received by the upstream router before the upstream router prunes its outgoing interface. 3.2. Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] In order to implement the optimizations to redundant JOIN suppression describe above, the subsection of RFC 3973 Sec. 4.8 entitled "Timer Name: Upstream Override Timer (OT(S,G))" is modified to read as follows: Timer Name: Upstream Override Timer (OT(S,G)) +------------+----------------+----------------------------------------+ | Value Name | Value | Explanation | +------------+----------------+----------------------------------------| | t_override | OI(I)* | Randomized delay to prevent response | | | ln(1+J)/ | implosion when sending a join message | | | ln(N-1) | to override someone else's prune | +------------+----------------+----------------------------------------+ t_override is a value between 0 and the incoming interface's Override_Interval (OI(I)) computed from the formula t_override = OI(I)*ln(1+J)/ln(N-1), where N is the total number of PIM routers on incoming interface I (including this router), and J is a count of the number of PIM routers on incoming interface I (other than the upstream router towards source S) with addresses greater than the address of this router. If all routers on a LAN are using the LAN Prune Delay option, the Override_Interval (OI(I)) MUST be set Weinstein & Keller Expires December 6, 2013 [Page 7] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 to the largest value on the LAN. Otherwise, the Override_Interval (OI(I)) MUST be set to 2.5 seconds. Note that, if N <= 2, no router will ever need to send a join message to override someone else's prune, and so neither an attempt to take the logarithm of a negative number nor an attempt to divide by zero can ever occur in this computation. 3.3. Other modifications No other sections of RFC 3973 need to be modified to support optimized redundant JOIN suppression. 4. Redundant PRUNE suppression 4.1. Motivation When a PIM-DM router with a null outgoing interface list for a source S and group G receives a data packet for that source and group from the upstream interface towards that source, and its Prune Limit timer is not running, then it sends a PRUNE for that source and group to the upstream router towards that source [RFC3973 Sec. 4.4.1]. Similarly, if a PIM-DM router receives a STATE REFRESH message for a source S and group G from the upstream router towards that source, the Prune Indicator bit in that STATE REFRESH message is clear, the upstream interface is in the Pruned state, and its Prune Limit timer is not running, then the router sends a PRUNE for that source and group to the upstream router towards that source [RFC3973 Sec. 4.4.1]. Unfortunately, this behavior can lead to multiple redundant PRUNE messages being sent if there is more than one router on a LAN with a null outgoing interface list for the (S,G) pair. Each of these routers will send its own PRUNE to the upstream router towards S. All except the first of these, is unnecessary. Even worse, each of these PRUNEs can trigger a storm of overriding JOIN messages. These unnecessary PRUNEs can be avoided by employing a strategy similar to that which is employed for avoiding redundant JOINs. Instead of sending a PRUNE message upstream as soon as it receives an incoming data packet from a LAN as specified in RFC 3973, the router may instead defer sending this PRUNE for a short period of time. The router then sends this PRUNE if, and only if, it does not overhear a PRUNE message from another router on the same LAN to the same upstream router for the same (S,G) pair before this timer expires. However, in the case where no downstream router has any downstream group members, it is absolutely essential that the time delay before Weinstein & Keller Expires December 6, 2013 [Page 8] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 emission of the first PRUNE by some downstream router be kept to an absolute minimum. Until that PRUNE is received by the upstream router, unnecessary user traffic will continue to be admitted to the LAN and, possibly, upstream networks as well, consuming unnecessary resources. Forutunately, this time delay is relevant only if no downstream router has any downstream members. Otherwise the PRUNE will be overridden anyway by a JOIN from one of the downstream routers with downstream members, and data will continue to flow. By using a deterministic algorithm for computing this delay, the delay before some router sends the first PRUNE can, as required, be kept to zero for the situation where there are no group members downstream of any router on the LAN. This cannot be guaranteed by any probabilistic algorithm. Since a deterministic algorithm is required, we consider again the algorithm described in sec. 3.1. above, restated as: Prune_deferral_timer = Prune_deferral_interval*ln(1+J)/ln(N-1) As before, J is an index function that assigns to each router on the LAN (except the upstream router towards the source S) a unique integer in the interval [0,N-2], where N is the total number of PIM routers on the LAN. Such a function can be constructed, for example, by having each router count the number of other routers (other than the upstream router) with higher-addressed interfaces on the LAN, and using that count as the value of J. Due to its deterministic character and the definition of the index function J, this algorithm guarantees that there will always be some router on the LAN for which J=0, so that Prune_deferral_timer is also zero. Consequently, when every router (other than the upstream router) lacks downstream group members, this algorithm guarantees that some router will as required emit a PRUNE with zero delay. Since the time delay before some router emits a first PRUNE is irrelevant if one or more downstream routers have downstream group members, the value of Prune_deferral_interval can safely be made fairly large, thus minimizing the risk that redundant PRUNEs will be sent. In fact, if there were no concern for the possibility that PRUNEs might be lost, there would be no need for any router other than the highest-addressed router to ever send a PRUNE. If it had failed to send a PRUNE, then it could be concluded that it had downstream group members, and hence would itself generate an overriding JOIN if some other router then sent a PRUNE. Using an algorithm that provides for other routers to send PRUNEs adds a measure of redundancy, that helps protect against the possibility that a PRUNE from the highest-addressed router might simply be lost. Weinstein & Keller Expires December 6, 2013 [Page 9] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 The order in which a router sees an incoming (S,G) data packet, a PRUNE emitted by some other router in response to that packet, and an overriding JOIN from a third router, may vary depending upon propagation delays in the network. In particular, one must also address the situation wherein a router sees an overriding JOIN from another before it sees either the PRUNE that triggered that overriding JOIN or the data packet that triggered the PRUNE. This can be handled through judicious use of the Prune Limit Timer. Hence, to avoid unnecessary PRUNEs without increasing the time delay before pruning can occur, the operation of the Upstream Prune, Join, and Graft State Machine [RFC3973 sec. 4.4.1] SHOULD be modified as described in this section. 4.2. Modifications to the state-machine-specific timers [RFC3973 Sec. 4.4.1] In the description of the state-machine-specific timers [RFC3973 sec. 4.4.1], the following timer is added: Prune Deferral Timer (PDT(S,G)) This timer is used to support redundant PRUNE suppression on a LAN. If a data packet arrives on the upstream interface towards S (RPF_Interface(S)) while the outgoing interface list olist(S,G) is NULL and the prune limit timer PLT(S,G) is not running, and redundant PRUNE suppression is enabled on that interface, then this timer is set to a value t_deferral between 0 and Prune Deferral Interval. If redundant PRUNE suppression is not enabled on that interface, then this timer is set to zero (0) and expires immediately. This timer is also set if a STATE REFRESH message is received with its Prune Indicator is clear, and the upstream interface is in the Pruned state. If an (S,G) PRUNE is seen on the upstream interface towards S (RPF_Interface(S)) and is directed towards RPF'(S), then this timer is cancelled. If this timer expires, a PRUNE is sent to the upstream router towards S (RPF'(s)). The computation` of t_deferral is described in Sec. 4.8. Weinstein & Keller Expires December 6, 2013 [Page 10] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 4.3. Modifications to the tabular description of the state machine [RFC3973 Sec. 4.4.1] In the tabular description of the state machine [RFC3973 sec. 4.4.1], the following entries are added: +-------------------------------+--------------------------------------+ | | Previous State | | +------------+------------+------------+ | Event | Forwarding | Pruned | AckPending | +-------------------------------+------------+------------+------------+ | PDT(S,G) expires | N/A | ->P Send | N/A | | | | Prune(S,G) | | +-------------------------------+------------+------------+------------+ Weinstein & Keller Expires December 6, 2013 [Page 11] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 The following entries, identified by event and previous state, are modified: +-------------------------------+--------------------------------------+ | | Previous State | | +------------+------------+------------+ | Event | Forwarding | Pruned | AckPending | +-------------------------------+------------+------------+------------+ | Data packet arrives on | ->P Set | ->P Set | N/A | | RPF_Interface(S) AND | PDT(S,G) | PDT(S,G) | | | olist(S,G) == NULL AND |Set PLT(S,G)|Set PLT(S,G)| | | PLT(S,G) not running | | | | +-------------------------------+------------+------------+------------+ | State Refresh(S,G) received | ->F | ->P Set |->F Cancel | | from RPF`(S) AND | | PDT(S,G) | GRT(S,G) | | Prune Indicator == 0 AND | |Set PLT(S,G)| | | PLT(S,G) not running | | | | +-------------------------------+------------+------------+------------+ | See Join(S,G) to RPF'(S) | ->F Cancel | ->P Cancel |->AP Cancel | | | OT(S,G) | PDT(S,G) | OT(S,G) | | |Set PLT(S,G)|Set PLT(S,G)|Set PLT(S,G)| +-------------------------------+------------+------------+------------+ | See Prune(S,G) | ->F Set | ->P Cancel |->AP Set | | | OT(S,G) | PDT(S,G) | OT(S,G) | | |Set PLT(S,G)|Set PLT(S,G)|Set PLT(S,G)| +-------------------------------+------------+------------+------------+ | olist(S,G)->non-NULL | N/A | ->AP Send | N/A | | | | Graft(S,G) | | | | |Set GRT(S,G)| | | | |Cancel | | | | | PDT(S,G)| | +-------------------------------+------------+------------+------------+ | RPF'(S) Changes AND | ->AP Send | ->AP Send |->AP Send | | olist(S,G) != NULL | Graft(S,G) | Graft(S,G) | Graft(S,G) | | |Set GRT(S,G)|Set GRT(S,G)|Set GRT(S,G)| | | |Cancel | | | | | PDT(S,G)| | +-------------------------------+------------+------------+------------+ 4.4. Modifications to Transitions from the Forwarding (F) State [RFC3973 Sec. 4.4.1] In the description of Transitions from the Forwarding (F) State [RFC 3973 Sec. 4.4.1.1], the following sections are modified as follows: Data Packet arrives on RPF_Interface(S) AND olist(S,G) == NULL AND S NOT directly connected Weinstein & Keller Expires December 6, 2013 [Page 12] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 The Upstream(S,G) state machine MUST transition to the Pruned (P) state, set the Prune Deferall Timer PDT(S,G) to t_deferall seconds, and set PLT(S,G) to t_limit seconds. See Join(S,G) to RPF'(S) This event is only relevant if RPF_interface(S) is a shared medium. This router sees another router on RPF_interface(S) send a Join(S,G) to RPF'(S,G). If the OT(S,G) is running, then it means that the router had scheduled a Join to override a previously received Prune. Another router has responded more quickly with a Join, so the local router SHOULD cancel its OT(S,G), if it is running. The Upstream(S,G) state machine remains in the Forwarding (F) state. In addition, there is no sense in sending a PRUNE to RPF'(S) in the near-term, as that PRUNE would only be overridden by another JOIN. Hence, the upstream (S,G) state machine SHOULD set the Prune Limit Timer PLT(S,G) to t_limit. See Prune(S,G) AND S NOT directly connected This event is only relevant if RPF_interface(S) is a shared medium. This router sees another router on RPF_interface(S) send a Prune(S,G). As this router is in Forwarding state, it must override the Prune after a short random interval. If OT(S,G) is not running, the router MUST set OT(S,G) to t_override seconds. The Upstream(S,G) state machine remains in Forwarding (F) state. The router MAY reset its PLT(S,G) to the value in the Holdtime field of the received message if it is greater than the current value of the PLT(S,G). 4.5. Modifications to Transitions from the Pruned (P) State [RFC3973 Sec. 4.4.1] In the description of Transitions from the Pruned (P) State [RFC 3973 Sec. 4.4.1.1], the following sections are added: Prune Deferral Timer PDT(S,G) expires Send a PRUNE(S,G) to RPF'(S) See Join(S,G) to RPF'(S) A Join(S,G) is seen on RPF_interface(S) to RPF'(S). The Upstream(S,G) state machine stays in the Pruned (P) state. If the Prune Deferral Timer PDT(S,G) is set, it SHOULD be cancelled. In addition, there is no sense in sending another PRUNE to RPF'(S) in the near-term, as that PRUNE would only be overridden by another JOIN. Hence, the upstream (S,G) state machine SHOULD set the Prune Limit Timer PLT(S,G) to t_limit. In the description of Transitions from the Pruned (P) State [RFC 3973 Sec. 4.4.1.1], the following sections are modified as follows: Weinstein & Keller Expires December 6, 2013 [Page 13] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 Data arrives on RPF_interface(S) AND PLT(S,G) not running AND S NOT directly connected Either another router on the LAN desires traffic from S addressed to G or a previous Prune was lost. To prevent generating a Prune(S,G) in response to every data packet, the Prune Limit Timer (PLT(S,G)) is used. To prevent multiple routers from generating Prune(S,G) messages, the Prune Deferral Timer is used (PDT(S,G)). Once the PLT(S,G) expires, some router needs to send another prune in response to a data packet not received directly from the source. The Prune Deferral Timer PDT(S,G) MUST be set to t_deferral, and the PLT(S,G) MUST be set to t_limit. State Refresh(S,G) Received from RPF'(S) The Upstream(S,G) state machine remains in a Pruned state. If the State Refresh has its Prune Indicator bit set to zero and PLT(S,G) is not running, The Prune Deferral Timer PDT(S,G) MUST be set to t_deferral, and the PLT(S,G) MUST be set to t_limit. If the State Refresh has its Prune Indicator bit set to one, the router MUST reset PLT(S,G) to t_limit. See Prune(S,G) to RPF'(S) A Prune(S,G) is seen on RPF_interface(S) to RPF'(S). The Upstream(S,G) state machine stays in the Pruned (P) state. If the Prune Deferral Timer PDT(S,G) is set, it SHOULD be cancelled. The router MAY reset its PLT(S,G) to the value in the Holdtime field of the received message if it is greater than the current value of the PLT(S,G). olist(S,G)->non-NULL AND S NOT directly connected The set of interfaces defined by the olist(S,G) macro becomes non-empty, indicating that traffic from S addressed to group G must be forwarded. The Upstream(S,G) state machine MUST cancel PLT(S,G), transition to the AckPending (AP) state and unicast a Graft message to RPF'(S). The Graft Retry Timer (GRT(S,G)) MUST be set to Graft_Retry_Period. If the Prune Deferral timer PDT(S,G) is set, it MUST be cancelled. RPF'(S) Changes AND olist(S,G) == non-NULL AND S NOT directly connected Unicast routing or Assert state causes RPF'(S) to change, including changes to RPF_Interface(S). The Upstream(S,G) state machine MUST cancel PLT(S,G), cancel the Prune Deferral Timer PDT(S,G), transition to the AckPending (AP)state, send a Graft unicast to the new RPF'(S), and set the GraftRetry Timer (GRT(S,G)) to Graft_Retry_Period. 4.6. Modifications to PIM-DM Timers [RFC 3973 Sec. 4.8] A new timer must be added to the list of PIM-DM Timers at the head of RFC 3973 Sec. 4.8 [RFC3973]: Weinstein & Keller Expires December 6, 2013 [Page 14] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 Per (S,G) Pair: ... (S,G) Prune Deferral Timer : PDT(S,G) A new subsection must be added to RFC 3973 Sec. 4.8 [RFC3973]: Timer Name: Prune Deferral Timer (PDT(S,G)) +------------+----------------+----------------------------------------+ | Value Name | Value | Explanation | +------------+----------------+----------------------------------------| | t_deferral | PDI(I)* | Randomized delay to prevent response | | | ln(1+J)/ | implosion when sending a prune in | | | ln(N-1) | response to incoming user traffic or | | | | an incoming state refresh message | +------------+----------------+----------------------------------------+ t_deferall is a value between 0 and the incoming interface's Prune_Deferall_Interval (PDT(I)) computed from the formula t_deferall = PDT(I)*ln(1+J)/ln(N-1), where N is the total number of PIM routers on incoming interface I (including this router), and J is a count of the number of PIM routers on incoming interface I (other than the upstream router towards source S) with addresses greater than the address of this router. The Prune_Deferall_Interval SHOULD be set by configuration and SHOULD have a value significantly greater than the expected propagation delay (PD(I)) but significantly less than the Prune Limit Timer. All routers on the same LAN SHOULD be assigned the same value of Prune_Deferall_Interval. Recommended default: 10 seconds. Note that, if N <= 1, there will be no router to send a prune on the incoming interface I, and so an attempt to take the logarithm of a negative number will never occur. If N = 2, then the only possible value of J will be J = 0, so the formula given above will result in a divide of zero by zero (ln(1) / ln(1) ). This should be resolved by letting t_deferral = 0 in this case. 4.7. Other modifications No other modifications to RFC 3973 are required. 5. Compatibility with RFC 3973 PIM-DM routers may implement the optimizations described herein on certain interfaces, while implementing standard PIM-DM as described in RFC 3973 on others. Hence, PIM-DM routers supporting the optimization described herein SHOULD provide a per-interface Weinstein & Keller Expires December 6, 2013 [Page 15] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 configuration mechanism for selectively enabling or disabling these optimizations. PIM-DM routers implementing the optimizations described herein will interoperate with those implementing standard PIM-DM, as described in RFC 3973, even if intermixed on the same LAN. However, to obtain the full benefit of these optimizations, all PIM routers on the LAN should implement these optimizations. Mixed LANs may exhibit degraded performance, especially if several PIM-DM routers implementing only standard PIM are assigned higher-numbered addresses on the LAN than any of those implementing these optimizations. 6. Security Considerations The optimizations described in this document should not have any impact upon the security of the PIM-DM protocol or the lack thereof. 7. IANA Considerations There are no IANA considerations. 8. Conclusions This document has described several optimizations to the PIM-DM protocol for LANs with large numbers of attached PIM-DM routers. These optimizations more fully exploit the potential for suppressing redundant JOIN/PRUNE messages inherent in the use of link-local multicast to transport these messages. These optimizations provide maximum benefit when the number of PIM routers attached to a single LAN is large, and the propagation delay on the LAN is non-trivial but much shorter than the LAN Prune Delay. The optimized JOIN/PRUNE suppression described here will in most cases scale much better to large LANs than possible alternatives in which JOIN/PRUNE messages are unicast to the upstream router instead (as proposed by RFC6559 for PIM-SM). The use of unicast obviates all possibility of redundant JOIN/PRUNE suppression, resulting in an implosion at the upstream router of N-1 unicast JOIN and PRUNE messages. By contrast, the offered load from multicast JOIN/PRUNE messages with the optimizations described here should scale roughly as N ** (propagation_delay/override_interval), with propagation_delay << override_interval. On a classical Ethernet the resource consumption from unicast and multicast messages is the same. Consequently, the optimized JOIN/ Weinstein & Keller Expires December 6, 2013 [Page 16] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 PRUNE suppression described here will almost always be preferable to use of unicast JOIN/PRUNE messages without redundant message suppression. For LANs emulated by multicast-capable IP tunneling across wireless mobile ad-hoc networks (MANETs) the resource consumption depends upon the lower-layer mechanisms used to implement multicast, the network topology, the MAC protocol, the routing protocol, and the opportunities for spatial reuse [BBN-TM-2031]. Nevertheless, as long as link-local multicast is implemented at the lower layer via reversed-shortest-path-forwarding or some other mechanism that is more efficient than packet replication at the source, the optimized JOIN/PRUNE suppression will still usually be preferable to use of unicast JOIN/PRUNE messages without redundant message suppression. However, exceptions can occur if propagation delay and/or packet loss on the emulated LAN are too high to permit effective suppression of redundant JOIN/PRUNE messages. The authors have examined the behavior of the modified protocol in LANs with up to 100 attached PIM routers, with propagation delays on the order of a few tens of milliseconds. We have found that, with the optimizations proposed herein, it is unusual for any redundant JOINs or PRUNEs to be emitted. In most cases, only one PRUNE and, if needed, one overriding JOIN is emitted onto the network in each prune cycle. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3973] Adams, A., Nicholas, J., and W. Siadak, "Protocol Independent Multicast - Dense Mode (PIM-DM): Protocol Specification (Revised)", RFC 3973, January 2005. 9.2. Informative References [BBN-TM-2031] Ramanathan, R., "An Approximate Model for Analyzing Real- World Wireless Network Scalability", BBM Technical Memo 2031, March 2012, . [RFC6559] Farinacci, D., Wijnands, IJ., Venaas, S., and M. Napierala, "A Reliable Transport Mechanism for PIM", Weinstein & Keller Expires December 6, 2013 [Page 17] Internet-Draft PIM-DM Optimizations for Large LANs June 2013 RFC 6559, March 2012. [Weinstein_In_Progress_a] Weinstein, J., "Mobility Hiding through IP Tunneling Across Wireless MANETs", Work in Progress a. [Weinstein_In_Progress_b] Weinstein, J., "PIM-DM over Multi-Layer Networks such as MANETs and VPLS emulated LANs", Work in Progress b. Authors' Addresses Joseph Weinstein Raytheon BBN Technologies 10 Moulton Street Cambridge, MA 02138 USA Email: weinstein@bbn.com Joseph Keller Raytheon BBN Technologies 10 Moulton Street Cambridge, MA 02138 USA Email: jkeller@bbn.com Weinstein & Keller Expires December 6, 2013 [Page 18]