Network Working Group Jerry Ash Internet Draft Gagan L. Choudhury Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Based on overload and failure experience with link-state protocols, this draft identifies means to enable link-state protocols to: o avoid getting into congestion states wherever possible o respond gracefully to network overloads and failures o gracefully recover from massive loss of topology database information The proposed mechanisms include the following: o throttle setups of link adjacencies o reduce the rate of flooding of link-state-advertisements(LSAs) o prioritized treatment of Hello & LSA ack packets o database backup & resynchronization for graceful recovery from loss of topology database information Table of Contents 1. Introduction & Problem Statement 2. Failure Experience & Analysis 3. Proposed Solution Methods 4. OSPF Congestion Avoidance & Control 4.1 OSPF Congestion Avoidance Mechanisms 4.1.1 Throttle Setups of Link Adjacencies 4.1.2 DSCP Marking of Hello & LSA Ack Packets 4.1.3 Use of Other Control Packets to Detect Live-ness of Adjacency 4.2 OSPF Congestion Control Mechanisms 4.2.1 Detecting Congestion & Notifying Neighbors 4.2.2 Router Response When it Detects Congestion 4.2.2.1 Congestion Notification 4.2.2.2 Throttle Rate of LSA Flooding 4.2.2.3 Increase Adjacency Break Interval 4.2.2.4 Reduce the Rate of SPF Computation 4.2.3 Router Response if Congestion Notification Received 4.2.4 Graceful Recovery from Loss of Topology Database Information - Database Backup & Resynchronization 5. Related Solution Methods 5.1 Optimized Link State Routing (OLSR) Protocol 5.2 Fisheye State Routing Protocol (FSR) for Ad Hoc Networks 5.3 Landmark Routing Protocol (LANMAR) for Large Scale Ad Hoc Networks 5.4 Topology Broadcast Based on Reverse-Path Forwarding (TBRPF) 6. Security Considerations 7. Acknowledgments 8. References 9. Appendix Appendix A - Dropping Link Adjacencies Under Extreme Congestion Appendix B - Vendor Analysis of LS Protocol Performance Appendix C - Analytic Modeling & Simulation Analysis of LS Protocol Performance 10. Authors' Addresses 11. Copyright Statement 1. Introduction & Problem Statement Congestion in control protocols has been a problem in data networks many times in the past, [att, pappalardo1, pappalardo2] being two good examples. It can arise in data networks for many different reasons. There is evidence based on previous failures that link-state protocols such as OSPF cannot recover from large failures that can result in widespread loss of topology database information. The problem is aggravated when the number of routers in an area is large, which can then lead to an overload in the flooding of topology database information. Topology information in OSPF is distributed in Link State Advertisements(LSA), which are encapsulated in link state update packets and periodically flooded to other routers in the domain through all available links. In some instances of network overload, failure, and/or congestion, redundancy of the flooding mechanism can overwhelm network processors and bring the network down. Earlier papers and contributions identified issues of congestion control and failure recovery for OSPF networks, such as OSPF [maunder, choudhury, pappalardo1, pappalardo2, atm01-0101]. The problem addressed in this draft is to enable OSPF to o avoid getting into congestion states wherever possible, o respond gracefully to network overloads and failures, and o gracefully recovery from massive loss of topology database information. The need for graceful recovery mechanisms is illustrated in Section 2, where we discuss failure experience in which all topology database information is lost in all routers in a large network, and the link-state protocol is unable to recover. If a router floods control packets to its neighbors and has no congestion avoidance and control mechanisms, it can overwhelm its neighbors and cause congestion. To prevent this we need mechanisms to avoid going into congestion and to recover when congestion occurs. Congestion avoidance prevents the OSPF domain from entering a state of congestive collapse with control traffic. Congestive collapse is a situation where, although the network links are being heavily utilized with an overload of control messages, very little useful work is being done because of the routers are overwhelmed with non-useful work. Congestion control and avoidance consists of automatic mechanisms to reduce control traffic to avoid congestion, and actions taken if congestion arises. We define both in the draft and propose specific additional considerations for network congestion control/failure recovery for the OSPF protocol, which include: o throttle setups of link adjacencies o reduce the rate of flooding of link-state-advertisements(LSAs) o prioritized treatment of Hello & LSA ack packets o database backup & resynchronization for graceful recovery from loss of topology database information We also define a standard way to detect and notify congestion states between routers. Section 2 contains a discussion of data network outages in which these problems have manifested themselves. As a result of these failures, a number of congestion control/failure recovery mechanisms were recommended, and these are reviewed in Section 3. In Section 4, we discuss how these candidate mechanisms for control of network congestion and failure recovery can be used. Section 5 discusses related solution methods which have been implemented elsewhere to control congestion and may be worth considering for OSPF. 2. Failure Experience & Analysis AT&T has experienced serious data network outages in which recovery of the underlying LS protocols was inadequate. For example, in the failure in the AT&T Frame Relay Network on April 13, 1998 [att], an initial procedural error triggered two undetected software bugs, leading to a huge overload of control messages in the network. The result of this control overload was the loss of all topology information, which the LS protocol then attempted to recover using the usual Hello and LS updates. However, the LS protocol was overwhelmed and unable to recover, and manual means had to be used to restart the network after a long outage. Analysis has shown that several problems then occurred to prevent the network from recovering properly: o Very large number of LS updates being sent to every router to process, causing general processor overload o Route computation based on incomplete topology recovery, causing routes to be generated based on transient, asynchronous topology information and then in need of frequent re-computation o Inadequate work queue management to allow processes to complete before more work is put into the process queue o Inability to segment the network (into smaller "areas") to aid in the LS protocol recovery o Inability to access the node processors with network management commands due to lack of necessary priority of these messages A more recent failure occurred on February 20, 2001 in the AT&T ATM Network, which resulted in a large overload of LS Updates, and a lengthy network outage [att, pappalardo1, pappalardo2]. Manual procedures were put in place to reduce LS updates flooding, which worked to stabilize the network. It is desirable that such LS updates flooding reduction be automatic under overload. In general, there have been a number of major outages reported by most major carriers, and routing protocol issues have generally been involved. Other relevant LS-network failures are reported in [cholewka, jander]. Various networks employing LS protocols use various control messages and mechanisms to update the LS database, not necessarily LSAs, PTSEs, or flooding mechanisms. Based on experience, however, the LS protocols including OSPF are found to be vulnerable to loss of topology information, such as occurred in the scenario described above [att], control overload to re-sync databases, and other failure/overload scenarios which make such networks more vulnerable in the absence of adequate protection mechanisms. Hence we are addressing a generic problem of LS protocols across a variety of implementations, and the basic problem is prevalent in LS protocol networks employing frame-relay, ATM, and IP based technologies. However the solutions given are for OSPF networks alone. 3. Proposed Solution Methods We propose enhancements and methods to prevent congestion arising out of any stimulus. There are general congestion control/failure recovery methods that have been applied in many networks, and these are considered for enhancement to the OSPF protocol. We propose enhancing OSPF using commonly known congestion control and recovery mechanisms. These types of mechanisms have been implemented in other protocols, such as EIGRP ["Enhanced Interior Gateway Routing Protocol"] which uses a backoff mechanism based on round trip timeout. Handling congestion in a more intelligent manner would also allow OSPF areas to grow to larger scales; the flooding diameter as a limiting factor could be greatly reduced, as congestion reaction mechanisms within the protocol allow flooding to be paced to levels on par with the capabilities of each adjacent pair of routers in the area. The following mechanisms are proposed in this draft for congestion control in OSPF: o throttle setups of link adjacencies, and reduce the rate of flooding of LSAs. This is discussed in greater detail in this document. o database backup, in which a topology database could be automatically recovered from loss based on local backup mechanisms. This is discussed in further in this document. o special marking of critical control messages (e.g., Hello and LSA Ack packets) so that they may receive prioritized processing [maunder]. This has been dealt in detail in [maunder], with packets prioritized according to the draft the chances of adjacencies breaking is lowered. Not processing of Hellos, which can lead to adjacencies breaking down, and not processing acknowledgements, which will lead to retransmissions, should also be avoided. The following mechanisms also help in congestion control for OSPF: o hitless restart, which allows routes to continue to be used if there is an uninterrupted data path, even if the control path is interrupted due to a failure [moy1, zinin1]. This is generally helpful in stable topologies, however hitless restart prevents other routers in the domain from changing paths to destinations through the restarting router, and hence prevents transient load caused by flooding of changing LSA's and SPF calculations. o give necessary priority to network management control messages so they are not locked out during times of failure. This is because certain configurable parameters need to be updated to prevent congestion, and giving less priority to such messages can lead to further congestion. o use flooding optimizations to prevent unnecessary flooding in densely meshed topologies. This mechanism has been specified in [moy2, zinin2], which prevent unnecessary flooding in parallel paths. Further work for highly meshed topologies is planned to be taken up by OSPF-WG. o mechanisms to be able to automatically segment a large area into smaller areas, allow these segments to recover their topology and routing, then combine the smaller areas into the original large area and allow that to recover. Such mechanisms are implemented in operational networks, such as through network management scripts to segment the network under severe overload. Currently these procedures are triggered manually, however it would be desirable to automate such mechanisms within OSPF (see Appendix A for further discussion). o provide bandwidth allocation mechanisms to limit the amount of link bandwidth allocation to control traffic versus data traffic. This is essentially a rate limiting mechanism, which can be and is used in implementations. For example, this capability exists in PNNI: RCCs are provisioned with traffic parameters (default parameters are specified in [af-pnni-0055.000]) and traffic is shaped into the RCC VC at the configured traffic rate. 4. OSPF Congestion Avoidance & Control Congestion avoidance and control mechanisms are described in this Section. While we suggest example implementation methods, it is intended that the exact implementation details be left as optional and up to the implementer. It is also intended that all of the methods suggested in the draft should be implemented together rather than any sub-set of the methods. 4.1 OSPF Congestion Avoidance Mechanisms 4.1.1 Throttle Setups of Link Adjacencies When an OSPF process in a router goes down and comes up again (or loses a large number of adjacencies for any other reason), it will start forming adjacencies with other routers. However, as the router reforms these lost adjacencies, it may start breaking already existing adjacencies due to processor overload or network congestion. For example, some implementations may experience memory starvation when a large number of adjacencies are being established because, per the OSPF specification, the router is supposed to construct the LSDB summary list containing headers of all LSAs for each adjacency when the adjacency transitions from ExStart to Exchange. Building this LSDB summary list may cause the OSPF to use more memory than it uses under normal operation, so more adjacencies can be formed and held over time than can be formed during a restart condition. This type of problem can be prevented by prioritizing adjacencies, which is further explained below. A count, MAX_ADJACENCY_BUILD_COUNT, of the maximum number of adjacencies that a router can bring up at any time SHOULD be kept in the OSPF main structure. After the count of the number of adjacencies in the states between Exchange and Loading exceeds MAX_ADJACENCY_BUILD_COUNT, new adjacencies SHOULD not be brought up. The methods which MAY be used to achieve this limiting of adjacencies may include: o Not sending Hellos on particular interfaces, which would bring the neighbors to the DOWN state. o Sending Hellos, but not listing the neighbors, if the neighbors are in state below Two-way (note that if we do not list the neighbor it causes the adjacency to go below Two-way to One-way). o Not let the adjacency progress past the Two-Way state by not sending any database description packets, and ignoring any database description packet received. We can also park the adjacency in ExStart. Such a MAX_ADJACENCY_BUILD_COUNT mechanism has already been implemented in some routers. 4.1.2 DSCP Marking of Hello & LSA Ack Packets A special DSCP [RFC2474] marking MUST be given to Hello and LSA Ack packets. These are critical control messages that must receive prioritized treatment. Not processing of Hellos, which can lead to adjacencies breaking down and inconsistent designated router elections, and not processing acknowledgements, which will lead to retransmissions, MUST be avoided. By marking these packets and giving them priority treatment in their processing, the chances of adjacencies breaking is lowered. The related goal is to achieve proper prioritization within routers' queues, both transmit and receive queues. Note, however, that this will be affected by cryptographic authentication, which has no tolerance for packet reordering and which is possible with priority queuing. Maintaining separate sequence number counters for different packet types on the receiving side may solve this issue. 4.1.3 Use of Other Control Packets to Detect Live-ness of Adjacency At present, if the control link between two routers is congested, for example due to a high volume of LSAs, it is possible for several consecutive Hellos to get dropped. As such, if the Inactivity Timer expires (i.e., the ROUTER_DEAD_INTERVAL is exceeded), we still assume the router is down even though because LSAs are being received it is clearly up. To prevent this, a router SHOULD assume that the receipt of any OSPF packet from a neighbor to be an indication that the neighbor is up, so on receipt of any OSPF packet (not just a Hello) the Inactivity Timer is reset. This behavior has been implemented in some routers. 4.2 OSPF Congestion Control Mechanisms 4.2.1 Detecting Congestion & Notifying Neighbors Router congestion can be detected at a router in various ways. For example, congestion can be inferred at a router by: o The length of internal work queues o Missing some percentage of hellos; for instance, missing 15 out of the last 20 hellos might indicate that a router is regularly missing OSPF packets on a given link. o Frequent retransmissions are required. o Explicit indication of congestion from an adjacent router. It is useful for explicit congestion notification to take place when a router notes its input work queues are very long, or if it is losing Hello's from another router. Detecting outbound congestion in some way does not require notification, since we should assume the adjacent router will detect this congestion on its own. Three levels of congestion MUST be defined: high, low, and no congestion. As multiple congestion states are specified, it may happen that congestion state changes very often, and could lead to oscillations between the high, low, and no congestion states. In order to damp congestion state oscillation, congestion state SHOULD not be changed unless some configurable time interval has elapsed since the last state change. For example, any router that goes into high congestion state CAN continue reporting this state for at least some interval High_Congestion_Interval, so as to avoid flapping of a router between high and medium congestion. The implementation details of congestion detection and damping are left to the implementer. When a router detects congestion, it MUST send notifications to neighbors of its state of congestion. The order in which it sends the notification MUST be in the order of ADJACENCY_PRIORITY, which have 2 levels: high and low. The congestion signaling MUST be done by sending a Choke LSA (defined in the following paragraph) or LLS signaling to its neighbor, which indicates the level of congestion at the node. If LLS signaling is used for congestion notification, then two new bits, called HCS (high congestion signal) and LCS(low congestion signal) are introduced into the Extended Options TLV in the Link-Local-Signaling (LLS) block [zinin4]. The value of these bits is TBD (temporarily, the values are 0x00000004 and 0x00000008 in Figure 1 below). +---+---+---+---+---+---+---+- -+---+---+---+---+---+---+---+---+ | * | * | * | * | * | * | * |...| * | * | * | * |HCS|LCS| RS| LR| +---+---+---+---+---+---+---+- -+---+---+---+---+---+---+---+---+ Figure 1. Bits in Extended Options TLV OSPF routers should set the HCS-bit in the Extended Options-TLV attached to a Hello packet when the router is in the heavy congestion state, and the LCS-bit when it is in the low congestion state. Only one of the above two bits may be set in a Hello at one time. When there is no congestion, neither of the bits is set. For a definition of the LR bit, see [OOB Signaling], and for RS bit see [zinin1]. 'No' congestion is sent after 'high' or 'low' congestion is sent and congestion subsides. 4.2.2 Router Response When it Detects Congestion When a router detects congestion, it SHOULD perform the actions described in the following Sections. In addition, mechanisms are proposed in Appendix A for dropping adjacencies under extreme congestion, but may be controversial, so they are placed in an Appendix for now. 4.2.2.1 Congestion Notification Send congestion notification to neighbors using Link-Local-Signaling. 4.2.2.2 Throttle Rate of LSA Flooding The rate at which an OSPF implementation floods information SHOULD be able to be progressively limited based on network conditions. Progressively limiting flooding rates can reduce packet drops and LSA storms. The mechanism used to progressively rate limit LSA flooding can vary between implementations, so no single implementation mechanism is set forward in this document as required. However, a mechanism is discussed here to illustrate how such a mechanism might work. A configurable variable MAX_LSA_FLOOD CAN be defined for each interface, which is the maximum number of LSAs flooded by any router at any time, for which it has not gotten acknowledgement. This is, in doing the rate limiting, the router should not send more than a given maximum rate of packets per second and should allow for interpacket gaps. This mechanism is similar to the Min_LSP_Transmission_Interval mechanism in ISIS. This mechanism can use a sliding window protocol for rate control, which is how it is already implemented in some routers. A router SHOULD also flood LSAs in the order of the priority of the adjacency. Intra-area LSAs that affect the SPF calculation CAN be given priority over inter-area LSAs. Inter-area LSAs CAN be given higher priority over external LSAs, which can be given priority over LSAs which do not result in any topology changes. When a router detects congestion from a neighbor, it CAN progressively decrease the flooding rate through some mechanism. For example, the router can increase its LSA_RETRANSMIT_INTERVAL by doubling the LSA_RETRANSMIT_INTERVAL for low congestion, and quadrupling the LSA_RETRANSMIT_INTERVAL for high congestion. Also, the congested router SHOULD a) reduce the rate of flooding, and b) increase the shortest path computation interval. 4.2.2.3 Increase Adjacency Break Interval Under high congestion, if no packet is received from a neighbor within the ROUTER_DEAD_INTERVAL, the router SHOULD not break adjacency immediately, but MUST wait for a greater interval ADJACENCY_BREAK_INTERVAL before it calls the adjacent router down. The HELLO_INTERVAL and ROUTER_DEAD_INTERVAL can be increased at most once at each router. The mechanism for changing these intervals needs to avoid the problem that in case the ROUTER_DEAD_INTERVAL is changed (as sent in Hello messages), a mismatch in ROUTER_DEAD_INTERVAL can cause adjacencies to break (as Hellos will not be processed when received with a different Hello interval from what is configured). If the router detects a mismatch, the HELLO_INTERVAL or ROUTER_DEAD_INTERVAL should be appropriately incremented/decremented. This mechanism would protect broadcast links on which a router has many neighbors, and perhaps one neighbor is dominating the use of bandwidth on the link. 4.2.2.4 Reduce the Rate of SPF Computation Under high congestion, reduce the rate of SPF computation and do not do any incremental SPF computation. This reduces the CPU overload load by not overloading the processor with SPF calculations. Also, during high congestion, the SPF computation would generally be based on incomplete topology information anyway, so convergence time is not affected in this case. 4.2.3 Router Response if Congestion Notification Received When a router receives a congestion notification message, it SHOULD perform the following actions: o Increase the retransmit interval to the congested router, o Throttle the rate of LSA flooding toward the congested router (see Section 4.2.2.2), and o Increase the adjacency break interval to the congested router (see Section 4.2.2.3). 4.2.4 Graceful Recovery from Loss of Topology Database Information - Database Backup & Resynchronization There is a need to gracefully recover from massive loss of topology database information. This need is illustrated in Section 2, where we discuss failure experience in which all topology database information is lost in all routers in a large network, and the link-state protocol is unable to recover. To address this problem, OSPF database backup and LS database synchronization SHOULD be applied. In essence, the database needs to be recovered, and can be backed up on the local node, and the process can use [zinin3] to resynchronize the database. OSPF LS database synchronization is achieved via two methods, o initial LSDB synchronization when router first connected to the network, and o asynchronous flooding ensures continuous LSDB synchronization in presence of topology changes after initial procedure completed. After two routers have established an adjacency (neighbor FSMs have reached Full state), routers announce adjacency states in their router-LSAs. Asynchronous flooding ensures routers' LS databases stay in sync in the presence of topology changes. It is sometimes necessary for OSPF routers to resynchronize their LS databases, such as under failure when database information is lost. Here we identify LS database backup mechanisms, and review resynchronization procedures. After a node fails and comes up, the node needs to recover the current state of topology database. The database of the failed router SHOULD be backed up before it fails, and this backup SHOULD be used to update the database. A router SHOULD back up all the non-self-originated LSAs and the states of the interfaces. Accordingly, when the network is recovering, a router SHOULD generate all its own LSAs, node, network, and summary LSAs. All Do_Not_Age (DNA) LSAs need not be regenerated. The router SHOULD only receive the LSAs that have changed between the time it failed and the current time. If the database backup is completely up to date, none of the LSAs would need to get updated. After a failure, a router SHOULD base its routing on the current database, derived as above. Some of the database information will be somewhat out of sync, and routing will not be optimal. However, if the database is relatively up to date, the routing would suffice until the database re-syncs completely. What is saved is a considerable amount of CPU processing of LS requests and LS updates. The approach of Zinin [zinin3] is relevant here when the router experiencing a failure uses graceful restart. The mechanism described in "OSPF Out-of-band LSDB resynchronization", when database information is lost, OSPF standard does not allow routers to do so without actually changing the topology view of the network. That is, if routers need to resynchronize their LS databases, they cannot do that without putting the neighbor FSMs into the ExStart state. This effectively causes adjacencies to be removed from the router-LSAs, and is not acceptable in some cases. With out-of-band link-state-data-base (OOB LSDB) resynchronization, routers can re-establish adjacencies after a reload, such as with a database backup, described above. To do this, an R-bit indicates OOB LSDB resynchronization. 5. Related Solution Methods The following is a summary of methods implemented elsewhere that help control congestion and may be worth considering here. 5.1 Optimized Link State Routing (OLSR) Protocol [jacquet] With Optimized Link State Routing (OLSR), a key concept is the use multipoint relays (MPRs). MPRs are selected routers which forward broadcast messages during the flooding process. As such, OLSR substantially reduces the message overhead as compared to a pure flooding mechanism. OLSR is designed to work in a distributed manner and does thus not depend on any central entity. Note that MPR behavior in dynamic topologies (when the connectivity graph changes) is still under consideration. 5.2 Fisheye State Routing Protocol (FSR) for Ad Hoc Networks [gerla1] With Fisheye State Routing (FSR), a router periodically broadcasts LS updates of a destination to its neighbors with a frequency that depends on hop distance to destination. LS updates corresponding to far away destinations are propagated with lower frequency than those for close by destinations. FSR reduces overhead caused by routing table updates. 5.3 Landmark Routing Protocol (LANMAR) for Large Scale Ad Hoc Networks [gerla2] With Landmark Routing Protocol (LANMAR), a 'landmark' is dynamically elected in each group. Routing to the landmark propagated throughout the network. A router directs a packet to the landmark and then to the destination. LANMAR reduces routing table size and routing update overhead, since it a) summarizes routing information of remote groups, and b) uses truncated local routing table. 5.4 Topology Broadcast Based on Reverse-Path Forwarding (TBRPF) [SRI International authors] With Topology Broadcast Based on Reverse-Path Forwarding (TBRPF), an efficient Hello mechanism is introduced in which o TBRPF Neighbor Discovery (TND) protocol allows each router to quickly detect neighboring nodes with direct link, and o TND uses "differential" HELLO messages which report only *changes* in status of neighbors. This results in Hello messages that are much smaller than other LS routing protocols which include the IDs of *all* neighbors. 6. Security Considerations There are no new security considerations based on proposals in this draft. 7. Acknowledgments The authors gratefully acknowledge the comments and suggestions from many people. We are very appreciative of the most helpful and constructive comments from Alex Zinin, Russ White (Cisco), and Dave Katz (Juniper). At AT&T we thank Margaret Chiosi, Enrique Cuevas, Tom Helstern, Steve Holmgren, Raghuram Aswatnarayan, and Mike Shapiro, Shawn McAllister at Alcatel, Marty Albright at Worldcom, Sohel Khan at Sprint, Ghassem Koleyni at Nortel Networks, Thomas Cornely and Greg Wetzel at Covad, Bob Dianda at Sprint, and Jeff Parker at Axiowave. 8. References [af-pnni-0055.000] "Private Network-Network Interface Specification Version 1.0 (PNNI 1.0)," March 1996. [ash] Ash, G. R., "Dynamic Routing in Telecommunications Networks," McGraw Hill. [atmf00-0249] "Scalability and Reliability of large ATM networks." [atm00-0257] "Signaling Congestion Control in PNNI Networks: The Need and Proposed Solution Outline." [atm00-0480] "Congestion Control/Failure Recovery in PNNI Networks." [atm01-0101] "Proposed Mechanisms for Congestion Control/Failure Recovery in PNNI Networks." [att] "AT&T announces cause of frame-relay network outage," AT&T Press Release, April 22, 1998. [btd-cs-congestion-02.00] "Signaling Congestion Control Version 1.0", Baseline Text [cholewka] Cholewka, K., "MCI Outage Has Domino Effect," Inter@ctive Week, August 20, 1999. [choudhury] Choudhury, G., Maunder, A. S., Sapozhnikova, V., "Faster Link-State Convergence and Improving Network Scalabiity and Stability," submitted for presentation at LCN 2001. [gerla1] Gerla, M., et. al., "Fisheye State Routing Protocol (FSR) for Ad Hoc Networks," work in progress. [gerla2] Gerla, M., et. al., "Landmark Routing Protocol (LANMAR) for Large Scale Ad Hoc Networks," work in progress. [hosein1] Hosein, P., "An Improved ACC Algorithm for Telecommunication Networks," Telecommunication Systems 0, 1998. [hosein2] Hosein, P., "Overload Control for Real-Time Telecommunication Databases," International Teletraffic Congress - 16, Edinburgh, Scotland, June 1999. [jacquet] Jacquet, et. al., "Optimized Link State Routing Protocol," work in progress. [jander] Jander, M., "In Qwest Outage, ATM Takes Some Heat," Light Reading, April 6, 2001. [maunder] Maunder, A. S., Choudhury, G., "Explicit Marking and Prioritized Treatment of Specific IGP Packets for Faster IGP Convergence and Improved Network Scalability and Stability," work in progress. [mpls-architecture] Rosen, E., Viswanathan, A., Callon, R., "Multiprotocol Label Switching Architecture," RFC 3031, January 2001. [mummert] Mummert, V. S., "Network Management and its Implementation on the No. 4ESS," International Switching Symposium, Japan, 1976. [moy1] Moy, J., "Hitless OSPF Restart", work in progress. [moy2] Moy, J., "Flooding over parallel point-to-point links," work in progress. [moy3] Moy, J., "Flooding Over a Subset Topology," work in progress. [murphy] Murphy, P., "OSPF Floodgates," work in progress. [pappalardo1] Pappalardo, D., "Can one rogue switch buckle AT&T's network?," Network World Fusion, February 23, 2001. [pappalardo2] Pappalardo, D., "AT&T, customers grapple with ATM net outage," Network World, February 26, 2001. [pillayesnault] Pillay-Esnault, P., "OSPF Refresh and flooding reduction in stable topologies," work in progress. [Q.764] "Signalling System No. 7 - ISDN user part signalling procedures," December 1999. [refresh-guide] Zinin, A., "Guidelines for Efficient LSA Refreshment in OSPF," work in progress. RFC2475] Blake, S., et. al., "An Architecture for Differentiated Services," RFC 2475. [sri] SRI authors, et. al., "Topology Broadcast Based on Reverse-Path Forwarding (TBRPF)," work in progress. [whitehead] Whitehead, Martin, "A class of overload controls based on controlling call reject rates," ITU-T contribution D.19, Feburary 2001. [zinen1] Zinin, A., et. al., "OSPF Restart Signaling," work in progress. [zinen2] Zinin, A., et. al., "Flooding optimizations in link-state routing protocols," work in progress. [zinin3] Zinin, A., et. al., "OSPF Out-of-band LSDB resynchronization," work in progress. [zinin4] Zinin, A., et. al., "OSPF Link-Local Signaling," work in progress. 9. Appendix Appendix A - Dropping Link Adjacencies Under Extreme Congestion The following mechanisms are proposed for dropping adjacencies under extreme congestion, but may be controversial, so they are placed in an Appendix for now. As discussed below, there is experience in using these mechanisms in practice, and also experimental verification that the mechanisms are effective. In the case a router is experiencing extremely high congestion, it SHOULD drop adjacencies on all links with low priority. It is natural to think that as we drop adjacencies under congestion, that in itself causes some flooding and so may not help. However, this flooding is only one-time event but over a longer term the reduced adjacency helps. The main point is that with lower adjacency the node has to do less work for any LSA generated anywhere in the network. This has been shown by the analytic model discussed in [choudhury]. In addition to the analysis and simulation results, there is implementation experience with this approach within large carrier networks. In fact, a procedure for reducing adjacencies was done to recover the network after the 2/20/01 failure discussed in Section 2, and is also invoked before each major upgrade. If a router has at any time more adjacencies between Exchange and Loading than MAX_ADJACENCY_BUILD_COUNT, the router SHOULD drop adjacencies in Exchange or Loading with lower priority. In that way, routers within a router-group can experience a lower level of LSA flooding and Hello processing. When congestion on a router subsides, it SHOULD then gradually reestablish adjacencies on links that were dropped. When an adjacency is dropped, the router SHOULD list the link as a stub link in its Router LSA, and if the router is a Designated Router on a broadcast link, it SHOULD remove the neighbor from its Network LSA. The Hello messages should still show the router as a neighbor. If the router in this case gets a 1-way Hello message, it SHOULD perform as per RFC2328. Also proposed is the concept of forming smaller 'router-groups' within an area in order to stabilize the network under extreme congestion, as a last resort to avoid network collapse. After the smaller router groups recover, then the router-groups should be re-constituted into the original area. Here again, analysis and simulation models have shown the benefit. This is also a current plan within a large carrier network to manually segment the network into multiple smaller sub-networks and then reconnecting them after stabilization has been achieved. Priorities CAN be set on interfaces such that it can help form smaller router-groups, or clusters. That is, links between routers in different groups SHOULD be given low priority and those within a group high priority. Links from a router to other router-groups in an area SHOULD be given low priority. Links within a router-group CAN be given high priority if only one link exists between routers. With parallel point-to-point links, one link CAN be given high priority over the others. In this way, the adjacency limiting mechanism will be confined to router groups, and routers within a router-group can experience a lower level of LSA flooding and Hello processing, thus helping to avoid the spread of congestion. In high congestion, the adjacencies are forming and then breaking again, and this leads to LSA storms such that none of the adjacencies are forming at all. In this state, we instead drop adjacencies, let only a subset of the adjacencies form, incrementally bring up adjacencies, and in that way make the domain stable. So dropping adjacencies may cause additional LSA flooding temporarily, but as routers now only have to flood to a subset of the topology, the network will stabilize sooner. Implementations should take care to damp link flaps, and provide interface and adjacency state dampening. If an interface or a given adjacency keeps flapping, it should be suppressed for a period of time, and the worse the instability the longer should be the hold-down period. Appendix B - Vendor Analysis of LS Protocol Performance Various vendors and service providers were asked to analyze their product with reference to how their LS protocol recovers from a total network failure, that is, loss of all database information in the specified scenario. The specification of the failure presented to the vendors made the following assumptions: 1. 400 node network is considered a) 100 backbone nodes b) 3 edge nodes per backbone node (Edge single homed) c) Each backbone node is connected to no more than 10 nodes (maximum node adjacency is 13, which defines a sparse network) 2. There are 101 areas a) 1 backbone area with 100 backbone nodes b) 100 edge areas, each with 3 nodes, all homed on the backbone peer group 3. 1,000,000 addresses are advertised in the network 4. maximum node adjacency is 13 a) sparse network Assumptions on packet sizes and processing times varied according to product capabilities associated with the individual estimates given below. The results for each projected recovery time from three vendor/service-provider analyses of the above scenario are as follows: Recovery Time Estimate A - 3.5 hours Recovery Time Estimate B - 5-15 minutes Recovery Time Estimate C - 5.5 hours Clearly there is a large variability to the estimates, however the expectation is that vendor equipment recovery is not adequate under a large failure scenario as was analyzed. Appendix C - Analytic Modeling & Simulation Analysis of LS Protocol Performance Some analysis models have been published which reveal problematic performance of LS protocols under various failure conditions [maunder, atmf00-0249]. There are various simulation capabilities [NS, wandl, opnet, scalable-networks, makesystems] for analyzing LS protocol behavior under failure. We report below an event simulation study [choudhury] on overloads in large-scale networks and the corresponding recovery times. Emulation of large failure performance could also be available through laboratory testing of actual protocol performance of vendor products under failure conditions. However to date no studies have been published on failures in large scale networks and the corresponding recovery times. The results from such simulation and emulation studies are specific to the type of equipment and network configuration modeled, and therefore are not completely general. However, the results to date lend support for the existence of problems with LS protocols in the face of congestion and network failures. A network-wide event simulation model is reported in [choudhury] to study the impact of a TSU storm. It captures the actual congestion seen at various nodes and accounts for propagation delay between nodes, retransmissions in case a TSU is not acknowledged, failure of links for TSUs delayed beyond the node-dead interval, and link recovery following database synchronization and TSU flooding once the TSU is processed. It approximates a real network implementation and uses processing times that are roughly in the same order of magnitude as measured in the real network (of the order of milliseconds). There are two categories of IGP messages processed at each node in the simulation. Category 1 messages are triggered by a timer and include the Hello refresh, TSU refresh and retransmission packets. Category 2 messages are not triggered by a timer and include received Hello, received TSU and received acknowledgements. Timer-triggered messages are given non-preemptive priority over the other type and are queued in the timer queue. As a result, the received Hello packets and the received acknowledgement packets may see long queuing delays under intense CPU overload. Table 1 below shows sample results of the simulation study when applied to a network with about 300 nodes and 800 links. The node-adjacency varies from node-to-node and the maximum node-adjacency is 30. The Hello interval is assumed to be 5 seconds, the minimum interval between successive SPF (Shortest-Path-First) calculations is 1 second, and the Node-Dead Interval ("Router-Dead Interval") is 15 seconds, i.e., a link is declared down if no Hello packet is received for three successive hello intervals. During the study, a TSU storm of size X is created at instant of time 100 seconds where storm-size is defined as the number of TSUs generated during a storm. Three cases are considered with X 3D 300, 600 and 900 respectively. Besides the storm, there are also the normal once-in-thirty-minutes TSU refreshes. At any given point of time we define a quantity "dispersion" that is the number of control packets already generated in the network but not received and processed in at least one node (each control packet is assumed to carry three TSUs). Table 1 plots dispersion as a function of time and thereby illustrates the impact of TSU storm on network stability. Before the TSU storm, the dispersion due to normal TSU refreshes remains small. We expect the dispersion to jump to a high value right after the storm and then come down to the pre-storm level after some period of time (this happens with X=300 and X=600 but not X=900). In Table 1 with a TSU storm size 300, the "heavy dispersion period" lasted about 11 seconds and no link losses were observed. With a TSU storm of size 600, the "heavy dispersion period" lasted about 40 seconds. Some link losses were observed a little after 15 seconds within the "heavy dispersion period" but eventually all links recovered and the dispersion came down to the pre-storm level. With a TSU storm of size 900, the "heavy dispersion period" lasted throughout the simulation period (6 minutes). ======|===================================================================== | Table 1: DISPERSION as a FUNCTION of TIME (in sec) | for different TSU Storm Sizes STORM |===================================================================== SIZE |100s 106s 110s 115s 140s 170s 230s 330s 370s ======|===================================================================== 300 | 0 39 3 1 0 1 0 0 0 ------|--------------------------------------------------------------------- 600 | 0 133 120 100 12 1 0 0 0 ------|--------------------------------------------------------------------- 900 | 0 230 215 196 101 119 224 428 488 ======|===================================================================== The generic observations are as follows: * If the initial TSU storm size (e.g., X3D300) is such that the delays experienced by Hello packets are not big enough to cause any link failures anywhere in the network, the network remains stable and quickly gets back to a period of "low dispersion". These types of LSA storms are observed quite frequently in operational networks, from which the network easily recovers. * If the initial TSU storm size (e.g., X3D600) is such that the delays experienced by a few Hello packets in a few nodes cause link failures then some secondary TSU storms are generated. However, the secondary storms do not keep growing indefinitely and the network remains stable and eventually gets back to a period of "low dispersion". This type of LSA storm was observed in an operational network triggered by a network upgrade, from which the network recovered but with some difficulty. * If the initial TSU storm size (e.g., X3D900), is such that the delays experienced by many Hello packets in many nodes cause link failures then a wave of secondary TSU storms are generated. The network enters an unstable state and the secondary storms are sustained indefinitely or for a very long period of time. This type of LSA storm was observed in an operational network triggered by a network failure (2/20/01 AT&T ATM network failure discussed in Section 2.1) from which the network did not recover without corrective steps (manual procedures were used to reduce TSU flooding and stabilize the network). The results show that there is a TSU storm threshold above which the network shows unstable behavior. It is desirable that TSU flooding reduction be automatic under overload, and the model could be used to assess the benefits of various control mechanisms, such as those discussed in the draft. 10. Authors' Addresses Jerry Ash AT&T Room MT D5-2A01 200 Laurel Avenue Middletown, NJ 07748, USA Phone: +1-(732)-420-4578 Fax: +1-(732)-368-8659 Email: gash@att.com Gagan L. Choudhury AT&T Room D5-3C21 200 S.Laurel Avenue Middletown, NJ 07748, USA Phone: + 1 732 420- 3721 Fax:+1 732 368-1919 E-mail: gchoudhury@att.com Vishwas Manral NetPlane Systems 189, Prashasan Nagar, Road number 72 Jubilee Hills Hyderabad, India E-mail: vishwasm@netplane.com Anurag S. Maunder Sanera Systems 370 San Aleso Avenue Sunnyvale, CA 94085, USA Ph: 408-734-6123 Fax: 408-400-0151 E-mail: amaunder@sanera.net Vera D. Sapozhnikova AT&T Room C5-2C29 200 S. Laurel Avenue Middletown, NJ 07748, USA Phone: + 1 732 420-2653 Fax: +.1 732 368-1774 E-mail: sapozhnikova@att.com Mostafa Hashem Sherif AT&T Room C5-2D18 200 S. Laurel Avenue Middletown, NJ 07748, USA Phone: + 1 732 420-2448 Fax: +.1 732 371-7234 E-mail: mhs@.att.com 11. Copyright Statement Copyright (C) The Internet Society (1998). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.