Network Working Group                                        Jerry Ash
Internet Draft                                      Gagan L. Choudhury
<draft-ash-manral-ospf-congestion-control-00.txt  Vera D. Sapozhnikova
Expiration Date: November 2002                          Mostafa Sherif
                                                                  AT&T

                                                        Vishwas Manral
                                                      NetPlane Systems

                                                        Anurag Maunder
                                                        Sanera Systems


                                                           April, 2002


            Congestion Avoidance & Control for OSPF Networks

           <draft-ash-manral-ospf-congestion-control-00.txt>

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Abstract

Based on overload and failure experience with link-state protocols, this 
draft identifies means to enable link-state protocols to:
   o avoid getting into congestion states wherever possible
   o respond gracefully to network overloads and failures
   o gracefully recover from massive loss of topology database 
     information
The proposed mechanisms include the following:
   o throttle setups of link adjacencies
   o reduce the rate of flooding of link-state-advertisements(LSAs)
   o prioritized treatment of Hello & LSA ack packets
   o database backup & resynchronization for graceful recovery from loss 
     of topology database information

Table of Contents

   1. Introduction & Problem Statement
   2. Failure Experience & Analysis
   3. Proposed Solution Methods
   4. OSPF Congestion Avoidance & Control                     
      4.1 OSPF Congestion Avoidance Mechanisms
          4.1.1 Throttle Setups of Link Adjacencies
          4.1.2 DSCP Marking of Hello & LSA Ack Packets
          4.1.3 Use of Other Control Packets to Detect Live-ness of 
                Adjacency
      4.2 OSPF Congestion Control Mechanisms
          4.2.1 Detecting Congestion & Notifying Neighbors
          4.2.2 Router Response When it Detects Congestion
                4.2.2.1 Congestion Notification
                4.2.2.2 Throttle Rate of LSA Flooding
                4.2.2.3 Increase Adjacency Break Interval
                4.2.2.4 Reduce the Rate of SPF Computation
          4.2.3 Router Response if Congestion Notification Received
          4.2.4 Graceful Recovery from Loss of Topology Database 
                Information - Database Backup & Resynchronization
   5. Related Solution Methods
      5.1 Optimized Link State Routing (OLSR) Protocol
      5.2 Fisheye State Routing Protocol (FSR) for Ad Hoc Networks
      5.3 Landmark Routing Protocol (LANMAR) for Large Scale Ad Hoc 
          Networks
      5.4 Topology Broadcast Based on Reverse-Path Forwarding (TBRPF)
   6. Security Considerations                                 
   7. Acknowledgments                                         
   8. References                                              
   9. Appendix
      Appendix A - Dropping Link Adjacencies Under Extreme Congestion    
      Appendix B - Vendor Analysis of LS Protocol Performance          
      Appendix C - Analytic Modeling & Simulation Analysis of LS 
                   Protocol Performance
   10. Authors' Addresses
   11. Copyright Statement

1. Introduction & Problem Statement

Congestion in control protocols has been a problem in data networks many 
times in the past, [att, pappalardo1, pappalardo2] being two good 
examples.  It can arise in data networks for many different reasons. 
There is evidence based on previous failures that link-state protocols 
such as OSPF cannot recover from large failures that can result in 
widespread loss of topology database information. The problem is 
aggravated when the number of routers in an area is large, which can 
then lead to an overload in the flooding of topology database 
information.  Topology information in OSPF is distributed in Link State 
Advertisements(LSA), which are encapsulated in link state update packets 
and periodically flooded to other routers in the domain through all 
available links.  In some instances of network overload, failure, and/or 
congestion, redundancy of the flooding mechanism can overwhelm network 
processors and bring the network down.   

Earlier papers and contributions identified issues of congestion control 
and failure recovery for OSPF networks, such as OSPF [maunder, 
choudhury, pappalardo1, pappalardo2, atm01-0101].  The problem addressed 
in this draft is to enable OSPF to

   o avoid getting into congestion states wherever possible,
   o respond gracefully to network overloads and failures, and
   o gracefully recovery from massive loss of topology database 
     information.

The need for graceful recovery mechanisms is illustrated in Section 2, 
where we discuss failure experience in which all topology database 
information is lost in all routers in a large network, and the 
link-state protocol is unable to recover.

If a router floods control packets to its neighbors and has no 
congestion avoidance and control mechanisms, it can overwhelm its 
neighbors and cause congestion.  To prevent this we need mechanisms to 
avoid going into congestion and to recover when congestion occurs. 
Congestion avoidance prevents the OSPF domain from entering a state of 
congestive collapse with control traffic. Congestive collapse is a 
situation where, although the network links are being heavily utilized 
with an overload of control messages, very little useful work is being 
done because of the routers are overwhelmed with non-useful work.

Congestion control and avoidance consists of automatic mechanisms to 
reduce control traffic to avoid congestion, and actions taken if 
congestion arises.  We define both in the draft and propose specific 
additional considerations for network congestion control/failure 
recovery for the OSPF protocol, which include:

   o throttle setups of link adjacencies
   o reduce the rate of flooding of link-state-advertisements(LSAs)
   o prioritized treatment of Hello & LSA ack packets
   o database backup & resynchronization for graceful recovery from loss 
     of topology database information

We also define a standard way to detect and notify congestion states 
between routers.
  
Section 2 contains a discussion of data network outages in which these 
problems have manifested themselves.  As a result of these failures, a 
number of congestion control/failure recovery mechanisms were 
recommended, and these are reviewed in Section 3.  In Section 4, we 
discuss how these candidate mechanisms for control of network congestion 
and failure recovery can be used. Section 5 discusses related solution 
methods which have been implemented elsewhere to control congestion and 
may be worth considering for OSPF.

2. Failure Experience & Analysis

AT&T has experienced serious data network outages in which recovery of 
the underlying LS protocols was inadequate.  For example, in the failure 
in the AT&T Frame Relay Network on April 13, 1998 [att], an initial 
procedural error triggered two undetected software bugs, leading to a 
huge overload of control messages in the network.  The result of this 
control overload was the loss of all topology information, which the LS 
protocol then attempted to recover using the usual Hello and LS updates. 
 However, the LS protocol was overwhelmed and unable to recover, and 
manual means had to be used to restart the network after a long outage.
   
Analysis has shown that several problems then occurred to prevent the 
network from recovering properly:
 
   o Very large number of LS updates being sent to every router to 
     process, causing general processor overload   
   o Route computation based on incomplete topology recovery, causing 
     routes to be generated based on transient, asynchronous topology 
     information and then in need of frequent re-computation   
   o Inadequate work queue management to allow processes to complete 
     before more work is put into the process queue   
   o Inability to segment the network (into smaller "areas") to aid in 
     the LS protocol recovery   
   o Inability to access the node processors with network management 
     commands due to lack of necessary priority of these messages
  
A more recent failure occurred on February 20, 2001 in the AT&T ATM 
Network, which resulted in a large overload of LS Updates, and a lengthy 
network outage [att, pappalardo1, pappalardo2].  Manual procedures were 
put in place to reduce LS updates flooding, which worked to stabilize 
the network.  It is desirable that such LS updates flooding reduction be 
automatic under overload.   In general, there have been a number of major 
outages reported by most major carriers, and routing protocol issues have 
generally been involved.  Other relevant LS-network failures are reported 
in [cholewka, jander].
  
Various networks employing LS protocols use various control messages and 
mechanisms to update the LS database, not necessarily LSAs, PTSEs, or 
flooding mechanisms.  Based on experience, however, the LS protocols 
including OSPF are found to be vulnerable to loss of topology 
information, such as occurred in the scenario described above [att], 
control overload to re-sync databases, and other failure/overload 
scenarios which make such networks more vulnerable in the absence of 
adequate protection mechanisms.   

Hence we are addressing a generic problem of LS protocols across a 
variety of implementations, and the basic problem is prevalent in LS 
protocol networks employing frame-relay, ATM, and IP based technologies. 
However the solutions given are for OSPF networks alone.   

3. Proposed Solution Methods   
   
We propose enhancements and methods to prevent congestion arising out of 
any stimulus.  There are general congestion control/failure recovery 
methods that have been applied in many networks, and these are 
considered for enhancement to the OSPF protocol.

We propose enhancing OSPF using commonly known congestion control and 
recovery mechanisms. These types of mechanisms have been implemented in 
other protocols, such as EIGRP ["Enhanced Interior Gateway Routing 
Protocol"] which uses a backoff mechanism based on round trip timeout.

Handling congestion in a more intelligent manner would also allow OSPF 
areas to grow to larger scales; the flooding diameter as a limiting 
factor could be greatly reduced, as congestion reaction mechanisms 
within the protocol allow flooding to be paced to levels on par with the 
capabilities of each adjacent pair of routers in the area.

The following mechanisms are proposed in this draft for congestion 
control in OSPF:
 
   o throttle setups of link adjacencies, and reduce the rate of 
     flooding of LSAs. This is discussed in greater detail in this 
     document.
   o database backup, in which a topology database could be 
     automatically recovered from loss based on local backup mechanisms. 
     This is discussed in further in this document.   
   o special marking of critical control messages (e.g., Hello and LSA 
     Ack packets) so that they may receive prioritized processing 
     [maunder]. This has been dealt in detail in [maunder], with packets 
     prioritized according to the draft the chances of adjacencies 
     breaking is lowered. Not processing of Hellos, which can lead to 
     adjacencies breaking down, and not processing acknowledgements, 
     which will lead to retransmissions, should also be avoided.

The following mechanisms also help in congestion control for OSPF:

   o hitless restart, which allows routes to continue to be used if 
     there is an uninterrupted data path, even if the control path is 
     interrupted due to a failure [moy1, zinin1].  This is generally 
     helpful in stable topologies, however hitless restart prevents 
     other routers in the domain from changing paths to destinations 
     through the restarting router, and hence prevents transient load 
     caused by flooding of changing LSA's and SPF calculations.
   o give necessary priority to network management control messages so 
     they are not locked out during times of failure. This is because 
     certain configurable parameters need to be updated to prevent 
     congestion, and giving less priority to such messages can lead to 
     further congestion.
   o use flooding optimizations to prevent unnecessary flooding in 
     densely meshed topologies. This mechanism has been specified in 
     [moy2, zinin2], which prevent unnecessary flooding in parallel 
     paths.  Further work for highly meshed topologies is planned to 
     be taken up by OSPF-WG.
   o mechanisms to be able to automatically segment a large area into 
     smaller areas, allow these segments to recover their topology and 
     routing, then combine the smaller areas into the original large 
     area and allow that to recover. Such mechanisms are implemented 
     in operational networks, such as through network management scripts 
     to segment the network under severe overload. Currently these 
     procedures are triggered manually, however it would be desirable 
     to automate such mechanisms within OSPF (see Appendix A for further 
     discussion).
   o provide bandwidth allocation mechanisms to limit the amount of link 
     bandwidth allocation to control traffic versus data traffic.  This 
     is essentially a rate limiting mechanism, which can be and is used 
     in implementations.  For example, this capability exists in PNNI: 
     RCCs are provisioned with traffic parameters (default parameters 
     are specified in [af-pnni-0055.000]) and traffic is shaped into the 
     RCC VC at the configured traffic rate.

4. OSPF Congestion Avoidance & Control

Congestion avoidance and control mechanisms are described in this 
Section.  While we suggest example implementation methods, it is 
intended that the exact implementation details be left as optional and 
up to the implementer.  It is also intended that all of the methods 
suggested in the draft should be implemented together rather than any 
sub-set of the methods.

4.1 OSPF Congestion Avoidance Mechanisms

4.1.1 Throttle Setups of Link Adjacencies

When an OSPF process in a router goes down and comes up again (or loses 
a large number of adjacencies for any other reason), it will start 
forming adjacencies with other routers.  However, as the router reforms 
these lost adjacencies, it may start breaking already existing 
adjacencies due to processor overload or network congestion. 

For example, some implementations may experience memory starvation when 
a large number of adjacencies are being established because, per the 
OSPF specification, the router is supposed to construct the LSDB summary 
list containing headers of all LSAs for each adjacency when the 
adjacency transitions from ExStart to Exchange. Building this LSDB 
summary list may cause the OSPF to use more memory than it uses under 
normal operation, so more adjacencies can be formed and held over time 
than can be formed during a restart condition. This type of problem can 
be prevented by prioritizing adjacencies, which is further explained 
below.

A count, MAX_ADJACENCY_BUILD_COUNT, of the maximum number of adjacencies 
that a router can bring up at any time SHOULD be kept in the OSPF main 
structure. After the count of the number of adjacencies in the states 
between Exchange and Loading exceeds MAX_ADJACENCY_BUILD_COUNT, new 
adjacencies SHOULD not be brought up.  The methods which MAY be used to 
achieve this limiting of adjacencies may include:

   o Not sending Hellos on particular interfaces, which would bring the 
     neighbors to the DOWN state.
   o Sending Hellos, but not listing the neighbors, if the neighbors are 
     in state below Two-way (note that if we do not list the neighbor it 
     causes the adjacency to go below Two-way to One-way).
   o Not let the adjacency progress past the Two-Way state by not 
     sending any database description packets, and ignoring any database 
     description packet received.  We can also park the adjacency in 
     ExStart.

Such a MAX_ADJACENCY_BUILD_COUNT mechanism has already been implemented 
in some routers.

4.1.2 DSCP Marking of Hello & LSA Ack Packets

A special DSCP [RFC2474] marking MUST be given to Hello and LSA Ack 
packets.  These are critical control messages that must receive 
prioritized treatment. Not processing of Hellos, which can lead to 
adjacencies breaking down and inconsistent designated router elections, 
and not processing acknowledgements, which will lead to retransmissions, 
MUST be avoided.  By marking these packets and giving them priority 
treatment in their processing, the chances of adjacencies breaking is 
lowered.

The related goal is to achieve proper prioritization within routers' 
queues, both transmit and receive queues.  Note, however, that this will 
be affected by cryptographic authentication, which has no tolerance for 
packet reordering and which is possible with priority queuing.  
Maintaining separate sequence number counters for different packet types 
on the receiving side may solve this issue.

4.1.3 Use of Other Control Packets to Detect Live-ness of Adjacency
   
At present, if the control link between two routers is congested, for 
example due to a high volume of LSAs, it is possible for several 
consecutive Hellos to get dropped.  As such, if the Inactivity Timer 
expires (i.e., the ROUTER_DEAD_INTERVAL is exceeded), we still assume 
the router is down even though because LSAs are being received it is 
clearly up. To prevent this, a router SHOULD assume that the receipt of 
any OSPF packet from a neighbor to be an indication that the neighbor is 
up, so on receipt of any OSPF packet (not just a Hello) the Inactivity 
Timer is reset.  This behavior has been implemented in some routers.
  
4.2 OSPF Congestion Control Mechanisms

4.2.1 Detecting Congestion & Notifying Neighbors

Router congestion can be detected at a router in various ways.  For 
example, congestion can be inferred at a router by:

   o The length of internal work queues
   o Missing some percentage of hellos; for instance, missing 15 out of 
     the last 20 hellos might indicate that a router is regularly 
     missing OSPF packets on a given link.
   o Frequent retransmissions are required.
   o Explicit indication of congestion from an adjacent router.

It is useful for explicit congestion notification to take place when a 
router notes its input work queues are very long, or if it is losing 
Hello's from another router.  Detecting outbound congestion in some way 
does not require notification, since we should assume the adjacent 
router will detect this congestion on its own.

Three levels of congestion MUST be defined: high, low, and no 
congestion.

As multiple congestion states are specified, it may happen that 
congestion state changes very often, and could lead to oscillations 
between the high, low, and no congestion states. In order to damp 
congestion state oscillation, congestion state SHOULD not be changed 
unless some configurable time interval has elapsed since the last state 
change. For example, any router that goes into high congestion state CAN 
continue reporting this state for at least some interval 
High_Congestion_Interval, so as to avoid flapping of a router between 
high and medium congestion.  The implementation details of congestion 
detection and damping are left to the implementer.

When a router detects congestion, it MUST send notifications to 
neighbors of its state of congestion.  The order in which it sends the 
notification MUST be in the order of ADJACENCY_PRIORITY, which have 2 
levels: high and low.  The congestion signaling MUST be done by sending 
a Choke LSA (defined in the following paragraph) or LLS signaling to its 
neighbor, which indicates the level of congestion at the node.

If LLS signaling is used for congestion notification, then two new bits, 
called HCS (high congestion signal) and LCS(low congestion signal) are 
introduced into the Extended Options TLV in the Link-Local-Signaling 
(LLS) block [zinin4].  The value of these bits is TBD (temporarily, the 
values are 0x00000004 and 0x00000008 in Figure 1 below).

     +---+---+---+---+---+---+---+- -+---+---+---+---+---+---+---+---+
     | * | * | * | * | * | * | * |...| * | * | * | * |HCS|LCS| RS| LR|
     +---+---+---+---+---+---+---+- -+---+---+---+---+---+---+---+---+

                  Figure 1. Bits in Extended Options TLV

OSPF routers should set the HCS-bit in the Extended Options-TLV attached 
to a Hello packet when the router is in the heavy congestion state, and 
the LCS-bit when it is in the low congestion state.  Only one of the 
above two bits may be set in a Hello at one time.  When there is no 
congestion, neither of the bits is set.  For a definition of the LR bit, 
see [OOB Signaling], and for RS bit see [zinin1].  'No' congestion is 
sent after 'high' or 'low' congestion is sent and congestion subsides.

4.2.2 Router Response When it Detects Congestion

When a router detects congestion, it SHOULD perform the actions 
described in the following Sections.  In addition, mechanisms are 
proposed in Appendix A for dropping adjacencies under extreme 
congestion, but may be controversial, so they are placed in an Appendix 
for now.
  
4.2.2.1 Congestion Notification

Send congestion notification to neighbors using Link-Local-Signaling.
  
4.2.2.2 Throttle Rate of LSA Flooding

The rate at which an OSPF implementation floods information SHOULD be 
able to be progressively limited based on network conditions. 
Progressively limiting flooding rates can reduce packet drops and LSA 
storms. The mechanism used to progressively rate limit LSA flooding can 
vary between implementations, so no single implementation mechanism is 
set forward in this document as required. However, a mechanism is 
discussed here to illustrate how such a mechanism might work.

A configurable variable MAX_LSA_FLOOD CAN be defined for each interface, 
which is the maximum number of LSAs flooded by any router at any time, 
for which it has not gotten acknowledgement. This is, in doing the rate 
limiting, the router should not send more than a given maximum rate of 
packets per second and should allow for interpacket gaps.  This 
mechanism is similar to the Min_LSP_Transmission_Interval mechanism in 
ISIS.  This mechanism can use a sliding window protocol for rate 
control, which is how it is already implemented in some routers.

A router SHOULD also flood LSAs in the order of the priority of the 
adjacency.  Intra-area LSAs that affect the SPF calculation CAN be given 
priority over inter-area LSAs.  Inter-area LSAs CAN be given higher 
priority over external LSAs, which can be given priority over LSAs which 
do not result in any topology changes.

When a router detects congestion from a neighbor, it CAN progressively 
decrease the flooding rate through some mechanism.  For example, the 
router can increase its LSA_RETRANSMIT_INTERVAL by doubling the 
LSA_RETRANSMIT_INTERVAL for low congestion, and quadrupling the 
LSA_RETRANSMIT_INTERVAL for high congestion.  Also, the congested router 
SHOULD a) reduce the rate of flooding, and b) increase the shortest path 
computation interval.

4.2.2.3 Increase Adjacency Break Interval

Under high congestion, if no packet is received from a neighbor within 
the ROUTER_DEAD_INTERVAL, the router SHOULD not break adjacency 
immediately, but MUST wait for a greater interval 
ADJACENCY_BREAK_INTERVAL before it calls the adjacent router down.  The 
HELLO_INTERVAL and ROUTER_DEAD_INTERVAL can be increased at most once at 
each router.  The mechanism for changing these intervals needs to avoid 
the problem that in case the ROUTER_DEAD_INTERVAL is changed (as sent in 
Hello messages), a mismatch in ROUTER_DEAD_INTERVAL can cause 
adjacencies to break (as Hellos will not be processed when received with 
a different Hello interval from what is configured).  If the router 
detects a mismatch, the HELLO_INTERVAL or ROUTER_DEAD_INTERVAL should be 
appropriately incremented/decremented.  This mechanism would protect 
broadcast links on which a router has many neighbors, and perhaps one 
neighbor is dominating the use of bandwidth on the link.

4.2.2.4 Reduce the Rate of SPF Computation

Under high congestion, reduce the rate of SPF computation and do not do 
any incremental SPF computation.  This reduces the CPU overload load by 
not overloading the processor with SPF calculations.  Also, during high 
congestion, the SPF computation would generally be based on incomplete 
topology information anyway, so convergence time is not affected in this 
case.

4.2.3 Router Response if Congestion Notification Received

When a router receives a congestion notification message, it SHOULD 
perform the following actions:

   o Increase the retransmit interval to the congested router,
   o Throttle the rate of LSA flooding toward the congested router (see 
     Section 4.2.2.2), and
   o Increase the adjacency break interval to the congested router (see 
     Section 4.2.2.3).

4.2.4 Graceful Recovery from Loss of Topology Database Information - 
Database Backup & Resynchronization

There is a need to gracefully recover from massive loss of topology 
database information.  This need is illustrated in Section 2, where we 
discuss failure experience in which all topology database information is 
lost in all routers in a large network, and the link-state protocol is 
unable to recover.  To address this problem, OSPF database backup and LS 
database synchronization SHOULD be applied.  In essence, the database 
needs to be recovered, and can be backed up on the local node, and the 
process can use [zinin3] to resynchronize the database.

OSPF LS database synchronization is achieved via two methods,

   o initial LSDB synchronization when router first connected to the 
     network, and
   o asynchronous flooding ensures continuous LSDB synchronization in 
     presence of topology changes after initial procedure completed.

After two routers have established an adjacency (neighbor FSMs have 
reached Full state), routers announce adjacency states in their 
router-LSAs.  Asynchronous flooding ensures routers' LS databases stay 
in sync in the presence of topology changes.

It is sometimes necessary for OSPF routers to resynchronize their LS 
databases, such as under failure when database information is lost.  
Here we identify LS database backup mechanisms, and review 
resynchronization procedures.

After a node fails and comes up, the node needs to recover the current 
state of topology database. The database of the failed router SHOULD be 
backed up before it fails, and this backup SHOULD be used to update the 
database. A router SHOULD back up all the non-self-originated LSAs and 
the states of the interfaces.  Accordingly, when the network is 
recovering, a router SHOULD generate all its own LSAs, node, network, 
and summary LSAs.  All Do_Not_Age (DNA) LSAs need not be regenerated.  
The router SHOULD only receive the LSAs that have changed between the 
time it failed and the current time.  If the database backup is 
completely up to date, none of the LSAs would need to get updated.
 
After a failure, a router SHOULD base its routing on the current 
database, derived as above.  Some of the database information will be 
somewhat out of sync, and routing will not be optimal.  However, if the 
database is relatively up to date, the routing would suffice until the 
database re-syncs completely.  What is saved is a considerable amount of 
CPU processing of LS requests and LS updates.

The approach of Zinin [zinin3] is relevant here when the router 
experiencing a failure uses graceful restart. The mechanism described in 
"OSPF Out-of-band LSDB resynchronization", when database information is 
lost, OSPF standard does not allow routers to do so without actually 
changing the topology view of the network.  That is, if routers need to 
resynchronize their LS databases, they cannot do that without putting 
the neighbor FSMs into the ExStart state.  This effectively causes 
adjacencies to be removed from the router-LSAs, and is not acceptable in 
some cases.  With out-of-band link-state-data-base (OOB LSDB) 
resynchronization, routers can re-establish adjacencies after a reload, 
such as with a database backup, described above.  To do this, an R-bit 
indicates OOB LSDB resynchronization.

5. Related Solution Methods

The following is a summary of methods implemented elsewhere that help 
control congestion and may be worth considering here.

5.1 Optimized Link State Routing (OLSR) Protocol [jacquet]

With Optimized Link State Routing (OLSR), a key concept is the use 
multipoint relays (MPRs).  MPRs are selected routers which forward 
broadcast messages during the flooding process.  As such, OLSR 
substantially reduces the message overhead as compared to a pure 
flooding mechanism.  OLSR is designed to work in a distributed manner 
and does thus not depend on any central entity.  Note that MPR behavior 
in dynamic topologies (when the connectivity graph changes) is still 
under consideration.

5.2 Fisheye State Routing Protocol (FSR) for Ad Hoc Networks [gerla1]

With Fisheye State Routing (FSR), a router periodically broadcasts LS 
updates of a destination to its neighbors with a frequency that depends 
on hop distance to destination.  LS updates corresponding to far away 
destinations are propagated with lower frequency than those for close by 
destinations.  FSR reduces overhead caused by routing table updates.

5.3 Landmark Routing Protocol (LANMAR) for Large Scale Ad Hoc Networks 
[gerla2]

With Landmark Routing Protocol (LANMAR), a 'landmark' is dynamically 
elected in each group.  Routing to the landmark propagated throughout 
the network.  A router directs a packet to the landmark and then to the 
destination.  LANMAR reduces routing table size and routing update 
overhead, since it a) summarizes routing information of remote groups, 
and b) uses truncated local routing table.

5.4 Topology Broadcast Based on Reverse-Path Forwarding (TBRPF) [SRI 
International authors]

With Topology Broadcast Based on Reverse-Path Forwarding (TBRPF), an 
efficient Hello mechanism is introduced in which

   o TBRPF Neighbor Discovery (TND) protocol allows each router to 
     quickly detect neighboring nodes with direct link, and
   o TND uses "differential" HELLO messages which report only *changes* 
     in status of neighbors.

This results in Hello messages that are much smaller than other LS 
routing protocols which include the IDs of *all* neighbors.

6. Security Considerations   
   
There are no new security considerations based on proposals in this 
draft.   
   
7. Acknowledgments   
   
The authors gratefully acknowledge the comments and suggestions from 
many people.  We are very appreciative of the most helpful and 
constructive comments from Alex Zinin, Russ White (Cisco), and Dave 
Katz (Juniper).  At AT&T we thank Margaret Chiosi, Enrique Cuevas, Tom 
Helstern, Steve Holmgren, Raghuram Aswatnarayan, and Mike Shapiro, 
Shawn McAllister at Alcatel, Marty Albright at Worldcom, Sohel Khan at 
Sprint, Ghassem Koleyni at Nortel Networks, Thomas Cornely and Greg 
Wetzel at Covad, Bob Dianda at Sprint, and Jeff Parker at Axiowave.   
   
8. References   
   
[af-pnni-0055.000]  "Private Network-Network Interface Specification 
Version 1.0 (PNNI 1.0)," March 1996.   
   
[ash]  Ash, G. R., "Dynamic Routing in Telecommunications Networks," 
McGraw Hill.   
   
[atmf00-0249] "Scalability and Reliability of large ATM networks."   
   
[atm00-0257] "Signaling Congestion Control in PNNI Networks: The Need 
and Proposed Solution Outline."   
   
[atm00-0480] "Congestion Control/Failure Recovery in PNNI Networks."   
   
[atm01-0101] "Proposed Mechanisms for Congestion Control/Failure 
Recovery in PNNI Networks."   
   
[att]  "AT&T announces cause of frame-relay network outage," AT&T Press  
Release, April 22, 1998.   
   
[btd-cs-congestion-02.00] "Signaling Congestion Control Version 1.0",    
Baseline Text   
   
[cholewka]  Cholewka, K., "MCI Outage Has Domino Effect," Inter@ctive 
Week, August 20, 1999.   
   
[choudhury]  Choudhury, G., Maunder, A. S., Sapozhnikova, V., "Faster    
Link-State Convergence and Improving Network Scalabiity and Stability,"  
submitted for presentation at LCN 2001.

[gerla1] Gerla, M., et. al., "Fisheye State Routing Protocol (FSR) for 
Ad Hoc Networks," work in progress.

[gerla2] Gerla, M., et. al., "Landmark Routing Protocol (LANMAR) for 
Large Scale Ad Hoc Networks," work in progress.
   
[hosein1] Hosein, P., "An Improved ACC Algorithm for Telecommunication   
Networks," Telecommunication Systems 0, 1998.   
   
[hosein2] Hosein, P., "Overload Control for Real-Time Telecommunication  
Databases," International Teletraffic Congress - 16, Edinburgh, 
Scotland, June 1999.

[jacquet] Jacquet, et. al., "Optimized Link State Routing Protocol," 
work in progress.
   
[jander]  Jander, M., "In Qwest Outage, ATM Takes Some Heat," Light 
Reading, April 6, 2001.    
   
[maunder] Maunder, A. S., Choudhury, G., "Explicit Marking and 
Prioritized Treatment of Specific IGP Packets for Faster IGP 
Convergence and Improved   Network Scalability and Stability," work 
in progress.   
   
[mpls-architecture] Rosen, E., Viswanathan, A., Callon, R., 
"Multiprotocol Label Switching Architecture," RFC 3031, January 2001.   
   
[mummert] Mummert, V. S., "Network Management and its Implementation on 
the No. 4ESS," International Switching Symposium, Japan, 1976.   
   
[moy1] Moy, J., "Hitless OSPF Restart", work in progress.   
   
[moy2]  Moy, J., "Flooding over parallel point-to-point links," work in  
progress.   
   
[moy3]  Moy, J., "Flooding Over a Subset Topology," work in progress.    
   
[murphy]  Murphy, P., "OSPF Floodgates," work in progress.   
   
[pappalardo1] Pappalardo, D., "Can one rogue switch buckle AT&T's 
network?," Network World Fusion, February 23, 2001.   
   
[pappalardo2]  Pappalardo, D., "AT&T, customers grapple with ATM net   
outage," Network World, February 26, 2001.

[pillayesnault] Pillay-Esnault, P., "OSPF Refresh and flooding 
reduction in stable topologies," work in progress.
   
[Q.764]  "Signalling System No. 7 - ISDN user part signalling 
procedures," December 1999.   
   
[refresh-guide]  Zinin, A., "Guidelines for Efficient LSA Refreshment 
in OSPF," work in progress.

RFC2475] Blake, S., et. al., "An Architecture for Differentiated 
Services," RFC 2475.

[sri] SRI authors, et. al., "Topology Broadcast Based on Reverse-Path 
Forwarding (TBRPF)," work in progress.  
   
[whitehead]  Whitehead, Martin, "A class of overload controls based 
on controlling call reject rates," ITU-T contribution D.19, Feburary 2001.  

[zinen1]  Zinin, A., et. al., "OSPF Restart Signaling," work in 
progress.
  
[zinen2]  Zinin, A., et. al., "Flooding optimizations in link-state 
routing protocols," work in progress.

[zinin3] Zinin, A., et. al., "OSPF Out-of-band LSDB resynchronization," 
work in progress.

[zinin4] Zinin, A., et. al., "OSPF Link-Local Signaling," work in 
progress.

9. Appendix   
   
Appendix A - Dropping Link Adjacencies Under Extreme Congestion

The following mechanisms are proposed for dropping adjacencies under 
extreme congestion, but may be controversial, so they are placed in an 
Appendix for now.  As discussed below, there is experience in using 
these mechanisms in practice, and also experimental verification that 
the mechanisms are effective.

In the case a router is experiencing extremely high congestion, it 
SHOULD drop adjacencies on all links with low priority.

It is natural to think that as we drop adjacencies under congestion, 
that in itself causes some flooding and so may not help.  However, this 
flooding is only one-time event but over a longer term the reduced 
adjacency helps.  The main point is that with lower adjacency the node 
has to do less work for any LSA generated anywhere in the network.  This 
has been shown by the analytic model discussed in [choudhury].  In 
addition to the analysis and simulation results, there is implementation 
experience with this approach within large carrier networks.  In fact, a 
procedure for reducing adjacencies was done to recover the network after 
the 2/20/01 failure discussed in Section 2, and is also invoked before 
each major upgrade.

If a router has at any time more adjacencies between Exchange and 
Loading than MAX_ADJACENCY_BUILD_COUNT, the router SHOULD drop 
adjacencies in Exchange or Loading with lower priority.  In that way, 
routers within a router-group can experience a lower level of LSA 
flooding and Hello processing.  When congestion on a router subsides, it 
SHOULD then gradually reestablish adjacencies on links that were 
dropped.

When an adjacency is dropped, the router SHOULD list the link as a stub 
link in its Router LSA, and if the router is a Designated Router on a 
broadcast link, it SHOULD remove the neighbor from its Network LSA.  The 
Hello messages should still show the router as a neighbor.  If the 
router in this case gets a 1-way Hello message, it SHOULD perform as per 
RFC2328.

Also proposed is the concept of forming smaller 'router-groups' within 
an area in order to stabilize the network under extreme congestion, as a 
last resort to avoid network collapse. After the smaller router groups 
recover, then the router-groups should be re-constituted into the 
original area. Here again, analysis and simulation models have shown the 
benefit. This is also a current plan within a large carrier network to 
manually segment the network into multiple smaller sub-networks and then 
reconnecting them after stabilization has been achieved.

Priorities CAN be set on interfaces such that it can help form smaller 
router-groups, or clusters. That is, links between routers in different 
groups SHOULD be given low priority and those within a group high 
priority.  Links from a router to other router-groups in an area SHOULD 
be given low priority.  Links within a router-group CAN be given  high 
priority if only one link exists between routers.  With parallel 
point-to-point links, one link CAN be given high priority over the 
others.  In this way, the adjacency limiting mechanism will be confined 
to router groups, and routers within a router-group can experience a 
lower level of LSA flooding and Hello processing, thus helping to avoid 
the spread of congestion.

In high congestion, the adjacencies are forming and then breaking again, 
and this leads to LSA storms such that none of the adjacencies are 
forming at all.  In this state, we instead drop adjacencies, let only a 
subset of the adjacencies form, incrementally bring up adjacencies, and 
in that way make the domain stable.  So dropping adjacencies may cause 
additional LSA flooding temporarily, but as routers now only have to 
flood to a subset of the topology, the network will stabilize sooner.

Implementations should take care to damp link flaps, and provide 
interface and adjacency state dampening. If an interface or a given 
adjacency keeps flapping, it should be suppressed for a period of time, 
and the worse the instability the longer should be the hold-down period.

Appendix B - Vendor Analysis of LS Protocol Performance
 
Various vendors and service providers were asked to analyze their 
product with reference to how their LS protocol recovers from a total 
network    

failure, that is, loss of all database information in the specified   
scenario.  The specification of the failure presented to the vendors 
made the following assumptions:   
   
1. 400 node network is considered   
a) 100 backbone nodes   
b) 3 edge nodes per backbone node (Edge single homed)   
c) Each backbone node is connected to no more than 10 nodes (maximum 
node adjacency is 13, which defines a sparse network)  
2. There are 101 areas   
a) 1 backbone area with 100 backbone nodes   
b) 100 edge areas, each with 3 nodes, all homed on the backbone peer 
group   
3. 1,000,000 addresses are advertised in the network   
4. maximum node adjacency is 13   
a) sparse network   
   
Assumptions on packet sizes and processing times varied according to 
product capabilities associated with the individual estimates given 
below.   
   
The results for each projected recovery time from three   
vendor/service-provider analyses of the above scenario are as follows:   
   
Recovery Time Estimate A - 3.5 hours   
Recovery Time Estimate B - 5-15 minutes   
Recovery Time Estimate C - 5.5 hours   
   
Clearly there is a large variability to the estimates, however the   
expectation is that vendor equipment recovery is not adequate under a 
large failure scenario as was analyzed.   
   
Appendix C - Analytic Modeling & Simulation Analysis of LS Protocol 
Performance
   
Some analysis models have been published which reveal problematic   
performance of LS protocols under various failure conditions [maunder,   
atmf00-0249].   
   
There are various simulation capabilities [NS, wandl, opnet,   
scalable-networks, makesystems] for analyzing LS protocol behavior 
under failure.  We report below an event simulation study [choudhury] 
on overloads   in large-scale networks and the corresponding recovery 
times.   
   
Emulation of large failure performance could also be available through   
laboratory testing of actual protocol performance of vendor products 
under failure conditions. However to date no studies have been 
published on failures in large scale networks and the corresponding 
recovery times.   
   
The results from such simulation and emulation studies are specific to 
the type of equipment and network configuration modeled, and therefore 
are not completely general.  However, the results to date lend support 
for the existence of problems with LS protocols in the face of 
congestion and network failures.    
   
A network-wide event simulation model is reported in [choudhury] to 
study the impact of a TSU storm.  It captures the actual congestion 
seen at various nodes and accounts for propagation delay between 
nodes,   retransmissions in case a TSU is not acknowledged, failure of 
links for TSUs delayed beyond the node-dead interval, and link 
recovery following database synchronization and TSU flooding once the 
TSU is processed. It approximates a real network implementation and 
uses processing times that are roughly in the same order of magnitude 
as measured in the real network (of the order of milliseconds).   
   
There are two categories of IGP messages processed at each node in the   
simulation. Category 1 messages are triggered by a timer and include 
the Hello refresh, TSU refresh and retransmission packets. Category 2 
messages are not triggered by a timer and include received Hello, 
received TSU and   received acknowledgements. Timer-triggered messages 
are given non-preemptive priority over the other type and are queued 
in the timer queue.  As a result, the received Hello packets and the 
received acknowledgement packets may see long queuing delays under 
intense CPU overload.  Table 1 below shows sample results of the 
simulation study when applied to a network with about 300 nodes and 
800 links.  The node-adjacency varies from node-to-node and the 
maximum node-adjacency is 30.  The Hello interval is assumed to be 5   
seconds, the minimum interval between successive SPF 
(Shortest-Path-First) calculations is 1 second, and the Node-Dead 
Interval ("Router-Dead Interval") is 15 seconds, i.e., a link is 
declared down if no Hello packet is received for three successive 
hello intervals.   During the study, a TSU storm of size X is created 
at instant of time 100 seconds where storm-size is defined as the 
number of TSUs generated during a storm.  Three cases are considered 
with X 3D 300, 600 and 900 respectively.  Besides the storm, there   
are also the normal once-in-thirty-minutes TSU refreshes.  At any 
given point of time we define a quantity "dispersion" that is the 
number of control packets already generated in the network but not 
received and processed in at least one node (each control packet is 
assumed to carry three TSUs).   
   
Table 1 plots dispersion as a function of time and thereby 
illustrates the impact of TSU storm on network stability.  Before 
the TSU storm, the dispersion due to normal TSU refreshes remains 
small.  We expect the dispersion to jump to a high value right after 
the storm and then come down to the pre-storm level after some period 
of time (this happens with X=300 and X=600 but not X=900).  In Table 
1 with a TSU storm size 300, the "heavy dispersion period" lasted 
about 11 seconds and no link losses were observed.  With a TSU storm 
of size 600, the "heavy dispersion period" lasted about 40 seconds.  
Some link losses were observed a little after 15 seconds within   
the "heavy dispersion period" but eventually all links recovered and 
the dispersion came down to the pre-storm level. With a TSU storm of 
size 900, the "heavy dispersion period" lasted throughout the 
simulation period (6 minutes).   

======|=====================================================================
      |            Table 1: DISPERSION as a FUNCTION of TIME (in sec)       
      |                     for different TSU Storm Sizes                   
STORM |=====================================================================
SIZE  |100s     106s    110s    115s    140s    170s    230s    330s    370s
======|=====================================================================
 300  | 0        39      3       1       0       1       0       0       0  
------|---------------------------------------------------------------------
 600  | 0       133     120     100      12      1       0       0       0  
------|---------------------------------------------------------------------
 900  | 0       230     215     196     101     119     224     428     488 
======|=====================================================================


The generic observations are as follows:   
   
*       If the initial TSU storm size (e.g., X3D300) is such that the 
delays experienced by Hello packets are not big enough to cause any 
link failures anywhere in the network, the network remains stable and 
quickly gets back to a period of "low dispersion".  These types of LSA 
storms are observed quite frequently in operational networks, from 
which the network easily recovers.   
*       If the initial TSU storm size (e.g., X3D600) is such that the 
delays experienced by a few Hello packets in a few nodes cause link 
failures then some secondary TSU storms are generated.  However, the 
secondary storms do not keep growing indefinitely and the network 
remains stable and eventually gets back to a period of "low 
dispersion".  This type of LSA storm was observed in an operational 
network triggered by a network upgrade, from which the network 
recovered but with some difficulty.   
*       If the initial TSU storm size (e.g., X3D900), is such that the 
delays experienced by many Hello packets in many nodes cause link 
failures then a wave of secondary TSU storms are generated.  The 
network enters an unstable state and the secondary storms are 
sustained indefinitely or for a very long period of time. This type 
of LSA storm was observed in an operational network triggered by a 
network failure (2/20/01 AT&T ATM network failure discussed in Section 
2.1) from which the network did not recover without corrective steps 
(manual procedures were used to reduce TSU flooding and stabilize the 
network).   
   
The results show that there is a TSU storm threshold above which the 
network shows unstable behavior.  It is desirable that TSU flooding 
reduction be automatic under overload, and the model could be used to 
assess the benefits of various control mechanisms, such as those 
discussed in the draft.   
   
10. Authors' Addresses

Jerry Ash
AT&T
Room MT D5-2A01
200 Laurel Avenue
Middletown, NJ 07748, USA
Phone: +1-(732)-420-4578
Fax:   +1-(732)-368-8659
Email: gash@att.com

Gagan L. Choudhury
AT&T
Room D5-3C21
200 S.Laurel Avenue
Middletown, NJ 07748, USA
Phone: + 1 732 420- 3721
Fax:+1 732 368-1919
E-mail: gchoudhury@att.com

Vishwas Manral
NetPlane Systems
189, Prashasan Nagar,
Road number 72
Jubilee Hills
Hyderabad, India
E-mail: vishwasm@netplane.com

Anurag S. Maunder
Sanera Systems
370 San Aleso Avenue
Sunnyvale, CA 94085, USA
Ph: 408-734-6123
Fax: 408-400-0151
E-mail: amaunder@sanera.net

Vera D. Sapozhnikova
AT&T
Room C5-2C29
200 S. Laurel Avenue
Middletown, NJ 07748, USA
Phone: + 1 732 420-2653
Fax: +.1 732 368-1774
E-mail: sapozhnikova@att.com

Mostafa Hashem Sherif
AT&T
Room C5-2D18
200 S. Laurel Avenue
Middletown, NJ 07748, USA
Phone: + 1 732 420-2448
Fax: +.1 732 371-7234
E-mail: mhs@.att.com

11. Copyright Statement

Copyright (C) The Internet Society (1998). All Rights Reserved.

This document and translations of it may be copied and furnished to 
others, and derivative works that comment on or otherwise explain it or 
assist in its implementation may be prepared, copied, published and 
distributed, in whole or in part, without restriction of any kind, 
provided that the above copyright notice and this paragraph are included 
on all such copies and derivative works.

However, this document itself may not be modified in any way, such as by 
removing the copyright notice or references to the Internet Society or 
other Internet organizations, except as needed for the purpose of 
developing Internet standards in which case the procedures  for 
copyrights defined in the Internet Standards process must be followed, 
or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be 
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an "AS 
IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 
FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 
INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF   MERCHANTABILITY OR 
FITNESS FOR A PARTICULAR PURPOSE.