Network Working Group Alex Zinin Internet Draft Cisco Systems Expiration Date: January 2001 July 2000 File name: draft-ietf-ospf-refresh-guide-01.txt Guidelines for Efficient LSA Refreshment in OSPF draft-ietf-ospf-refresh-guide-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract OSPF, an IGP widely deployed in IP networks today, requires each LSA to be refreshed by the originating router every 30 minutes. This method increases the protocol's robustness and solves potential problems that may be caused by software bugs, as well as some properties of the protocol itself. Though OSPF expects each LSA to be refreshed independently, ABRs and ASBRs tend to originate Type 3/4/5 LSAs within a short period of time, thus causing periodical network resource exhaustion by LSA refreshments. This document discusses the techniques that can be used to remedy this problem and provide smooth protocol operation. It must be noted that discussed methods can be applied to other link-state routing protocols, such as IS-IS. Zinin [Page 1] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 1 Introduction The requirement for each LSA to be refreshed every 30 minutes is really important for OSPF [Ref1]. Not only because OSPF allows LSAs to live no longer than one hour, but also because LSA refreshment provides an automatic method for problem correction. Possible problems solved by periodical LSA refreshments include those caused by the software bugs and those caused by the features of OSPF itself. This document does not describe the first type of problems, the reader could easily think them out. Examples include lost LSAs, not installed routes, etc. The description of a protocol feature that can potentially cause problems follows. OSPF uses the Fletcher checksum (see Section 12.1.7 of [Ref1]) to insure that the information transferred and stored in LSAs is not corrupted. But this is not the only function of the LS checksum. It is also used during the LSDB synchronization and LSA flooding processes to understand whether the newly received LSA and an LSA with the same Link State ID and Advertising Router already installed in the LSDB are the same when other parameters (LS Sequence Number and LS Age) do not allow to make the right decision (see Section 13.1 of [Ref1]). Equal values of the LS checksum indicate that LSAs contain identical information, so the received LSA is not processed any further. In the overwhelming majority of cases this really indicates that LSAs are the same and that another copy of already installed LSA is received due to the redundancy of the flooding protocol. However, in some situations, for example when the originating router was reloaded, it is possible that the new LSA (with the same values of the LS ID and Advertising Router fields) will contain different information, but will still have the same checksum as the old one. This is because of the nature of the checksumming process---a small number of bits is used to characterize the contents of a big informational block. The probability of having the same checksum for two different data blocks depends on the mean size of the data blocks, the number of bits used to calculate and store the checksum, and (the last, but definitely not least) the algorithm used to calculate it. Since the probability of this event for OSPF Link State checksum is sufficiently low, the protocol itself relies on the periodical LSA refreshments to solve potential LSDB inconsistency. So, the periodical LSA refreshments insure that the protocol converges within a finite period of time. The most straightforward approach to refresh LSAs could be to have a single timer firing every 30 minutes and causing reorigination of all self-originated LSAs. This method proved itself as a non-scalable solution causing periodical CPU and link bandwidth exhaustion in the networks. Having a separate timer for each LSA does not mean solving Zinin [Page 2] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 all problems. First of all, a router may have a lot of self-originated LSAs in its LSDB, so it may very often be not efficient to keep one timer per each LSA in terms of memory and CPU resources. Second, ABRs tend to originate Type-3 and 4 LSAs within a short period of time. This is caused by the fact that in some implementations Dijkstra is not scheduled every time a change in the LSDB is observed, but only after the LSDB has been quiet for some period of time, preventing excessive CPU load. This may very likely lead to the situation when all intra- area routes are announced in summary-LSAs at once. Even though these originated LSAs have separate timers, these timers fire approximately simultaneously. The same effect can be seen when external routing information is injected into an OSPF domain by an ASBR. If the external routes are available at the moment the ASBR is instructed to advertise them, LSAs will be originated almost at once, causing periodical refreshments to be grouped into big chunks and leading to periodical congestion. Similar logic can be applied to the NSSA ASBRs, NSSA translating ABRs, as well as OSPF routers originating Opaque-LSAs. The graph show in Figure 1 depicts the number of self-originated LSAs as a function of time. This graph is greatly simplified assuming that the originating router is an ASBR and injects static routing information with stable topology. Zinin [Page 3] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 SelfLSAs() ^ | | | | | |- - - - - - - - - - - - - - - - ________..._________| | ^ | | | # of ASE-LSAs | | | | | | V | |- - - - _________...___________| | | | | | | | | | | | | | | ___| | / | | |__| | +------- ---------...-----------|--------...---------|----> t | | | LSRefreshTime | LSRefreshTime | | |<--------------------->|<------------------>| t1 t2 t3 t4 Figure 1. Graph of SelfLSAs(t) --- number of self-originated LSAs At moment t1, the router originates its router-LSA. Then, at t2, the router is instructed to inject external routing information into the OSPF domain. At t3 (t3 - t2 == LSRefreshTime) the router reoriginates the AS-external-LSAs and floods them to its neighbors (reorigination of the router-LSAs is not illustrated). It can be seen that the router introduces periodical traffic bursts into the network, also causing excessive CPU utilization in other OSPF routers. The customers of such a network can suffer from this behavior, experiencing periodical network congestion and increased response time. The following sections of this document discuss the methods of implementing OSPF LSA refreshment feature so that it does not disrupt network operation. Zinin [Page 4] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 2 Consideration of possible approaches 2.1 Dispersing LSA origination One of the most obvious techniques to avoid LSA refresh grouping could be dispersion of LSA originations. If LSAs were originated with some interval, their refreshes wouldn't group. Though this method seems to be a very easy solution, it has several disadvantages. The most important one is that the time of routing information propagation is increased, i.e., it may take a lot of time (protocol- wise) until the last LSA in the origination queue gets served. However, introducing a minor dispersion in the initial origination process is encouraged. 2.2 Increasing LSA refresh period Another "easy" way to solve the problem is to increase the refresh interval up to some value near the MaxAge value. This would, on the one hand, decrease the frequency of the updates and, on the other hand, would let the protocol function properly without changes in the aging mechanism. This solution is not ideal, because it does not eliminate periodical LSA refresh bursts, though making them less frequent. This method also increases the time of protocol self- correction in case of problems described in Section 1. 2.3 Avoiding periodical refreshes Another proposal is to avoid periodic LSA refreshes at all by making all or some subset of LSAs be DoNotAge as in [Ref2]. This approach does solve the refresh burst problem (there are no refreshments at all), but it also decreases the protocol robustness by removing one of the automatic survival features, thus making the protocol less bullet-proof and attractive. 2.4 Further increasing the refresh period using DNA LSAs The DNA method mentioned above could be used to increase the LSA refresh interval beyond the MaxAge value. This is possible because DNA LSAs are not aged by the routers, so the originating router can refresh its LSAs at whatever rate it chooses. This seems to be a potentially advantageous possibility. However, neither DNA LSAs, nor the refresh interval increase per se can be considered as a sufficient method to prevent the periodical LSA burst problem. More than that, as it was already mentioned, increasing the refresh interval may lead to potentially longer nonoperational periods. A wide range of methods can be used to implement LSA refreshments efficiently. The following section describes an approach, called Zinin [Page 5] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 "LSA Refreshment Dispersion", implemented in Zebra routing software, as an example of such a technique. 3 LSA Refreshment Dispersion Optimization of the LSA refreshment process described here is based on the observation that if the first refreshment interval is randomized rather than set exactly to LSRefreshTime, subsequent LSA refreshes will also be dispersed. If LSAs are refreshed using individual timers, randomization of the first interval for self- originated LSAs prevents LSA refreshment events from grouping and causing network congestion. Individual LSA refreshments are distributed within the LSRefreshTime period evenly. The described method results in distribution of LSA origination events shown in Figure 2. SelfLSAs() ^ | | __ | __| | __| | | __| | __| | | __| | __| | | __| | _| | | | | | | | | | | | | ___| | / | |__| | +------- ---------...-----------|-----> t | | | LSRefreshTime | | |<--------------------->| t1 t2 t3 Figure 1. Graph of SelfLSAs(t) when Refreshment Dispersion is used The effect of the LSA refreshment dispersion is that OSPF does not put high load on the networks, but instead uses network resources evenly, allowing routers to adapt to its traffic and network Zinin [Page 6] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 administrators to predict necessary resources in more effective manner. In order to use this technique, the implementation should schedule the LSA refreshment as shown in pseudocode in Figure 3. if ((LSA_SEQ (lsa) == OSPF_INITIAL_SEQUENCE_NUMBER) && (LSA_AGE (lsa) == 0)) delay = OSPF_LS_REFRESH_SHIFT + (random () % OSPF_LS_REFRESH_TIME); else { delay = OSPF_LS_REFRESH_TIME - LSA_AGE(lsa); if (delay < 0) delay = 0; /* Add jitter to avoid syncing */ delay = delay + (random () % OSPF_LS_REFRESH_JITTER) +1; } schedule_lsa_refresh (lsa, delay); Figure 3. Pseudocode for refreshment delay calculation The constants used in the algorithm are defined as follows: OSPF_INITIAL_SEQUENCE_NUMBER is the InitialSequenceNumber constant as defined in [Ref1] OSPF_LS_REFRESH_TIME is the LSRefreshTime constant as defined in [Ref1] OSPF_LS_REFRESH_SHIFT is the length of the silent period, i.e., the minimum time that must elapse after LSA origination before the first LSA refresh. OSPF_LS_REFRESH_JITTER is the limit for the pseudo-random number added to the LSA refresh interval to avoid synchronization, i.e., to introduce additional minor dispersion. Zinin [Page 7] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 As it was mentioned before, maintaining a separate timer for each LSA can be costly. A simple way of solving this problem is to group LSAs that must be refreshed approximately at the same time and maintain a single timer for them. This may be implemented by having an LSA accumulation group and a periodic timer. The LSA group is used to list all LSAs originated (or reoriginated) between two consecutive timer expirations (within OSPF_REFRESH_GROUP_TIME seconds). The group is flushed either on the timer expiration or when the size of the group reaches a predefined limit (OSPF_REFRESH_GROUP_LIMIT). When a group is flushed, a refresh event is scheduled as shown in the pseudocode in Figure 3. The event handling routine is supposed to submit the LSAs in the group for reorigination. Note that this is really necessary to limit the maximum number of LSAs in the LSA group and flush the group when this limit is reached in order to achieve acceptable results in the refresh event dispersion. Even when refresh events are dispersed using the method described above, there can still happen some situations where a big number of LSAs must be refreshed within a short period of time. This may lead to high CPU load in the originating router and protocol traffic bursts in the network. It is possible to avoid these problems if the OSPF implementation maintains an LSA refresh queue, which is served not faster than some predefined number of LSAs per second (OSPF_REFRESH_QUEUE_RATE). Note that the value of the OSPF_REFRESH_QUEUE_RATE should not be very low, because this may lead to the reorigination queue being never served completely and some LSAs remaining unrefreshed for a long period. This may also affect the initial dispersion in the refresh events introduced by randomizing the first refresh interval. The recommended value for the OSPF_REFRESH_QUEUE_RATE constant is several times the OSPF_REFRESH_GROUP_LIMIT per second. The described algorithm provides desired results in situations when a router receives self-originated LSAs during adjacency establishment after a reload and decides to accept them as correct (and consequently does not originate the new instances). In this case after the received LSAs are identified as self-originated and approved for future use, they are registered in the refreshment module as if they were originated. Such LSAs are also supposed to be bundled into refreshment groups, but in addition to the periodic timer and maximum group size the implementation should care about the ages of the LSAs being grouped. Generally, if the absolute difference between the age of an LSA being added to the group and the first LSA in the group is larger than a certain acceptable value (OSPF_REFRESH_GROUP_AGE_DIF), the group should be unconditionally flushed and the new LSA must start the new group. Note that it is perfectly possible for the received self-originated LSAs to be not Zinin [Page 8] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 grouped by their age at all when they are transmitted in the database description and link-state update packets. In the worst case each refreshment group will contain a single LSA. However, even in this worst case, LSAs will be grouped by the age again after the first refreshment cycle, so the grouping algorithm presented above is self-optimizing. Below goes the functional specification of the refreshment algorithm. 1. Each originated (reoriginated) or received self-originated LSA is passed for registration to the LSA refreshment module 2. The LSA refreshment module bundles the LSAs submitted for registration into refresh groups. The LSAs in one group share the same timer. A group is flushed (see item 3) whenever one of the following three events happens: a) the periodical timer expires (every OSPF_REFRESH_GROUP_TIME seconds) b) after an LSA has been added to the group, the size of the group reaches OSPF_REFRESH_GROUP_LIMIT c) the age of the LSA to be added to the group differs from the age of the first LSA in the group more than OSPF_REFRESH_GROUP_AGE_DIF 3. when a group is flushed, perform the following action: a) calculate the delay for the refreshment event as presented in Figure 3 b) schedule the refreshment event with the delay calculated in the previous step, passing the LSA refresh group as the parameter to the event handling routine c) create a new LSA refresh group 4. When a refresh event is processed, put the LSAs in the refresh group associated with the event to the reorigination queue. 5 The reorigination queue is served not faster than OSPF_REFRESH_QUEUE_RATE LSAs per second The following table gives recommended values for the constants used in the algorithm description: Zinin [Page 9] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 Constant Recommended Value _________________________________________________ OSPF_LS_REFRESH_SHIFT 60 OSPF_LS_REFRESH_JITTER 10 OSPF_REFRESH_GROUP_TIME 1 OSPF_REFRESH_GROUP_LIMIT 10 OSPF_REFRESH_GROUP_AGE_DIF 3 OSPF_REFRESH_QUEUE_RATE 70 Table 1. Recommended values for constants used in algorithm Note that the constants shown above can be implemented as configur- able variables. However, the default values should be taken from the above table. It is also worth noting that presented algorithm can be used along with the extended LSA refresh period described in subsection 2.4 of this document. However, this would require all routers in the net- work to support DoNotAge LSAs, which may not be the case for some already deployed networks. LSA Refreshment Dispersion per se, in contrast, will possibly require software upgrade only on ASBRS and other routers originating a lot of LSAs. For more information on the described algorithm, please refer to the source code of Zebra routing software (governed by GNU GPL), avail- able at http://www.zebra.org. 4 Security Considerations The algorithm specified in this document does not raise any security issues that are not already covered in [Ref1]. 5 References [Ref1] J. Moy. OSPF version 2. Technical Report RFC 2328, Internet Engineering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc2328.txt. [Ref2] J. Moy. Extending OSPF to Support Demand Circuits, Standards Track RFC 1793, Internet Engineering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc1793.txt. Zinin [Page 10] INTERNET DRAFT Guidelines for LSA Refreshment July 2000 6 Author's Address Alex D. Zinin Cisco Systems 150 West Tasman Dr. San Jose,CA 95134 E-mail: azinin@cisco.com Zinin [Page 11]