Network Working Group B. Zhang Internet-Draft Univ. of Arizona Intended status: Informational L. Wang Expires: April 22, 2010 Univ. of Memphis L. Zhang UCLA October 19, 2009 FIB Aggregation draft-zhang-fibaggregation-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 22, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November Zhang, et al. Expires April 22, 2010 [Page 1] Internet-Draft FIB Aggregation October 2009 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Abstract The rapid growth of Forwarding Information Base (FIB) has raised concerns among many Internet Service Providers. One potential solution to this problem is FIB aggregation, i.e. aggregating FIB entries without affecting the forwarding paths taken by data traffic. It is a purely local software optimization limited within a router, requiring no changes to routing protocols or router hardware. To understand the feasibility of using FIB aggregation to extend router lifetime, we present several FIB aggregation algorithms and evaluate their performance using routing tables and updates from tens of networks. We find that FIB aggregation can reduce the FIB table size by as much as 70% with small computational overhead. We also show that the computational overhead can be controlled through various mechanisms. Zhang, et al. Expires April 22, 2010 [Page 2] Internet-Draft FIB Aggregation October 2009 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. FIB Aggregation Overview . . . . . . . . . . . . . . . . . . . 6 3. FIB Aggregation Techniques and Algorithms . . . . . . . . . . 7 3.1. Level 1 Aggregation . . . . . . . . . . . . . . . . . . . 10 3.2. Level 2 Aggregation . . . . . . . . . . . . . . . . . . . 10 3.3. Level 3 Aggregation . . . . . . . . . . . . . . . . . . . 11 3.4. Level 4 Aggregation . . . . . . . . . . . . . . . . . . . 12 3.5. Extra Routable Space . . . . . . . . . . . . . . . . . . . 13 4. Updating Aggregated FIB . . . . . . . . . . . . . . . . . . . 14 5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 15 5.2. Table Size Reduction and Overhead . . . . . . . . . . . . 16 5.3. Routing Update Handling . . . . . . . . . . . . . . . . . 17 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 8. Security Considerations . . . . . . . . . . . . . . . . . . . 19 9. Informative References . . . . . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 Zhang, et al. Expires April 22, 2010 [Page 3] Internet-Draft FIB Aggregation October 2009 1. Introduction The Internet routing scalability problem has raised concerns in both industry and research communities, as documented in the report from the IAB Workshop on Routing and Addressing [RAWS]. Several solutions have been proposed under the IRTF RRG and IETF GROW working groups. In order to address the root cause of the scalability problem, we need fundamental changes to the Internet routing architecture and protocols. However, deploying these architectural changes may take a long time, as we have learned from the deployment of IPv6. While we believe that architectural changes will be beneficial to the Internet in the long run, we should not avoid short-term solutions that can help ISPs reduce their operational costs, which will in turn be beneficial to end users. In particular, ISPs urgently need a solution to reduce their forwarding table size. Forwarding tables are derived from routing tables and router configurations, so their size increases as routing tables grow. However, forwarding tables use high performance memory that is more expensive and more difficult to scale than the memory used to hold routing tables. Therefore, their size is a more immediate concern to ISPs and vendors. This work investigates the feasibility of a purely local solution: FIB aggregation, which is to combine multiple entries in the forwarding table without changing the next hops for data forwarding. This approach is particularly appealing because it can be done by a software upgrade at a router and its impact is limited within the router. It does not require changes to routing protocols or router hardware, nor does it affect multi-homing, traffic engineering, or other network-wide operations. It is important to note that FIB aggregation is not a replacement for the long-term architectural solutions because it does not address the root causes of the routing scalability problem. Instead, FIB aggregation is a local solution that can be quickly implemented and deployed in the short-term, and in the long run, it can co-exist and complement architectural solutions. The idea of FIB aggregation is not new. It was mentioned as a potential strategy in "Preliminary Recommendation for a Routing Architecture" [Li09RRG]. Through personal exchanges, we have learned that one vendor has implemented a simple FIB aggregation scheme (similar to our Level-1 aggregation). There is also a patent for a FIB aggregation algorithm [Cain02AUTOAGG]. In this work, we evaluate the benefits and costs of several FIB aggregation schemes. We recognize that FIB aggregation is an opportunistic technique -- its effectiveness depends on what prefixes are present in the table, how many of them can be numerically represented by a single prefix, and how many of them share the same next-hop in data forwarding. The benefits of FIB aggregation come with certain costs, such as extra Zhang, et al. Expires April 22, 2010 [Page 4] Internet-Draft FIB Aggregation October 2009 CPU overhead. The costs also depend on the actual aggregation algorithms, and how routing changes are handled to update the forwarding table. We design and implement five algorithms at different aggregation levels, each representing different tradeoffs between table size reduction and computation complexity. We evaluate them using publicly available routing tables from tens of networks. The results show that the lowest-level aggregation can reduce table size from 50\% to 70\%, equivalent to the full table size two and half years ago, while the highest-level aggregation can reduce the table size to 30\%, equivalent to the full table size eight years ago. The computation time of one aggregation run ranges from 50 milliseconds to 250 milliseconds on a commodity Linux machine. Although these numbers may not reflect the computation time on a router, they reflect the relative speed of different levels of aggregation. Moreover, such full aggregation is performed only when certain thresholds are reached, so the computation time is amortized over time. To handle routing changes, we design and implement algorithms to incrementally update the aggregated FIB upon a change. The full aggregation algorithm is invoked only when the router CPU load is low or the FIB size rises above a threshold. The computation time can be further reduced by not aggregating highly active prefixes, which represent a small fraction of the routing table but are responsible for many routing changes. The evaluation using one-month BGP routing updates shows that compared with un-aggregated FIB, the computation overhead of maintaining aggregated FIB over time is small. Overall, our conclusion is that FIB aggregation is a viable solution to the routing scalability problem, since it provides significant reduction in table size, and its extra computation can be controlled by various mechanisms, such as using different levels of aggregation, incremental update algorithms, on-deman invocation, and not aggregating highly active prefixes. Another proposed approach to reducing FIB size is Virtual Aggregation [Virtual_Aggregation]. Virtual Aggregation designates a small set of routers (APRs) that announce virtual prefixes, so that other routers do not need to install more specific prefixes under those virtual prefixes in their FIB -- other routers simply forward their packets to the APRs responsible for the corresponding virtual prefixes. It can be independently deployed by one ISP, and does not require changes to the routing architecture or protocols. However, Virtual Aggregation requires considerable changes to network-wide router configurations and specialized routers to announce virtual prefixes. Moreover, it introduces extra delays (stretch) in packet delivery. To some extent, the Level 4 FIB aggregation described in this draft Zhang, et al. Expires April 22, 2010 [Page 5] Internet-Draft FIB Aggregation October 2009 can be viewed as a local Virtual Aggregation limited within a router. 2. FIB Aggregation Overview FIB aggregation eliminates redundant entries in a FIB without affecting the delivery of data traffic. For example, it may remove an entry A from the FIB if another entry B with the same next-hop already covers A's address space. It may also introduce a new entry to the FIB after removing several entries that share the same next- hop. To ensure packet delivery and avoid changing the path that a packet will take, a FIB aggregation scheme should satisfy the property of forwarding correctness: after longest match lookup, if a destination address has a non-NULL next-hop in the un-aggregated FIB, it should have the same next-hop in the aggregated FIB. There are two main questions in designing FIB aggregation techniques: how to aggregate the full FIB and how to update an aggregated FIB upon a routing change. In the next two sections, we will describe five aggregation algorithms and one update handling algorithm. The effectiveness of FIB aggregation depends on which address prefixes are present in the routing table and how these prefixes are distributed over next-hop routers. Generally speaking, the fewer neighbors a router has, the better aggregation it may achieve. In the extreme case, all prefixes share the same next-hop and the aggregation is maximized. According to Li et. al [RouterTopology], although some routers have high degrees up to a couple of hundreds, these routers connect to a large number of end-customers, not transit neighbor routers. Therefore, they will still use only a small number of next-hops, i.e., the transit neighbors, to reach most of the address prefixes. Besides sharing the same next-hop, prefixes also need to be numerically aggregatable. This is possible due to two factors. First, in IP address allocation, large blocks of Internet addresses are first allocated to Regional Internet Registries and then they further allocate the addresses to networks within the same region. A router outside the region tends to use the same nexthop to reach these address prefixes, which can then be aggregated. Second, for prefixes split for traffic engineering or other purposes, a router near the origin network is likely to have different next-hops, but a router further away from the origin network is more likely to have the same next-hop towards these numerically aggregatable prefixes. Therefore, although FIB aggregation is opportunistic and the aggregation degree varies from router to router, there are some Zhang, et al. Expires April 22, 2010 [Page 6] Internet-Draft FIB Aggregation October 2009 inherent properties of the Internet that can make FIB aggregation effective. If FIB aggregation is effective in reducing table size, its most appealing feature is that the impact is limited within a router's data plane. It does not change any routing protocols, or any router's routing decisions. Data traffic still flows on the same router paths. Therefore, it can co-exist with almost any new routing protocols, including those architectural solutions to the routing scalability problem in the long run. 3. FIB Aggregation Techniques and Algorithms We present four levels of full FIB aggregation, each aggregating the table more but incuring more computation and other overhead. These algorithms assume that the routing table is stored in a tree structure. Though our implementation uses patricia trie, a tree-like data structure widely used for routing tables since the BSD implementation, the algorithms should apply to any tree data structure. Note that our algorithms do not build any additional trees just for aggregation; we simply use the existing trees that the RIB and FIB use. For a network device that uses non-tree data structure to implement routing tables, the general techniques discussed here still apply. In real networks, the algorithms and implementations can always be optimized for specific hardware, operating systems, and routing table data structures. Before describing each FIB aggregation algorithm, we first introduce the following terms. o Ancestor prefix: if P1 is a less specific address prefix compared to P2, P1 is an ancestor prefix of P2. For example, 12/8 is an ancestor prefix of 12.1/16. o Covering prefix and covered prefix: if an address prefix P1 has no ancestor prefix in a RIB/FIB, then it is a covering prefix. Otherwise, it is a covered prefix. Note that in a patricia trie implementation, a covering prefix may have ancestor nodes that do not have corresponding real prefixes. o Hole: if a RIB/FIB entry X has a more specific prefix than RIB/FIB entry Y does and they have different next-hops, then A is punching a hole in B. For example, the entry (prefix: 12.1/16, nexthop: A) punches a hole in (prefix: 12/8, nexthop: B). o Sibling prefix: if two prefixes of the same length are numerically consecutive and numerically aggregatable, we call them sibling prefixes. For example, 12.2/16 and 12.3/16 are sibling prefixes. In a patricia trie, sibling prefixes are children prefixes under Zhang, et al. Expires April 22, 2010 [Page 7] Internet-Draft FIB Aggregation October 2009 the same parent. o Adjacent prefixes: if two prefixes are numerically consecutive, they are called adjacent prefixes. Note that adjacent prefixes may not be numerically aggregatable. For example, 12.3/16 and 12.4/16 are numerically consecutive but not numerically aggregatable. o IN-FIB and NON-FIB: an aggregation algorithm determines whether a prefix should be placed in the FIB. If placed in the FIB, the prefix is labeled as IN-FIB. Otherwise, it is labeled as NON-FIB. o Postorder traversal: this is a recursive tree traversal algorithm that examines the children nodes before the parent nodes. Suppose in a given FIB tree, prefix P1 has two children prefixes P2 and P3 and prefix P2 has two children prefixes P4 and P5 (P3, P4, P5 are leaf nodes), then the algorithm will visit the prefixes in the following order: P4, P5, P2, P3, P1. o Preorder traversal: this is a recursive tree traversal algorithm that traverses each node, its left subtree and then its right subtee. In the above example, a pre-order traversal will visit the prefixes in the following order: P1, P2, P4, P5, P3. Zhang, et al. Expires April 22, 2010 [Page 8] Internet-Draft FIB Aggregation October 2009 Figure 1 illustrates four different types of aggregation, as we will describe in detail in the rest of this section. The binary tree represents part of the IP address space. Nodes labeled with letters are prefixes in the routing table, and the letter represents the next-hop for the prefix. Nodes without labels do not have their corresponding prefixes in the FIB tree. Filled nodes are extra routable space introduced by the aggregation. (a) Level-1: Removing covered prefixes (A) (A) / \ => / \ / \ / \ (A) ( ) ( ) ( ) (b) Level-2: Combining sibling prefixes ( ) (A) / \ => / \ / \ / \ (A) (A) ( ) ( ) (c) Level-3: Allowing extra space (A) (A) / \ / \ / \ / \ () () => () () /\ /\ /\ /\ / \ / \ / \ / \ (A) () () (A) () () () () (d) Level-4: Allowing holes in the aggregation (A) (A) / \ / \ / \ / \ () () => () () /\ /\ /\ /\ / \ / \ / \ / \ (A) (B)() (A) () (B)() () Figure 1 Our aggregation techniques ensure forwarding correctness by aggregating prefixes with the same next-hop. In reality, there are other types of information stored in a FIB entry, such as packet count. Aggregating two prefixes will probably lose such fine-grained Zhang, et al. Expires April 22, 2010 [Page 9] Internet-Draft FIB Aggregation October 2009 statistics, which also happens in all other routing scalability solutions, usually to a much wider extent. Losing more detailed information is a necessary cost we have to pay in order to improve overall system scalability. One way to mitigate the problem is having a router-wide packet counter instead of per-FIB packet counter, thus the aggregated information from all line cards are still kept for each prefix. Another way is for the operators to configure some important prefixes not be subject to FIB aggregation, thus fine-grained statistics of these important prefixes will be kept. 3.1. Level 1 Aggregation This technique is illustrated in Figure 1a. It removes a prefix if it shares the same next-hop with its immediate ancestor prefix. Addresses that previously match the removed prefix now will match the ancestor prefix and still get the same next-hop. Previously non- routable packets, whose table lookup ends up with NULL next-hop, will still be non-routable. In other words, this aggregation does not introduce any new prefix nor extra routable space into the table. The algorithm implementing this technique simply traverses the tree recursively from the root node in postorder. When it arrives at a node with a prefix, it compares this prefix's next-hop with its immediate ancestor prefix's next-hop. If they have the same next- hop, it labels the current node NON-FIB, otherwise labels it IN-FIB. The immediate ancestor prefix's next-hop is updated and remembered during the tree traversal. Eventually every prefix node is labeled as either NON-FIB or IN-FIB, and all IN-FIB prefixes comprise the aggregated FIB. The aggregation is done recursively throughout the entire table. The computation time is O(n), where n is the total number of nodes in the tree. 3.2. Level 2 Aggregation This technique is illustrated in Figure 1b. In addition to performing Level 1 aggregation, Level 2 combines sibling prefixes that share the same next-hop into a parent prefix. If the parent node already has a prefix with a different next-hop, then the aggregation cannot be done. Or if the parent node already has a prefix with the same next-hop, then it is part of Level 1 aggregation. Therefore, Level 2 is done when the parent node has no prefix. The net result is to introduce a new prefix to cover two existing prefixes, but there is no extra routable space introduced, i.e. the aggregated FIB covers the exact address space as the un- aggregated FIB. The algorithm implementing Level 2 aggregation traverses the tree Zhang, et al. Expires April 22, 2010 [Page 10] Internet-Draft FIB Aggregation October 2009 recursively from the root node in postorder. Besides doing Level 1 aggregation, when it arrives at a node without a prefix, it compares this node's two children. If both children have prefixes and use the same next-hop, then both children are labeled NON-FIB, and this current node is assigned the parent prefix and labeled IN-FIB. The aggregation is done recursively throughout the entire table. The computation time is O(n), where n is the total number of nodes in the tree. 3.3. Level 3 Aggregation This technique is illustrated in Figure 1c. In addition to performing the Level 1 and 2 aggregation, Level 3 aggregates a set of non-adjacent prefixes that have the same next-hop into a super prefix. Between these non-adjacent prefixes, non-routable space is allowed. For example, in Figure 1c, at the bottom level of the tree, there are two nodes with address prefixes (real nodes) sharing the same next hop. However, these two nodes are separated in the tree by two nodes without address prefixes. The prefixes of the two real nodes can be aggregated into a grandparent prefix. Level 3 aggregation must be implemented with care to ensure its forwarding correctness. For example, in Figure 1c, two grandchildren prefixes are aggregated into one grandparent prefix. This would be incorrect if there is already a great-grandparent prefix (not shown in the figure) covering the subtree with a different next-hop B, because that means the two middle nodes at the bottom level are not non-routable space and their next-hops would change from B to A after the aggregation. In order to handle this case without introducing much computation overhead, we decide that in our implementation we only apply this type of aggregation to prefixes that do not have any existing ancestor prefix, i.e. covering prefixes. In a typical DFZ routing table, about half of all the prefixes are covering prefixes and the other half are covered prefixes. The covered prefixes can be aggregated by Level 1 and Level 2, therefore our choice does not lose too much aggregation capability. The algorithm implementing Level 3 aggregation traverses the tree recursively in postorder. Besides doing Level 1 and Level 2 aggregation for all nodes, when it arrives at a covering prefix, it checks whether this prefix has a sibling node that does not have a prefix. If yes, it returns a pointer of this prefix node to its parent node, which will further pass this pointer up along the tree. Whenever an upper level node has two such pointers, one from a left descendant and another from a right descendant, and these two descendants have the same next-hop, then a new prefix is created at this upper level node and labeled IN-FIB, while the two descendant nodes are labeled NON-FIB. If the two descendants have different Zhang, et al. Expires April 22, 2010 [Page 11] Internet-Draft FIB Aggregation October 2009 next-hops, then aggregation cannot be done and they remain IN-FIB. The computation complexity is O(n), where n is the number of nodes in the tree. 3.4. Level 4 Aggregation This technique is illustrated in Figure 1d. In addition to performing Level 1, 2 and 3 aggregation, Level 4 aggregates a set of non-adjacent prefixes. The difference from Level 3 aggregation is that, in Level 4, between the non-adjacent aggregated prefixes with the same next-hops, other prefixes with different next-hops are allowed, while Level 3 only allows prefixes with the same next-hop. For example, in Figure 1d, an address prefix with next-hop B is between the prefixes being aggregated to a new prefix with next-hop A, punching a "hole'' in the aggregate prefix. This type of aggregation maintains forwarding correctness and may also introduce extra routable space as Level 3 does. For the same reason as in Level 3, our algorithm only applies this type of aggregation to prefixes that do not have ancestor prefixes. The seemingly trivial difference between Level 4 and Level 3 actually has significant implication to algorithm design. It allows the maximum flexibility for aggregation. However, taking advantage of it may also require significant computational time. For example, given a set of non-adjacent prefixes with different next-hops, which super- prefix should be inserted? Which next-hop should the super-prefix take? Finally, how should the decision be made without too much computational complexity? In this work, we present and evaluate two different Level 4 algorithms described as follows. The Level 4A algorithm traverses the tree recursively once in postorder. Besides doing Level 1, 2 and 3 aggregations, when it arrives at a covering prefix (our level 4 algorithm only applies to covering prefixes), it returns a pointer of this prefix node to its parent, which will further pass this pointer up along the tree. Whenever an upper level node receives two lists of its prefix descendants, one from its left child and the other from its right child, this node combines the two lists to get all its immediate prefix descendants and their next-hops, picks the most popular next- hop as its own next-hop and inserts a prefix at this node. All the immediate prefix descendants that use the most popular next-hop will be labeled NON-FIB, and other descendants are labeled IN-FIB. If there are multiple next-hops tie for the most popular, then one of them will be randomly selected. The computation time is O(n), where n is the number of nodes in the tree. The Level 4B algorithm is based on Herrin's proposal [Herrin]. It differs from the Level 4A algorithm mainly in how the popular next- Zhang, et al. Expires April 22, 2010 [Page 12] Internet-Draft FIB Aggregation October 2009 hop is calculated -- the 4A algorithm considers only the immediate prefix descendants of a node, while the 4B algorithm considers all the prefix descendants of a node (i.e. all the prefixes in the tree rooted at this node). It traverses the tree twice. The first step traverses the tree recursively in postorder, which is like sweeping all tree nodes from bottom up. During this process, the algorithm calculates the most popular next-hop among all descendant prefixes of a node and records this next-hop with the node unless this node already has a prefix with a different next-hop. The second step traverses the covering prefixes in preorder, which is like going through all tree nodes from top down. During this process, the algorithm tries to insert new prefixes with the most popular next-hop from all prefix descendants (not just immediate prefix descendants as in Level 4A), as calculated in the previous postorder tree traversal, and label the descendant prefixes NON-FIB or IN-FIB accordingly. When there are multiple equally popular next-hops, we randomly select one. The computation time is O(n), where n is the number of nodes in the tree. It tries to do a more thorough aggregation than Level 4A, but will take longer time since it traverse the tree twice. 3.5. Extra Routable Space A side effect of Level 3 and 4 aggregations is that they introduce new prefixes that cover previously non-routable space, therefore some previously non-routable traffic (which would have been dropped by this router) will be forwarded along the next-hop of the new prefixes. This does not violate the correctness requirement since all previously routable traffic is still routable and will follow the same path. The impact of introducing extra routable space into the FIB depends on how much traffic is destined to that address space. In normal operational conditions, the volume of such traffic should be negligible. However, malicious traffic such as port scanning usually explores such non-routable space and in certain cases it may become noticeable. Eventually these packets will be dropped, either because they arrive at a router that does not have a route for these packets, or because the packet's time-to-live expires. These packets will consume bandwidth during transit and that is a new type of overhead introduced by Level 3 aggregation. A good Level 3 or Level 4 algorithm should consider the tradeoff between table size reduction, extra routable space introduced, and computation time. In our evaluation, we limited the newly introduced prefixes to be no longer than a certain length and found that setting the prefix length limit to 15 results in a good trade-off between table size reduction and extra routable space. Zhang, et al. Expires April 22, 2010 [Page 13] Internet-Draft FIB Aggregation October 2009 4. Updating Aggregated FIB Internet routes change over time, thus the obvious question is how to update the aggregated FIB when there is a change. Re-run the full FIB aggregation will maintain the best aggregation all the time, but it will also incur significant computation overhead since the full FIB aggregation accesses every tree node and updating an un- aggregated FIB only looks up and updates a single prefix node. We use the combination of four mechanisms to make sure that the computation cost of updating aggregated FIBs is under the control of operators. First, operators can choose the level of full FIB aggregation that suits their routers the best. Routers with faster CPU and fewer routing updates can use higher level FIB aggregation, otherwise they can use lower level FIB aggregation. Second, we design an algorithm that updates the aggregated FIB incrementally. The algorithm tries to minimize the number of tree nodes that have to be accessed and changed to maintain forwarding correctness after the routing change. It does not attempt to keep table size small. Third, the full FIB aggregation is only invoked when needed, e.g., the table size has crossed a threshold after being incrementally updated for a while, or when the router has free CPU cycles to spare, \ie, the router load is under a threshold. Fourth, as shown in [HighlyActive], a small number of highly active prefixes are responsible for a large number of routing changes. Therefore we can keep these highly active prefixes in the FIB without aggregating them out. Then every time they have a routing change, the update process is the same as updating an un-aggregated FIB. Operators can configure the criteria for deciding which prefixes should be kept in FIB regardless of the aggregation process. The basic idea of the incremental update algorithm is as follows. For each routing change, the algorithm looks up the prefix in the RIB, and finds out the prefix node's nearest prefix ancestor, and all the nearest prefix descendants. The nearest prefix ancestor is the nearest ancestor node that has a prefix. Similarly, nearest prefix descendants are descendant nodes that have prefixes and there is no other prefix descendants in between. In the worst case, all the nearest prefix ancestor and nearest prefix descendant nodes need to be updated. For example, when an aggregate prefix in Level 1 aggregation is withdrawn, in the worst case, all of its nearest prefix descendants need to be converted from NON-FIB back to IN-FIB. But in many cases, we are able to narrow the changes to a smaller number of nodes. For example, when a new prefix is announced, any of its nearest prefix descendants who shares the same next-hop with the new prefix does not need any change. We do not even remove them from the FIB since the goal here is to minimize changes, not to maintain small table size. In certain cases, we can actually avoid changes Zhang, et al. Expires April 22, 2010 [Page 14] Internet-Draft FIB Aggregation October 2009 that would have happened to un-aggregated FIBs. For example, in Figure 1, if the covered prefix is withdrawn, it would require a change to the un-aggregated FIB, but the aggregated FIB actually does not need to be updated. Therefore, how many changes happen to the aggregated FIB and how many nodes each change affects depend on the content of the routing updates. In the next section, we use one month of BGP routing updates to evaluate the incremental update algorithms and find that the number of updated nodes per change is close to one, similar to updating an un-aggregated FIB. 5. Evaluation We use publicly available routing tables from tens of networks collected by the RouteViews Oregon monitor [routeviews] to evaluate the various FIB aggregation algorithms for their table size reduction, computing times, and extra routable space. We also use BGP routing updates to evaluate our FIB update algorithm in terms of the frequency of FIB updates, the scope of the updates, computing times, and how often the full algorithm may need to run. 5.1. Methodology We evaluate different FIB aggregation algorithms using publicly available BGP routing tables taken from the RouteViews Oregon monitor [routeviews]. Although these routing tables contain valid next AS hops, they either do not have next-hop router information or do not reflect the diversity of next-hops that an operational router typically has, since the route monitors are not operational routers. Therefore we need to generate realistic next-hops based on known information. Our guideline of this process is trying to overestimate the number of next-hops so that the table reduction results reflect the worst case scenario, and real routers are likely to have better aggregation ratio. The specifics of generating next-hops are as follows. Tables downloaded from RouteViews only contain the AS path for each prefix. From the AS path, we can extract the next-AS-hop for each prefix. We assume that prefixes sharing the same next-AS-hop share the same iBGP neighbor (and thus the same next-hop router at the IP level). We use routing tables from looking glass servers, which provider the IBGP neighbor information, to validate this assumption. For each next-AS-hop, if there is only one iBGP neighbor, then all the prefixes using this next-AS-hop satisfy the assumption. If there are multiple iBGP neighbors, the one that carries the most prefixes is called "popular,'' and the prefixes that use the popular iBGP neighbors satisfy the assumption. We found that most routing tables have more than 90% of prefixes satisfying the assumption. Based on Zhang, et al. Expires April 22, 2010 [Page 15] Internet-Draft FIB Aggregation October 2009 this assumption, we use next-AS-hop as the next-hop router in evaluation. This reflects the worst case scenario since large networks have hundreds to thousands of neighbor ASes, but the number of real next-hops should be much smaller. We verify the correctness of each aggregated FIB by dividing each original RIB prefix into multiple /24 sub prefixes and look up the /24 sub prefixes in the aggregated FIB, which should give the same next-hop as that in the RIB. All the results from our FIB aggregation algorithms and incremental update algorithms have been verified using this method. All the evaluation has been done on a Linux machine with an Intel Core 2 Quad 2.83GHz CPU using a single thread. The algorithms are implemented in C and no performance optimization techniques have been attempted. The Patricia trie implementation is taken from the C source code of Perl's Net::Patricia module [NetPatricia], which in turn was adapted from MRTD's [mrtd] source code. We use the public BGP routing tables to do the evaluation because these tables come from a diverse set of networks, from tier-1 ISPs to small networks. However, in operational networks, there are other types of routes, such as VPN routes, which can be of a large number too. The FIB aggregation algorithms can be applied to these other types of routes as well, even though this paper does not evaluation the effectiveness of doing so. We plan to obtain forwarding tables from operational ISP routers to validate our results. 5.2. Table Size Reduction and Overhead We applied our algorithms to 37 routing tables archived at RouteViews on Dec. 31, 2008 and then calculated the ratio between the aggregated FIB size and the original routing table size. We obtained the following results: (1) each level of aggregation can reduce the FIB size more than the previous level, which is as expected; (2) even with the simple Level 1 aggregation, the FIB size can be reduced by 30% to 50%; (3) Level 4 aggregation can reduce the FIB size from 60% to over 90% with a median around 70% -- some of the tables have almost all the prefixes sharing the same nexthop, so they can be aggregated into very few entries; and (4) the Level 4A algorithm is slightly better than the Level 4B algorithm, although the difference is almost negligible. To evaluate the effectiveness of our algorithms over a longer period of time, we applied them to the RouteViews routing tables from 2001 to 2008. For each year, we obtained the median aggregation ratio of all the routing tables. We observed an overall decreasing trend in the aggregation ratio (more table size reduction), suggesting that Zhang, et al. Expires April 22, 2010 [Page 16] Internet-Draft FIB Aggregation October 2009 the FIB has become more amenable to aggregation over the years. One possible explanation is that the increasing practice of prefix splitting due to multi-homing and traffic engineering has made a larger percentage of FIB entries aggregatable. We plan to further investigate this phenomenon in our future work. In order to understand the significance of the above results, we use the size of the routing table from 1994 to 2009 to translate the table size reduction into how many years we turn the clock back for a router. The data is obtained from bgp.potaroo.net, a site that tracks the growth of the BGP table size. We found that the FIB size in 2001 is around 30% of the FIB size on 12/31/2008, which means that if an ISP uses our Level 4 aggregation algorithm, it will still be able to use routers that were deployed in 2001. Assuming these routers were sufficiently provisioned when they were purchased seven or eight years ago, they should still be able to accommodate some further routing table growth. We measured the computing time incurred by our algorithms to aggregate each of the 37 routing tables. The Level 1 to 3 algorithms typically require tens of milliseconds, while the Level 4 algorithms consume at most a few hundreds milliseconds. An operational router may have slower CPU than our commodity Linux machine, but it has specialized hardware and software, thus it is hard to infer a router's computing time from what we report here. Nevertheless, the simplicity of the algorithms and the very short computing time suggest that the computational overhead in an operational router may be small. Moreover, our results can be used to compare the relative speed between different aggregation algorithms. For example, we can observe that the Level 4B algorithm is more computing intensive than the Level 4A algorithm. We also measured the amount of extra routable space (measured by the number of /8 prefixes) introduced by three of our algorithms. Since Level 1 and 2 algorithms do not introduce new routable space, we do not present them here. We found that the Level 3 algorithm is better than the Level 4 algorithms in this regard, while the Level 4B algorithm covers more routable space than the Level 4A algorithm. This is mainly because the 4B algorithm aggregates prefixes from top to bottom, which introduces more shorter prefixes than the 4A algorithm. 5.3. Routing Update Handling We use one month (December 2008) of BGP routing updates collected by RouteViews from the Level 3 Communications peer router to evaluate our incremental update handling algorithms. We summarize our results in Figure 2. There were over 7 million routing updates during this Zhang, et al. Expires April 22, 2010 [Page 17] Internet-Draft FIB Aggregation October 2009 month. For each unaggregated or aggregated FIB, we counted how many of these routing updates would cause a change to the FIB, i.e. an insertion, removal, or a change to the next-hop of a FIB entry. There were 3,048,038 changes to the unaggregated FIB. Note, however, that an aggregated FIB may experience fewer changes than the unaggregated FIB. For example, the aggregated FIB from our Level 4A algorithm had 157,722 fewer FIB updates than the unaggregated FIB. This is due to two reasons. First, some of the withdrawals may be for prefixes already removed from the FIB by the aggregation algorithm. Second, our update algorithm minimizes the number of FIB changes at the cost of slightly increased FIB size. On the other hand, each FIB change may affect slightly more prefixes in the case of the aggregated FIB, as shown in the last column of Figure 2. As a result, the processing time for aggregated and unaggregated FIBs are very close. In other words, our update handling algorithm for FIB aggregation has minimal impact on the processing time for routing updates. +==========+===========+===========+================+============+ |Algorithms|No. Updates|Total Proc.|Avg Proc Time(S)|Prefixes per| | | | Time (s) |per FIB update | FIB Update | +==========+===========+===========+================+============+ | RIB | 7254478 | N/A | N/A | N/A | +----------+-----------+-----------+----------------+------------+ | FIB | 3048111 | 10.98 | 3.6 | 1.0 | +----------+-----------+-----------+----------------+------------+ | Level-1 | 2904630 | 11.26 | 3.88 | 1.01 | +----------+-----------+-----------+----------------+------------+ | Level-2 | 2901446 | 11.24 | 3.87 | 1.01 | +----------+-----------+-----------+----------------+------------+ | Level-3 | 2900304 | 11.23 | 3.87 | 1.01 | +----------+-----------+-----------+----------------+------------+ | Level-4A | 2897385 | 11.11 | 3.83 | 1.02 | +----------+-----------+-----------+----------------+------------+ | Level-4B | 2913995 | 6.16 | 2.11 | 1.17 | +----------+-----------+-----------+----------------+------------+ Figure 2 Since our update handling algorithm trades off the FIB size for fewer changes, the FIB needs to be re-aggregated when its size reaches a certain threshold. To estimate how frequently the re-aggregation will be triggered, we measure the growth of the FIB size as our algorithm handles the BGP updates during the month of Dec. 2008. The Level 4 aggregated FIB had 99,676 entries on Dec. 1, 2008 (37.3% of the routing table size). If we set the FIB size threshold to be 150,000 entries (about 55% of the full routing table size), the FIB Zhang, et al. Expires April 22, 2010 [Page 18] Internet-Draft FIB Aggregation October 2009 would be re-aggregated five days later on Dec. 6, 2008. We modified our update handling algorithm to do a full aggregation when the FIB size reaches 150,000 and found that the full aggregation was indeed triggered every few days. Considering that each full aggregation run takes at most a few hundred milliseconds on our commodity PC (perhaps a little longer on a router), incurring this overhead every few days or so should not be a concern for an ISP. 6. Conclusion We have presented an in-depth analysis of FIB aggregation and our results suggest that it is a viable short-term solution to the problem of growing FIB table size. Our aggregation algorithms reduces the FIB size by as much as 70% and requires no hardware changes or network-wide software/configuration changes, thus reducing the need for ISP router upgrades in the short term. During this time, the research community and the industry can work on a long-term solution to reduce both the routing table and the FIB table. Moreover, FIB aggregation can co-exist with any long-term solution to further reduce ISPs' operational costs. 7. Acknowledgements The authors are part of the eFIT project team funded by the US National Science Foundation. 8. Security Considerations This draft is a discussion on the Internet's necessity to follow an evolutionary path towards the future. There is no direct impact on the Internet security. 9. Informative References [Cain02AUTOAGG] Cain, B., "Auto aggregation method for IP prefix/length pairs", Patent, June 2002, . [Herrin] Herrin, W., "Opportunistic Topological Aggregation in the RIB-FIB Calculation?", July 2008, . [HighlyActive] Zhang, et al. Expires April 22, 2010 [Page 19] Internet-Draft FIB Aggregation October 2009 Oliveira, R., Izhak-Ratzin, R., Zhang, B., and L. Zhang, "Measurement of Highly Active Prefixes in BGP". [Li09RRG] Li, T., "Preliminary Recommendation for a Routing Architecture", Internet Draft, March 2009, . [NetPatricia] Plonka, D. and P. Prindeville, "Net-Patricia Perl Module", . [RAWS] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB Workshop on Routing and Addressing", Internet Draft, 2007, . [RFC4984] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB Workshop on Routing and Addressing", RFC 4984, September 2007. [RouterTopology] Li, L., Alderson, D., Willinger, W., and J. Doyle, "A first-principles approach to understanding the Internet's router-level topology", ACM SIGCOMM 2004. [Virtual_Aggregation] Francis, P., Xu, X., and H. Billani, "FIB Suppression with Virtual Aggregation and Default Routes", draft-francis-idr-intra-va-01, September 2008. [mrtd] "MRTD: The Multi-Threaded Routing Toolkit", . [routeviews] Advanced Network Technology Center and University of Oregon, "The RouteViews Project", . Authors' Addresses Beichuan Zhang Univ. of Arizona Email: bzhang@arizona.edu Zhang, et al. Expires April 22, 2010 [Page 20] Internet-Draft FIB Aggregation October 2009 Lan Wang Univ. of Memphis Email: lanwang@memphis.edu Lixia Zhang UCLA Email: lixia@cs.ucla.edu Zhang, et al. Expires April 22, 2010 [Page 21]