Network Working Group                                           B. Zhang
Internet-Draft                                          Univ. of Arizona
Intended status: Informational                                   L. Wang
Expires: April 22, 2010                                 Univ. of Memphis
                                                                L. Zhang
                                                                    UCLA
                                                        October 19, 2009


                            FIB Aggregation
                   draft-zhang-fibaggregation-00.txt

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 22, 2010.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November


Zhang, et al.            Expires April 22, 2010                 [Page 1]

Internet-Draft               FIB Aggregation                October 2009


   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Abstract

   The rapid growth of Forwarding Information Base (FIB) has raised
   concerns among many Internet Service Providers.  One potential
   solution to this problem is FIB aggregation, i.e. aggregating FIB
   entries without affecting the forwarding paths taken by data traffic.
   It is a purely local software optimization limited within a router,
   requiring no changes to routing protocols or router hardware.  To
   understand the feasibility of using FIB aggregation to extend router
   lifetime, we present several FIB aggregation algorithms and evaluate
   their performance using routing tables and updates from tens of
   networks.  We find that FIB aggregation can reduce the FIB table size
   by as much as 70% with small computational overhead.  We also show
   that the computational overhead can be controlled through various
   mechanisms.


Zhang, et al.            Expires April 22, 2010                 [Page 2]

Internet-Draft               FIB Aggregation                October 2009


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  FIB Aggregation Overview . . . . . . . . . . . . . . . . . . .  6
   3.  FIB Aggregation Techniques and Algorithms  . . . . . . . . . .  7
     3.1.  Level 1 Aggregation  . . . . . . . . . . . . . . . . . . . 10
     3.2.  Level 2 Aggregation  . . . . . . . . . . . . . . . . . . . 10
     3.3.  Level 3 Aggregation  . . . . . . . . . . . . . . . . . . . 11
     3.4.  Level 4 Aggregation  . . . . . . . . . . . . . . . . . . . 12
     3.5.  Extra Routable Space . . . . . . . . . . . . . . . . . . . 13
   4.  Updating Aggregated FIB  . . . . . . . . . . . . . . . . . . . 14
   5.  Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     5.1.  Methodology  . . . . . . . . . . . . . . . . . . . . . . . 15
     5.2.  Table Size Reduction and Overhead  . . . . . . . . . . . . 16
     5.3.  Routing Update Handling  . . . . . . . . . . . . . . . . . 17
   6.  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 19
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 19
   9.  Informative References . . . . . . . . . . . . . . . . . . . . 19
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20


Zhang, et al.            Expires April 22, 2010                 [Page 3]

Internet-Draft               FIB Aggregation                October 2009


1.  Introduction

   The Internet routing scalability problem has raised concerns in both
   industry and research communities, as documented in the report from
   the IAB Workshop on Routing and Addressing [RAWS].  Several solutions
   have been proposed under the IRTF RRG and IETF GROW working groups.
   In order to address the root cause of the scalability problem, we
   need fundamental changes to the Internet routing architecture and
   protocols.  However, deploying these architectural changes may take a
   long time, as we have learned from the deployment of IPv6.  While we
   believe that architectural changes will be beneficial to the Internet
   in the long run, we should not avoid short-term solutions that can
   help ISPs reduce their operational costs, which will in turn be
   beneficial to end users.  In particular, ISPs urgently need a
   solution to reduce their forwarding table size.  Forwarding tables
   are derived from routing tables and router configurations, so their
   size increases as routing tables grow.  However, forwarding tables
   use high performance memory that is more expensive and more difficult
   to scale than the memory used to hold routing tables.  Therefore,
   their size is a more immediate concern to ISPs and vendors.

   This work investigates the feasibility of a purely local solution:
   FIB aggregation, which is to combine multiple entries in the
   forwarding table without changing the next hops for data forwarding.
   This approach is particularly appealing because it can be done by a
   software upgrade at a router and its impact is limited within the
   router.  It does not require changes to routing protocols or router
   hardware, nor does it affect multi-homing, traffic engineering, or
   other network-wide operations.  It is important to note that FIB
   aggregation is not a replacement for the long-term architectural
   solutions because it does not address the root causes of the routing
   scalability problem.  Instead, FIB aggregation is a local solution
   that can be quickly implemented and deployed in the short-term, and
   in the long run, it can co-exist and complement architectural
   solutions.

   The idea of FIB aggregation is not new.  It was mentioned as a
   potential strategy in "Preliminary Recommendation for a Routing
   Architecture" [Li09RRG].  Through personal exchanges, we have learned
   that one vendor has implemented a simple FIB aggregation scheme
   (similar to our Level-1 aggregation).  There is also a patent for a
   FIB aggregation algorithm [Cain02AUTOAGG].  In this work, we evaluate
   the benefits and costs of several FIB aggregation schemes.  We
   recognize that FIB aggregation is an opportunistic technique -- its
   effectiveness depends on what prefixes are present in the table, how
   many of them can be numerically represented by a single prefix, and
   how many of them share the same next-hop in data forwarding.  The
   benefits of FIB aggregation come with certain costs, such as extra


Zhang, et al.            Expires April 22, 2010                 [Page 4]

Internet-Draft               FIB Aggregation                October 2009


   CPU overhead.  The costs also depend on the actual aggregation
   algorithms, and how routing changes are handled to update the
   forwarding table.

   We design and implement five algorithms at different aggregation
   levels, each representing different tradeoffs between table size
   reduction and computation complexity.  We evaluate them using
   publicly available routing tables from tens of networks.  The results
   show that the lowest-level aggregation can reduce table size from
   50\% to 70\%, equivalent to the full table size two and half years
   ago, while the highest-level aggregation can reduce the table size to
   30\%, equivalent to the full table size eight years ago.  The
   computation time of one aggregation run ranges from 50 milliseconds
   to 250 milliseconds on a commodity Linux machine.  Although these
   numbers may not reflect the computation time on a router, they
   reflect the relative speed of different levels of aggregation.
   Moreover, such full aggregation is performed only when certain
   thresholds are reached, so the computation time is amortized over
   time.

   To handle routing changes, we design and implement algorithms to
   incrementally update the aggregated FIB upon a change.  The full
   aggregation algorithm is invoked only when the router CPU load is low
   or the FIB size rises above a threshold.  The computation time can be
   further reduced by not aggregating highly active prefixes, which
   represent a small fraction of the routing table but are responsible
   for many routing changes.  The evaluation using one-month BGP routing
   updates shows that compared with un-aggregated FIB, the computation
   overhead of maintaining aggregated FIB over time is small.  Overall,
   our conclusion is that FIB aggregation is a viable solution to the
   routing scalability problem, since it provides significant reduction
   in table size, and its extra computation can be controlled by various
   mechanisms, such as using different levels of aggregation,
   incremental update algorithms, on-deman invocation, and not
   aggregating highly active prefixes.

   Another proposed approach to reducing FIB size is Virtual Aggregation
   [Virtual_Aggregation].  Virtual Aggregation designates a small set of
   routers (APRs) that announce virtual prefixes, so that other routers
   do not need to install more specific prefixes under those virtual
   prefixes in their FIB -- other routers simply forward their packets
   to the APRs responsible for the corresponding virtual prefixes.  It
   can be independently deployed by one ISP, and does not require
   changes to the routing architecture or protocols.  However, Virtual
   Aggregation requires considerable changes to network-wide router
   configurations and specialized routers to announce virtual prefixes.
   Moreover, it introduces extra delays (stretch) in packet delivery.
   To some extent, the Level 4 FIB aggregation described in this draft


Zhang, et al.            Expires April 22, 2010                 [Page 5]

Internet-Draft               FIB Aggregation                October 2009


   can be viewed as a local Virtual Aggregation limited within a router.


2.  FIB Aggregation Overview

   FIB aggregation eliminates redundant entries in a FIB without
   affecting the delivery of data traffic.  For example, it may remove
   an entry A from the FIB if another entry B with the same next-hop
   already covers A's address space.  It may also introduce a new entry
   to the FIB after removing several entries that share the same next-
   hop.

   To ensure packet delivery and avoid changing the path that a packet
   will take, a FIB aggregation scheme should satisfy the property of
   forwarding correctness: after longest match lookup, if a destination
   address has a non-NULL next-hop in the un-aggregated FIB, it should
   have the same next-hop in the aggregated FIB.  There are two main
   questions in designing FIB aggregation techniques: how to aggregate
   the full FIB and how to update an aggregated FIB upon a routing
   change.  In the next two sections, we will describe five aggregation
   algorithms and one update handling algorithm.

   The effectiveness of FIB aggregation depends on which address
   prefixes are present in the routing table and how these prefixes are
   distributed over next-hop routers.  Generally speaking, the fewer
   neighbors a router has, the better aggregation it may achieve.  In
   the extreme case, all prefixes share the same next-hop and the
   aggregation is maximized.  According to Li et. al [RouterTopology],
   although some routers have high degrees up to a couple of hundreds,
   these routers connect to a large number of end-customers, not transit
   neighbor routers.  Therefore, they will still use only a small number
   of next-hops, i.e., the transit neighbors, to reach most of the
   address prefixes.

   Besides sharing the same next-hop, prefixes also need to be
   numerically aggregatable.  This is possible due to two factors.
   First, in IP address allocation, large blocks of Internet addresses
   are first allocated to Regional Internet Registries and then they
   further allocate the addresses to networks within the same region.  A
   router outside the region tends to use the same nexthop to reach
   these address prefixes, which can then be aggregated.  Second, for
   prefixes split for traffic engineering or other purposes, a router
   near the origin network is likely to have different next-hops, but a
   router further away from the origin network is more likely to have
   the same next-hop towards these numerically aggregatable prefixes.

   Therefore, although FIB aggregation is opportunistic and the
   aggregation degree varies from router to router, there are some


Zhang, et al.            Expires April 22, 2010                 [Page 6]

Internet-Draft               FIB Aggregation                October 2009


   inherent properties of the Internet that can make FIB aggregation
   effective.  If FIB aggregation is effective in reducing table size,
   its most appealing feature is that the impact is limited within a
   router's data plane.  It does not change any routing protocols, or
   any router's routing decisions.  Data traffic still flows on the same
   router paths.  Therefore, it can co-exist with almost any new routing
   protocols, including those architectural solutions to the routing
   scalability problem in the long run.


3.  FIB Aggregation Techniques and Algorithms

   We present four levels of full FIB aggregation, each aggregating the
   table more but incuring more computation and other overhead.  These
   algorithms assume that the routing table is stored in a tree
   structure.  Though our implementation uses patricia trie, a tree-like
   data structure widely used for routing tables since the BSD
   implementation, the algorithms should apply to any tree data
   structure.  Note that our algorithms do not build any additional
   trees just for aggregation; we simply use the existing trees that the
   RIB and FIB use.  For a network device that uses non-tree data
   structure to implement routing tables, the general techniques
   discussed here still apply.  In real networks, the algorithms and
   implementations can always be optimized for specific hardware,
   operating systems, and routing table data structures.

   Before describing each FIB aggregation algorithm, we first introduce
   the following terms.

   o  Ancestor prefix: if P1 is a less specific address prefix compared
      to P2, P1 is an ancestor prefix of P2.  For example, 12/8 is an
      ancestor prefix of 12.1/16.

   o  Covering prefix and covered prefix: if an address prefix P1 has no
      ancestor prefix in a RIB/FIB, then it is a covering prefix.
      Otherwise, it is a covered prefix.  Note that in a patricia trie
      implementation, a covering prefix may have ancestor nodes that do
      not have corresponding real prefixes.

   o  Hole: if a RIB/FIB entry X has a more specific prefix than RIB/FIB
      entry Y does and they have different next-hops, then A is punching
      a hole in B. For example, the entry (prefix: 12.1/16, nexthop: A)
      punches a hole in (prefix: 12/8, nexthop: B).

   o  Sibling prefix: if two prefixes of the same length are numerically
      consecutive and numerically aggregatable, we call them sibling
      prefixes.  For example, 12.2/16 and 12.3/16 are sibling prefixes.
      In a patricia trie, sibling prefixes are children prefixes under


Zhang, et al.            Expires April 22, 2010                 [Page 7]

Internet-Draft               FIB Aggregation                October 2009


      the same parent.

   o  Adjacent prefixes: if two prefixes are numerically consecutive,
      they are called adjacent prefixes.  Note that adjacent prefixes
      may not be numerically aggregatable.  For example, 12.3/16 and
      12.4/16 are numerically consecutive but not numerically
      aggregatable.

   o  IN-FIB and NON-FIB: an aggregation algorithm determines whether a
      prefix should be placed in the FIB.  If placed in the FIB, the
      prefix is labeled as IN-FIB.  Otherwise, it is labeled as NON-FIB.

   o  Postorder traversal: this is a recursive tree traversal algorithm
      that examines the children nodes before the parent nodes.  Suppose
      in a given FIB tree, prefix P1 has two children prefixes P2 and P3
      and prefix P2 has two children prefixes P4 and P5 (P3, P4, P5 are
      leaf nodes), then the algorithm will visit the prefixes in the
      following order: P4, P5, P2, P3, P1.

   o  Preorder traversal: this is a recursive tree traversal algorithm
      that traverses each node, its left subtree and then its right
      subtee.  In the above example, a pre-order traversal will visit
      the prefixes in the following order: P1, P2, P4, P5, P3.


Zhang, et al.            Expires April 22, 2010                 [Page 8]

Internet-Draft               FIB Aggregation                October 2009


   Figure 1 illustrates four different types of aggregation, as we will
   describe in detail in the rest of this section.  The binary tree
   represents part of the IP address space.  Nodes labeled with letters
   are prefixes in the routing table, and the letter represents the
   next-hop for the prefix.  Nodes without labels do not have their
   corresponding prefixes in the FIB tree.  Filled nodes are extra
   routable space introduced by the aggregation.

   (a) Level-1: Removing covered prefixes

          (A)         (A)
          / \    =>   / \
         /   \       /   \
       (A)   ( )   ( )   ( )

   (b) Level-2: Combining sibling prefixes

          ( )         (A)
          / \    =>   / \
         /   \       /   \
       (A)   (A)   ( )   ( )

   (c) Level-3: Allowing extra space

          (A)             (A)
          / \             / \
         /   \           /   \
       ()     ()   =>  ()     ()
       /\     /\       /\     /\
      /  \   /  \     /  \   /  \
    (A)  () ()   (A) ()  () ()   ()

   (d) Level-4: Allowing holes in the aggregation

          (A)             (A)
          / \             / \
         /   \           /   \
       ()     ()   =>  ()     ()
       /\     /\       /\     /\
      /  \   /  \     /  \   /  \
    (A)  (B)()   (A) ()  (B)()   ()

                                 Figure 1

   Our aggregation techniques ensure forwarding correctness by
   aggregating prefixes with the same next-hop.  In reality, there are
   other types of information stored in a FIB entry, such as packet
   count.  Aggregating two prefixes will probably lose such fine-grained


Zhang, et al.            Expires April 22, 2010                 [Page 9]

Internet-Draft               FIB Aggregation                October 2009


   statistics, which also happens in all other routing scalability
   solutions, usually to a much wider extent.  Losing more detailed
   information is a necessary cost we have to pay in order to improve
   overall system scalability.  One way to mitigate the problem is
   having a router-wide packet counter instead of per-FIB packet
   counter, thus the aggregated information from all line cards are
   still kept for each prefix.  Another way is for the operators to
   configure some important prefixes not be subject to FIB aggregation,
   thus fine-grained statistics of these important prefixes will be
   kept.

3.1.  Level 1 Aggregation

   This technique is illustrated in Figure 1a.  It removes a prefix if
   it shares the same next-hop with its immediate ancestor prefix.
   Addresses that previously match the removed prefix now will match the
   ancestor prefix and still get the same next-hop.  Previously non-
   routable packets, whose table lookup ends up with NULL next-hop, will
   still be non-routable.  In other words, this aggregation does not
   introduce any new prefix nor extra routable space into the table.

   The algorithm implementing this technique simply traverses the tree
   recursively from the root node in postorder.  When it arrives at a
   node with a prefix, it compares this prefix's next-hop with its
   immediate ancestor prefix's next-hop.  If they have the same next-
   hop, it labels the current node NON-FIB, otherwise labels it IN-FIB.
   The immediate ancestor prefix's next-hop is updated and remembered
   during the tree traversal.  Eventually every prefix node is labeled
   as either NON-FIB or IN-FIB, and all IN-FIB prefixes comprise the
   aggregated FIB.  The aggregation is done recursively throughout the
   entire table.  The computation time is O(n), where n is the total
   number of nodes in the tree.

3.2.  Level 2 Aggregation

   This technique is illustrated in Figure 1b.  In addition to
   performing Level 1 aggregation, Level 2 combines sibling prefixes
   that share the same next-hop into a parent prefix.  If the parent
   node already has a prefix with a different next-hop, then the
   aggregation cannot be done.  Or if the parent node already has a
   prefix with the same next-hop, then it is part of Level 1
   aggregation.  Therefore, Level 2 is done when the parent node has no
   prefix.  The net result is to introduce a new prefix to cover two
   existing prefixes, but there is no extra routable space introduced,
   i.e. the aggregated FIB covers the exact address space as the un-
   aggregated FIB.

   The algorithm implementing Level 2 aggregation traverses the tree


Zhang, et al.            Expires April 22, 2010                [Page 10]

Internet-Draft               FIB Aggregation                October 2009


   recursively from the root node in postorder.  Besides doing Level 1
   aggregation, when it arrives at a node without a prefix, it compares
   this node's two children.  If both children have prefixes and use the
   same next-hop, then both children are labeled NON-FIB, and this
   current node is assigned the parent prefix and labeled IN-FIB.  The
   aggregation is done recursively throughout the entire table.  The
   computation time is O(n), where n is the total number of nodes in the
   tree.

3.3.  Level 3 Aggregation

   This technique is illustrated in Figure 1c.  In addition to
   performing the Level 1 and 2 aggregation, Level 3 aggregates a set of
   non-adjacent prefixes that have the same next-hop into a super
   prefix.  Between these non-adjacent prefixes, non-routable space is
   allowed.  For example, in Figure 1c, at the bottom level of the tree,
   there are two nodes with address prefixes (real nodes) sharing the
   same next hop.  However, these two nodes are separated in the tree by
   two nodes without address prefixes.  The prefixes of the two real
   nodes can be aggregated into a grandparent prefix.

   Level 3 aggregation must be implemented with care to ensure its
   forwarding correctness.  For example, in Figure 1c, two grandchildren
   prefixes are aggregated into one grandparent prefix.  This would be
   incorrect if there is already a great-grandparent prefix (not shown
   in the figure) covering the subtree with a different next-hop B,
   because that means the two middle nodes at the bottom level are not
   non-routable space and their next-hops would change from B to A after
   the aggregation.  In order to handle this case without introducing
   much computation overhead, we decide that in our implementation we
   only apply this type of aggregation to prefixes that do not have any
   existing ancestor prefix, i.e. covering prefixes.  In a typical DFZ
   routing table, about half of all the prefixes are covering prefixes
   and the other half are covered prefixes.  The covered prefixes can be
   aggregated by Level 1 and Level 2, therefore our choice does not lose
   too much aggregation capability.

   The algorithm implementing Level 3 aggregation traverses the tree
   recursively in postorder.  Besides doing Level 1 and Level 2
   aggregation for all nodes, when it arrives at a covering prefix, it
   checks whether this prefix has a sibling node that does not have a
   prefix.  If yes, it returns a pointer of this prefix node to its
   parent node, which will further pass this pointer up along the tree.
   Whenever an upper level node has two such pointers, one from a left
   descendant and another from a right descendant, and these two
   descendants have the same next-hop, then a new prefix is created at
   this upper level node and labeled IN-FIB, while the two descendant
   nodes are labeled NON-FIB.  If the two descendants have different


Zhang, et al.            Expires April 22, 2010                [Page 11]

Internet-Draft               FIB Aggregation                October 2009


   next-hops, then aggregation cannot be done and they remain IN-FIB.
   The computation complexity is O(n), where n is the number of nodes in
   the tree.

3.4.  Level 4 Aggregation

   This technique is illustrated in Figure 1d.  In addition to
   performing Level 1, 2 and 3 aggregation, Level 4 aggregates a set of
   non-adjacent prefixes.  The difference from Level 3 aggregation is
   that, in Level 4, between the non-adjacent aggregated prefixes with
   the same next-hops, other prefixes with different next-hops are
   allowed, while Level 3 only allows prefixes with the same next-hop.
   For example, in Figure 1d, an address prefix with next-hop B is
   between the prefixes being aggregated to a new prefix with next-hop
   A, punching a "hole'' in the aggregate prefix.  This type of
   aggregation maintains forwarding correctness and may also introduce
   extra routable space as Level 3 does.  For the same reason as in
   Level 3, our algorithm only applies this type of aggregation to
   prefixes that do not have ancestor prefixes.

   The seemingly trivial difference between Level 4 and Level 3 actually
   has significant implication to algorithm design.  It allows the
   maximum flexibility for aggregation.  However, taking advantage of it
   may also require significant computational time.  For example, given
   a set of non-adjacent prefixes with different next-hops, which super-
   prefix should be inserted?  Which next-hop should the super-prefix
   take?  Finally, how should the decision be made without too much
   computational complexity?  In this work, we present and evaluate two
   different Level 4 algorithms described as follows.

   The Level 4A algorithm traverses the tree recursively once in
   postorder.  Besides doing Level 1, 2 and 3 aggregations, when it
   arrives at a covering prefix (our level 4 algorithm only applies to
   covering prefixes), it returns a pointer of this prefix node to its
   parent, which will further pass this pointer up along the tree.
   Whenever an upper level node receives two lists of its prefix
   descendants, one from its left child and the other from its right
   child, this node combines the two lists to get all its immediate
   prefix descendants and their next-hops, picks the most popular next-
   hop as its own next-hop and inserts a prefix at this node.  All the
   immediate prefix descendants that use the most popular next-hop will
   be labeled NON-FIB, and other descendants are labeled IN-FIB.  If
   there are multiple next-hops tie for the most popular, then one of
   them will be randomly selected.  The computation time is O(n), where
   n is the number of nodes in the tree.

   The Level 4B algorithm is based on Herrin's proposal [Herrin].  It
   differs from the Level 4A algorithm mainly in how the popular next-


Zhang, et al.            Expires April 22, 2010                [Page 12]

Internet-Draft               FIB Aggregation                October 2009


   hop is calculated -- the 4A algorithm considers only the immediate
   prefix descendants of a node, while the 4B algorithm considers all
   the prefix descendants of a node (i.e. all the prefixes in the tree
   rooted at this node).  It traverses the tree twice.  The first step
   traverses the tree recursively in postorder, which is like sweeping
   all tree nodes from bottom up.  During this process, the algorithm
   calculates the most popular next-hop among all descendant prefixes of
   a node and records this next-hop with the node unless this node
   already has a prefix with a different next-hop.  The second step
   traverses the covering prefixes in preorder, which is like going
   through all tree nodes from top down.  During this process, the
   algorithm tries to insert new prefixes with the most popular next-hop
   from all prefix descendants (not just immediate prefix descendants as
   in Level 4A), as calculated in the previous postorder tree traversal,
   and label the descendant prefixes NON-FIB or IN-FIB accordingly.
   When there are multiple equally popular next-hops, we randomly select
   one.  The computation time is O(n), where n is the number of nodes in
   the tree.  It tries to do a more thorough aggregation than Level 4A,
   but will take longer time since it traverse the tree twice.

3.5.  Extra Routable Space

   A side effect of Level 3 and 4 aggregations is that they introduce
   new prefixes that cover previously non-routable space, therefore some
   previously non-routable traffic (which would have been dropped by
   this router) will be forwarded along the next-hop of the new
   prefixes.  This does not violate the correctness requirement since
   all previously routable traffic is still routable and will follow the
   same path.  The impact of introducing extra routable space into the
   FIB depends on how much traffic is destined to that address space.
   In normal operational conditions, the volume of such traffic should
   be negligible.  However, malicious traffic such as port scanning
   usually explores such non-routable space and in certain cases it may
   become noticeable.  Eventually these packets will be dropped, either
   because they arrive at a router that does not have a route for these
   packets, or because the packet's time-to-live expires.  These packets
   will consume bandwidth during transit and that is a new type of
   overhead introduced by Level 3 aggregation.  A good Level 3 or Level
   4 algorithm should consider the tradeoff between table size
   reduction, extra routable space introduced, and computation time.

   In our evaluation, we limited the newly introduced prefixes to be no
   longer than a certain length and found that setting the prefix length
   limit to 15 results in a good trade-off between table size reduction
   and extra routable space.


Zhang, et al.            Expires April 22, 2010                [Page 13]

Internet-Draft               FIB Aggregation                October 2009


4.  Updating Aggregated FIB

   Internet routes change over time, thus the obvious question is how to
   update the aggregated FIB when there is a change.  Re-run the full
   FIB aggregation will maintain the best aggregation all the time, but
   it will also incur significant computation overhead since the full
   FIB aggregation accesses every tree node and updating an un-
   aggregated FIB only looks up and updates a single prefix node.

   We use the combination of four mechanisms to make sure that the
   computation cost of updating aggregated FIBs is under the control of
   operators.  First, operators can choose the level of full FIB
   aggregation that suits their routers the best.  Routers with faster
   CPU and fewer routing updates can use higher level FIB aggregation,
   otherwise they can use lower level FIB aggregation.  Second, we
   design an algorithm that updates the aggregated FIB incrementally.
   The algorithm tries to minimize the number of tree nodes that have to
   be accessed and changed to maintain forwarding correctness after the
   routing change.  It does not attempt to keep table size small.
   Third, the full FIB aggregation is only invoked when needed, e.g.,
   the table size has crossed a threshold after being incrementally
   updated for a while, or when the router has free CPU cycles to spare,
   \ie, the router load is under a threshold.  Fourth, as shown in
   [HighlyActive], a small number of highly active prefixes are
   responsible for a large number of routing changes.  Therefore we can
   keep these highly active prefixes in the FIB without aggregating them
   out.  Then every time they have a routing change, the update process
   is the same as updating an un-aggregated FIB.  Operators can
   configure the criteria for deciding which prefixes should be kept in
   FIB regardless of the aggregation process.

   The basic idea of the incremental update algorithm is as follows.
   For each routing change, the algorithm looks up the prefix in the
   RIB, and finds out the prefix node's nearest prefix ancestor, and all
   the nearest prefix descendants.  The nearest prefix ancestor is the
   nearest ancestor node that has a prefix.  Similarly, nearest prefix
   descendants are descendant nodes that have prefixes and there is no
   other prefix descendants in between.  In the worst case, all the
   nearest prefix ancestor and nearest prefix descendant nodes need to
   be updated.  For example, when an aggregate prefix in Level 1
   aggregation is withdrawn, in the worst case, all of its nearest
   prefix descendants need to be converted from NON-FIB back to IN-FIB.
   But in many cases, we are able to narrow the changes to a smaller
   number of nodes.  For example, when a new prefix is announced, any of
   its nearest prefix descendants who shares the same next-hop with the
   new prefix does not need any change.  We do not even remove them from
   the FIB since the goal here is to minimize changes, not to maintain
   small table size.  In certain cases, we can actually avoid changes


Zhang, et al.            Expires April 22, 2010                [Page 14]

Internet-Draft               FIB Aggregation                October 2009


   that would have happened to un-aggregated FIBs.  For example, in
   Figure 1, if the covered prefix is withdrawn, it would require a
   change to the un-aggregated FIB, but the aggregated FIB actually does
   not need to be updated.  Therefore, how many changes happen to the
   aggregated FIB and how many nodes each change affects depend on the
   content of the routing updates.  In the next section, we use one
   month of BGP routing updates to evaluate the incremental update
   algorithms and find that the number of updated nodes per change is
   close to one, similar to updating an un-aggregated FIB.


5.  Evaluation

   We use publicly available routing tables from tens of networks
   collected by the RouteViews Oregon monitor [routeviews] to evaluate
   the various FIB aggregation algorithms for their table size
   reduction, computing times, and extra routable space.  We also use
   BGP routing updates to evaluate our FIB update algorithm in terms of
   the frequency of FIB updates, the scope of the updates, computing
   times, and how often the full algorithm may need to run.

5.1.  Methodology

   We evaluate different FIB aggregation algorithms using publicly
   available BGP routing tables taken from the RouteViews Oregon monitor
   [routeviews].  Although these routing tables contain valid next AS
   hops, they either do not have next-hop router information or do not
   reflect the diversity of next-hops that an operational router
   typically has, since the route monitors are not operational routers.
   Therefore we need to generate realistic next-hops based on known
   information.  Our guideline of this process is trying to overestimate
   the number of next-hops so that the table reduction results reflect
   the worst case scenario, and real routers are likely to have better
   aggregation ratio.  The specifics of generating next-hops are as
   follows.

   Tables downloaded from RouteViews only contain the AS path for each
   prefix.  From the AS path, we can extract the next-AS-hop for each
   prefix.  We assume that prefixes sharing the same next-AS-hop share
   the same iBGP neighbor (and thus the same next-hop router at the IP
   level).  We use routing tables from looking glass servers, which
   provider the IBGP neighbor information, to validate this assumption.
   For each next-AS-hop, if there is only one iBGP neighbor, then all
   the prefixes using this next-AS-hop satisfy the assumption.  If there
   are multiple iBGP neighbors, the one that carries the most prefixes
   is called "popular,'' and the prefixes that use the popular iBGP
   neighbors satisfy the assumption.  We found that most routing tables
   have more than 90% of prefixes satisfying the assumption.  Based on


Zhang, et al.            Expires April 22, 2010                [Page 15]

Internet-Draft               FIB Aggregation                October 2009


   this assumption, we use next-AS-hop as the next-hop router in
   evaluation.  This reflects the worst case scenario since large
   networks have hundreds to thousands of neighbor ASes, but the number
   of real next-hops should be much smaller.

   We verify the correctness of each aggregated FIB by dividing each
   original RIB prefix into multiple /24 sub prefixes and look up the
   /24 sub prefixes in the aggregated FIB, which should give the same
   next-hop as that in the RIB.  All the results from our FIB
   aggregation algorithms and incremental update algorithms have been
   verified using this method.

   All the evaluation has been done on a Linux machine with an Intel
   Core 2 Quad 2.83GHz CPU using a single thread.  The algorithms are
   implemented in C and no performance optimization techniques have been
   attempted.  The Patricia trie implementation is taken from the C
   source code of Perl's Net::Patricia module [NetPatricia], which in
   turn was adapted from MRTD's [mrtd] source code.

   We use the public BGP routing tables to do the evaluation because
   these tables come from a diverse set of networks, from tier-1 ISPs to
   small networks.  However, in operational networks, there are other
   types of routes, such as VPN routes, which can be of a large number
   too.  The FIB aggregation algorithms can be applied to these other
   types of routes as well, even though this paper does not evaluation
   the effectiveness of doing so.  We plan to obtain forwarding tables
   from operational ISP routers to validate our results.

5.2.  Table Size Reduction and Overhead

   We applied our algorithms to 37 routing tables archived at RouteViews
   on Dec. 31, 2008 and then calculated the ratio between the aggregated
   FIB size and the original routing table size.  We obtained the
   following results: (1) each level of aggregation can reduce the FIB
   size more than the previous level, which is as expected; (2) even
   with the simple Level 1 aggregation, the FIB size can be reduced by
   30% to 50%; (3) Level 4 aggregation can reduce the FIB size from 60%
   to over 90% with a median around 70% -- some of the tables have
   almost all the prefixes sharing the same nexthop, so they can be
   aggregated into very few entries; and (4) the Level 4A algorithm is
   slightly better than the Level 4B algorithm, although the difference
   is almost negligible.

   To evaluate the effectiveness of our algorithms over a longer period
   of time, we applied them to the RouteViews routing tables from 2001
   to 2008.  For each year, we obtained the median aggregation ratio of
   all the routing tables.  We observed an overall decreasing trend in
   the aggregation ratio (more table size reduction), suggesting that


Zhang, et al.            Expires April 22, 2010                [Page 16]

Internet-Draft               FIB Aggregation                October 2009


   the FIB has become more amenable to aggregation over the years.  One
   possible explanation is that the increasing practice of prefix
   splitting due to multi-homing and traffic engineering has made a
   larger percentage of FIB entries aggregatable.  We plan to further
   investigate this phenomenon in our future work.

   In order to understand the significance of the above results, we use
   the size of the routing table from 1994 to 2009 to translate the
   table size reduction into how many years we turn the clock back for a
   router.  The data is obtained from bgp.potaroo.net, a site that
   tracks the growth of the BGP table size.  We found that the FIB size
   in 2001 is around 30% of the FIB size on 12/31/2008, which means that
   if an ISP uses our Level 4 aggregation algorithm, it will still be
   able to use routers that were deployed in 2001.  Assuming these
   routers were sufficiently provisioned when they were purchased seven
   or eight years ago, they should still be able to accommodate some
   further routing table growth.

   We measured the computing time incurred by our algorithms to
   aggregate each of the 37 routing tables.  The Level 1 to 3 algorithms
   typically require tens of milliseconds, while the Level 4 algorithms
   consume at most a few hundreds milliseconds.  An operational router
   may have slower CPU than our commodity Linux machine, but it has
   specialized hardware and software, thus it is hard to infer a
   router's computing time from what we report here.  Nevertheless, the
   simplicity of the algorithms and the very short computing time
   suggest that the computational overhead in an operational router may
   be small.  Moreover, our results can be used to compare the relative
   speed between different aggregation algorithms.  For example, we can
   observe that the Level 4B algorithm is more computing intensive than
   the Level 4A algorithm.

   We also measured the amount of extra routable space (measured by the
   number of /8 prefixes) introduced by three of our algorithms.  Since
   Level 1 and 2 algorithms do not introduce new routable space, we do
   not present them here.  We found that the Level 3 algorithm is better
   than the Level 4 algorithms in this regard, while the Level 4B
   algorithm covers more routable space than the Level 4A algorithm.
   This is mainly because the 4B algorithm aggregates prefixes from top
   to bottom, which introduces more shorter prefixes than the 4A
   algorithm.

5.3.  Routing Update Handling

   We use one month (December 2008) of BGP routing updates collected by
   RouteViews from the Level 3 Communications peer router to evaluate
   our incremental update handling algorithms.  We summarize our results
   in Figure 2.  There were over 7 million routing updates during this


Zhang, et al.            Expires April 22, 2010                [Page 17]

Internet-Draft               FIB Aggregation                October 2009


   month.  For each unaggregated or aggregated FIB, we counted how many
   of these routing updates would cause a change to the FIB, i.e. an
   insertion, removal, or a change to the next-hop of a FIB entry.
   There were 3,048,038 changes to the unaggregated FIB.  Note, however,
   that an aggregated FIB may experience fewer changes than the
   unaggregated FIB.  For example, the aggregated FIB from our Level 4A
   algorithm had 157,722 fewer FIB updates than the unaggregated FIB.
   This is due to two reasons.  First, some of the withdrawals may be
   for prefixes already removed from the FIB by the aggregation
   algorithm.  Second, our update algorithm minimizes the number of FIB
   changes at the cost of slightly increased FIB size.  On the other
   hand, each FIB change may affect slightly more prefixes in the case
   of the aggregated FIB, as shown in the last column of Figure 2.  As a
   result, the processing time for aggregated and unaggregated FIBs are
   very close.  In other words, our update handling algorithm for FIB
   aggregation has minimal impact on the processing time for routing
   updates.


   +==========+===========+===========+================+============+
   |Algorithms|No. Updates|Total Proc.|Avg Proc Time(S)|Prefixes per|
   |          |           | Time (s)  |per FIB update  | FIB Update |
   +==========+===========+===========+================+============+
   | RIB      |   7254478 |     N/A   |   N/A          |   N/A      |
   +----------+-----------+-----------+----------------+------------+
   | FIB      |   3048111 |     10.98 |   3.6          |   1.0      |
   +----------+-----------+-----------+----------------+------------+
   | Level-1  |   2904630 |     11.26 |   3.88         |   1.01     |
   +----------+-----------+-----------+----------------+------------+
   | Level-2  |   2901446 |     11.24 |   3.87         |   1.01     |
   +----------+-----------+-----------+----------------+------------+
   | Level-3  |   2900304 |     11.23 |   3.87         |   1.01     |
   +----------+-----------+-----------+----------------+------------+
   | Level-4A |   2897385 |     11.11 |   3.83         |   1.02     |
   +----------+-----------+-----------+----------------+------------+
   | Level-4B |   2913995 |     6.16  |   2.11         |   1.17     |
   +----------+-----------+-----------+----------------+------------+

                                 Figure 2

   Since our update handling algorithm trades off the FIB size for fewer
   changes, the FIB needs to be re-aggregated when its size reaches a
   certain threshold.  To estimate how frequently the re-aggregation
   will be triggered, we measure the growth of the FIB size as our
   algorithm handles the BGP updates during the month of Dec. 2008.  The
   Level 4 aggregated FIB had 99,676 entries on Dec. 1, 2008 (37.3% of
   the routing table size).  If we set the FIB size threshold to be
   150,000 entries (about 55% of the full routing table size), the FIB


Zhang, et al.            Expires April 22, 2010                [Page 18]

Internet-Draft               FIB Aggregation                October 2009


   would be re-aggregated five days later on Dec. 6, 2008.  We modified
   our update handling algorithm to do a full aggregation when the FIB
   size reaches 150,000 and found that the full aggregation was indeed
   triggered every few days.  Considering that each full aggregation run
   takes at most a few hundred milliseconds on our commodity PC (perhaps
   a little longer on a router), incurring this overhead every few days
   or so should not be a concern for an ISP.


6.  Conclusion

   We have presented an in-depth analysis of FIB aggregation and our
   results suggest that it is a viable short-term solution to the
   problem of growing FIB table size.  Our aggregation algorithms
   reduces the FIB size by as much as 70% and requires no hardware
   changes or network-wide software/configuration changes, thus reducing
   the need for ISP router upgrades in the short term.  During this
   time, the research community and the industry can work on a long-term
   solution to reduce both the routing table and the FIB table.
   Moreover, FIB aggregation can co-exist with any long-term solution to
   further reduce ISPs' operational costs.


7.  Acknowledgements

   The authors are part of the eFIT project team funded by the US
   National Science Foundation.


8.  Security Considerations

   This draft is a discussion on the Internet's necessity to follow an
   evolutionary path towards the future.  There is no direct impact on
   the Internet security.


9.  Informative References

   [Cain02AUTOAGG]
              Cain, B., "Auto aggregation method for IP prefix/length
              pairs", Patent, June 2002,
              <http://www.freepatentsonline.com/6401130.html>.

   [Herrin]   Herrin, W., "Opportunistic Topological Aggregation in the
              RIB-FIB Calculation?", July 2008, <http://
              www.ops.ietf.org/lists/rrg/2008/threads.html#01880>.

   [HighlyActive]


Zhang, et al.            Expires April 22, 2010                [Page 19]

Internet-Draft               FIB Aggregation                October 2009


              Oliveira, R., Izhak-Ratzin, R., Zhang, B., and L. Zhang,
              "Measurement of Highly Active Prefixes in BGP".

   [Li09RRG]  Li, T., "Preliminary Recommendation for a Routing
              Architecture", Internet Draft, March 2009, <http://
              tools.ietf.org/html/draft-irtf-rrg-recommendation-02>.

   [NetPatricia]
              Plonka, D. and P. Prindeville, "Net-Patricia Perl Module",
              <http://search.cpan.org/dist/Net-Patricia/>.

   [RAWS]     Meyer, D., Zhang, L., and K. Fall, "Report from the IAB
              Workshop on Routing and Addressing", Internet Draft, 2007,
              <http://www.ietf.org/internet-drafts/
              draft-iab-raws-report-02.txt>.

   [RFC4984]  Meyer, D., Zhang, L., and K. Fall, "Report from the IAB
              Workshop on Routing and Addressing", RFC 4984,
              September 2007.

   [RouterTopology]
              Li, L., Alderson, D., Willinger, W., and J. Doyle, "A
              first-principles approach to understanding the Internet's
              router-level topology", ACM SIGCOMM 2004.

   [Virtual_Aggregation]
              Francis, P., Xu, X., and H. Billani, "FIB Suppression with
              Virtual Aggregation and Default Routes",
              draft-francis-idr-intra-va-01, September 2008.

   [mrtd]     "MRTD: The Multi-Threaded Routing Toolkit",
              <http://www.mrtd.net/>.

   [routeviews]
              Advanced Network Technology Center and University of
              Oregon, "The RouteViews Project",
              <http://www.routeviews.org/>.


Authors' Addresses

   Beichuan Zhang
   Univ. of Arizona

   Email: bzhang@arizona.edu


Zhang, et al.            Expires April 22, 2010                [Page 20]

Internet-Draft               FIB Aggregation                October 2009


   Lan Wang
   Univ. of Memphis

   Email: lanwang@memphis.edu


   Lixia Zhang
   UCLA

   Email: lixia@cs.ucla.edu


Zhang, et al.            Expires April 22, 2010                [Page 21]