Network Working Group B. Zhang Internet-Draft Univ. of Arizona Intended status: Informational L. Zhang Expires: September 5, 2009 UCLA March 4, 2009 Evolution Towards Global Routing Scalability draft-zhang-evolution-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 5, 2009. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. Zhang & Zhang Expires September 5, 2009 [Page 1] Internet-Draft Scaling BGP March 2009 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract Internet routing scalability has long been considered a serious problem. Over the years many efforts have been devoted to address this problem, however the IETF community as a whole is yet to achieve a shared understanding on what is the best way forward. We step up a level to re-examine the problem and the ongoing efforts, and conclude, to effectively solve the routing scalability problem, we first need a clear understanding on how to introduce solutions to the Internet, which is a global scale deployed system. In this draft we sketch out our reasoning on the need for an evolutionary path towards scaling the global routing system, instead of attempting a new design. Zhang & Zhang Expires September 5, 2009 [Page 2] Internet-Draft Scaling BGP March 2009 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Difficulties in Deploying New Solutions . . . . . . . . . . . 4 3. An Evolutionary Path towards Scalable Routing . . . . . . . . 6 3.1. Stage One: Reducing FIB Size . . . . . . . . . . . . . . . 7 3.2. Stage Two: Reducing Multi-AS Virtual Aggregation Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3. Stage Three: Reducing RIB Size . . . . . . . . . . . . . . 9 3.4. Stage Four: Insulating the Core from Edge Churns . . . . . 10 3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 11 4. Evolution versus Incremental Deployability . . . . . . . . . . 12 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 6. Security Considerations . . . . . . . . . . . . . . . . . . . 13 7. Informative References . . . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 Zhang & Zhang Expires September 5, 2009 [Page 3] Internet-Draft Scaling BGP March 2009 1. Introduction Internet routing scalability has been a long outstanding problem. Over the years many efforts have been devoted to solve this problem; for the last five years we have also been working on a new design to solve this problem [SIRA]. Since the 2006 IAB Workshop on Internet Routing and Addressing [RFC4984], new IRTF/IETF efforts have been devoted to developing a new routing architecture that can provide effective control over the routing system growth. A number of proposals have been put on the table; some proposal even developed running code [LISP]. Yet no clear consensus has emerged in the community as which proposal(s) may have a good chance to be deployed, or what is the best way forward. Assuming the routing scalability problem is real and we can find a new design that is technically sound, why is it so difficult to agree on deploying a new design that can solve the problem? We put in the effort to understand fundamental roadblocks in rolling out our own design to scalable Inter-domain routing [APT], and came to a new understanding of the problem at hand: when facing a problem, the natural approach one tends to take is to develop a new design and roll it out to replace the old problematic system. That can be an effective way to solve problems in small scale systems, but it does not work for the Internet. Instead, the Internet-scale system needs to resolve problems through an evolutionary path, not a revolutionary new design. In this draft we first explain the major difficulties in rolling out a new design to solve the global routing scalability problem. These difficulties taught us why the Internet infrastructure needs an evolutionary path to move forward. We then sketch out a solution scenario towards resolving routing scalability problem. We also draw a distinction between an evolution towards final solution direction versus "a deployable new design." 2. Difficulties in Deploying New Solutions Two of the few fundamental properties of the Internet are its distributed governance and its diversity along multiple dimensions. The Internet is an interconnect of tens of thousands independently administrated networks, each with its own budget, planning, business models and operational practices. As a result, not everyone shares the same view as far as routing scalability is concerned. For example many customer networks and small regional providers do not carry the full BGP routing table internally; instead they only propagate internal routes inside their networks and use default routing to reach the rest of the Internet through one or a few exit Zhang & Zhang Expires September 5, 2009 [Page 4] Internet-Draft Scaling BGP March 2009 points. On the other hand, large networks in general carry the full routing table internally for efficient delivery to a large number of destinations. In this case, the former may not feel the pain of routing table growth but the latter may do. Even among the networks that do carry the full routing table inside, some (such as content providers) are able to upgrade their routing infrastructure every few years to keep up with the demand of ever growing BGP table; others may not be able to afford doing so. For example, we have learned from a few large ISPs that, although they may be able to upgrade their (relatively small number of) core routers with the latest technology that can handle a million or more routes, they could not afford upgrading all their edge routers which may count up to a thousand or more, even though some of them have been 10 or more years old. As a result, some networks may encounter the routing scalability problem years earlier than others, some may experience severe problems while others may not see the problem at all. Even within the same network, some routers can handle the increasing routing table size while others cannot. Several incidents have occurred recently that were caused by edge router RIB overflow. Although these incidents may be due to other problems (e.g., route leak-out) that led to inflation of the RIB size, they did show the fact that a large RIB size can easily push old edge routers to fall off the cliff. Therefore, although finding a way to put routing table size under control may be viewed as needed in the long run, different network can have different degrees of incentive to solve the problem, and some may not see a need to take any action towards fixing the problem for the time being. Yet another important issue is network economics. A new solution design usually calls for software upgrade or even new hardware, both require additional investment as well as new expertise in managing and troubleshooting the new technology. The affordability associated with deploying a new design varies greatly among different networks. Even if a network may suffer pain from the growing routing table size, it still may not be able to deploy a solution if the cost is considered prohibitively high. Each network makes its own business decision on whether to deploy a new design or not, based on its evaluation of the severity of the problem and the cost of deploying the solution. Given the scale and diversity of the Internet, it is certain that the buy-in of any new solution will not be harmonious. Even for those networks that require a solution to handle routing scalability, the deployment will likely be a gradual process with several stages. Furthermore, the day for *all* the networks to deploy a new solution may take forever Zhang & Zhang Expires September 5, 2009 [Page 5] Internet-Draft Scaling BGP March 2009 to come. To summarize: we see that o Different parties have different perceptions regarding the routing scalability problem, both due to different operations and different affordability; some are yet to be convinced that the routing scalability problem is serious [BGP2008]. o Even for networks where the routing scalability problem shows up, there are different severity at different routers. o If any new solution gets rolled out, it is certain to start from one or a few parties first, and may or may not ever reach the whole Internet. The above argues that we should attack the routing scalability problem with an evolutionary approach. By evolution we mean that the solution should allow table size reduction to be done only for those routers whose capacity fall behind the the FIB or RIB growth, and that the solution should be built on top of the existing system, rather than bringing in a replacement of it. Building a solution on top of the existing system makes it much easier to work transparently with the rest part that does not make the changes or does not make the change at the same time. 3. An Evolutionary Path towards Scalable Routing Based on our current understanding of the problem and the solution space, in this section we sketch out an evolutionary path towards scalable Inter-domain routing. As the Internet continues to evolve over time, it is possible that our understanding may also evolve, thus the path we sketched out in this draft may change. The main point we want to make is not any particular evolution path, but rather to show evidence that such an evolutionary path both exists and is feasible, and that efforts towards solving an existing problem should aim for an evolutionary path towards architectural change, rather than attempting a brand new design. At this time we can see several stages in evolving today's BGP routing system towards a controllable growth of the core routing table size. We divide this evolution into stages by identifying potentially most severe pain at the time that seems warranting a fix, and we identify a fix that has a reasonable cost, can be carried out by individual network, and can built on top of the existing operations, so that it does not break any other parts of the routing system. Note that such a simple fix necessarily has its limitations. As the fix gets widely deployed, its limitations are likely to become more pronounced, and can become the next problem to address, which Zhang & Zhang Expires September 5, 2009 [Page 6] Internet-Draft Scaling BGP March 2009 will lead to the next stage of evolving the system forward. 3.1. Stage One: Reducing FIB Size Over the last month or so we conducted a quick survey on routing scalability among a small group of people with operational expertise. The results identified the fast growing FIB size as the highest priority concern in routing scalability; this is also consistent with the results from the IAB 2006 workshop on Routing and Addressing [RFC4984]. Therefore, we consider reducing FIB size is the first issue to resolve towards scalable BGP routing, and we believe there is no major disagreement regarding this problem statement. The proposed solutions for resolving this FIB scalability problem, on the other hand, differ significantly. Most of the proposals presented to the IRTF Routing Research Group (including our own, APT) took on the direction of a basic architectural change. Not only is an architectural change likely to take long to go through the IETF standardization process as well as costly to roll out, but also a more fundamental problem is that it is difficult to make it both show some immediate benefits to first movers and be compatible with today's deployed base. We will discuss more about the issues in introducing a new design in a later section. A very different solution, Virtual Aggregation (VA), has been proposed by Francis and Xu [Virtual_Aggregation]. Briefly speaking, Virtual Aggregation works as follows. An ISP can reduce its routers FIB size by configuring a router to announce a short prefix, say 1.0.0.0/8 in place of multiple longer prefixes that fall within 1.0.0.0/8, into its own network. This router is called an Aggregation Point Router (APR) and this prefix is called a virtual prefix. The APR maintains FIB entries for all the longer prefixes (e.g., 1.1.0.0/16) covered by the virtual prefix, while other routers in the network only maintain one FIB entry for the virtual prefix 1.0.0.0/8. When a router receives a packet to be forwarded to 1.1.0.0/16, its FIB will direct the packet to the APR, which checks its FIB entry for 1.1.0.0./16 and then tunnel the packet to the egress router for this real prefix. We view Virtual Aggregation as an evolutionary fix to the FIB scalability problem, because it can be done by an individual ISP to effectively shrink the FIB of some of its routers, and the deployment only requires configuration changes to start with. It has no impact on the routing operation of any other networks. At the same time, since all packets destined to a prefix that has been aggregated will go through the APR, this step of evolutionary fix introduces both additional delivery delay (i.e., path stretch) and encapsulation cost. Furthermore, the APR can also become a Zhang & Zhang Expires September 5, 2009 [Page 7] Internet-Draft Scaling BGP March 2009 concentration point of traffic. Several operational steps can be applied to mitigate these problems. o Do not aggregate prefixes that carry heavy volumes of traffic to prevent the traffic from path stretch or contributing to the APR load. o One can adjust APR load by adjusting the number of virtual prefixes, using more APRs to share the load. o One may be able to configure an APR at the POP where adjacent prefixes are announced into one's network; properly positioning APRs can minimize the path stretch. o Finally, if an APR receives heaving volume of traffic from certain ingress routers, the APR can send to those ingress routers the FIB entries that their traffic are destined to, and if the ingress routers cache these entries, they can encapsulate the packets towards the egress router themselves. This last technique makes an APR perform more or less in the same way as a Default Mapper (DM) in our APT design [APT], however with one fundamental difference. Deploying an APR does not require any new protocol or a new functional box (the DM node) that the APT deployment would require. Instead, an operator can simply configure a router to be an APR, without needing any changes to other routers that benefit from reduced FIB size. Only when the APR rollout becomes successful and the APR load becomes an issue, then the operator may consider additional changes to make the ingress routers handle caching. We believe that the deployment of Virtual Aggregation (VA) can effectively reduce the FIB size at some routers. How many ISPs would deploy VA? How much time can VA buy us in curtailing the FIB size growth? It seems only time can tell. But if we look ahead one step, as the Internet continues to grow, and as IPv6 deployment starts rolling out, more networks may face the FIB size problem and adopt Virtual Aggregation as a solution. When two or more adjacent ASes all deploy virtual aggregation, packets that traverse these ASes will experience the cumulated path stretch and encapsulation cost of all the ASes along their paths. The need to resolve this new problem (of cumulated path stretch and cost) can naturally lead to the next step of evolution towards better routing scalability. 3.2. Stage Two: Reducing Multi-AS Virtual Aggregation Overhead Assuming the AS path a packet takes is W-X-Y-Z, and both X and Y have deployed Virtual Aggregation. In this case we would like to see that X's APR encapsulates the packet directly to the egress router of Y instead of X's own. This will minimize the path stretch and the packet will only need to be encapsulated/decapsulated once. Zhang & Zhang Expires September 5, 2009 [Page 8] Internet-Draft Scaling BGP March 2009 To enable such inter-AS Virtual Aggregation, X's APR needs to know Y's egress router for the destination prefix. This mapping information (i.e., mapping from a destination prefix to an egress router) needs to be propagated somehow from Y to X. The least resistant approach is to piggyback such mapping information on existing BGP announcements. Francis and Xu have proposed such an extension to BGP, which carries the mapping information in a new BGP attribute [InterDomainVA]; the APT team was also looking into more or less the same design when the above mentioned draft was published. We argue that this second step is feasible by the following reasoning. First, this second step towards better routing scalability will take effect only after at least two networks (X and Y) have deployed VA and benefited from it. Therefore we reason they would not want to move away from VA but would like to minimize its cost in path stretch and encapsulation, to improve the traffic performance for their customers. Second, the required BGP implementation changes are backward compatible, meaning that networks that have deployed this solution and networks that have not deployed this solution can still communicate without problems. As a side note we would also like to point out that this virtual aggregation mapping exchange *closely* resembles the early design of mapping information exchange between Default Mappers in APT that the APT team proposed earlier [APT-00]. Again a fundamental difference between what we discussed in this section and that early design back in 2007 is that this draft sketches out an evolutionary path forward, which does not require a protocol change as a starting point, nor any information exchange across multiple ASes. Rather, the need for mapping exchange arises only after the FIB size reduction has been achieved, and the mapping exchange can start with two adjacent ASes after each of them has deployed virtual aggregation. 3.3. Stage Three: Reducing RIB Size Piggybacking the virtual aggregation mapping information on BGP can work well when the mapping table is small, i.e., the number of networks that have adopted virtual aggregation is small. When more networks have adopted virtual aggregation, the mapping table will grow big, which may make it no longer feasible to piggyback all the mapping information on the existing BGP sessions. The main problem, as we can perceive now, would be the RIB size growth: A BGP router will receive the same mapping information from every neighboring BGP router, and store all of it in its Adj-RIBs-IN and Local-RIB. Thus every BGP router will have to store multiple copies of the mapping table. This issue was pointed out back in 2007 when the early APT design was discussed, and one suggestion to get around the problem is to use separate BGP sessions for mapping information exchange. Zhang & Zhang Expires September 5, 2009 [Page 9] Internet-Draft Scaling BGP March 2009 Another factor is that, after a network X has deployed virtual aggregation for a while and has gained sufficient operational experience, it may become clear that many routers no longer need to keep the full RIB table. If an internal router has small FIB and relies on APRs to route packets towards all other destinations, it does not need a full RIB to build its FIB. Theoretically speaking, all border routers of X that connect to legacy networks (i.e., those that have not deployed VA) would still need to keep the full RIB in order to make BGP announcements into the legacy neighbors. However in practice, only the customer-facing border routers need a full RIB. The other border routers, those that face either peer or provider legacy neighbors, only need to announce X's own customer prefixes to them. Again, careful engineering analysis and configuration can eliminate the need for many routers to keep full RIB, and those which keep the full RIB will be the ones serving as APRs. As we perceive what may happen further into the future, the picture becomes more blurry, hence what we try to forecast here may or may not bear great accuracy for what may happen in the future. Having said that, we perceive that the combination of the aforementioned two factors (relieving regular routers from storing mapping table and full RIB table) would lead to moving the mapping dissemination from the regular BGP instance (which is used for inter-domain routing) to a separate BGP instance only between APRs via multi-hop BGP sessions. Though the protocol is still BGP for the ease of deployment, APRs run a different session (e.g., a different TCP port) for mapping dissemination purpose only. Other regular routers run regular BGP instance for inter-domain routing purpose, but are relieved from bearing the overhead of storing and propagating mapping information or the full RIB table. Again we cannot help but to point out the close resemblance between the system we depicted above and the original APT design. On the surface, it seems the only noticeable difference is just the names: here we have APRs instead of DMs that use BGP to exchange mapping information. But once again we must not forget an essential difference: we reach this perceived stage three towards scalable routing through an evolutionary path, instead of requiring installation of a new design from day one. 3.4. Stage Four: Insulating the Core from Edge Churns In the current Internet, flaps of customer prefixes are propagated to the rest of the Internet in the form of BGP updates, i.e., routing churns. With virtual aggregation and mapping exchange, these churns can be reflected as mapping updates, which are disseminated through the interconnects of APRs. We perceive this as a benefits, as other non-APR routers can be sheltered from updates due to edge Zhang & Zhang Expires September 5, 2009 [Page 10] Internet-Draft Scaling BGP March 2009 instabilities. Our earlier measurement and analysis study [TopologyGrowth] has shown that most Internet topology growth comes from the addition of customer edge ASes. It is conceivable that as the number of customer sites continues to increase, the amount of churns may become too much to handle in a cost-effective way. A solution to this edge churn problem is to insulate the edge dynamics from the mapping dissemination system. Based on the current BGP data, our estimation shows that, if we could remove BGP updates induced by customer prefix instabilities, we would have reduced the total amount of routing churns by an order of magnitude [eFIT_IPv6]. Ideally, when the link connecting a customer site to a provider fails, the mapping system should propagate this failure information only when the failure has a long duration, so that every network will be aware of this failure and choose an alternative path. But long lasting failures probably do not happen frequently. Short failures, which are frequent, should not be propagated through the mapping system. Instead, they should be handled by other means. For example, APT has a failure handling mechanism in which the failure handling actions are data-driven, i.e., a link failure to an edge network is not responded to unless and until there are data packets that are heading towards the failed link customer site. We are actively working on an evolutionary solution that can provide equivalent data-driven handling of edge failures as APT does. 3.5. Summary If we imagine a picture where all the networks in the Internet had deployed all the stages of routing scalability improvement we sketched above, then the Internet routing would have deployed a new map-encap routing architecture like APT. The prefixes that got aggregated out of the core routing system would be those that belong to the edge ASes, as the ISPs still must exchange routing reachability among themselves to be able to tunnel packets toward their egress routes. As such, the separation of transit networks from the edge sites, as the APT design had proposed, would have been achieved. Then what is the new contribution of this draft? First, we emphasize that our goal is to reduce the FIB/RIB table size; the separation itself should *not* be a required starting point. Second, we show an evolutionary path towards resolving routing scalability through several stages, with clearly identified benefits and minimal cost at each stage. Furthermore, we show that the evolutionary path, as we sketched out in this draft, can naturally converge towards the separation as a result. We make two points from this last statements: (1) This could also be used as an evidence that, if we Zhang & Zhang Expires September 5, 2009 [Page 11] Internet-Draft Scaling BGP March 2009 could afford starting anew, then a separation design would lead to scalable routing! (2) We used the phrase "converge towards separation", rather than "achieving separation", because we believe that, even after a long time and many networks have adopted the solution, it is most likely that some networks will remain at various early stages, some may not have made a single change. This is the nature of the Internet, due to its two properties that we mentioned at the beginning: its distributed governance, and its diversity along multiple dimensions 4. Evolution versus Incremental Deployability Many new designs have plans for incremental deployment. So what is the difference between incremental deployability and evolution? A new design requires that the whole system eventually make the change, and the benefit of the new design will be achieved only after a significant portion of the system has deployed the design. Incremental deployment techniques are used to glue the part that has made the change and the part that has not. It makes the system function at the intermediate stage, but it does not provide any incentive for individual networks to make the change. A typical example is to use MBone tunnels to incrementally deploy IP multicast. Although tunneling provides a technique to connect multicast islands together, the whole deployment plan does not have enough incentive for networks to join the MBone. The main benefit of deploying IP multicast is to improve the performance and scalability of large scale group communication applications, but this benefit exists only after the majority of the Internet has deployed IP multicast. Early adopters would not see enough deployment benefits to justify the cost. The evolutionary approach recognizes that changes to the Internet is a gradual process with possibly several stages, and at each stage, networks that make the changes must have the incentive to do so. More specifically, 1. Each stage focuses on an immediate problem with enough economic impact that warrants a fix. 2. Each stage offers a solution that solves the problem, does not break other parts of the Internet, and can be deployed with a reasonable cost considering the specific problem. 3. As the solution is being deployed by more and more networks, its downside may become more pronounced and eventually requires a fix, which leads to the next stage of the evolution. Zhang & Zhang Expires September 5, 2009 [Page 12] Internet-Draft Scaling BGP March 2009 An evolutionary design emphasizes the evolution path that the Internet would take, while a revolutionary new design focuses on the final outcome once the changes are done by all participants. We believe that we must take an evolutionary approach towards a scalable Internet routing architecture. 5. Acknowledgements The authors are part of the APT team. The APT effort is funded by NSF. 6. Security Considerations This draft is a discussion on the Internet's necessity to follow an evolutionary path towards the future. There is no direct impact on the Internet security. 7. Informative References [APT] Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and L. Zhang, "APT: A Practical Transit Mapping Service", draft-jen-apt-01, November 2007. [APT-00] Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and L. Zhang, "APT: A Practical Transit Mapping Service", draft-jen-apt-00, July 2007. [BGP2008] Huston, G., "BGP IN 2008 - what's changed", APRICOT presentation, 2009, . [InterDomainVA] Xu, X. and P. Francis, "Simple Tunnel Endpoint Signaling in BGP", draft-xu-tunnel-00, February 2009. [LISP] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "Location/ID Separation Protocol (LISP)", draft-farinacci-lisp-12, March 2009. [RFC4984] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB Workshop on Routing and Addressing", RFC 4984, September 2007. [SIRA] Zhang, B. and et. al., "A Secure and Scalable Internet Routing Architecture", ACM SIGCOMM 2006 Poster Session. Zhang & Zhang Expires September 5, 2009 [Page 13] Internet-Draft Scaling BGP March 2009 [TopologyGrowth] Oliveira, R., Zhang, B., and L. Zhang, "Observing the Evolution of Internet AS Topology", ACM SIGCOMM 2007. [Virtual_Aggregation] Francis, P., Xu, X., and H. Billani, "FIB Suppression with Virtual Aggregation and Default Routes", draft-francis-idr-intra-va-01, September 2008. [eFIT_IPv6] Massey, D. and et. al., "A Scalable Routing System Design for Future Internet", ACM SIGCOMM 2007 IPv6 Workshop. Authors' Addresses Beichuan Zhang Univ. of Arizona Email: bzhang@arizona.edu Lixia Zhang UCLA Email: lixia@cs.ucla.edu Zhang & Zhang Expires September 5, 2009 [Page 14]