Network Working Group P. Francis Internet-Draft Cornell U. Intended status: Informational X. Xu Expires: April 29, 2009 Huawei H. Ballani Cornell U. October 26, 2008 Mapped BGP Design draft-francis-mapped-bgp-design-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 29, 2009. Abstract This draft introduces Mapped-BGP, a routing protocol that uses BGP to distributed tunnel endpoint-to-prefix mappings. The goal of this draft are to present preliminary concepts and get feedback. It is not meant to be a fully-formed proposal. The goals of Mapped-BGP are: 1) to reduce the processing required to run BGP, 2) to speed up inter-domain convergence, 3) to improve the cross-ISP load balancing capabilities of BGP, and where possible, 4) to enable forms of address aggregation like geographical addressing (i.e. for IPv6). Improved address aggregation is unlikely to be very useful for IPv4, Francis, et al. Expires April 29, 2009 [Page 1] Internet-Draft Mapped BGP October 2008 because most addresses have already been assigned. This design takes the position that Mapped BGP is useful even without better aggregation, because 1) FIB size can be reduced through FIB suppression with Virtual Aggregation, and 2) RIB size per se is not the growth bottleneck. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terms and concepts . . . . . . . . . . . . . . . . . . . . . . 4 3. Description of Mapped-BGP . . . . . . . . . . . . . . . . . . 5 3.1. Structure of new attributes . . . . . . . . . . . . . . . 5 3.2. Map-RIB data structure . . . . . . . . . . . . . . . . . . 6 3.3. Tunnel Endpoints (TE) . . . . . . . . . . . . . . . . . . 6 3.4. Rules for advertising maps . . . . . . . . . . . . . . . . 7 3.4.1. Rules for initiating a map . . . . . . . . . . . . . . 7 3.4.2. Transposing Maps and Routes . . . . . . . . . . . . . 8 3.4.3. Authenticating updates . . . . . . . . . . . . . . . . 9 3.4.4. Longest-prefix map selection rules and aggregation . . 9 3.4.5. Changing maps . . . . . . . . . . . . . . . . . . . . 11 3.4.6. Propogating and activating maps . . . . . . . . . . . 12 3.4.7. Changing TE-route . . . . . . . . . . . . . . . . . . 13 3.5. Load Balancing in Mapped-BGP . . . . . . . . . . . . . . . 13 3.5.1. Incoming Load Balance at Sites . . . . . . . . . . . . 14 3.5.2. Incoming Load Balance at Lower-tier ISPs . . . . . . . 15 3.5.3. Multi-exit discrimination with Mapped-BGP . . . . . . 17 3.6. Aggregation in Mapped-BGP . . . . . . . . . . . . . . . . 18 3.6.1. Geographic or Metro Addressing . . . . . . . . . . . . 21 3.6.2. Opportunistic AS aggregation clusters . . . . . . . . 26 3.6.3. Generalized Inter-domain Virtual Aggregation . . . . . 26 4. Performance Benefits . . . . . . . . . . . . . . . . . . . . . 27 5. Normative References . . . . . . . . . . . . . . . . . . . . . 28 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28 Intellectual Property and Copyright Statements . . . . . . . . . . 30 Francis, et al. Expires April 29, 2009 [Page 2] Internet-Draft Mapped BGP October 2008 1. Introduction The basic idea behind Mapped-BGP is quite simple. Rather than distribute routes to reachable prefixes, BGP distributes routes to tunnel endpoints (TE) and distributed maps that associate reachable prefixes with TEs. Otherwise, run BGP in much the same way that it runs today. Indeed, in Mapped-BGP it is possible to transpose TE routes and their associated maps back into routes to prefixes. This transposition is used to allow ISPs running Mapped-BGP to interface with legacy ISPs that do not run Mapped-BGP. The transposition also allows us to reuse the security mechanisms of BGP, especially prefix filtering. The maps in Mapped-BGP are, for the most part, policy-free. By this we mean that the types of policies normally applied to routes; the seven-step best path computation, the assignment of weights and local preferences, the addition or deletion of attributes including path prepending, and decisions about where to advertise routes, are not applied to maps. Rather, maps are blindly distributed along the routes traced out by their associated TEs. Since the majority of prefixes would be distributed by maps rather than by routes, the cost of processing BGP updates would be significantly decreased. Note that RIB and FIB size would not be reduced with this approach. However, FIB size can be reduced with FIB suppression associated with Virtual Aggregation [I-D.francis-idr-intra-va], and we doubt that RIB size per se is a serious bottleneck in BGP (this needs to be validated). A natural question to ask is, if policies are not being applied to maps, how are BGP policies applied to prefixes advertised in maps? Since maps are distributed along the reverse best paths of their associated TEs, policies that apply to the TE routes are automatically grandfathered onto the map prefixes. This works well for policies that are used to control which routes pass through an ISP, for instance to configure valley-free routing. This does not work as well, however, for policies used for load balance across ASes. This is because BGP load balancing mechanisms operate at the granularity of routes, which in the absense of maps operate at the granularity of prefixes. With Mapped-BGP, a TE-route originated by an ISP will apply to all of the ISP's prefixes. In other words, the ISP only originates a single route, and so there isn't enough route granularity on which to apply BGP load balancing policies. To make up for this shortcoming, Mapped-BGP introduces a parameter called a Tunnel Endpoint Discriminator (TED). This is a parameterless value that a remote router uses to decide the relative probability with which it will use the different TEs that apply to a given prefix. TEDs allow both multi-homed sites and lower-tier Francis, et al. Expires April 29, 2009 [Page 3] Internet-Draft Mapped BGP October 2008 multi-homed ISPs to load balance at relatively fine granularity. The tunnels in Mapped-BGP provide a simple mechanism to produce virtual topologies across ASes. If used in concert with aggregatable address assignment policies like geographical addressing, Mapped-BGP provides significant new opportunities for aggregation without the need for careful physical topology management across ISPs (for instance within a geographical area). 2. Terms and concepts FIB-install and FIB-suppress: These two terms refer to the act of installing a route into the FIB, and not installing a route into the FIB, respectively. Note that the mechanism for not installing a route into the FIB may be simply not putting it into the routing table (defined below). Head-end and tail-end: Head-end generally refers to the start of the tunnel. For instance, head-end router is the router that starts the tunnel. Head-end ISP is the ISP that contains the head-end router, etc. Tail-end generally refers to the terminating point of the tunnel. The term tunnel endpoint (TE) is generally synomonous with tail-end. Legacy: Refers to something that does not operate Mapped-BGP (for instance, a legacy AS or ISP, a legacy router, etc.). Anything that is not labeled as legacy is assumed to be operating Mapped- BGP. Map: The term "map" refers to a single prefix-TE mapping. It may also refer to the "map attribute" in a BGP update. Note that in general, however, a map attribute will contain multiple individual maps. Routing Table: The term Routing Table is defined here the same way as in Section 3.2 of RFC4271: "Routing information that the BGP speaker uses to forward packets (or to construct the forwarding table used for packet forwarding) is maintained in the Routing Table." As such, FIB Suppression can be achieved by not installing a route into the Routing Table Tunnel endpoint (TE): The term TE typically refers to the router or AS that detunnels the packet. The term can also refer to the TE address. Tunnel endpoints (TE) should be anycasted across some or all routers in the AS. TE-route: This is a normal BGP route whose NLRI contains one or more TEs. TE-block, TE-subblock, and Path Splitting: Typically a TE will be defined by a CIDR block of addresses (as opposed to a single address). This is done to enable upstream load balance through a mechanism called Path Splitting (see Section 3.5), whereby the route for the entire TE block is split into multiple routes, each Francis, et al. Expires April 29, 2009 [Page 4] Internet-Draft Mapped BGP October 2008 to a sub-block within the block. These routes are advertised to different neighbors, giving upstream ASes multiple paths to choose from to get to a given destination prefix. The term TE-block refers to the entire block of addresses that comprise a TE, and the term TE-subblock refers to a sub-block within the block. Tunnel endpoint discriminator (TED): A map may have a TED associated with it for the purpose of incoming load balancing. This is used when an AS is multi-homed to multiple providers, and each provider serves a TE. A split also has TEDs associated with it, which are used by an ISP to load balance traffic incoming among its AS peering links. The TED is a parameterless indication of the proportion of traffic that should be sent to each TE or AS-link. Note that head-end ISPs are not required to honor the TED. Note also that TED info in maps is lost when maps are aggregated. 3. Description of Mapped-BGP 3.1. Structure of new attributes There are two new attributes associated with Mapped-BGP. One is the "map", which is used to associate a reachable address prefix with a Tunnel Endpoint Block (TE-block). The other is the "split", which is used to associate a TED value with segments of multiple paths to TEs. The contents of the map attribute is as follows: [TE-list] List of one or more address targets, consisting of: [prefix], action, [TED], where: TE = a CIDR block of one or more addresses, TE-list = a list of TE's action = add, remove prefix = a CIDR block of one or more addresses TED = value between 0 - 255 (or smaller range) [] = optional (note that either the TE-list, or the prefix, or both must be present) The format of the split attribute is: Downstream AS List of two or more Upstream ASes, consisting of: Upstream AS TED Francis, et al. Expires April 29, 2009 [Page 5] Internet-Draft Mapped BGP October 2008 3.2. Map-RIB data structure We assume a new data structure called the map-RIB. For each eBGP neighbor, there is conceptually a map-RIB-in and a map-RIB-out, which contains the maps received from and sent to the neighbor respectively. Normally the same map (i.e. same TE, TED, and action) will have been received from each peer and sent to each peer. During a change (a map going from add to remove, or a change in TED), however, there will be a brief convergence period during which the map received from different peers will differ. The map-RIB data structure can be substantially compressed to exploit this fact. In other words, most map-RIB entries can simply have a flag indicating that all received and sent maps are the same, and avoid listing them explicitly. 3.3. Tunnel Endpoints (TE) TEs are typically anycasted across multiple routers for both the sake of resilience and to allow for aggregation. When a TE is associated with a single AS, then all routers in the AS will be anycasted with the TE address. A TE may be associated with multiple ASes (i.e. for aggregation), in which case all routers in all the ASes will be anycasted. It may also be possible to assign a TE to a metro or geographical area. In this case, the TE address is anycasted across at least all routers within the area, but not necessarily all routers in all ASes that have a presence in the area. A "TE" can in fact be composed of a CIDR block. In other words, a group of addresses can all act as the TE (i.e. all cause the router to detunnel the packet). From the point of view of the TE router, all addresses in the block are treated identically---it doesn't matter which TE address was used to tunnel a received packet. The purpose of allowing a block of addresses to be a TE is to allow for load balancing. Different sub-blocks within a TE-block may follow different paths to the TE (path splitting), thus allowing the head- end router to select a path by virtue of selecting different TE addresses within the block. This path selection can be loosely influenced by downstream ASes through the use of TEDs. Because a router may participate in multiple levels of aggregation (i.e. AS-level and geographical-level), a given router may advertise multiple TE-blocks in its maps. There should not, however, be more than two or at most three TE-blocks in a given map. Francis, et al. Expires April 29, 2009 [Page 6] Internet-Draft Mapped BGP October 2008 3.4. Rules for advertising maps 3.4.1. Rules for initiating a map An ISP will initiate a map on behalf of its stub-AS customers. This is illustrated in the following example. It shows a network of stub ASes, A, B, C, and D, and ISPs (all other ASes). The prefixes associated with the stub ASes are as shown. B, C, and D are single- homed customers of W, and A is a multihomed customer of W and Z. W is a customer of X and Y. J / \ / \ I------X Y / \ / / \ / Z W TEw=40.1.1.0/28 \ | \ ---------------------- \ / | | | Pa=20.1/16 A B C D Pd=30.1/16 Pb=20.2/16 Pc=20.3/16 Given this configuration, AS W would initiate the following updates to non-legacy AS's: Route: AS-path=(W), NLRI=(40.1.1.0/28) Map: TE=(40.1.1.0/28), AT=(<20.0/14>, <30.1/16>, <20.1/16,TED=20>) which for the sake of readability we can rewrite as: Route: AS-path=(W), NLRI=(TEw) Map: TE=(TEw), AT=(, , ) where Pw-agg is an aggregate consisting of Pa, Pb, and Pc. The first update (the route) is a normal BGP advertisement with AS- path = W and NLRI=TEw=40.1.1.0/28 (other attributes left off for simplicity). We call this update a TE-route, since it is a route to a TE. The second update is a map. The TE is TEw, which is the same as the TE-route. Also included are three address targets. In each, the "action" is assumed to be "add", and is not otherwise shown. The first, Pw-agg=20.0/14, is the aggregate of Pa=20.1/16, Pb=20.2/16, and Pc=20.3/16. The second is Pd=30.1/16, which is not aggregatable and so given separately. Note that D is not multihomed, and so has no need for a TED. The third is for A. Even though A's prefix falls Francis, et al. Expires April 29, 2009 [Page 7] Internet-Draft Mapped BGP October 2008 within the 20.0/14 aggregate, it is also individually listed in order to convey its TED, value TEDaw=20. AS Z would also advertise a separate map with a another TED value, thus giving A some control the volume of incoming traffic on its two access links (see Section 3.5). There are three ways that AS W could have learned the value TEDaw. One is to have statically configured it. A second is for A to convey it via BGP in an Extended Communities Attribute. This would especially be useful if A is running legacy BGP. The third would be for A to advertise a map to W, but keeping the TE field as NULL: Map: TE=(NULL), AT=() This would effectively signal to W that it wants to have its prefix Pa advertised individually with the associated TED. 3.4.1.1. More flexible aggregation Nominally it appears in the above example that we are doing the same amount of aggregation as with legacy BGP today. This is because Pa is advertised individually because of multihoming. Section Section 3.6 describes how Mapped-BGP provides additional opportunities for aggregation. 3.4.2. Transposing Maps and Routes A key aspect of Mapped-BGP is that the combination of route+map can be transposed into a route. This is important for the simple pragmatic reason that it allows an AS to speak BGP with legacy AS's. It is also important because it allows certain existing BGP mechansisms that operate on routes, like filtering incoming updates, to be applied to map+routes. As an example of this transposition, the updates advertised by W in the above example can be transposed into the following route: Route: AS-path=(W), NLRI=(40.1.1.0/28, 20.0/14, 30.1/16, 20.1/16) or equivalently: Route: AS-path=(W), NLRI=(TEw, Pw-agg, Pd, Pa) This is possible because the prefixes in the map (20.0/14, 30.1/16, 20.1/16) can be associated to the AS-path in the route (W) by virtue of TEw matching the NLRI of the TE-route. To continue the example, the BGP updates advertised by AS X would be: Route: AS-path=(W,X), NLRI=(TEw) Francis, et al. Expires April 29, 2009 [Page 8] Internet-Draft Mapped BGP October 2008 Map: TE=(TEw), AT=(, , ) AS X adds itself to the route, but does not change the map. Indeed maps never change as they propogate through the Internet (though they can be dropped during aggregation). These two updates can be transposed into: Route: AS-path=(W,X), NLRI=(TEw, Pw-agg, Pd, Pa) In fact, if AS I were a legacy AS, then AS X would give this update to AS I. This allows legacy AS's to coexist with updated AS's. This document does not address the issue of having legacy routers and Mapped-BGP routers coexist within the same AS. 3.4.3. Authenticating updates One of the challenges of any tunneled routing system is that of authenticating maps. Mapped-BGP exploits the fact that route+map is transposable to route to acheive authentication equivalent to that of BGP, and indeed to mostly reuse the authentication mechanisms and configuration of BGP. Conceptually authentication can be seen as operating as follows: When a router receives an eBGP route+map, it converts it to the equivalent route. It then applies its existing filtering mechanisms of the route. If the route is acceptable, then the route+map is also acceptable. If the route is not acceptable, then the route+map is likewise not accepted. 3.4.4. Longest-prefix map selection rules and aggregation Mapped-BGP uses longest-prefix selection on maps in much the same way that legacy BGP uses longest-prefix selection on routes. In the following discussion, assume that Pl and Ps are two prefixes that overlap. Pl has a larger mask, and Ps has a smaller mask (i.e. Pl falls within Ps). If an AS receives maps for Pl and Ps with different TEs, then the Pl map must be used to route packets to addresses within Pl. This is similar to legacy BGP, where if an AS has different routes to Pl and Ps, the route to Pl must be used. The reason in both legacy BGP and Mapped-BGP is the same: it is not clear whether addresses in Pl are reachable in the AS originating the route to Ps. If the maps for Pl and Ps have the same TE, then either may be used to route packets within Pl. However, in this case there is an important difference with legacy BGP. In legacy BGP, if Ps is selected (i.e. aggregation takes place), then Ps is advertised upstream and upstream ASes never learn about Pl. With Mapped-BGP, all maps may, and typically will still be advertised upstream, and Francis, et al. Expires April 29, 2009 [Page 9] Internet-Draft Mapped BGP October 2008 upstream ASes may in fact make a different choice. Why this matters can be illustrated using the example network above. With legacy BGP, AS X would receive the following two routes from AS W: Pa=20.1/16 and Pw-agg=20.0/14. If AS X decides to aggregate these two into the single route Pw-agg, then AS I will receive Pw-agg from W, and Pa from Z. Now, AS I has no choice to accept the route to Pa via Z, because it does not know that Pa is reachable via W. On the other hand, if AS X chooses to forward both routes to AS I, then AS I receives from X Pa=20.1/16 and Pw-agg=20.0/14, and from Z a route to Pa. Now I may choose between the route via Z and the route via W, but once the choice is made, ASes upstream of I are forced into the same choice. By contrast, with Mapped-BGP, all of the maps (Pa and Pw-agg via W, and Pa via Z) would be propogated, along with TE-routes to Z and W. In this way, AS I can choose one route, and ASes upstream of I can choose different routes. Furthermore, this choice can include installing only the aggregate prefix Pw-agg into router FIBs if so desired. With legacy BGP, this choice often doesn't exist. Indeed, different routers in AS I could use different TEs (TEz or TEw), or even multipath to both TEs (that is, use both TEs simultaneously). Of course, the cost of doing this is that both of A's maps must be propogated everywhere. We defend this with two arguments here. First, that the cost of propogating a map is expected to be relatively small. If an AS chooses to load only the aggregate in its FIB, then the cost of the unused maps is limited to receiving them, deciding to suppress them from the FIB, storing them in the RIB, and passing them on. Though we need to run benchmarks to measure this cost, intuitively we believe that this is significantly less expensive than processing a full-blown route and entering it in the FIB. Second, ASes still have the option of dropping the maps altogether if they can't deal with them. Doing so results in the same sorts of inflexibility we see today in BGP, but nevertheless the option exists. For instance, if in the above example AS X decided to simply drop the map for Pa altogether, then AS I would receive the aggregate map Pw-agg from X, and Pa map from Z. AS I would have to choose the route via Z here, because it would not be able to tell that A is connected to W. So bottom line there is considerably more flexibility with Mapped-BGP in making the overhead versus routing granularity tradeoff. More broadly, this example illustrates one of the core design principles of Mapped-BGP; that by both making the processing of routing information cheaper, and providing considerable flexibility Francis, et al. Expires April 29, 2009 [Page 10] Internet-Draft Mapped BGP October 2008 as to what to do with that routing information, we have the option of propogating much more detailed information in the global routing system than we are able to do today. At the same time, individual ISPs have the option of ignoring the details if they so choose, and are less constrained by the decisions made by downstream ISPs. This principle does in fact result in a shift of power among ASes. Today, upstream ASes are held hostage to the decisions of downstream ASes. In Mapped-BGP, however, downstream ASes lose some control of packet forwarding (or, at least, that control becomes more expensive to acheive). For instance, in the above example topology, lets imagine that AS X decides that AS B is under attack and wants to drop (or identify and scrub) those packets. Unfortunately all packets to B are tunneled to AS W along with packets to A, C, and D. If X wants to distinguish these, it must look deeper into all packets going to W. We believe that this shift in power is probably overall good, but more thought and experimentation is required to understand this. 3.4.5. Changing maps There are three things that can change on a map: the set of TEs it is associated with, its prefix, and its TED. There are two actions associated with maps: add and remove. Changes in TE or prefix are done using add and remove. For instance, in the above figure, if the link between A and W goes down, W would advertise: Map: TE=TEw, AT=() Because A is multihomed, this update causes ASes to use the TE associated with Z exclusively. In other words, this effectively disassociates Pa with TEw. The TED does not need to be included with a remove update, nor does the route to the TE. If the map is subsequently added again (because the link comes back up), then the TED would of course have to be included, but the TE-route would still not have to be repeated. If the link between D and W goes down, then W would advertise: Map: TE=TEw, AT=() Since this is the only TE associated with Pd, this update would effectively remove Pd from routers everywhere. It is worth noting that W can tell that D is single-homed because W does not receive any other maps associated with D. Because of this, W might very reasonably decide not to advertise D's unreachability, thus saving some control processing overhead on the rest of the Internet. If the link between W and B goes down, then W does not need to Francis, et al. Expires April 29, 2009 [Page 11] Internet-Draft Mapped BGP October 2008 advertise anything via eBGP, because B's prefix is aggregated and B is not multihomed. TED values may be modified from time to time even though no other aspects of the map (its TE or add/remove status) changes. TED changes are advertised by simply repeating the complete map with the new TED value. It is worth noting that if a map advertises only a TED change, other ASes do not need to process the change right away. For instance, they could wait until they recompute traffic engineering. 3.4.6. Propogating and activating maps Maps are similar to link-state updates in that each effectively describes a "link" somewhere in the Internet (i.e. that an AS with a given prefix is attached to an AS with another given prefix). As such, as with link-state updates, maps have the potential to be interpreted out of order. For example, an ISP might advertise an "add" map after a "remove" map, but the "remove" could well be received after the "add" at some remote ISP, thus installing the wrong state. OSPF solves this problem using sequence numbers and a set of rules on how to interpret them. Mapped BGP can exploit the trees generated by routes, combined with the fact that BGP speakers send updates in order, to solve the same problem without the need for sequence numbers. Specifically, what Mapped-BGP does is to require that maps are distributed along the trees created by routes. This prevents old maps from looping around on themselves and incorrectly voiding more recent updates. While an older map heard from one AS neighbor may temporarily be used in preference to a newer map heard from another AS neighbor, the fact that maps must follow the tree (in order) means that eventually the newer update will overtake the older one. (As of this writing, we don't have a formal proof of this.) In particular, maps are distributed according to the following rules: 1. Each router remembers, for every map prefix, the latest map received from every eBGP peer, and the latest map sent to every eBGP peer. Note that most of the time these will all be the same, and so the data structures can be compressed to exploit this. 2. The map used by an AS, and advertised to other ASes, is that received from the next-hop AS on the associated TE-route. If the next-hop AS changes a map, then this changed map is used and advertised. If a different next-hop AS is selected, then the maps advertised by that AS are used. If this causes any maps to change, then the changes are used and advertised. Francis, et al. Expires April 29, 2009 [Page 12] Internet-Draft Mapped BGP October 2008 3. A map is never advertised to the next-hop AS on the TE-route. 3.4.6.1. Peering sessions and maps As an optimization to speed up the establishment of a peering session between eBGP speakers, we exploit the fact that maps are usually the same for all peers, and "guess" the value of a map before a peer advertises it. Specifically, when a peering session first comes up, the peers exchange all routes before exchanging any maps. When a peer learns a route (probably a TE-route) and selects it as a next- hop, it immediately uses any maps associated with the TE. In other words, it continues to use whatever TEs it was already using. Subsequently, when the peer starts advertising maps, the BGP speaker responds accordingly. 3.4.7. Changing TE-route Routes, including TE-routes, are handled as with normal BGP. They are handled independently of maps. In other words, if a BGP speaker advertises a change of route to its peer, it does not need to re- advertise the associated maps. Assume, for instance, that AS J uses AS X as the next hop to TEw, and the link between X and W goes down. X will withdraw the route to TEw, but does not need to withdraw the maps with TE=TEw as the TE. These maps will have been previously advertised by Y, and so the alternate path through Y can be used right away. When the link between X and W is restored, then X only need advertize the route to W again---the previously advertised maps are still valid and can be used immediately. 3.5. Load Balancing in Mapped-BGP A cornerstone of the performance benefits of Mapped-BGP is the fact there there are no policies associated with maps per se (other than the fact that they need to be source filtered to prevent hijacking, just as routes are source filtered today). In other words, policies are applied to routes, but not to maps. This raises the obvious question of what policies are we giving up because of this, and how do we get them back? We can divide policies into two types: policies that act at the granularity of ASes, and those that act at the granularity of prefixes. AS-level policies are preserved in Mapped-BGP, because these policies can be applied to TE-routes (of which there are, roughly, one per AS). Indeed, AS-granularity policies become cheaper to enact, because there are fewer routes to deal with. Examples of AS-based policies are: prefer routes to customers over other routes. Prefer routes to peers over routes to providers. Do not export routes received from peers to non-customers. Francis, et al. Expires April 29, 2009 [Page 13] Internet-Draft Mapped BGP October 2008 Most policies that act at the granularity of prefixes are for the purpose of traffic engineering. There are a number of such examples. For instance, one ISP may give per-prefex MEDs to its neighbor in order to influence how packets enter each peering point. This may be done either for the purpose of load balance, or in order to minimize the distance that packets need to travel within the receiving ISP. Likewise an ISP might set loc-prefs on a per prefix basis to influence the outgoing load on each peering point. A multi-homed site might deaggregate its prefix, and then use community attributes offered by its provider to do per-prefix path prepending or route filtering to influence the load on its incoming access links. Mapped-BGP improves upon existing inter-domain traffic-engineering through two mechanisms: the Tunnel Endpoint Descriminator (TED), and "path splitting". These mechanisms are simpler, more scalable, and expected to be more effective than the current set of BGP mechansisms. It is hoped that the use of this simpler approach would simplify BGP configuration overall. Before discussing these mechanisms, we should point out the obvious; which is that traffic engineering requirements necessarily put ASes in conflict with each other. A simple example of this is illustrated below. Say that A want to send half of its traffic to B and half to C, and D wants to receive 25% of its traffic from B and 75% from C. It may not be possible to satisfy both A's and D's requirements. A / \ / \ B C \ / \ / D Mapped-BGP's approach to this conflict is to provide a mechanism whereby the receiver can convey to the sender what its traffic engineering needs are, and the sender can honor or ignore the receiver's wishes. This is similar in spirit to how MEDs work in legacy BGP. Other legacy mechanisms, however, like path prepending, attempt to "force" the sender into honoring the receiver's incoming traffic engineering requirements by manipulating its next-hop selection algorithm. 3.5.1. Incoming Load Balance at Sites Mapped-BGP's TED mechnanism has already been partially described. In the previous example, AS A conveys TED values to ASes W and Z, which in turn attach these TED values to the corresponding maps: Francis, et al. Expires April 29, 2009 [Page 14] Internet-Draft Mapped BGP October 2008 From W: Map: TE=(TEw), AT=(..., ) From Z: Map: TE=(TEz), AT=(..., ) Assuming that these TEDs aren't suppressed through some aggregation somewhere, they are conveyed to all ASes in the Internet. The TEDs are parameterless. They are interpreted at each AS as an indication that more or less traffic should be directed to the associated TE. This interpretation, however, is entirely up to the head-end AS. AS A uses TED values to control the volume of incoming traffic from Z and W as follows. AS A sets some initial TED values, say TEDaw=50 and TEDaz=50, and over some period of time (days or a couple weeks) measure the incoming volume of traffic. If the volume is not as desired, for instance too much trafic from W and not enough from Z, then AS A can modify the TED values, say to TEDaw=40 and TEDaz=60. Over time, AS A can determine what the appropriate values are, as well as gain a sense of how future changes in TED values are likely to effect traffic load. How a head-end AS (one that transmits traffic to addresses in Pa) interprets the TED values is up to it. The TED values may or may not have any effect on the AS'es traffic engineering decisions. The basic idea here is that if AS cannot satisfy its own traffic engineering requirements while honoring the TED values, then it will ignore the TED values. If on the other hand the head-end AS can both satisfy its traffic engineering requirements and honor the TED values, it will do so. The hope is that enough head-end ASes will be able to honor the TED values to allow receiving ASes to control its incoming traffic. While exactly how an AS determines how to process TEDs is for further study, we can imagine a sequence of steps whereby the AS first determines which map+routes must ignore TEDs. For instance, the AS might be doing hot-potato, and there are simply some destinations where one TE is more hot-potato than another and therefore prefered. Assuming that there are remaining destinations for which the choice of TEs are roughly equivalent, the AS can then look at the TED and select on that basis. For instance, if as in the previous example the TEs are TEDaw=40 and TEDaz=60, the AS might choose with probability 0.6 to choose TEDaz, and with probability 0.4 to choose TEDaw (or each router could make its own selection on this basis). 3.5.2. Incoming Load Balance at Lower-tier ISPs The example above shows how a multihomed stub AS site can use TEDs to do incoming traffic engineering across multiple ISPs. This is not the only case, however, where incoming traffic engineering across different ASes is required. For instance, using the same figure, assume that AS W is a lower-tier ISP that is a customer of provider Francis, et al. Expires April 29, 2009 [Page 15] Internet-Draft Mapped BGP October 2008 ISPs X and Y, and that wishes to balance traffic arriving from X and Y. This is made possible in Mapped-BGP using path splitting. Recall that TEs actually consist of a block of addresses. Path splitting operates by splitting a TE-block into multiple sub-blocks (as many as there are links to balance over), and advertising each sub-block to a different neighbor AS. This creates multiple paths that can be used to reach the same TE. By associating a destination prefix with one path or another, a head-end AS can influence which path is used, and therefore the volume of traffic on a given link. For example, assume that AS W wants to control the incoming volume of traffic from ASes X and Y. Recall that in the earlier example (without path splitting), AS W advertised the following: Route: AS-path=(W), NLRI=(TEw=40.1.1.0/28) Map: TE=(TEw), AT=(, , ) To do path splitting, what AS W instead advertises is the following: Route To X: AS=(W), NLRI=(TEwx=40.1.1.0/29) Route To Y: AS=(W), NLRI=(TEwy=40.1.1.8/29) Map: TE=(TEw), AT=(, , ) zzzz Split: DS-AS=W, US-AS=(, ) where: DS-AS = Downstream AS US-AS = Upstream AS The first thing to note about this is that the original map goes unchanged. The second thing to note is that AS W has split its TE- block in half, and is advertising a separate route to each subblock (shown as TEwx and TEwy). To be clear, the map still associates the reachable prefixes with the full original TE-block, but here that block is reachable via two paths. Whats more, AS W advertises one such route to X only, and the other to Y only. This effectively gives upstream ASes some path control. If they select a TE address within the TEwx subblock, then packets will get to W via X. Likewise if they select a TE address within the TEwy subblock, then packets will get to W via Y. Finally, W has generated a split attribute in order to convey the relative volume that should enter via the two neighbors. The Downstream-AS (DS-AS) is W itself. There are two records for the Upstream ASes (US-AS), one for X and one for Y, each associated with a separate TED (TEDwx and TEDwy respectively). Given this, consider the behavior of AS J. J will receive a route to TEwx from X, and to TEwy from Y. J will also propogate these routes Francis, et al. Expires April 29, 2009 [Page 16] Internet-Draft Mapped BGP October 2008 to its neighbors. Now J, as a head-end AS, can choose between these two routes, on a router-by-router or even packet-by-packet basis if it wishes, to send packets to destinations in A, B, C, and D. Mechanistically it makes this choice by selecting a TE address from either the TEwx or TEwy subblock. Assuming that J is willing to honor W's TEDs in making the choice, it would send more or less traffic along each route according to the value of the TEDs. Indeed in this particular example, J can choose first how to send traffic to A (i.e. via TEz or TEw). Of the traffic that J chooses to send to W, it can then choose how to split the traffic between X and Y. Now consider the behavior of AS I. I will receive two TE-routes from X, one to TEwx with AS-path X-W, and the other to TEwy with AS-path X-J-Y-W. AS I, as a head-end AS, can strictly speaking choose between these two paths based on which TE address it uses to tunnel packets to A, B, C, and D. In this particular example, the choice does not effect I's traffic (it all goes to X either way). The choice does of course effect W's incoming load balance, as well as the length of the paths and the amount of traffic load at J and Y. In this case, I should almost certainly favor efficiency (shorter path) over W's load-balancing needs. This is especially true considering that W may in any event satisfy its load balancing requirements even if I does send all packets on the shortest path, because other ASes can reasonably choose the path via Y. Note that this approach is frugal in its overhead. W wants to balance two peering links, and so creates exactly two routes. Contrast this with today's situation, where an AS may need to deaggregate a prefix multiple times in order to get the granularity needed to effectively load balance. Whats more, the multiple routes can be aggregated back together by any AS. This might well be done, for instance, by an AS that is relatively far from W, and that sends very little traffic to W. In doing this, the aggregating AS would also drop the split attribute. 3.5.3. Multi-exit discrimination with Mapped-BGP The above paragraphs describe how an AS can influence the volume of traffic entering from different ISPs. It is also important to be able to influence how traffic enters at multiple peering points between the same neighbor ISP. Today MEDs are used for this, where the MED value is set on a per-prefix basis. In Mapped-BGP, an AS L will send two kinds of packets to its neighbor AS K: packets that are detunneled at K, and packets that are not detunneled at K. AS K can of course set MEDs on the TE-routes for packets that it does not detunnel. Normally, however, an AS K only advertises one TE-route per neighbor AS for its own TE. As a result, there is no basis for discriminating packets addressed to K's TE. Francis, et al. Expires April 29, 2009 [Page 17] Internet-Draft Mapped BGP October 2008 One way to approach this problem might be to use TE-route splitting here as well. However, this approach leads to either a potentially large number of split TE-routes, or a large number of additional maps. (Explanation of why left out for now.) In general it seems inappropriate to burden the rest of the Internet for a routing matter that is strictly between two neighboring ASes. As such, the solution is limited between the two neighbors. Specifically, when AS L wishes to let its neighbor AS K dictate which exit it should use on a per- prefix basis, AS L must detunnel packets otherwise destined for K. In other words, the routers in AS L are configured to detunnel packets with K's TE addresses. Once detunneled, routers in L route packets to K based on the inner header destination address. There are two ways in which L could learn about the MEDs associated with K's prefixes. One is for K to simply advertise these prefixes to L as normal BGP routes with MEDs attached. Another would be for K to attach the MEDs to the maps it sends to L. L would strip these MEDs before forwarding the maps onwards. At this time we don't have a preference for one approach over the other. Finally, it should be pointed out that Mapped-BGP in general creates more choices for path selection, and therefore more choices for traffic engineering (both outgoing and incoming) compared to legacy BGP. With legacy BGP, an AS makes one next-hop-AS choice per destination prefix. With Mapped-BGP, an AS can make multiple next- hop-AS choices per destination prefix. 3.6. Aggregation in Mapped-BGP Mapped-BGP has all of the aggregation features of legacy BGP (physical topological aggregation), as well as new opportunities for aggregation beyond what BGP offers in the form of inter-domain virtual aggregation. Virtual aggregation can be used in a number of useful ways. It can be used in conjunction with geographical address assignment to provide a realistic way to implement geographical addressing. It can be used opportunistically to allow small groups of ISPs to aggregate some portion of the address space that is already mostly assigned to them. But it can also be used generally as a way of shrinking FIBs in the "core" of the network (i.e. the cores of tier-1 ISPs) and dramatically shrinking RIBs and FIBs everywhere else. Section 3.4.1 describes how an ISP can aggregate the prefixes of its customers. As with legacy BGP, this is done by an ISP that "owns" the address space that it is aggregating. It is in this sense that Mapped-BGP aggregation is similar to BGP Mapped-BGP also has a mechanism that allows for inter-domain virtual Francis, et al. Expires April 29, 2009 [Page 18] Internet-Draft Mapped BGP October 2008 aggregation similar in spirit to that described in the intra-domain virtual aggregation draft [I-D.francis-idr-intra-va]. This mechanism, especially when used in conjunction with appropriate address assignment policies, gives Mapped-BGP more opportunities for aggregation than legacy BGP. The mechanism is this: any router can become an Aggregation Point Router (APR) for a Virtual Prefix (VP). A VP is a prefix that is not topologically aggregatable, and must be bigger (have a smaller mask) than any topological prefix. An APR for a given VP advertises that VP as a route in BGP. This route must be tagged with a transitive attribute that indicates that it is a route for a VP. This allows other routers to know that all subprefixes are reachable via the VP route. An APR must FIB-install every subprefix within the VP. These subprefixes may be reachable natively as routes, or through map tunnels, but they must be reachable. We can illustrate this using the example topology above (though note that this example doesn't represent the best usage of this feature). Imagine that AS Z has a single-homed customer AS E with prefix Pe=20.0/16. Note in particular that Pe comes out of the aggregate prefix that W advertises (Pw-agg=20.0/14). With both legacy- and Mapped-BGP, W could still advertise the aggregate Pw-agg: Pe would "punch a hole" in the aggregate and routes to Pe would go to Z rather than W. Note, however, that with Mapped-BGP, W receives Z's map for Pe: Map: TE=(TEz), AT=(..., ) As a result, W is able to tunnel packets destined to Pe to Z. If AS W is willing to do that on behalf of other ASes (i.e. act as a transit for packets to Pe even when W is not on the normal BGP path to Pe), then it can advertise Pw-agg as a VP. This would allow a remote AS to suppress loading the finer-grained prefix Pe into its FIB. The remote AS could forward all packets to Pw-agg towards W. These packets will either reach W, in which case W would in turn tunnel the packets to Z, or they would reach a router on the path to W that has installed Pe, in which case this router will tunnel the packets to Z. Obviously in this case this results in added latency for packets that reach W, as well as extra load for W. As such, this mechanism should not be used willy-nilly, and in fact would probably not be used in this particular example. Situations where use of the Virtual Aggregation is appropriate is described later on. In the above example, the route for Pe could be suppressed from the FIB (or equivalently, the "routing table" as defined by BGP), but it is still necessary to keep the maps in the map-RIB. Keeping maps in the map-RIB is unlikely to become a scaling problem. The reason is that it doesn't take very much processing to distribute a map that is not loaded into the FIB. A router needs to determine that a map is Francis, et al. Expires April 29, 2009 [Page 19] Internet-Draft Mapped BGP October 2008 valid, and then determine that the map can be FIB suppressed, But once that is done the map only needs to be stored and transmitted to neighbors. It seems reasonable to expect a router to be able to store and process millions of FIB-suppressed maps. As an aside, even though it should be possible to distribute a very large number of FIB-suppressed maps, it is in fact possible to not require many ASes to store the maps at all. This is because in principle the only ASes that have to keep the maps are those that are need to distribute the maps to where they need to go. In the above example, ASes I and X need to keep the map for Pe, because they convey it from Z to W. But ASes J and Y can in principle ignore the Pe map altogether. Unfortunately there is no simple way, other than static configuration, to tell an AS whether or not it needs to distribute a given set of maps or not. On the other hand, in many cases this configuration will be relatively straight-forward, as discussed later. Unless protected against, the the use of VPs creates a possibility for transient loops. The problem is illustrated using the figure below. This figure is a blow-up of AS Z from the example just described. It shows two border routers in AS Z (z1 and z2). z1 is connected to the border router E1 in customer AS E, and z2 is connected to the border router i1 in AS I. Note that, as the router with the customer interface, z1 is responsible for advertising maps about Pe (either add or remove). As the router with an ISP interface, z2 will be the TE for packets tunneled to TEz and destined for Pe. +--------+ +-------+ | | | | | z2-+---+-i1 | | | | | | z1 | | | | / | | | +-+------+ +-------+ / AS Z AS I +---+-+ | / | | e1 | | |AS E (with prefix Pe=20.0/16) +-----+ Now imagine that the z1-e1 link goes down, and that z1 detects this. We can divide subsequent behavior into three time periods: Francis, et al. Expires April 29, 2009 [Page 20] Internet-Draft Mapped BGP October 2008 1. Only z1 knows that the link has failed. 2. z1 and z2 know that the link has failed but no routers outside of AS Z know this. 3. All routers, in particularly routers in AS W, know that the link has failed. Consider the behavior of z1 during the first two periods. It has a route to prefix Pw-agg. Therefore, if it receives a packet destined to Pe, it would be expected to forward the packet towards AS W. Indeed, routers in AS W might have another route to Pe that z1 is unaware of. On the other hand, AS W may not have another route, and so would simply forward any packets received for Pe back through the tunnel to z2, thus forming a loop. In other words, z1 doesn't know if a loop has formed or not. If a loop has formed, then z1 would not want to forward packets to Pw-agg, and if a loop has not formed, then z1 would want to forward packets to Pw-agg. The same holds for other routers in AS Z. The behavior we would like is for z1 and z2 to recognize whether any given packet destined to Pe has looped or not. If it has, it should be dropped. If it hasn't, then it should be forwarded to Pw-agg. During period 2, z2 can tell that packets received via the TEz tunnel may be looping and must therefore drop those packets. But during the first period, z2 will forward any received packets towards z1. Therefore, z1 needs to be able to tell whether a packet it receives arrived via z2's tunnel or not. The way to do this is to have z2 tunnel packets destined for e1. This could be for instance an MPLS LSP with e1 as its target, as described in the Intra-domain Virtual Aggregation draft. Note also that it is possible to distribute VPs as maps rather than as routes. We currently see no advantage to this, but leave it for further study none-the-less. In the following paragraphs, we outline various ways in which VPs may be used in Mapped-BGP. 3.6.1. Geographic or Metro Addressing There have been many proposals in the past to deploy geographic addressing. The basic idea is simple: if a site accesses the Internet within a particular geographic area, then it is assigned addresses from a prefix dedicated to that area. This makes both multihoming and changing providers easier, because the site is likely to multihome to providers in the same area, or to switch to another provider in the same area. This allows ISPs serving that area to aggregate the area prefix. The criticism of geographic addressing has always stemmed from the fact that existing routing algorithms require physical connectivity within the aggregate topology. There Francis, et al. Expires April 29, 2009 [Page 21] Internet-Draft Mapped BGP October 2008 is no regulatory structure in place today to insure that that physical connectivity is created and maintained. With Mapped-BGP, the need for intra-area physical connectivity is not as critical. Of course, it is still important, because to the extent that such physical connectivity does not exist, paths will be longer. Mapped-BGP, however, allows for a great deal more flexibility as to how much physical connectivity needs to exist, and provides a scalable re-routing mechanism for when intra-area links do fail. To operate geographic addressing with Mapped-BGP, of course first an area needs to be identified, and an address space reserved for it. Call this address space the area-prefix. It is too late to do this for IPv4, but it could certainly be done for IPv6 (indeed, such addresses have already been defined). Of the ISPs that have a presence in the area, some will be willing to provide general transit and others will only provide service for their customers. These are refered to here as transits and non-transits, and the routers in these ASes are called transit routers and non-transit routers respectively. Transit routers within the area (i.e. those that provide access for sites within the area) are configured to advertise a VP-route for the area-prefix. These "area routers" must also FIB-install maps for all sub-prefixes within the area-prefix. Note that a given ISP may span multiple areas. Only the routers within a given area need advertise the VP-route and FIB-install the sub-prefixes. Other routers in the ISP but not in the area may FIB-suppress those sub-prefixes. The area routers would separately advertise maps for the individual sub- prefixes, using the TE assigned to the AS (i.e. as normal). Customers of non-transit routers within the area would still be assigned area-prefix addresses, but non-transit routers would not advertise the VP-route. Rather, they would only advertise maps and TE-routes for their customer's individual prefixes as normal. The following figure illustrates this. Francis, et al. Expires April 29, 2009 [Page 22] Internet-Draft Mapped BGP October 2008 ,------. / `. ----'--------- `. / / \ \ Area (Y, Z, J, K Transit ISPs X--;-----Y-------Z `. and L are in / \ ; / \ / \ : the area) / / / \ / \ | / ; \ / \ / \ | Non-transit I----|--J--------K-------L | ISPs \ : / ; ----\--------- ,' `-. ,-' `----------' Here we see three transit ISPs (X, Y, and Z), and four non-transit ISPs (I-L), some of which are in the area and some of which are out of the area, as shown. The non-transit ISPs are customers of the transit ISPs as shown, and peer with each other as shown. Assume that Y and Z span many areas, have multiple peering points with each other, and peer with other ISPs (X and others not shown). A remote AS would receive the following maps and routes from area ASes: Map: TE=(TEy), AT=() Map: TE=(TEz), AT=() Map: TE=(TEj), AT=() Map: TE=(TEk), AT=() Map: TE=(TEl), AT=() (.... TE-routes for all of the above maps ....) one or both of the following VP routes: Route: AS-path=(Y...), NLRI=Pa=20.0/16 Route: AS-path=(Z...), NLRI=Pa=20.0/16 What's more, although the routes are not shown, assume that the routes to TEj and TEk are split for load balance. (Of course there are likely to be more prefixes advertised from each TE. One is enough to illustrate the technique.) Assume for now that there is no RIB suppression of maps: all maps are distributed to all ASes globally. The first thing to note is that any AS could choose to FIB-suppress any of the /24 maps and still be able to deliver packets to the destinations. However, in each of these cases, FIB-suppression would have a greater or lesser impact on traffic to the destination. In the case of Pj, remote ASes that choose to FIB-install Pj can use the TEDs in the split route (as well as their own traffic engineering Francis, et al. Expires April 29, 2009 [Page 23] Internet-Draft Mapped BGP October 2008 considerations) to decide whether to route via X or Y. Packets from remote ASes that FIB-suppress Pj will be routed to Y or Z, depending which is the better route. Packets to Y will reach J through the Y-J link. Packets to Z may be routed to J via X, but the fact that Y and Z have POPs in the area, and X doesn't, suggests that a larger proportion of Z's packets may reach J through Y. What's more, packets may reach Z via X, only to be tunneled to J back through X. Having said that, a remote AS whose route to Z or Y is via X can tell that FIB suppression is likely to result in a longer path, and so may be less likely to FIB-suppress. Ultimately, FIB-suppressing Pj is likely to produce significantly more traffic on the Y-J link compared to the X-J link. To some extent J might be able to counter this imbalance with TEDs. Another option, however, could be for X and Y to offer explicit load balancing services to J. In this case, J could supply a separate pair of locally advertised TEDs that X and Y use to balance traffic to J. For instance, if the Y-J load is too heavy, and the X-J load too light, J could ask Y to divert some of its traffic via X, using the split route to TEj that traverses X. Note, however, that if the Y-J link goes down, all traffic will successfully reach J through X, even if Pj is suppressed. This is because Y and Z will find that the only route to J is via X, and tunnel packets accordingly. Ultimately J might easily find that multihoming to an ISP not in its area is worth doing. Note also that this scenario creates the possibility of transient loops, similar to those described in Section 3.6 between ASes Z and W. For instance, if the X-J link goes down, but AS Y doesn't know it yet and continues to tunnel packets to X (either for load balancing or because the Y-J link has gone down), then AS X would just route the packets back to the VP area aggregate. As described previously, the solution is for routers in X to drop packets when they know they have been received via the TEx tunnel, but to forward them to the VP otherwise. In the case of Pk, FIB-suppression of its map by remote ASes will eliminate the ability for them to load balance traffic between Y and Z. From their perspective, all traffic to K would be routed to Pa, which reaches either Y and Z depending on which path is shorter. In this case, as with X and Y above, Y and Z could offer explicit load balancing services to K. As a result, K's multihoming could be hidden from the vast majority of routers FIBs while sill providing K with robust and load-balanced multihoming service. In the case of Pl, FIB-suppression of its map by remote ASes may result in some packets taking a longer path than they otherwise Francis, et al. Expires April 29, 2009 [Page 24] Internet-Draft Mapped BGP October 2008 might. For instance, X may choose to route some Pl-destined packets to Y even though a path to Z would be the shorter path. Note that in the above example, by not advertising a route to Pa, J, K, and L avoid becoming transits for other destinations in the area. Rather, these ASes can control the extent to which they do transit traffic through control of the routes they propogate. For instance, if J wants for some reason to transit traffic for its peer K, it would propogate the TE-route for TEk (as well as the map for TEk) as appropriate. Now lets assume, in order to better scale RIBs, that we do not wish to propogate all maps to all Internet ASes. (Though once again we point out that do to the relative efficiency of map distribution, such scaling is unlikely to be necessary for the forseeable future if ever.) The fact that Y and Z can in principle load balance for their customers makes this option tenable. In this example, Y and Z are the only transit ASes participating in the geographic areas. For now lets assume that Y and Z have multiple peering points in multiple geographic locations (i.e. both are national or multi-national ISPs with significant territorial overlap). In this case it is highly unlikely that Y and Z will become partitioned from each other. Given that, it might be deemed reasonable that Y and Z only need to distribute area subprefix maps for area ASes to each other. Thus all other ASes never get maps for subprefixes in the area. In this example, J would still propogate its map (and TE-route) to X and I. Although I, as a non-transit, would likely not further proporate J's map, X most likely would. As a result, J's map would be propogated Internet-wide, but not K's or L's. As long as most multi-homing is "in area", most maps could be suppressed, resulting in both greatly reduced RIB size as well as FIB size. Now lets assume that Y and Z only peer in one place (i.e. between their POPs in the area). Assume further that they both peer with X in multiple places. In other words, X serves as a robust backup route between Y and Z should their single peering point fail. In this case, X must be willing to propogate area maps between Y and Z (along with the TE-routes for all area ASes). More generally, anywhere there is a desire to not propogate maps, the area ASes would need to evaluate the richness of paths and determine which additional ASes need to propogate maps. These additional ASes would need to agree to do so, and would need to be configured as to where to propogate the maps. It might also be desirable to have a flag associated with the area maps indicating that they don't need to be globally propogated. This way, if an AS does accidently leak the maps, they don't get distributed everywhere. Francis, et al. Expires April 29, 2009 [Page 25] Internet-Draft Mapped BGP October 2008 3.6.2. Opportunistic AS aggregation clusters The previous section on geographic addressing assumes that addresses have been assigned with geographic aggregation in mind, and so doesn't apply to IPv4. However, IPv4 addresses have been assigned in regional blocks for some time now. For instance, IANA has assigned 11 prefixes to RIPE (5 /8's, 3 /7's, 2 /6's, and 1 /5), at least according to a RIPE database document. Presumably most of these have been assigned by ISPs to customers in Europe (though many of these may be multi-homed outside of Europe). Given this, there may well be opportunities for "clusters" of richly inter-connected ISPs to advertise an aggregate, whereby most members of the aggregate are within those ISPs. The extent to which these opportunities exist is for further study. If they do exist, however, they can be exploited by Mapped-BGP in a fashion very similar to grographic addressing described above. Specifically, a VP route for the prefix would be advertised by routers in the cluster, and these routers would FIB-install all sub- prefixes for the VP. An important difference between engineered geographic addresses and opportunistic AS clusters is that, in the latter, there will be more "stray" sites: sites that have an address within the cluster VP but are not physically attached to any cluster ASes. Because of this, it typically won't be easy to identify a set of ASes that would be willing to suppress propogation of maps to ASes outside that set. So we should assume that maps will be propogated, and that the only scaling opportunity comes from FIB reduction. When remote ASes do FIB suppression, they should prefer to suppress prefixes within the cluster to those outside the cluster, on the assumption that paths to prefixes outside the cluster will be longer. To do this, remote clusters obviously need to identify which prefixes are in and which are out. One way to do this would be to include all ASes in the cluster in the VP-route. Paths to TE-routes that do not contain any of the cluster ASes would be considered to be outside the cluster. 3.6.3. Generalized Inter-domain Virtual Aggregation Virtual aggregation could also be deployed in a general fashion, whereby the global address space is carved up into VPs, and individual routers are assigned as APRs for different VPs. This is very much in the spirit of the Intra-domain VA draft, but with a couple of key differences. First, the extra hop suffered by VA paths would only occur in one ISP, the first one to tunnel the packet to the destination TE. As such, the load and latency penalty for Inter- Francis, et al. Expires April 29, 2009 [Page 26] Internet-Draft Mapped BGP October 2008 domain VA is significantly less. Second, VA could be deployed in such a way that the Tier-1 ISPs maintain full routes (i.e. have APRs for all VPs), but lower-tier ISPs do not maintain any APRs. Rather, lower-tier ISPs keep VP-routes and any additional routes or maps that they wish to install. As a result, RIBs and FIBs in lower-tier ISPs could be almost arbitrarily small while still having the ability to load balance both incoming and outgoing traffic. 4. Performance Benefits This section summarizes the performance benefits of Mapped-BGP. Note that none of the following stated benefits have been quantified. 1. Mapped-BGP decreases the amount of processing needed to handle a prefix. This is primarily because most policies currently needed to compare and select paths and determine how to advertise routes are not required for processing maps. On the other hand, Mapped- BGP introduces a new policy decision, namely processing TEDs for those fraction of prefixes to which they apply. The majority of prefixes will be distributed as maps rather than routes. 2. Mapped-BGP requires less RIB storage space, primarily because during steady state the map heard from any given neighbor is the same. Storage can be compressed by exploiting this. 3. Peering sessions initialize faster in Mapped-BGP. This is because in general only the routes (of which we might expect a few tens of thousands at most) need to be conveyed before packets can start flowing (see Section 3.4.6.1). 4. Mapped-BGP will have fewer "big" events. This is because route changes in Mapped-BGP effect only routes, not maps. Whats more, if the FIB is organized in a tiered fashion (prefix points to a TE, which points to a next hop), then a change in TE next hop only requires a single update to the FIB, not one update for each impacted prefix. On the other hand, Mapped-BGP is likely to have more "small" events, because each map will be propogated both because of a change in add/remove status, and a change in TED. Indeed, with virtual aggregation, many or even most map updates don't even impact the FIB. 5. Global convergence in Mapped-BGP will in general be faster. This is primarily because changes to maps can be distributed before any policy decisions are made on those changes. This in turn is possible because maps don't change as they are propogated through the Internet. This allows an AS to first quickly distribute a received map and only afterwards process it. Indeed, map changes that involve only modifications to the TED can be processed much later in time (minutes). Francis, et al. Expires April 29, 2009 [Page 27] Internet-Draft Mapped BGP October 2008 6. Load balancing across ASes is both more accurate and more efficient in Mapped-BGP. This is because TEDs allow for a fine- grained description of how much load is desired. This is in contrast to legacy BGP, where granularity is proportional to the number of prefixes that can be selected over. 7. With virtual aggregation, Mapped-BGP provides significant opportunites for new aggregation. 5. Normative References [I-D.francis-idr-intra-va] Francis, P., Xu, X., and H. Ballani, "FIB Suppression with Virtual Aggregation and Default Routes", draft-francis-idr-intra-va-01 (work in progress), September 2008. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Authors' Addresses Paul Francis Cornell University 4108 Upson Hall Ithaca, NY 14853 US Phone: +1 607 255 9223 Email: francis@cs.cornell.edu Xiaohu Xu Huawei Technologies No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District Beijing, Beijing 100085 P.R.China Phone: +86 10 82836073 Email: xuxh@huawei.com Francis, et al. Expires April 29, 2009 [Page 28] Internet-Draft Mapped BGP October 2008 Hitesh Ballani Cornell University 4130 Upson Hall Ithaca, NY 14853 US Phone: +1 607 279 6780 Email: hitesh@cs.cornell.edu Francis, et al. Expires April 29, 2009 [Page 29] Internet-Draft Mapped BGP October 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Francis, et al. Expires April 29, 2009 [Page 30]