Network Working Group                                         P. Francis
Internet-Draft                                                Cornell U.
Intended status: Informational                                     X. Xu
Expires: April 29, 2009                                           Huawei
                                                              H. Ballani
                                                              Cornell U.
                                                        October 26, 2008


                           Mapped BGP Design
                 draft-francis-mapped-bgp-design-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 29, 2009.

Abstract

   This draft introduces Mapped-BGP, a routing protocol that uses BGP to
   distributed tunnel endpoint-to-prefix mappings.  The goal of this
   draft are to present preliminary concepts and get feedback.  It is
   not meant to be a fully-formed proposal.  The goals of Mapped-BGP
   are: 1) to reduce the processing required to run BGP, 2) to speed up
   inter-domain convergence, 3) to improve the cross-ISP load balancing
   capabilities of BGP, and where possible, 4) to enable forms of
   address aggregation like geographical addressing (i.e. for IPv6).
   Improved address aggregation is unlikely to be very useful for IPv4,


Francis, et al.          Expires April 29, 2009                 [Page 1]

Internet-Draft                 Mapped BGP                   October 2008


   because most addresses have already been assigned.  This design takes
   the position that Mapped BGP is useful even without better
   aggregation, because 1) FIB size can be reduced through FIB
   suppression with Virtual Aggregation, and 2) RIB size per se is not
   the growth bottleneck.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terms and concepts . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Description of Mapped-BGP  . . . . . . . . . . . . . . . . . .  5
     3.1.  Structure of new attributes  . . . . . . . . . . . . . . .  5
     3.2.  Map-RIB data structure . . . . . . . . . . . . . . . . . .  6
     3.3.  Tunnel Endpoints (TE)  . . . . . . . . . . . . . . . . . .  6
     3.4.  Rules for advertising maps . . . . . . . . . . . . . . . .  7
       3.4.1.  Rules for initiating a map . . . . . . . . . . . . . .  7
       3.4.2.  Transposing Maps and Routes  . . . . . . . . . . . . .  8
       3.4.3.  Authenticating updates . . . . . . . . . . . . . . . .  9
       3.4.4.  Longest-prefix map selection rules and aggregation . .  9
       3.4.5.  Changing maps  . . . . . . . . . . . . . . . . . . . . 11
       3.4.6.  Propogating and activating maps  . . . . . . . . . . . 12
       3.4.7.  Changing TE-route  . . . . . . . . . . . . . . . . . . 13
     3.5.  Load Balancing in Mapped-BGP . . . . . . . . . . . . . . . 13
       3.5.1.  Incoming Load Balance at Sites . . . . . . . . . . . . 14
       3.5.2.  Incoming Load Balance at Lower-tier ISPs . . . . . . . 15
       3.5.3.  Multi-exit discrimination with Mapped-BGP  . . . . . . 17
     3.6.  Aggregation in Mapped-BGP  . . . . . . . . . . . . . . . . 18
       3.6.1.  Geographic or Metro Addressing . . . . . . . . . . . . 21
       3.6.2.  Opportunistic AS aggregation clusters  . . . . . . . . 26
       3.6.3.  Generalized Inter-domain Virtual Aggregation . . . . . 26
   4.  Performance Benefits . . . . . . . . . . . . . . . . . . . . . 27
   5.  Normative References . . . . . . . . . . . . . . . . . . . . . 28
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28
   Intellectual Property and Copyright Statements . . . . . . . . . . 30


Francis, et al.          Expires April 29, 2009                 [Page 2]

Internet-Draft                 Mapped BGP                   October 2008


1.  Introduction

   The basic idea behind Mapped-BGP is quite simple.  Rather than
   distribute routes to reachable prefixes, BGP distributes routes to
   tunnel endpoints (TE) and distributed maps that associate reachable
   prefixes with TEs.  Otherwise, run BGP in much the same way that it
   runs today.  Indeed, in Mapped-BGP it is possible to transpose TE
   routes and their associated maps back into routes to prefixes.  This
   transposition is used to allow ISPs running Mapped-BGP to interface
   with legacy ISPs that do not run Mapped-BGP.  The transposition also
   allows us to reuse the security mechanisms of BGP, especially prefix
   filtering.

   The maps in Mapped-BGP are, for the most part, policy-free.  By this
   we mean that the types of policies normally applied to routes; the
   seven-step best path computation, the assignment of weights and local
   preferences, the addition or deletion of attributes including path
   prepending, and decisions about where to advertise routes, are not
   applied to maps.  Rather, maps are blindly distributed along the
   routes traced out by their associated TEs.  Since the majority of
   prefixes would be distributed by maps rather than by routes, the cost
   of processing BGP updates would be significantly decreased.  Note
   that RIB and FIB size would not be reduced with this approach.
   However, FIB size can be reduced with FIB suppression associated with
   Virtual Aggregation [I-D.francis-idr-intra-va], and we doubt that RIB
   size per se is a serious bottleneck in BGP (this needs to be
   validated).

   A natural question to ask is, if policies are not being applied to
   maps, how are BGP policies applied to prefixes advertised in maps?
   Since maps are distributed along the reverse best paths of their
   associated TEs, policies that apply to the TE routes are
   automatically grandfathered onto the map prefixes.  This works well
   for policies that are used to control which routes pass through an
   ISP, for instance to configure valley-free routing.  This does not
   work as well, however, for policies used for load balance across
   ASes.  This is because BGP load balancing mechanisms operate at the
   granularity of routes, which in the absense of maps operate at the
   granularity of prefixes.  With Mapped-BGP, a TE-route originated by
   an ISP will apply to all of the ISP's prefixes.  In other words, the
   ISP only originates a single route, and so there isn't enough route
   granularity on which to apply BGP load balancing policies.

   To make up for this shortcoming, Mapped-BGP introduces a parameter
   called a Tunnel Endpoint Discriminator (TED).  This is a
   parameterless value that a remote router uses to decide the relative
   probability with which it will use the different TEs that apply to a
   given prefix.  TEDs allow both multi-homed sites and lower-tier


Francis, et al.          Expires April 29, 2009                 [Page 3]

Internet-Draft                 Mapped BGP                   October 2008


   multi-homed ISPs to load balance at relatively fine granularity.

   The tunnels in Mapped-BGP provide a simple mechanism to produce
   virtual topologies across ASes.  If used in concert with aggregatable
   address assignment policies like geographical addressing, Mapped-BGP
   provides significant new opportunities for aggregation without the
   need for careful physical topology management across ISPs (for
   instance within a geographical area).


2.  Terms and concepts

   FIB-install and FIB-suppress:  These two terms refer to the act of
      installing a route into the FIB, and not installing a route into
      the FIB, respectively.  Note that the mechanism for not installing
      a route into the FIB may be simply not putting it into the routing
      table (defined below).
   Head-end and tail-end:  Head-end generally refers to the start of the
      tunnel.  For instance, head-end router is the router that starts
      the tunnel.  Head-end ISP is the ISP that contains the head-end
      router, etc.  Tail-end generally refers to the terminating point
      of the tunnel.  The term tunnel endpoint (TE) is generally
      synomonous with tail-end.
   Legacy:  Refers to something that does not operate Mapped-BGP (for
      instance, a legacy AS or ISP, a legacy router, etc.).  Anything
      that is not labeled as legacy is assumed to be operating Mapped-
      BGP.
   Map:  The term "map" refers to a single prefix-TE mapping.  It may
      also refer to the "map attribute" in a BGP update.  Note that in
      general, however, a map attribute will contain multiple individual
      maps.
   Routing Table:  The term Routing Table is defined here the same way
      as in Section 3.2 of RFC4271: "Routing information that the BGP
      speaker uses to forward packets (or to construct the forwarding
      table used for packet forwarding) is maintained in the Routing
      Table."  As such, FIB Suppression can be achieved by not
      installing a route into the Routing Table
   Tunnel endpoint (TE):  The term TE typically refers to the router or
      AS that detunnels the packet.  The term can also refer to the TE
      address.  Tunnel endpoints (TE) should be anycasted across some or
      all routers in the AS.
   TE-route:  This is a normal BGP route whose NLRI contains one or more
      TEs.
   TE-block, TE-subblock, and Path Splitting:  Typically a TE will be
      defined by a CIDR block of addresses (as opposed to a single
      address).  This is done to enable upstream load balance through a
      mechanism called Path Splitting (see Section 3.5), whereby the
      route for the entire TE block is split into multiple routes, each


Francis, et al.          Expires April 29, 2009                 [Page 4]

Internet-Draft                 Mapped BGP                   October 2008


      to a sub-block within the block.  These routes are advertised to
      different neighbors, giving upstream ASes multiple paths to choose
      from to get to a given destination prefix.  The term TE-block
      refers to the entire block of addresses that comprise a TE, and
      the term TE-subblock refers to a sub-block within the block.
   Tunnel endpoint discriminator (TED):  A map may have a TED associated
      with it for the purpose of incoming load balancing.  This is used
      when an AS is multi-homed to multiple providers, and each provider
      serves a TE.  A split also has TEDs associated with it, which are
      used by an ISP to load balance traffic incoming among its AS
      peering links.  The TED is a parameterless indication of the
      proportion of traffic that should be sent to each TE or AS-link.
      Note that head-end ISPs are not required to honor the TED.  Note
      also that TED info in maps is lost when maps are aggregated.


3.  Description of Mapped-BGP

3.1.  Structure of new attributes

   There are two new attributes associated with Mapped-BGP.  One is the
   "map", which is used to associate a reachable address prefix with a
   Tunnel Endpoint Block (TE-block).  The other is the "split", which is
   used to associate a TED value with segments of multiple paths to TEs.

   The contents of the map attribute is as follows:

   [TE-list]
   List of one or more address targets, consisting of:
           [prefix],
           action,
           [TED],

   where:
           TE = a CIDR block of one or more addresses,
           TE-list = a list of TE's
           action = add, remove
           prefix = a CIDR block of one or more addresses
           TED = value between 0 - 255 (or smaller range)
           [] = optional (note that either the TE-list, or the prefix, or
                both must be present)

   The format of the split attribute is:

   Downstream AS
   List of two or more Upstream ASes, consisting of:
           Upstream AS
           TED


Francis, et al.          Expires April 29, 2009                 [Page 5]

Internet-Draft                 Mapped BGP                   October 2008


3.2.  Map-RIB data structure

   We assume a new data structure called the map-RIB.  For each eBGP
   neighbor, there is conceptually a map-RIB-in and a map-RIB-out, which
   contains the maps received from and sent to the neighbor
   respectively.

   Normally the same map (i.e. same TE, TED, and action) will have been
   received from each peer and sent to each peer.  During a change (a
   map going from add to remove, or a change in TED), however, there
   will be a brief convergence period during which the map received from
   different peers will differ.  The map-RIB data structure can be
   substantially compressed to exploit this fact.  In other words, most
   map-RIB entries can simply have a flag indicating that all received
   and sent maps are the same, and avoid listing them explicitly.

3.3.  Tunnel Endpoints (TE)

   TEs are typically anycasted across multiple routers for both the sake
   of resilience and to allow for aggregation.  When a TE is associated
   with a single AS, then all routers in the AS will be anycasted with
   the TE address.  A TE may be associated with multiple ASes (i.e. for
   aggregation), in which case all routers in all the ASes will be
   anycasted.  It may also be possible to assign a TE to a metro or
   geographical area.  In this case, the TE address is anycasted across
   at least all routers within the area, but not necessarily all routers
   in all ASes that have a presence in the area.

   A "TE" can in fact be composed of a CIDR block.  In other words, a
   group of addresses can all act as the TE (i.e. all cause the router
   to detunnel the packet).  From the point of view of the TE router,
   all addresses in the block are treated identically---it doesn't
   matter which TE address was used to tunnel a received packet.  The
   purpose of allowing a block of addresses to be a TE is to allow for
   load balancing.  Different sub-blocks within a TE-block may follow
   different paths to the TE (path splitting), thus allowing the head-
   end router to select a path by virtue of selecting different TE
   addresses within the block.  This path selection can be loosely
   influenced by downstream ASes through the use of TEDs.

   Because a router may participate in multiple levels of aggregation
   (i.e.  AS-level and geographical-level), a given router may advertise
   multiple TE-blocks in its maps.  There should not, however, be more
   than two or at most three TE-blocks in a given map.


Francis, et al.          Expires April 29, 2009                 [Page 6]

Internet-Draft                 Mapped BGP                   October 2008


3.4.  Rules for advertising maps

3.4.1.  Rules for initiating a map

   An ISP will initiate a map on behalf of its stub-AS customers.  This
   is illustrated in the following example.  It shows a network of stub
   ASes, A, B, C, and D, and ISPs (all other ASes).  The prefixes
   associated with the stub ASes are as shown.  B, C, and D are single-
   homed customers of W, and A is a multihomed customer of W and Z. W is
   a customer of X and Y.

                          J
                         / \
                        /   \
                I------X     Y
               /        \   /
              /          \ /
             Z            W  TEw=40.1.1.0/28
              \           |
               \   ----------------------
                \ /    |         |      |
     Pa=20.1/16  A     B         C      D  Pd=30.1/16
                 Pb=20.2/16   Pc=20.3/16


   Given this configuration, AS W would initiate the following updates
   to non-legacy AS's:

   Route:  AS-path=(W), NLRI=(40.1.1.0/28)
   Map:    TE=(40.1.1.0/28), AT=(<20.0/14>, <30.1/16>, <20.1/16,TED=20>)

   which for the sake of readability we can rewrite as:

   Route:  AS-path=(W), NLRI=(TEw)
   Map:    TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)

   where Pw-agg is an aggregate consisting of Pa, Pb, and Pc.

   The first update (the route) is a normal BGP advertisement with AS-
   path = W and NLRI=TEw=40.1.1.0/28 (other attributes left off for
   simplicity).  We call this update a TE-route, since it is a route to
   a TE.  The second update is a map.  The TE is TEw, which is the same
   as the TE-route.  Also included are three address targets.  In each,
   the "action" is assumed to be "add", and is not otherwise shown.  The
   first, Pw-agg=20.0/14, is the aggregate of Pa=20.1/16, Pb=20.2/16,
   and Pc=20.3/16.  The second is Pd=30.1/16, which is not aggregatable
   and so given separately.  Note that D is not multihomed, and so has
   no need for a TED.  The third is for A. Even though A's prefix falls


Francis, et al.          Expires April 29, 2009                 [Page 7]

Internet-Draft                 Mapped BGP                   October 2008


   within the 20.0/14 aggregate, it is also individually listed in order
   to convey its TED, value TEDaw=20.  AS Z would also advertise a
   separate map with a another TED value, thus giving A some control the
   volume of incoming traffic on its two access links (see Section 3.5).

   There are three ways that AS W could have learned the value TEDaw.
   One is to have statically configured it.  A second is for A to convey
   it via BGP in an Extended Communities Attribute.  This would
   especially be useful if A is running legacy BGP.  The third would be
   for A to advertise a map to W, but keeping the TE field as NULL:

   Map:    TE=(NULL), AT=(<Pa,TEDaw>)

   This would effectively signal to W that it wants to have its prefix
   Pa advertised individually with the associated TED.

3.4.1.1.  More flexible aggregation

   Nominally it appears in the above example that we are doing the same
   amount of aggregation as with legacy BGP today.  This is because Pa
   is advertised individually because of multihoming.  Section
   Section 3.6 describes how Mapped-BGP provides additional
   opportunities for aggregation.

3.4.2.  Transposing Maps and Routes

   A key aspect of Mapped-BGP is that the combination of route+map can
   be transposed into a route.  This is important for the simple
   pragmatic reason that it allows an AS to speak BGP with legacy AS's.
   It is also important because it allows certain existing BGP
   mechansisms that operate on routes, like filtering incoming updates,
   to be applied to map+routes.  As an example of this transposition,
   the updates advertised by W in the above example can be transposed
   into the following route:

   Route:  AS-path=(W), NLRI=(40.1.1.0/28, 20.0/14, 30.1/16, 20.1/16)

   or equivalently:

   Route:  AS-path=(W), NLRI=(TEw, Pw-agg, Pd, Pa)

   This is possible because the prefixes in the map (20.0/14, 30.1/16,
   20.1/16) can be associated to the AS-path in the route (W) by virtue
   of TEw matching the NLRI of the TE-route.

   To continue the example, the BGP updates advertised by AS X would be:

   Route:  AS-path=(W,X), NLRI=(TEw)


Francis, et al.          Expires April 29, 2009                 [Page 8]

Internet-Draft                 Mapped BGP                   October 2008


   Map:    TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)

   AS X adds itself to the route, but does not change the map.  Indeed
   maps never change as they propogate through the Internet (though they
   can be dropped during aggregation).  These two updates can be
   transposed into:

   Route: AS-path=(W,X), NLRI=(TEw, Pw-agg, Pd, Pa)

   In fact, if AS I were a legacy AS, then AS X would give this update
   to AS I. This allows legacy AS's to coexist with updated AS's.  This
   document does not address the issue of having legacy routers and
   Mapped-BGP routers coexist within the same AS.

3.4.3.  Authenticating updates

   One of the challenges of any tunneled routing system is that of
   authenticating maps.  Mapped-BGP exploits the fact that route+map is
   transposable to route to acheive authentication equivalent to that of
   BGP, and indeed to mostly reuse the authentication mechanisms and
   configuration of BGP.  Conceptually authentication can be seen as
   operating as follows: When a router receives an eBGP route+map, it
   converts it to the equivalent route.  It then applies its existing
   filtering mechanisms of the route.  If the route is acceptable, then
   the route+map is also acceptable.  If the route is not acceptable,
   then the route+map is likewise not accepted.

3.4.4.  Longest-prefix map selection rules and aggregation

   Mapped-BGP uses longest-prefix selection on maps in much the same way
   that legacy BGP uses longest-prefix selection on routes.  In the
   following discussion, assume that Pl and Ps are two prefixes that
   overlap.  Pl has a larger mask, and Ps has a smaller mask (i.e.  Pl
   falls within Ps).

   If an AS receives maps for Pl and Ps with different TEs, then the Pl
   map must be used to route packets to addresses within Pl.  This is
   similar to legacy BGP, where if an AS has different routes to Pl and
   Ps, the route to Pl must be used.  The reason in both legacy BGP and
   Mapped-BGP is the same: it is not clear whether addresses in Pl are
   reachable in the AS originating the route to Ps.

   If the maps for Pl and Ps have the same TE, then either may be used
   to route packets within Pl.  However, in this case there is an
   important difference with legacy BGP.  In legacy BGP, if Ps is
   selected (i.e. aggregation takes place), then Ps is advertised
   upstream and upstream ASes never learn about Pl.  With Mapped-BGP,
   all maps may, and typically will still be advertised upstream, and


Francis, et al.          Expires April 29, 2009                 [Page 9]

Internet-Draft                 Mapped BGP                   October 2008


   upstream ASes may in fact make a different choice.

   Why this matters can be illustrated using the example network above.
   With legacy BGP, AS X would receive the following two routes from AS
   W: Pa=20.1/16 and Pw-agg=20.0/14.  If AS X decides to aggregate these
   two into the single route Pw-agg, then AS I will receive Pw-agg from
   W, and Pa from Z. Now, AS I has no choice to accept the route to Pa
   via Z, because it does not know that Pa is reachable via W. On the
   other hand, if AS X chooses to forward both routes to AS I, then AS I
   receives from X Pa=20.1/16 and Pw-agg=20.0/14, and from Z a route to
   Pa.  Now I may choose between the route via Z and the route via W,
   but once the choice is made, ASes upstream of I are forced into the
   same choice.

   By contrast, with Mapped-BGP, all of the maps (Pa and Pw-agg via W,
   and Pa via Z) would be propogated, along with TE-routes to Z and W.
   In this way, AS I can choose one route, and ASes upstream of I can
   choose different routes.  Furthermore, this choice can include
   installing only the aggregate prefix Pw-agg into router FIBs if so
   desired.  With legacy BGP, this choice often doesn't exist.  Indeed,
   different routers in AS I could use different TEs (TEz or TEw), or
   even multipath to both TEs (that is, use both TEs simultaneously).

   Of course, the cost of doing this is that both of A's maps must be
   propogated everywhere.  We defend this with two arguments here.
   First, that the cost of propogating a map is expected to be
   relatively small.  If an AS chooses to load only the aggregate in its
   FIB, then the cost of the unused maps is limited to receiving them,
   deciding to suppress them from the FIB, storing them in the RIB, and
   passing them on.  Though we need to run benchmarks to measure this
   cost, intuitively we believe that this is significantly less
   expensive than processing a full-blown route and entering it in the
   FIB.

   Second, ASes still have the option of dropping the maps altogether if
   they can't deal with them.  Doing so results in the same sorts of
   inflexibility we see today in BGP, but nevertheless the option
   exists.  For instance, if in the above example AS X decided to simply
   drop the map for Pa altogether, then AS I would receive the aggregate
   map Pw-agg from X, and Pa map from Z. AS I would have to choose the
   route via Z here, because it would not be able to tell that A is
   connected to W. So bottom line there is considerably more flexibility
   with Mapped-BGP in making the overhead versus routing granularity
   tradeoff.

   More broadly, this example illustrates one of the core design
   principles of Mapped-BGP; that by both making the processing of
   routing information cheaper, and providing considerable flexibility


Francis, et al.          Expires April 29, 2009                [Page 10]

Internet-Draft                 Mapped BGP                   October 2008


   as to what to do with that routing information, we have the option of
   propogating much more detailed information in the global routing
   system than we are able to do today.  At the same time, individual
   ISPs have the option of ignoring the details if they so choose, and
   are less constrained by the decisions made by downstream ISPs.

   This principle does in fact result in a shift of power among ASes.
   Today, upstream ASes are held hostage to the decisions of downstream
   ASes.  In Mapped-BGP, however, downstream ASes lose some control of
   packet forwarding (or, at least, that control becomes more expensive
   to acheive).  For instance, in the above example topology, lets
   imagine that AS X decides that AS B is under attack and wants to drop
   (or identify and scrub) those packets.  Unfortunately all packets to
   B are tunneled to AS W along with packets to A, C, and D. If X wants
   to distinguish these, it must look deeper into all packets going to
   W. We believe that this shift in power is probably overall good, but
   more thought and experimentation is required to understand this.

3.4.5.  Changing maps

   There are three things that can change on a map: the set of TEs it is
   associated with, its prefix, and its TED.  There are two actions
   associated with maps: add and remove.  Changes in TE or prefix are
   done using add and remove.  For instance, in the above figure, if the
   link between A and W goes down, W would advertise:

   Map:    TE=TEw, AT=(<Pa,action=remove>)

   Because A is multihomed, this update causes ASes to use the TE
   associated with Z exclusively.  In other words, this effectively
   disassociates Pa with TEw.  The TED does not need to be included with
   a remove update, nor does the route to the TE.  If the map is
   subsequently added again (because the link comes back up), then the
   TED would of course have to be included, but the TE-route would still
   not have to be repeated.

   If the link between D and W goes down, then W would advertise:

   Map:    TE=TEw, AT=(<Pd,action=remove>)

   Since this is the only TE associated with Pd, this update would
   effectively remove Pd from routers everywhere.  It is worth noting
   that W can tell that D is single-homed because W does not receive any
   other maps associated with D. Because of this, W might very
   reasonably decide not to advertise D's unreachability, thus saving
   some control processing overhead on the rest of the Internet.

   If the link between W and B goes down, then W does not need to


Francis, et al.          Expires April 29, 2009                [Page 11]

Internet-Draft                 Mapped BGP                   October 2008


   advertise anything via eBGP, because B's prefix is aggregated and B
   is not multihomed.

   TED values may be modified from time to time even though no other
   aspects of the map (its TE or add/remove status) changes.  TED
   changes are advertised by simply repeating the complete map with the
   new TED value.  It is worth noting that if a map advertises only a
   TED change, other ASes do not need to process the change right away.
   For instance, they could wait until they recompute traffic
   engineering.

3.4.6.  Propogating and activating maps

   Maps are similar to link-state updates in that each effectively
   describes a "link" somewhere in the Internet (i.e. that an AS with a
   given prefix is attached to an AS with another given prefix).  As
   such, as with link-state updates, maps have the potential to be
   interpreted out of order.  For example, an ISP might advertise an
   "add" map after a "remove" map, but the "remove" could well be
   received after the "add" at some remote ISP, thus installing the
   wrong state.  OSPF solves this problem using sequence numbers and a
   set of rules on how to interpret them.  Mapped BGP can exploit the
   trees generated by routes, combined with the fact that BGP speakers
   send updates in order, to solve the same problem without the need for
   sequence numbers.

   Specifically, what Mapped-BGP does is to require that maps are
   distributed along the trees created by routes.  This prevents old
   maps from looping around on themselves and incorrectly voiding more
   recent updates.  While an older map heard from one AS neighbor may
   temporarily be used in preference to a newer map heard from another
   AS neighbor, the fact that maps must follow the tree (in order) means
   that eventually the newer update will overtake the older one.  (As of
   this writing, we don't have a formal proof of this.)  In particular,
   maps are distributed according to the following rules:

   1.  Each router remembers, for every map prefix, the latest map
       received from every eBGP peer, and the latest map sent to every
       eBGP peer.  Note that most of the time these will all be the
       same, and so the data structures can be compressed to exploit
       this.
   2.  The map used by an AS, and advertised to other ASes, is that
       received from the next-hop AS on the associated TE-route.  If the
       next-hop AS changes a map, then this changed map is used and
       advertised.  If a different next-hop AS is selected, then the
       maps advertised by that AS are used.  If this causes any maps to
       change, then the changes are used and advertised.


Francis, et al.          Expires April 29, 2009                [Page 12]

Internet-Draft                 Mapped BGP                   October 2008


   3.  A map is never advertised to the next-hop AS on the TE-route.

3.4.6.1.  Peering sessions and maps

   As an optimization to speed up the establishment of a peering session
   between eBGP speakers, we exploit the fact that maps are usually the
   same for all peers, and "guess" the value of a map before a peer
   advertises it.  Specifically, when a peering session first comes up,
   the peers exchange all routes before exchanging any maps.  When a
   peer learns a route (probably a TE-route) and selects it as a next-
   hop, it immediately uses any maps associated with the TE.  In other
   words, it continues to use whatever TEs it was already using.
   Subsequently, when the peer starts advertising maps, the BGP speaker
   responds accordingly.

3.4.7.  Changing TE-route

   Routes, including TE-routes, are handled as with normal BGP.  They
   are handled independently of maps.  In other words, if a BGP speaker
   advertises a change of route to its peer, it does not need to re-
   advertise the associated maps.  Assume, for instance, that AS J uses
   AS X as the next hop to TEw, and the link between X and W goes down.
   X will withdraw the route to TEw, but does not need to withdraw the
   maps with TE=TEw as the TE.  These maps will have been previously
   advertised by Y, and so the alternate path through Y can be used
   right away.  When the link between X and W is restored, then X only
   need advertize the route to W again---the previously advertised maps
   are still valid and can be used immediately.

3.5.  Load Balancing in Mapped-BGP

   A cornerstone of the performance benefits of Mapped-BGP is the fact
   there there are no policies associated with maps per se (other than
   the fact that they need to be source filtered to prevent hijacking,
   just as routes are source filtered today).  In other words, policies
   are applied to routes, but not to maps.  This raises the obvious
   question of what policies are we giving up because of this, and how
   do we get them back?

   We can divide policies into two types: policies that act at the
   granularity of ASes, and those that act at the granularity of
   prefixes.  AS-level policies are preserved in Mapped-BGP, because
   these policies can be applied to TE-routes (of which there are,
   roughly, one per AS).  Indeed, AS-granularity policies become cheaper
   to enact, because there are fewer routes to deal with.  Examples of
   AS-based policies are: prefer routes to customers over other routes.
   Prefer routes to peers over routes to providers.  Do not export
   routes received from peers to non-customers.


Francis, et al.          Expires April 29, 2009                [Page 13]

Internet-Draft                 Mapped BGP                   October 2008


   Most policies that act at the granularity of prefixes are for the
   purpose of traffic engineering.  There are a number of such examples.
   For instance, one ISP may give per-prefex MEDs to its neighbor in
   order to influence how packets enter each peering point.  This may be
   done either for the purpose of load balance, or in order to minimize
   the distance that packets need to travel within the receiving ISP.
   Likewise an ISP might set loc-prefs on a per prefix basis to
   influence the outgoing load on each peering point.  A multi-homed
   site might deaggregate its prefix, and then use community attributes
   offered by its provider to do per-prefix path prepending or route
   filtering to influence the load on its incoming access links.

   Mapped-BGP improves upon existing inter-domain traffic-engineering
   through two mechanisms: the Tunnel Endpoint Descriminator (TED), and
   "path splitting".  These mechanisms are simpler, more scalable, and
   expected to be more effective than the current set of BGP
   mechansisms.  It is hoped that the use of this simpler approach would
   simplify BGP configuration overall.

   Before discussing these mechanisms, we should point out the obvious;
   which is that traffic engineering requirements necessarily put ASes
   in conflict with each other.  A simple example of this is illustrated
   below.  Say that A want to send half of its traffic to B and half to
   C, and D wants to receive 25% of its traffic from B and 75% from C.
   It may not be possible to satisfy both A's and D's requirements.

      A
     / \
    /   \
   B     C
    \   /
     \ /
      D

   Mapped-BGP's approach to this conflict is to provide a mechanism
   whereby the receiver can convey to the sender what its traffic
   engineering needs are, and the sender can honor or ignore the
   receiver's wishes.  This is similar in spirit to how MEDs work in
   legacy BGP.  Other legacy mechanisms, however, like path prepending,
   attempt to "force" the sender into honoring the receiver's incoming
   traffic engineering requirements by manipulating its next-hop
   selection algorithm.

3.5.1.  Incoming Load Balance at Sites

   Mapped-BGP's TED mechnanism has already been partially described.  In
   the previous example, AS A conveys TED values to ASes W and Z, which
   in turn attach these TED values to the corresponding maps:


Francis, et al.          Expires April 29, 2009                [Page 14]

Internet-Draft                 Mapped BGP                   October 2008


   From W:  Map:    TE=(TEw), AT=(..., <Pa,TEDaw>)
   From Z:  Map:    TE=(TEz), AT=(..., <Pa,TEDaz>)

   Assuming that these TEDs aren't suppressed through some aggregation
   somewhere, they are conveyed to all ASes in the Internet.  The TEDs
   are parameterless.  They are interpreted at each AS as an indication
   that more or less traffic should be directed to the associated TE.
   This interpretation, however, is entirely up to the head-end AS.  AS
   A uses TED values to control the volume of incoming traffic from Z
   and W as follows.  AS A sets some initial TED values, say TEDaw=50
   and TEDaz=50, and over some period of time (days or a couple weeks)
   measure the incoming volume of traffic.  If the volume is not as
   desired, for instance too much trafic from W and not enough from Z,
   then AS A can modify the TED values, say to TEDaw=40 and TEDaz=60.
   Over time, AS A can determine what the appropriate values are, as
   well as gain a sense of how future changes in TED values are likely
   to effect traffic load.

   How a head-end AS (one that transmits traffic to addresses in Pa)
   interprets the TED values is up to it.  The TED values may or may not
   have any effect on the AS'es traffic engineering decisions.  The
   basic idea here is that if AS cannot satisfy its own traffic
   engineering requirements while honoring the TED values, then it will
   ignore the TED values.  If on the other hand the head-end AS can both
   satisfy its traffic engineering requirements and honor the TED
   values, it will do so.  The hope is that enough head-end ASes will be
   able to honor the TED values to allow receiving ASes to control its
   incoming traffic.

   While exactly how an AS determines how to process TEDs is for further
   study, we can imagine a sequence of steps whereby the AS first
   determines which map+routes must ignore TEDs.  For instance, the AS
   might be doing hot-potato, and there are simply some destinations
   where one TE is more hot-potato than another and therefore prefered.
   Assuming that there are remaining destinations for which the choice
   of TEs are roughly equivalent, the AS can then look at the TED and
   select on that basis.  For instance, if as in the previous example
   the TEs are TEDaw=40 and TEDaz=60, the AS might choose with
   probability 0.6 to choose TEDaz, and with probability 0.4 to choose
   TEDaw (or each router could make its own selection on this basis).

3.5.2.  Incoming Load Balance at Lower-tier ISPs

   The example above shows how a multihomed stub AS site can use TEDs to
   do incoming traffic engineering across multiple ISPs.  This is not
   the only case, however, where incoming traffic engineering across
   different ASes is required.  For instance, using the same figure,
   assume that AS W is a lower-tier ISP that is a customer of provider


Francis, et al.          Expires April 29, 2009                [Page 15]

Internet-Draft                 Mapped BGP                   October 2008


   ISPs X and Y, and that wishes to balance traffic arriving from X and
   Y. This is made possible in Mapped-BGP using path splitting.

   Recall that TEs actually consist of a block of addresses.  Path
   splitting operates by splitting a TE-block into multiple sub-blocks
   (as many as there are links to balance over), and advertising each
   sub-block to a different neighbor AS.  This creates multiple paths
   that can be used to reach the same TE.  By associating a destination
   prefix with one path or another, a head-end AS can influence which
   path is used, and therefore the volume of traffic on a given link.

   For example, assume that AS W wants to control the incoming volume of
   traffic from ASes X and Y. Recall that in the earlier example
   (without path splitting), AS W advertised the following:

   Route:  AS-path=(W), NLRI=(TEw=40.1.1.0/28)
   Map:    TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)

   To do path splitting, what AS W instead advertises is the following:

   Route To X: AS=(W), NLRI=(TEwx=40.1.1.0/29)
   Route To Y: AS=(W), NLRI=(TEwy=40.1.1.8/29)
   Map:    TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)
                           zzzz
   Split:  DS-AS=W, US-AS=(<X,TEDwx>, <Y,TEDwy>)
           where: DS-AS = Downstream AS
                  US-AS = Upstream AS

   The first thing to note about this is that the original map goes
   unchanged.  The second thing to note is that AS W has split its TE-
   block in half, and is advertising a separate route to each subblock
   (shown as TEwx and TEwy).  To be clear, the map still associates the
   reachable prefixes with the full original TE-block, but here that
   block is reachable via two paths.  Whats more, AS W advertises one
   such route to X only, and the other to Y only.  This effectively
   gives upstream ASes some path control.  If they select a TE address
   within the TEwx subblock, then packets will get to W via X. Likewise
   if they select a TE address within the TEwy subblock, then packets
   will get to W via Y.

   Finally, W has generated a split attribute in order to convey the
   relative volume that should enter via the two neighbors.  The
   Downstream-AS (DS-AS) is W itself.  There are two records for the
   Upstream ASes (US-AS), one for X and one for Y, each associated with
   a separate TED (TEDwx and TEDwy respectively).

   Given this, consider the behavior of AS J. J will receive a route to
   TEwx from X, and to TEwy from Y. J will also propogate these routes


Francis, et al.          Expires April 29, 2009                [Page 16]

Internet-Draft                 Mapped BGP                   October 2008


   to its neighbors.  Now J, as a head-end AS, can choose between these
   two routes, on a router-by-router or even packet-by-packet basis if
   it wishes, to send packets to destinations in A, B, C, and D.
   Mechanistically it makes this choice by selecting a TE address from
   either the TEwx or TEwy subblock.  Assuming that J is willing to
   honor W's TEDs in making the choice, it would send more or less
   traffic along each route according to the value of the TEDs.  Indeed
   in this particular example, J can choose first how to send traffic to
   A (i.e. via TEz or TEw).  Of the traffic that J chooses to send to W,
   it can then choose how to split the traffic between X and Y.

   Now consider the behavior of AS I. I will receive two TE-routes from
   X, one to TEwx with AS-path X-W, and the other to TEwy with AS-path
   X-J-Y-W.  AS I, as a head-end AS, can strictly speaking choose
   between these two paths based on which TE address it uses to tunnel
   packets to A, B, C, and D. In this particular example, the choice
   does not effect I's traffic (it all goes to X either way).  The
   choice does of course effect W's incoming load balance, as well as
   the length of the paths and the amount of traffic load at J and Y. In
   this case, I should almost certainly favor efficiency (shorter path)
   over W's load-balancing needs.  This is especially true considering
   that W may in any event satisfy its load balancing requirements even
   if I does send all packets on the shortest path, because other ASes
   can reasonably choose the path via Y.

   Note that this approach is frugal in its overhead.  W wants to
   balance two peering links, and so creates exactly two routes.
   Contrast this with today's situation, where an AS may need to
   deaggregate a prefix multiple times in order to get the granularity
   needed to effectively load balance.  Whats more, the multiple routes
   can be aggregated back together by any AS.  This might well be done,
   for instance, by an AS that is relatively far from W, and that sends
   very little traffic to W. In doing this, the aggregating AS would
   also drop the split attribute.

3.5.3.  Multi-exit discrimination with Mapped-BGP

   The above paragraphs describe how an AS can influence the volume of
   traffic entering from different ISPs.  It is also important to be
   able to influence how traffic enters at multiple peering points
   between the same neighbor ISP.  Today MEDs are used for this, where
   the MED value is set on a per-prefix basis.  In Mapped-BGP, an AS L
   will send two kinds of packets to its neighbor AS K: packets that are
   detunneled at K, and packets that are not detunneled at K. AS K can
   of course set MEDs on the TE-routes for packets that it does not
   detunnel.  Normally, however, an AS K only advertises one TE-route
   per neighbor AS for its own TE.  As a result, there is no basis for
   discriminating packets addressed to K's TE.


Francis, et al.          Expires April 29, 2009                [Page 17]

Internet-Draft                 Mapped BGP                   October 2008


   One way to approach this problem might be to use TE-route splitting
   here as well.  However, this approach leads to either a potentially
   large number of split TE-routes, or a large number of additional
   maps.  (Explanation of why left out for now.)  In general it seems
   inappropriate to burden the rest of the Internet for a routing matter
   that is strictly between two neighboring ASes.  As such, the solution
   is limited between the two neighbors.  Specifically, when AS L wishes
   to let its neighbor AS K dictate which exit it should use on a per-
   prefix basis, AS L must detunnel packets otherwise destined for K. In
   other words, the routers in AS L are configured to detunnel packets
   with K's TE addresses.  Once detunneled, routers in L route packets
   to K based on the inner header destination address.

   There are two ways in which L could learn about the MEDs associated
   with K's prefixes.  One is for K to simply advertise these prefixes
   to L as normal BGP routes with MEDs attached.  Another would be for K
   to attach the MEDs to the maps it sends to L. L would strip these
   MEDs before forwarding the maps onwards.  At this time we don't have
   a preference for one approach over the other.

   Finally, it should be pointed out that Mapped-BGP in general creates
   more choices for path selection, and therefore more choices for
   traffic engineering (both outgoing and incoming) compared to legacy
   BGP.  With legacy BGP, an AS makes one next-hop-AS choice per
   destination prefix.  With Mapped-BGP, an AS can make multiple next-
   hop-AS choices per destination prefix.

3.6.  Aggregation in Mapped-BGP

   Mapped-BGP has all of the aggregation features of legacy BGP
   (physical topological aggregation), as well as new opportunities for
   aggregation beyond what BGP offers in the form of inter-domain
   virtual aggregation.  Virtual aggregation can be used in a number of
   useful ways.  It can be used in conjunction with geographical address
   assignment to provide a realistic way to implement geographical
   addressing.  It can be used opportunistically to allow small groups
   of ISPs to aggregate some portion of the address space that is
   already mostly assigned to them.  But it can also be used generally
   as a way of shrinking FIBs in the "core" of the network (i.e. the
   cores of tier-1 ISPs) and dramatically shrinking RIBs and FIBs
   everywhere else.

   Section 3.4.1 describes how an ISP can aggregate the prefixes of its
   customers.  As with legacy BGP, this is done by an ISP that "owns"
   the address space that it is aggregating.  It is in this sense that
   Mapped-BGP aggregation is similar to BGP

   Mapped-BGP also has a mechanism that allows for inter-domain virtual


Francis, et al.          Expires April 29, 2009                [Page 18]

Internet-Draft                 Mapped BGP                   October 2008


   aggregation similar in spirit to that described in the intra-domain
   virtual aggregation draft [I-D.francis-idr-intra-va].  This
   mechanism, especially when used in conjunction with appropriate
   address assignment policies, gives Mapped-BGP more opportunities for
   aggregation than legacy BGP.  The mechanism is this: any router can
   become an Aggregation Point Router (APR) for a Virtual Prefix (VP).
   A VP is a prefix that is not topologically aggregatable, and must be
   bigger (have a smaller mask) than any topological prefix.  An APR for
   a given VP advertises that VP as a route in BGP.  This route must be
   tagged with a transitive attribute that indicates that it is a route
   for a VP.  This allows other routers to know that all subprefixes are
   reachable via the VP route.  An APR must FIB-install every subprefix
   within the VP.  These subprefixes may be reachable natively as
   routes, or through map tunnels, but they must be reachable.

   We can illustrate this using the example topology above (though note
   that this example doesn't represent the best usage of this feature).
   Imagine that AS Z has a single-homed customer AS E with prefix
   Pe=20.0/16.  Note in particular that Pe comes out of the aggregate
   prefix that W advertises (Pw-agg=20.0/14).  With both legacy- and
   Mapped-BGP, W could still advertise the aggregate Pw-agg: Pe would
   "punch a hole" in the aggregate and routes to Pe would go to Z rather
   than W. Note, however, that with Mapped-BGP, W receives Z's map for
   Pe:

   Map:    TE=(TEz), AT=(..., <Pe=20.0/16>)

   As a result, W is able to tunnel packets destined to Pe to Z. If AS W
   is willing to do that on behalf of other ASes (i.e. act as a transit
   for packets to Pe even when W is not on the normal BGP path to Pe),
   then it can advertise Pw-agg as a VP.  This would allow a remote AS
   to suppress loading the finer-grained prefix Pe into its FIB.  The
   remote AS could forward all packets to Pw-agg towards W. These
   packets will either reach W, in which case W would in turn tunnel the
   packets to Z, or they would reach a router on the path to W that has
   installed Pe, in which case this router will tunnel the packets to Z.
   Obviously in this case this results in added latency for packets that
   reach W, as well as extra load for W. As such, this mechanism should
   not be used willy-nilly, and in fact would probably not be used in
   this particular example.  Situations where use of the Virtual
   Aggregation is appropriate is described later on.

   In the above example, the route for Pe could be suppressed from the
   FIB (or equivalently, the "routing table" as defined by BGP), but it
   is still necessary to keep the maps in the map-RIB.  Keeping maps in
   the map-RIB is unlikely to become a scaling problem.  The reason is
   that it doesn't take very much processing to distribute a map that is
   not loaded into the FIB.  A router needs to determine that a map is


Francis, et al.          Expires April 29, 2009                [Page 19]

Internet-Draft                 Mapped BGP                   October 2008


   valid, and then determine that the map can be FIB suppressed, But
   once that is done the map only needs to be stored and transmitted to
   neighbors.  It seems reasonable to expect a router to be able to
   store and process millions of FIB-suppressed maps.

   As an aside, even though it should be possible to distribute a very
   large number of FIB-suppressed maps, it is in fact possible to not
   require many ASes to store the maps at all.  This is because in
   principle the only ASes that have to keep the maps are those that are
   need to distribute the maps to where they need to go.  In the above
   example, ASes I and X need to keep the map for Pe, because they
   convey it from Z to W. But ASes J and Y can in principle ignore the
   Pe map altogether.  Unfortunately there is no simple way, other than
   static configuration, to tell an AS whether or not it needs to
   distribute a given set of maps or not.  On the other hand, in many
   cases this configuration will be relatively straight-forward, as
   discussed later.

   Unless protected against, the the use of VPs creates a possibility
   for transient loops.  The problem is illustrated using the figure
   below.  This figure is a blow-up of AS Z from the example just
   described.  It shows two border routers in AS Z (z1 and z2). z1 is
   connected to the border router E1 in customer AS E, and z2 is
   connected to the border router i1 in AS I. Note that, as the router
   with the customer interface, z1 is responsible for advertising maps
   about Pe (either add or remove).  As the router with an ISP
   interface, z2 will be the TE for packets tunneled to TEz and destined
   for Pe.

          +--------+   +-------+
          |        |   |       |
          |     z2-+---+-i1    |
          |        |   |       |
          |  z1    |   |       |
          |  /     |   |       |
          +-+------+   +-------+
           /  AS Z       AS I
      +---+-+
      |  /  |
      | e1  |
      |     |AS E (with prefix Pe=20.0/16)
      +-----+

   Now imagine that the z1-e1 link goes down, and that z1 detects this.
   We can divide subsequent behavior into three time periods:


Francis, et al.          Expires April 29, 2009                [Page 20]

Internet-Draft                 Mapped BGP                   October 2008


   1.  Only z1 knows that the link has failed.
   2.  z1 and z2 know that the link has failed but no routers outside of
       AS Z know this.
   3.  All routers, in particularly routers in AS W, know that the link
       has failed.
   Consider the behavior of z1 during the first two periods.  It has a
   route to prefix Pw-agg.  Therefore, if it receives a packet destined
   to Pe, it would be expected to forward the packet towards AS W.
   Indeed, routers in AS W might have another route to Pe that z1 is
   unaware of.  On the other hand, AS W may not have another route, and
   so would simply forward any packets received for Pe back through the
   tunnel to z2, thus forming a loop.  In other words, z1 doesn't know
   if a loop has formed or not.  If a loop has formed, then z1 would not
   want to forward packets to Pw-agg, and if a loop has not formed, then
   z1 would want to forward packets to Pw-agg.  The same holds for other
   routers in AS Z.

   The behavior we would like is for z1 and z2 to recognize whether any
   given packet destined to Pe has looped or not.  If it has, it should
   be dropped.  If it hasn't, then it should be forwarded to Pw-agg.
   During period 2, z2 can tell that packets received via the TEz tunnel
   may be looping and must therefore drop those packets.  But during the
   first period, z2 will forward any received packets towards z1.
   Therefore, z1 needs to be able to tell whether a packet it receives
   arrived via z2's tunnel or not.  The way to do this is to have z2
   tunnel packets destined for e1.  This could be for instance an MPLS
   LSP with e1 as its target, as described in the Intra-domain Virtual
   Aggregation draft.

   Note also that it is possible to distribute VPs as maps rather than
   as routes.  We currently see no advantage to this, but leave it for
   further study none-the-less.

   In the following paragraphs, we outline various ways in which VPs may
   be used in Mapped-BGP.

3.6.1.  Geographic or Metro Addressing

   There have been many proposals in the past to deploy geographic
   addressing.  The basic idea is simple: if a site accesses the
   Internet within a particular geographic area, then it is assigned
   addresses from a prefix dedicated to that area.  This makes both
   multihoming and changing providers easier, because the site is likely
   to multihome to providers in the same area, or to switch to another
   provider in the same area.  This allows ISPs serving that area to
   aggregate the area prefix.  The criticism of geographic addressing
   has always stemmed from the fact that existing routing algorithms
   require physical connectivity within the aggregate topology.  There


Francis, et al.          Expires April 29, 2009                [Page 21]

Internet-Draft                 Mapped BGP                   October 2008


   is no regulatory structure in place today to insure that that
   physical connectivity is created and maintained.

   With Mapped-BGP, the need for intra-area physical connectivity is not
   as critical.  Of course, it is still important, because to the extent
   that such physical connectivity does not exist, paths will be longer.
   Mapped-BGP, however, allows for a great deal more flexibility as to
   how much physical connectivity needs to exist, and provides a
   scalable re-routing mechanism for when intra-area links do fail.

   To operate geographic addressing with Mapped-BGP, of course first an
   area needs to be identified, and an address space reserved for it.
   Call this address space the area-prefix.  It is too late to do this
   for IPv4, but it could certainly be done for IPv6 (indeed, such
   addresses have already been defined).  Of the ISPs that have a
   presence in the area, some will be willing to provide general transit
   and others will only provide service for their customers.  These are
   refered to here as transits and non-transits, and the routers in
   these ASes are called transit routers and non-transit routers
   respectively.

   Transit routers within the area (i.e. those that provide access for
   sites within the area) are configured to advertise a VP-route for the
   area-prefix.  These "area routers" must also FIB-install maps for all
   sub-prefixes within the area-prefix.  Note that a given ISP may span
   multiple areas.  Only the routers within a given area need advertise
   the VP-route and FIB-install the sub-prefixes.  Other routers in the
   ISP but not in the area may FIB-suppress those sub-prefixes.  The
   area routers would separately advertise maps for the individual sub-
   prefixes, using the TE assigned to the AS (i.e. as normal).

   Customers of non-transit routers within the area would still be
   assigned area-prefix addresses, but non-transit routers would not
   advertise the VP-route.  Rather, they would only advertise maps and
   TE-routes for their customer's individual prefixes as normal.

   The following figure illustrates this.


Francis, et al.          Expires April 29, 2009                [Page 22]

Internet-Draft                 Mapped BGP                   October 2008


                             ,------.
                            /        `.
                       ----'---------  `.
                      /  /           \   \ Area (Y, Z, J, K
      Transit ISPs   X--;-----Y-------Z   `.  and L are in
                    / \ ;    / \     / \    :   the area)
                   /   /    /   \   /   \   |
                  /   ; \  /     \ /     \  |
    Non-transit  I----|--J--------K-------L |
      ISPs        \   :          /          ;
                   ----\---------         ,'
                        `-.            ,-'
                           `----------'

   Here we see three transit ISPs (X, Y, and Z), and four non-transit
   ISPs (I-L), some of which are in the area and some of which are out
   of the area, as shown.  The non-transit ISPs are customers of the
   transit ISPs as shown, and peer with each other as shown.  Assume
   that Y and Z span many areas, have multiple peering points with each
   other, and peer with other ISPs (X and others not shown).

   A remote AS would receive the following maps and routes from area
   ASes:

   Map:    TE=(TEy), AT=(<Py=20.0.0/24>)
   Map:    TE=(TEz), AT=(<Pz=20.0.1/24>)
   Map:    TE=(TEj), AT=(<Pj=20.0.2/24>)
   Map:    TE=(TEk), AT=(<Pk=20.0.3/24>)
   Map:    TE=(TEl), AT=(<Pl=20.0.4/24>)
   (.... TE-routes for all of the above maps ....)
   one or both of the following VP routes:
   Route:  AS-path=(Y...), NLRI=Pa=20.0/16
   Route:  AS-path=(Z...), NLRI=Pa=20.0/16

   What's more, although the routes are not shown, assume that the
   routes to TEj and TEk are split for load balance.  (Of course there
   are likely to be more prefixes advertised from each TE.  One is
   enough to illustrate the technique.)

   Assume for now that there is no RIB suppression of maps: all maps are
   distributed to all ASes globally.  The first thing to note is that
   any AS could choose to FIB-suppress any of the /24 maps and still be
   able to deliver packets to the destinations.  However, in each of
   these cases, FIB-suppression would have a greater or lesser impact on
   traffic to the destination.

   In the case of Pj, remote ASes that choose to FIB-install Pj can use
   the TEDs in the split route (as well as their own traffic engineering


Francis, et al.          Expires April 29, 2009                [Page 23]

Internet-Draft                 Mapped BGP                   October 2008


   considerations) to decide whether to route via X or Y. Packets from
   remote ASes that FIB-suppress Pj will be routed to Y or Z, depending
   which is the better route.  Packets to Y will reach J through the Y-J
   link.  Packets to Z may be routed to J via X, but the fact that Y and
   Z have POPs in the area, and X doesn't, suggests that a larger
   proportion of Z's packets may reach J through Y. What's more, packets
   may reach Z via X, only to be tunneled to J back through X. Having
   said that, a remote AS whose route to Z or Y is via X can tell that
   FIB suppression is likely to result in a longer path, and so may be
   less likely to FIB-suppress.  Ultimately, FIB-suppressing Pj is
   likely to produce significantly more traffic on the Y-J link compared
   to the X-J link.

   To some extent J might be able to counter this imbalance with TEDs.
   Another option, however, could be for X and Y to offer explicit load
   balancing services to J. In this case, J could supply a separate pair
   of locally advertised TEDs that X and Y use to balance traffic to J.
   For instance, if the Y-J load is too heavy, and the X-J load too
   light, J could ask Y to divert some of its traffic via X, using the
   split route to TEj that traverses X.

   Note, however, that if the Y-J link goes down, all traffic will
   successfully reach J through X, even if Pj is suppressed.  This is
   because Y and Z will find that the only route to J is via X, and
   tunnel packets accordingly.  Ultimately J might easily find that
   multihoming to an ISP not in its area is worth doing.

   Note also that this scenario creates the possibility of transient
   loops, similar to those described in Section 3.6 between ASes Z and
   W. For instance, if the X-J link goes down, but AS Y doesn't know it
   yet and continues to tunnel packets to X (either for load balancing
   or because the Y-J link has gone down), then AS X would just route
   the packets back to the VP area aggregate.  As described previously,
   the solution is for routers in X to drop packets when they know they
   have been received via the TEx tunnel, but to forward them to the VP
   otherwise.

   In the case of Pk, FIB-suppression of its map by remote ASes will
   eliminate the ability for them to load balance traffic between Y and
   Z. From their perspective, all traffic to K would be routed to Pa,
   which reaches either Y and Z depending on which path is shorter.  In
   this case, as with X and Y above, Y and Z could offer explicit load
   balancing services to K. As a result, K's multihoming could be hidden
   from the vast majority of routers FIBs while sill providing K with
   robust and load-balanced multihoming service.

   In the case of Pl, FIB-suppression of its map by remote ASes may
   result in some packets taking a longer path than they otherwise


Francis, et al.          Expires April 29, 2009                [Page 24]

Internet-Draft                 Mapped BGP                   October 2008


   might.  For instance, X may choose to route some Pl-destined packets
   to Y even though a path to Z would be the shorter path.

   Note that in the above example, by not advertising a route to Pa, J,
   K, and L avoid becoming transits for other destinations in the area.
   Rather, these ASes can control the extent to which they do transit
   traffic through control of the routes they propogate.  For instance,
   if J wants for some reason to transit traffic for its peer K, it
   would propogate the TE-route for TEk (as well as the map for TEk) as
   appropriate.

   Now lets assume, in order to better scale RIBs, that we do not wish
   to propogate all maps to all Internet ASes.  (Though once again we
   point out that do to the relative efficiency of map distribution,
   such scaling is unlikely to be necessary for the forseeable future if
   ever.)  The fact that Y and Z can in principle load balance for their
   customers makes this option tenable.  In this example, Y and Z are
   the only transit ASes participating in the geographic areas.  For now
   lets assume that Y and Z have multiple peering points in multiple
   geographic locations (i.e. both are national or multi-national ISPs
   with significant territorial overlap).  In this case it is highly
   unlikely that Y and Z will become partitioned from each other.  Given
   that, it might be deemed reasonable that Y and Z only need to
   distribute area subprefix maps for area ASes to each other.  Thus all
   other ASes never get maps for subprefixes in the area.

   In this example, J would still propogate its map (and TE-route) to X
   and I. Although I, as a non-transit, would likely not further
   proporate J's map, X most likely would.  As a result, J's map would
   be propogated Internet-wide, but not K's or L's.  As long as most
   multi-homing is "in area", most maps could be suppressed, resulting
   in both greatly reduced RIB size as well as FIB size.

   Now lets assume that Y and Z only peer in one place (i.e. between
   their POPs in the area).  Assume further that they both peer with X
   in multiple places.  In other words, X serves as a robust backup
   route between Y and Z should their single peering point fail.  In
   this case, X must be willing to propogate area maps between Y and Z
   (along with the TE-routes for all area ASes).

   More generally, anywhere there is a desire to not propogate maps, the
   area ASes would need to evaluate the richness of paths and determine
   which additional ASes need to propogate maps.  These additional ASes
   would need to agree to do so, and would need to be configured as to
   where to propogate the maps.  It might also be desirable to have a
   flag associated with the area maps indicating that they don't need to
   be globally propogated.  This way, if an AS does accidently leak the
   maps, they don't get distributed everywhere.


Francis, et al.          Expires April 29, 2009                [Page 25]

Internet-Draft                 Mapped BGP                   October 2008


3.6.2.  Opportunistic AS aggregation clusters

   The previous section on geographic addressing assumes that addresses
   have been assigned with geographic aggregation in mind, and so
   doesn't apply to IPv4.  However, IPv4 addresses have been assigned in
   regional blocks for some time now.  For instance, IANA has assigned
   11 prefixes to RIPE (5 /8's, 3 /7's, 2 /6's, and 1 /5), at least
   according to a RIPE database document.  Presumably most of these have
   been assigned by ISPs to customers in Europe (though many of these
   may be multi-homed outside of Europe).  Given this, there may well be
   opportunities for "clusters" of richly inter-connected ISPs to
   advertise an aggregate, whereby most members of the aggregate are
   within those ISPs.  The extent to which these opportunities exist is
   for further study.

   If they do exist, however, they can be exploited by Mapped-BGP in a
   fashion very similar to grographic addressing described above.
   Specifically, a VP route for the prefix would be advertised by
   routers in the cluster, and these routers would FIB-install all sub-
   prefixes for the VP.

   An important difference between engineered geographic addresses and
   opportunistic AS clusters is that, in the latter, there will be more
   "stray" sites: sites that have an address within the cluster VP but
   are not physically attached to any cluster ASes.  Because of this, it
   typically won't be easy to identify a set of ASes that would be
   willing to suppress propogation of maps to ASes outside that set.  So
   we should assume that maps will be propogated, and that the only
   scaling opportunity comes from FIB reduction.

   When remote ASes do FIB suppression, they should prefer to suppress
   prefixes within the cluster to those outside the cluster, on the
   assumption that paths to prefixes outside the cluster will be longer.
   To do this, remote clusters obviously need to identify which prefixes
   are in and which are out.  One way to do this would be to include all
   ASes in the cluster in the VP-route.  Paths to TE-routes that do not
   contain any of the cluster ASes would be considered to be outside the
   cluster.

3.6.3.  Generalized Inter-domain Virtual Aggregation

   Virtual aggregation could also be deployed in a general fashion,
   whereby the global address space is carved up into VPs, and
   individual routers are assigned as APRs for different VPs.  This is
   very much in the spirit of the Intra-domain VA draft, but with a
   couple of key differences.  First, the extra hop suffered by VA paths
   would only occur in one ISP, the first one to tunnel the packet to
   the destination TE.  As such, the load and latency penalty for Inter-


Francis, et al.          Expires April 29, 2009                [Page 26]

Internet-Draft                 Mapped BGP                   October 2008


   domain VA is significantly less.  Second, VA could be deployed in
   such a way that the Tier-1 ISPs maintain full routes (i.e. have APRs
   for all VPs), but lower-tier ISPs do not maintain any APRs.  Rather,
   lower-tier ISPs keep VP-routes and any additional routes or maps that
   they wish to install.  As a result, RIBs and FIBs in lower-tier ISPs
   could be almost arbitrarily small while still having the ability to
   load balance both incoming and outgoing traffic.


4.  Performance Benefits

   This section summarizes the performance benefits of Mapped-BGP.  Note
   that none of the following stated benefits have been quantified.

   1.  Mapped-BGP decreases the amount of processing needed to handle a
       prefix.  This is primarily because most policies currently needed
       to compare and select paths and determine how to advertise routes
       are not required for processing maps.  On the other hand, Mapped-
       BGP introduces a new policy decision, namely processing TEDs for
       those fraction of prefixes to which they apply.  The majority of
       prefixes will be distributed as maps rather than routes.
   2.  Mapped-BGP requires less RIB storage space, primarily because
       during steady state the map heard from any given neighbor is the
       same.  Storage can be compressed by exploiting this.
   3.  Peering sessions initialize faster in Mapped-BGP.  This is
       because in general only the routes (of which we might expect a
       few tens of thousands at most) need to be conveyed before packets
       can start flowing (see Section 3.4.6.1).
   4.  Mapped-BGP will have fewer "big" events.  This is because route
       changes in Mapped-BGP effect only routes, not maps.  Whats more,
       if the FIB is organized in a tiered fashion (prefix points to a
       TE, which points to a next hop), then a change in TE next hop
       only requires a single update to the FIB, not one update for each
       impacted prefix.  On the other hand, Mapped-BGP is likely to have
       more "small" events, because each map will be propogated both
       because of a change in add/remove status, and a change in TED.
       Indeed, with virtual aggregation, many or even most map updates
       don't even impact the FIB.
   5.  Global convergence in Mapped-BGP will in general be faster.  This
       is primarily because changes to maps can be distributed before
       any policy decisions are made on those changes.  This in turn is
       possible because maps don't change as they are propogated through
       the Internet.  This allows an AS to first quickly distribute a
       received map and only afterwards process it.  Indeed, map changes
       that involve only modifications to the TED can be processed much
       later in time (minutes).


Francis, et al.          Expires April 29, 2009                [Page 27]

Internet-Draft                 Mapped BGP                   October 2008


   6.  Load balancing across ASes is both more accurate and more
       efficient in Mapped-BGP.  This is because TEDs allow for a fine-
       grained description of how much load is desired.  This is in
       contrast to legacy BGP, where granularity is proportional to the
       number of prefixes that can be selected over.
   7.  With virtual aggregation, Mapped-BGP provides significant
       opportunites for new aggregation.


5.  Normative References

   [I-D.francis-idr-intra-va]
              Francis, P., Xu, X., and H. Ballani, "FIB Suppression with
              Virtual Aggregation and Default Routes",
              draft-francis-idr-intra-va-01 (work in progress),
              September 2008.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


Authors' Addresses

   Paul Francis
   Cornell University
   4108 Upson Hall
   Ithaca, NY  14853
   US

   Phone: +1 607 255 9223
   Email: francis@cs.cornell.edu


   Xiaohu Xu
   Huawei Technologies
   No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District
   Beijing, Beijing  100085
   P.R.China

   Phone: +86 10 82836073
   Email: xuxh@huawei.com


Francis, et al.          Expires April 29, 2009                [Page 28]

Internet-Draft                 Mapped BGP                   October 2008


   Hitesh Ballani
   Cornell University
   4130 Upson Hall
   Ithaca, NY  14853
   US

   Phone: +1 607 279 6780
   Email: hitesh@cs.cornell.edu


Francis, et al.          Expires April 29, 2009                [Page 29]

Internet-Draft                 Mapped BGP                   October 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Francis, et al.          Expires April 29, 2009                [Page 30]