lapukhov-bgp-sdn-00.txt

Internet DRAFT - draft-lapukhov-bgp-sdn
draft-lapukhov-bgp-sdn

Last Version:	draft-lapukhov-bgp-sdn-00.txt	Tracker Entry
Date:	`02-Sep-2013`
Disposition:	expired





Network Working Group                                        P. Lapukhov
Internet-Draft                                               E. Nkposong
Intended status: Informational                     Microsoft Corporation
Expires: March 06, 2014                               September 02, 2013


Centralized Routing Control in BGP Networks Using Link-State Abstraction
                       draft-lapukhov-bgp-sdn-00

Abstract

   Some operators deploy networks consisting of multiple BGP Autonomous-
   Systems (ASNs) under the same administrative control.  There are also
   implementations which use only one routing protocol, namely BGP, as
   in [I-D.lapukhov-bgp-routing-large-dc], for example.  In such
   designs, inter-AS traffic engineering is commonly implemented using
   BGP policies, by configuring multiple routers at the ASN boundaries.
   This distributed policy model is difficult to manage and scale due to
   its dependency on complex routing policies and the need to develop
   and maintain a model for per-prefix path preference signaling.  One
   example of such models could be standard BGP community-based (see
   [RFC1997]) signaling, which requires careful documentation and
   consistent configuration.  Furthermore, automating such policy
   configuration changes for the purpose of centralized management
   requires additional efforts and is dependent on a particular vendor's
   configuration management (CLI extensions, NetConf [RFC6241] etc).

   This document proposes a method for inter-AS traffic engineering for
   use with the kind of deployment scenarios outlined above.  No
   protocol changes or additional features are required to implement
   this method.  The key to the proposed methodology is a new software
   entity called "BGP Controller" - a special purpose application that
   peers with all eBGP speakers in the managed network.  This controller
   constructs live state of the underlying BGP ASN graph and presents
   multi-topology view of this graph via a simple API to third-party
   applications interested in performing network traffic engineering.
   An example application could be an operational tool used to drain
   traffic from network devices.  In response to changes in the logical
   network topology proposed by these applications, the controller
   computes new routing tables, and pushes them down to the network
   devices via the established BGP sessions.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.





Lapukhov & Nkposong      Expires March 06, 2014                 [Page 1]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on March 06, 2014.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Architectural Assumptions . . . . . . . . . . . . . . . .   5
     2.3.  BGP Controller  . . . . . . . . . . . . . . . . . . . . .   8
   3.  Link-State Abstraction and Multiple Topologies  . . . . . . .   9
     3.1.  Link-State Discovery Process  . . . . . . . . . . . . . .   9
     3.2.  The Default Topology  . . . . . . . . . . . . . . . . . .  10
     3.3.  Alternate Topologies  . . . . . . . . . . . . . . . . . .  11
     3.4.  Overloading a Vertex  . . . . . . . . . . . . . . . . . .  13
   4.  Implementation Details  . . . . . . . . . . . . . . . . . . .  15
     4.1.  Programming Next-Hops . . . . . . . . . . . . . . . . . .  15
     4.2.  Equal-Cost Multipath Routing  . . . . . . . . . . . . . .  15
     4.3.  Prefix Discovery Process  . . . . . . . . . . . . . . . .  16
     4.4.  Sequenced Device Programming  . . . . . . . . . . . . . .  16
     4.5.  Mapping Prefixes to Topologies  . . . . . . . . . . . . .  17
     4.6.  Autonomous Systems with iBGP Peering Mesh . . . . . . . .  17
     4.7.  Minimizing Controller-Injected State  . . . . . . . . . .  18
   5.  Handling Failure Scenarios  . . . . . . . . . . . . . . . . .  18



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 2]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


     5.1.  Underlying Network Failures . . . . . . . . . . . . . . .  18
     5.2.  BGP Controller failures . . . . . . . . . . . . . . . . .  19
     5.3.  Multiple BGP Controllers  . . . . . . . . . . . . . . . .  20
     5.4.  Network Partitioning  . . . . . . . . . . . . . . . . . .  21
   6.  Controller API  . . . . . . . . . . . . . . . . . . . . . . .  21
     6.1.  Pathnames and document names  . . . . . . . . . . . . . .  22
     6.2.  Encoding of the documents and objects . . . . . . . . . .  22
     6.3.  Creating & Deleting State . . . . . . . . . . . . . . . .  22
     6.4.  Reading State . . . . . . . . . . . . . . . . . . . . . .  23
     6.5.  Writing State . . . . . . . . . . . . . . . . . . . . . .  23
     6.6.  Typical API Call Sequence . . . . . . . . . . . . . . . .  23
     6.7.  Limitations . . . . . . . . . . . . . . . . . . . . . . .  24
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  24
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  24
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  24
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  24
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  25
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  26

1.  Introduction

   BGP was intentionally designed as a path-vector protocol, since
   efficiently distributing link-state information for Internet-sized
   graph is virtually impossible.  However, some network deployments
   leverage multiple BGP ASN to separate IGP domains, or simply use BGP
   as the only routing protocol.  See, for example
   [I-D.lapukhov-bgp-routing-large-dc] which proposes using a BGP AS
   either per network device or "horizontal" device group, within a
   data-center.  In such cases, the number of BGP ASNs is very small
   when compared to the Internet - on the order of few thousands in the
   largest case.

   Under these assumptions, it becomes possible to build and maintain a
   link-state graph of the complete inter-AS topology and compute
   network paths based on this link-state information.  In accomplishing
   this, it is desirable to avoid adding any protocol extensions so that
   current implementations can leverage the proposed method, such as
   those described, for example in [RWHITE2005].  Instead, this document
   proposes the use of a centralized agent (referred to as "BGP
   Controller" or simply "the controller") that peers with all eBGP
   speakers in the underlying network.  The BGP Controller is
   responsible for constructing an up-to-date link-state view of the BGP
   inter-AS graph and pushing down routing information (prefixes and
   their associated next-hops) to the network devices via BGP updates.
   The new routing information reflects the results of link-state path
   computations performed by the controller.  Such routing information
   push is possible because BGP supports the next-hop attribute that
   could be recursively resolved via either IGP or BGP.  Notice that



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 3]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   while the controller pushes routing information to the device, the
   underlying BGP processes also compute the best-paths for the same
   prefixes using the path-vector logic in the regular way.  However,
   the BGP Controller could override this information by manipulating
   BGP attributes of injected routes, such as LOCAL_PREF to make its own
   advertisements more preferred.

   Third party applications can influence routing computations by
   creating logical alternations of the network link-state graph, e.g.
   changing the cost of the links from the BGP Controllers point of
   view.  This document will refer to those constructs as "alternate
   topologies" (or simply "topologies" for short), while the original,
   unaltered, link-state graph will referred to as the "default
   topology".  The controller would use these alternate topologies to
   make routing decisions different from those that BGP would have made
   based on available information.  It is possible to create multiple
   alternate topologies and associate different prefixes with every
   topology, with the restriction that each prefix maps to one and only
   one topology.  Once this mapping is defined, the BGP Controller would
   perform autonomously, detecting network faults and reacting by re-
   computing routing information as needed based on the effect that the
   failure has across all instantiated topologies.

   In many aspects, the proposed method was inspired by and is similar
   to the "Routing Control Platform" [RCP], but differs in the fact that
   link-state discovery is done using BGP mechanics only, and overall
   BGP is the only protocol used to build the system.

2.  Overview

2.1.  Use Cases

   Primary intended use case of the BGP Controller is inter-AS traffic
   engineering.  This includes, but is not limited, to the following:

   o  Link/device overloading for the purpose of drying out traffic from
      a device.  A link, or group of links, connecting one ASN to
      another could be declared as having "infinite" cost from the
      controller's viewpoint, causing the latter to re-compute paths and
      instruct the network devices to bypass those links.  Notice that
      this does not include "internal" overload (inside an ASN), that
      may need to be done using IGP techniques.

   o  Traffic load-sharing among multiple links, e.g. links connecting
      two different ASN's. Multiple alternate topologies could be
      created where the same link is given different costs in each
      topology.  These topologies will then have subsets of prefixes
      mapped to them, thus engineering different inter-AS paths for



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 4]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


      these prefixes.  Notice that for accurate load-sharing, knowledge
      of the traffic matrix may be required, but this requirement
      equally applies to any traffic engineering solutions.  The load-
      sharing could be also accomplished using weighted Equal-Cost
      Multipath (ECMP), accounting for link capacities as "weights" to
      distribute different proportions of egress traffic to the peering
      points.  See [KVALBEIN2007] for more information on the multi-
      topology techniques in general and [I-D.ietf-idr-link-bandwidth]
      for information on weighted ECMP signaling in BGP.

   The main benefit of the proposed approach is centralized control of
   the above functions.  There is no need to configure policies on
   multiple devices, as all routing changes could be done using the
   uniform light-weight API to the controller.  This ensures ease of
   automation and consistent changes.  Furthermore, such a centralized
   model should be deployed to augment the classical distributed routing
   policy configuration.  The advantage is that centralized control
   could be disabled at any time, falling the network back to the
   "traditional" BGP decision model, thus allowing for a safe state to
   roll-back to.  Next, knowing the link-state of the network may allow
   avoiding the BGP path-hunting problem, and improve global BGP
   convergence timing in a large group of heavily meshed ASNs.
   Additionally, to avoid the phenomena of routing micro-loops the
   controller could enforce certain ordering for the network device
   programming sequence.  Specifically, every time a link-state change
   is proposed to the controller, the devices in the network are
   programmed starting with those farther away from the change in terms
   of the metric of the existing graph.  The same logic applies to link-
   down conditions detected by the controller via the health probing
   mechanism described below.

2.2.  Architectural Assumptions

   Firstly, the devices in the network are assumed (but not required) to
   have minimal BGP policy applied, enough for them to exchange routing
   information and compute best-paths based on shortest AS_PATH lengths.
   This means that the configured policy should not override best-path
   selection process using LOCAL_PREF or any other BGP attributes for
   enforcing a custom routing policy.  The assumption of the "minimal
   policy" allows for making the BGP Controllers update logic less
   intrusive, as described further in the section Section 4.7.  Next,
   every device is assumed to advertise a locally bound prefix into BGP
   for the purpose of BGP peering with the controller.  That is, the
   controller peers "inband" with the devices it controls - either by
   initiating iBGP sessions to all devices or by passively accepting the
   sessions from the devices.  As will be shown in the Section 5, inband
   peering requirement is important to avoid inconsistencies between
   multiple controllers programming the same network.



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 5]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   Another major assumption is how the link-state graph vertices are
   defined.  From the BGP Controller perspective, there are two type of
   vertices:

   o  Type 1, Individual Devices: BGP Speaker(s) that have the SAME BGP
      ASN configured, with the restriction that none of these speakers
      peers with each other, inside this ASN.  This could be a single
      speaker in its own ASN as well.  Each of these speakers is treated
      as a vertex on its own.  Peering with other ASN's is not
      restricted.  Notice how this is different from the traditional
      notion of BGP ASN, where all speakers are assumed to be part of
      the same iBGP mesh.

   o  Type 2, Complete BGP ASN: BGP Speakers in the SAME BGP ASN with
      the normal requirement that they ALL exchange their BGP views via
      iBGP, using either full-mesh or any other approach for full
      internal BGP state synchronization.  All of these BGP speakers are
      grouped into a single graph vertex.

   The following Figure 1 illustrates this concept:

               Legend
               ------- eBGP
               ....... iBGP

                                     eBGP Peering
                                           |
                                     +-----+-----+
                                     |     |     |
                                     |   +-+-+   |
                                     |   |R3 |   |
                                     |   +-+-+   |
                                     |     |     |
                                     +-----+-----+
                                           |

                                      eBGP Peering

                                     |            |
                           +---------+------------+----------+
                           |         |     AS1    |          |
                           |       +-+-+        +-+-+        |
                           |       |R1 |        |R2 |        |
                           |       +-+-+        +-+-+        |
                           |         |            |          |
                           +---------+------------+----------+
                                     |            |




Lapukhov & Nkposong      Expires March 06, 2014                 [Page 6]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


                       Type 1: Each device is individual graph vertex
                             (three vertices, each with two edges).

                                     |           |
                                +----+-----------+----+
                                |    |           |    |
                                |  +-+-+       +-+-+  |
                                |  |R1 |.......|R2 |  |
                                |  +---+.     .+---+  |
                                |    .    .  .   .    |
                                |    .     .     .    |
                                |    .    .  .   .    |
                                |    .  .     .  .    |
                                |  +---+       +---+  |
                                |  |R3 |.......|R4 |  |
                                |  +-+-+       +---+  |
                                |    |           |    |
                                +----+-----------+----+
                                     |           |

                       Type 2: All devices below are grouped into
                              single vertex with four edges.

                         Figure 1: Graph Vertices

   Routing information could be associated with a graph vertex either by
   means of static binding or dynamic discovery: this process is
   described in details in sections Section 4.3.  When programming the
   network prefixes into the devices, the controller does not inject a
   prefix back in the vertex the prefix is associated with.

   The BGP Controller decision logic is independent of the address
   family, and could apply to both IPv4 and IPv6 prefixes equally.  It
   is possible to run two independent controllers, one for each address
   family.  This allows for full "fate decoupling" between the address
   families, though may result in duplication of the link state
   information.

   The edges of the constructed link-state graph may have two
   attributes: metric, which is additive, and capacity (bandwidth) that
   is non-additive.  The former is used to compute shortest paths, and
   the latter could be used to compute ECMP weight values in case where
   multiple equal-cost paths exist to the same vertex.  For every ECMP
   path, the minimum capacity value that occurs along that path will be
   used as its weight by the controller, if the underlying network
   supports weighted ECMP functionality.





Lapukhov & Nkposong      Expires March 06, 2014                 [Page 7]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


2.3.  BGP Controller

   The Figure 2 demonstrates the BGP Controller peering with the network
   devices.  Multiple managed devices peer via eBGP following the
   traditional BGP design.  For simplicity, we assume that every device
   belongs to it's own ASN - see Section 4.6 for more information on
   handling the "compound" Type-2 vertices consisting of multiple BGP
   speakers interconnected with iBGP mesh.  Prefixes P3, P4 and P5 are
   associated with the devices (vertices) in ASNs 3, 4, and 5
   respectively using techniques described in Section 4.3.  The other
   remaining vertices are assumed to be purely transit for the purpose
   of this discussion.

   These devices exchange routing information in the usual manner and
   the BGP Controller establishes iBGP peering sessions with every
   device.  It uses the technique described in section Section 3.1 to
   build the inter-AS link-state graph.  For now, it is sufficient to
   say that the discovery process uses special "beacon" prefixes
   dynamically injected into the network and relayed back to the
   controller to discover the state of the links interconnecting the
   graph vertices.

               Legend:

               ------- iBGP (controller to network)
               ....... eBGP (ASN to ASN)

                                    BGP Controller
                                      +-------+
                                      |       |
                                      +-------+
                                       || | ||
                                       || | ||
                         +-------------+| | |+--------------+
                         |         +----+ | +----+          |
                         |         |      |      |          |
                         |         v      |      v          |
                         |       +---+    |    +---+        |
                         |       |AS1|....|....|AS2|        |
                         v       +---+    |    +---+        v
                       +---+       .      |     . .       +---+
                    P3 |AS3|........      |     . ........|AS4| P4
                       +---+              |     .         +---+
                         .                V     .           .
                         .              +---+....           .
                         ...............|AS5|................
                                        +---+
                                          P5



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 8]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


                         Figure 2: BGP Controller

   At this point, the BGP Controller has knowledge of the link-state
   graph as well as the prefixes associated with every vertex, and can
   now run Dijkstra's SPF algorithm to compute shortest paths between
   vertices.  A result of this computation would be routing tables built
   for every vertex.  The Section 3.2 below demonstrates the adjacency
   list built by the controller for the above topology, as well as
   routing-tables computed for every vertex.  The next-hops in the
   routing tables presented in the figure are simply the vertices to
   send the packets to.  When programming the network devices, the
   actual IP addresses of the next-hops are computed as described in
   Section 4.1 section.  This routing state corresponds to the unaltered
   (default) topology.

3.  Link-State Abstraction and Multiple Topologies

   This section provides detailed information on the link-state
   abstractions used by the controller and how those are used to perform
   traffic engineering in the underlying network.

3.1.  Link-State Discovery Process

   The network devices that the controller peers with establish eBGP
   peering sessions with each other.  The fact that there is one-to-one
   correspondence between eBGP sessions and underlying IP link allows
   using the state of the eBGP session as the indication of the IP link
   health.  Specifically, this is accomplished by injecting special
   "beacon" prefixes into every vertex (which could be a device or
   collection of devices interconnected with iBGP mesh) and expecting
   those beacons to be re-advetised back to the controller by every
   vertex adjacent to the point of injection.  If a particular BGP
   session is down, the injected prefix will not be re-advertised by the
   affected peer back to the controller, allowing us to conclude that
   the corresponding link is down.

   The Figure 3 demonstrates this process.  For simplicity, we assume
   that every device belongs to its own BGP ASN.  The BGP controller
   injects prefix X into device R1 and expects to hear this prefix from
   device R2.  At the same time, it is desirable to prevent this prefix
   from leaking any farther than one hop away from R1, i.e. make sure it
   is not re-advertised to R3.  To accomplish this, prefix X could be
   tagged with a special community value, which is replaced with the
   well-known community "no-export" when advertising over eBGP session.
   Because of this policy, the prefix will be announced back to the
   controller as it uses iBGP session for peering, but not any further
   to eBGP peers of router R2 in our case.  An alternative to using the
   standard BGP communities could be leveraging the wide-communities



Lapukhov & Nkposong      Expires March 06, 2014                 [Page 9]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   limiting the scope of the announced prefixes - see
   [I-D.raszuk-wide-bgp-communities] for more details on this technique.


               ------- iBGP (controller to network)
               ....... eBGP (ASN to ASN)

                               +------------+
                        +------| Controller |<------+
                        |      +------------+       |
                        X                           X
                        |                           |
                        V                           |
                      +---+                       +---+
                      |R1 |...........X..........>|R2 |
                      +---+                       +---+
                                                    .
                                    +---+           .
                                    |R3 |............
                                    +---+

                      Figure 3: Link-State Discovery

   Using this technique, the controller is able to build a view of the
   links connecting the graph vertices.  Notice that if two parallel
   links connect vertices, this method will not be able to differentiate
   between them.  For simplicity, the proposal is that such parallel
   links should be grouped into a single logical IP link using, for
   example, [IEEE8023AD] technology.

3.2.  The Default Topology

   When the controller starts, it discovers the current network graph
   and computes the routing table assuming that all links have the same
   metric value.  The Figure 4 illustrates the adjacency list describing
   the graph taken from Figure 2 along with the routing table computed
   for every vertex/ASN.  The numbers on the graph edges designate the
   link costs.


                              Inter-AS Graph and Prefixes

                                 +---+         +---+
                                 |AS1|...(1)...|AS2|
                                 +---+         +---+
                        +---+      .            . .       +---+
                     P3 |AS3|..(1)..            . ...(1)..|AS4| P4
                        +---+                   .         +---+



Lapukhov & Nkposong      Expires March 06, 2014                [Page 10]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


                          .                    (1)          .
                          .                     .           .
                          .             +---+....           .
                          .......(1)....|AS5|......(1).......
                                        +---+
                                          P5

           Inter-AS Graph Adjacency List           Per-ASN Routing Table

              +-----+--------------+           +-----+----------------------------+
              | Src | Dst ASNs     |           | AS  | Prefix:Next-Hop(s)         |
              +-----+--------------+           +-----+----------------------------+
              | AS1 | AS2,AS3      |           | AS1 | P3:AS3,P4:AS2,P5:[AS2,AS3] |
              +-----+--------------+           +-----+----------------------------+
              | AS2 | AS1,AS4,AS5  |           | AS2 | P3:AS1,P4:AS4,P5:AS5       |
              +-----+--------------+           +-----+----------------------------+
              | AS3 | AS1,AS5      |           | AS3 | P3:Self,P4:AS5,P5:AS5      |
              +-----+--------------+           +-----+----------------------------+
              | AS4 | AS2,AS5      |           | AS4 | P3:AS5,P4:Self,P5:AS5      |
              +-----+--------------+           +-----+----------------------------+
              | AS5 | AS4,AS2,AS3  |           | AS5 | P3:AS3,P4:AS4,P5:Self      |
              +-----+--------------+           +-----+----------------------------+

                     Figure 4: Unaltered Routing State

3.3.  Alternate Topologies

   Assume the following TE requirements for illustrative purposes:

   o  Traffic from AS4 to P5 needs to traverse AS2.

   o  Traffic to P4 from AS5 needs to ECMP over two paths: direct and
      via AS2.

   o  Traffic from AS3 to P5 must not use the direct path.

   These requirements could be satisfied with two different topologies:

   o  Topology 1 has "very large" metric assigned to the links between
      AS4,AS5 and AS3,AS5.

   o  Topology 2 has metric value of 2 assigned to the link between AS4
      and AS5.

   The prefixes map to the topologies as following: P5->Topology1 and
   P4->Topology2.  P3 should retain mapping to the default (unaltered)
   topology, which we would refer to as Topology 0 to refer to all
   topologies by their numbers.  The assumption of "very large" metric



Lapukhov & Nkposong      Expires March 06, 2014                [Page 11]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   is important - the path containing this link could still be used if
   all alternate paths are down because of physical failures.  For
   simplicity, we assume "very large" equals to 100 in the case under
   consideration.  The set of topologies and associated prefixes would
   look as on Figure 5, where numbers on the links designate their
   metrics.

                               [Topology 0]

                           +---+         +---+
                           |AS1|...(1)...|AS2|
                           +---+         +---+
                  +---+      .            . .       +---+
               P3 |AS3|..(1)..            . ..(1)...|AS4|
                  +---+                   .         +---+
                    .                    (1)          .
                    .                     .           .
                    .             +---+....           .
                    .....(1)......|AS5|......(1).......
                                  +---+

                               [Topology 1]

                           +---+         +---+
                           |AS1|...(1)...|AS2|
                           +---+         +---+
                  +---+      .            . .       +---+
               P3 |AS3|..(1)..            . ..(1)...|AS4|
                  +---+                   .         +---+
                    .                    (1)          .
                    .                     .           .
                    .             +---+....           .
                    ....(100).....|AS5|.......(100)....
                                  +---+
                                    P5

                               [Topology 2]

                           +---+         +---+
                           |AS1|...(1)...|AS2|
                           +---+         +---+
                  +---+      .            . .       +---+
                  |AS3|..(1)..            . ..(1)...|AS4| P4
                  +---+                   .         +---+
                    .                    (1)          .
                    .                     .           .
                    .             +---+....           .
                    .....(1)......|AS5|.......(2)......



Lapukhov & Nkposong      Expires March 06, 2014                [Page 12]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


                                  +---+

                      Figure 5: Alternate Topologies

   Based on the set of topologies presented above, the BGP Controller
   will compute the routing tables shown in Figure 6, which reflects the
   desired traffic engineering goals defined previously.  The entries
   that differ from the routing decisions in unaltered topology are
   highlighted with the asterisk (*) characters.  Notice that AS3 now
   sees P4 as ECMP reachable via AS1 and AS5, because of the metric
   change in Topology 2.  The original traffic engineering policy
   requirements did not call for that, but this result appears because
   of the change made between AS4 and AS5, which is a natural effect
   with shortest-path, destination-based forwarding techniques.

                       Per-ASN Routing Table

              +-----+--------------------------------+
              | AS  | Prefix:Next-Hop(s)             |
              +-----+--------------------------------+
              | AS1 | P3:AS3,P4:AS2,*P5:AS2*         |
              +-----+--------------------------------+
              | AS2 | P3:AS1,P4:AS4,P5:AS5           |
              +-----+--------------------------------+
              | AS3 | P3:Self,*P4:[AS5,AS1]*,P5:AS1  |
              +-----+--------------------------------+
              | AS4 | P3:AS5,P4:Self,*P5:AS2*        |
              +-----+--------------------------------+
              | AS5 | P3:AS3,P4:*[AS4,AS2]*,P5:Self  |
              +-----+--------------------------------+

                  Figure 6: Multi-Topology Routing Tables

   The controller will push the computed routing tables to the network
   devices using higher LOCAL_PREF values to ensure that the new
   information overrides the routing decision that "traditional" BGP
   processes running on the BGP speakers have already made.  It is
   possible to use other attributes to signal better preference, but
   LOCAL_PREF has the benefit of being used very early in the BGP tie-
   breaking process.

3.4.  Overloading a Vertex

   This section illustrates a special, but important practical case of
   "overloading" a graph vertex, such that all traffic bypasses the
   vertex.  This operation could be used in a scenario in which a
   particular network device needs an upgrade and requires all traffic
   to be dried out of it.  The Figure 7 demonstrates the implementation



Lapukhov & Nkposong      Expires March 06, 2014                [Page 13]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   of this policy with respect to the AS5 vertex.  The Topology-0 has no
   prefixes mapped to it, but all prefixes are mapped to Topology-2
   instead.  This topology has the cost of 100 assigned to all links
   connected to AS5, which forces all traffic to avoid transiting AS5.

                             [Topology 0]

                         +---+         +---+
                         |AS1|...(1)...|AS2|
                         +---+         +---+
                +---+      .            . .       +---+
                |AS3|..(1)..            . ..(1)...|AS4|
                +---+                   .         +---+
                  .                    (1)          .
                  .                     .           .
                  .             +---+....           .
                  .....(1)......|AS5|......(1).......
                                +---+

                             [Topology 2]

                         +---+         +---+
                         |AS1|...(1)...|AS2|
                         +---+         +---+
                +---+      .            . .       +---+
             P3 |AS3|..(1)..            . ..(1)...|AS4| P4
                +---+                   .         +---+
                  .                   (100)         .
                  .                     .           .
                  .             +---+....           .
                  ....(100).....|AS5|......(100).....
                                +---+
                                  P5

                       Per-ASN Routing Table

              +-----+--------------------------------+
              | AS  | Prefix:Next-Hop(s)             |
              +-----+--------------------------------+
              | AS1 | P3:AS3,P4:AS2,*P5:[AS2,AS3]*   |
              +-----+--------------------------------+
              | AS2 | P3:AS1,P4:AS4,P5:AS5           |
              +-----+--------------------------------+
              | AS3 | P3:Self,*P4:AS1*,P5:AS5        |
              +-----+--------------------------------+
              | AS4 | P3:*AS2*,P4:Self,P5:AS5        |
              +-----+--------------------------------+
              | AS5 | P3:AS3,P4:AS4,P5:Self          |



Lapukhov & Nkposong      Expires March 06, 2014                [Page 14]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


              +-----+--------------------------------+

                      Figure 7: Overloading a Vertex

4.  Implementation Details

4.1.  Programming Next-Hops

   As mentioned previously, the prefixes that the controller injects in
   the network needs to have their next-hops properly resolved.  In the
   simplest case, the next-hops could be the remote IP addresses of the
   links directly connected to the device programmed by the controller.
   This, however, adds certain complexities due to the IP address
   variability on the point-to-point links connecting the network
   devices.  An alternative could be injecting pre-generated next-hops
   into the devices - one per device - and resolving them recursively
   via BGP.

   Specifically, every graph vertex would have a host route (either IPv4
   or IPv6) associated with it.  The controller would inject this prefix
   into the respective device(s) (see Section 4.6 associated with this
   vertex, tagged with the special community value discussed in the
   section Section 3.1.  Moreover, for simplicity, it is possible to re-
   use the same prefix used for link-state discovery as the value of the
   next-hop attribute, thus reducing the amount of supplementary routing
   state injected by the controller.

   Next, it is easy to notice that using the special BGP community to
   limit the beacon/next-hop prefix propagation is not strictly
   necessary.  Indeed, the controller may simply discard all "special"
   prefixes whose AS_PATH contains more than one AS-hop.  However, this
   will result in unneeded routing state propagated in the network,
   which is not desirable from manageability perspective.

4.2.  Equal-Cost Multipath Routing

   In many practical topologies, the controller may find multiple equal-
   cost paths from one vertext to another.  It may then proceed
   programming multiple paths for the prefixes affected by this
   decision.  Either of the two ways could accomplish the multiple-paths
   programming requirement:

   o  Using the BGP Add-Path extension, [I-D.ietf-idr-add-paths]
      specifying multiple next-hops values.

   o  Using the Diverse Path Advertisement method presented in [RFC6774]
      to inject multiple paths.




Lapukhov & Nkposong      Expires March 06, 2014                [Page 15]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   Furthermore, it is possible to implement weighted ECMP functionality
   with this approach, relying on [I-D.ietf-idr-link-bandwidth] for
   weight signaling.  The graph edges could have weights associated with
   them, and a given path's weight computed as the minimum weight value
   along the path, as mentioned previously.  The logic behind the weight
   selection is outside the scope of this document.

4.3.  Prefix Discovery Process

   In order to build routing state information, the controller needs to
   know the "leaf" prefixes associated with the graph vertices.  There
   are two ways of accomplishing this: either defining a static mapping
   of prefixes to vertices in the BGP controller configuration, or by
   letting the controller learn those prefixes in dynamic fashion.  In
   both cases, the assumption is that the network reachability
   information is already advertised into BGP, such that regular "in-
   band" routing model is working.

   The controller may dynamically associate a prefix with a vertex by
   using two properties: firstly, by observing an empty AS_PATH in the
   prefix received from the managed device.  Secondly, by filtering out
   prefixes injected for the purpose of network health discovery and
   next-hop programming.  The controller treats everything that matches
   these two criteria as the routing information associated with the
   respective vertex.

4.4.  Sequenced Device Programming

   Distributed routing systems are susceptible to transient
   inconsistencies when a network state changes in such a way that
   requires changing the best-paths election.  Since a topological event
   (e.g. a link flap) is not propagated in an instant, devices that are
   closer to the origin of the event would update their forwarding
   tables faster, as compared to others.  The devices directly adjacent
   to those that have their tables already updated would still be using
   old forwarding state.  This would create transient routing loops for
   the time it takes to fully synchronize the forwarding state of all
   devices.

   Since the controller is aware of the full network topology, it may
   avoid the above scenario by pushing the routing updates in proper
   sequence - starting with the vertices that are farthest away from the
   location of the event.  This way the newly programmed state will
   "implode" toward the change, as opposing to "exploding" from the
   events point of occurrence.  Such sequencing is similar to the
   process outlined in [RFC6976], but relies on centralized programming,
   which makes it very simple to implement.




Lapukhov & Nkposong      Expires March 06, 2014                [Page 16]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


4.5.  Mapping Prefixes to Topologies

   The controller needs a manageable way of associating discovered
   prefixes with any of the topologies defined by the third-party
   applications.  As mentioned previously, all prefixes are by default
   mapped to the default topology, which corresponds to the actual
   network state.  Once an alternate topology has been defined, prefixes
   could be mapped to this new topology.  One possible way of
   implementing such mapping table could be by maintaining a radix tree
   data-structure, which associates a prefix with the corresponding
   topology.  Using longest-match lookup in this table for each
   discovered prefix would then yield the topology that this prefix
   belongs to.  This allows for easy and natural grouping of prefix-to-
   topology mappings, while maintaining familiar semantics of longest-
   match routing lookups.  To implement the default mapping, the
   prefixes 0.0.0.0/0 and ::/128 should always be in the radix tree,
   pointing to one of the defined topologies.  When those prefixes are
   deleted per application request, the BGP controller would need to re-
   insert them, linking back to default topology again.

4.6.  Autonomous Systems with iBGP Peering Mesh

   The BGP Controller treats BGP ASN's that have a form of internal BGP
   mesh differently than systems that do not peer over iBGP.  Such
   systems are perceived as an atomic opaque graph vertex for the
   purpose of next-hop and beacon prefix injection.  The routing inside
   such ASN is not defined by the controller, but rather relies on some
   other mechanism, such as IGP.  The controller only defines egress
   points out of the ASN, and possibly can specify weights associated
   with exit points, to allow for weighted ECMP load-distribution.  This
   treatment naturally arises from the fact that iBGP injected beacon
   prefixes are not relayed to iBGP peers.  Furthermore, the beacon
   prefixes learned from eBGP neighbors are propagated to all iBGP
   peers, but not relayed back to the BGP Controller when learned over
   iBGP session.  Thus, the controller will discover peering links of
   every "edge" router in such BGP ASN with all external peers, but will
   not be able to see the internal iBGP peering mesh.

   If the underlying ASN implements iBGP route reflection or BGP
   Confederations, only the routers that form eBGP sessions with
   external ASN's need to have the routing information injected into
   them.  The routing information will disseminate to the internal
   speakers by means of normal BGP replication process, with unmodified
   next-hops and LOCAL_PREF attribute value, thus ensuring that it
   overrides the normal "in-band" routing information.

   When programming ECMP paths, it may happen so that the egress points
   specified by the controller do not satisfy iBGP requirements for



Lapukhov & Nkposong      Expires March 06, 2014                [Page 17]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   multipath (e.g. IGP costs to reach the egress points could be
   different).  In such case, normal BGP tie breaking will occur and
   only ECMP-equivalent paths will be installed in the RIB.
   Alternatively, if the underlying ASN implement tunneling techniques,
   it is possible to perform load sharing even if the IGP costs toward
   the BGP next-hops are different.

4.7.  Minimizing Controller-Injected State

   The BGP Controller can push down all of the prefixes it computes
   paths for: that is, all prefixes known in the network.  This means
   that for every prefix present in the "regular" eBGP interconnected
   topology the controller will inject the same prefix with different
   attributes.  It is also possible for the controller to push down only
   the "delta" between the prefixes that need their next-hops/paths
   changed, based on the supplied policy.  This mode of operation
   requires that the underlying network finds the best-paths between the
   graph vertices using the "shortest-path logic", where the path length
   equals the length of the AS_PATH attribute.  This is equivalent to
   running Dijkstra's SPF algorithm on graph unit metric values assigned
   to the edges.  This is needed since the controller performs path
   computation using SPF logic, and BGP could elect different paths if
   some policies are present.  Ensuring that both the underlying network
   and the controller perform the same computations effectively allows
   for the "delta" mode operations.

   Publishing only the "delta" state to the network means more
   "intelligent" work on the controller side and special requirements to
   the network policies.  However, the benefit is significantly reduced
   intervention in the regular forwarding since majority of the state is
   not likely to change in many cases.  Once again, it is possible to
   implement the mode where the controller overrides all routing
   information.

5.  Handling Failure Scenarios

   This section reviews two different type of failure scenarios:
   failures in the underlying network and the controller failures.

5.1.  Underlying Network Failures

   Either vertex (if it's a device) or graph edge (network link) may
   fail.  For the BGP Controller, underlying failure be it edge or
   vertex, is visible only after all eBGP session interconnecting two
   vertices have failed.  This could be driven either by an event, such
   as link down condition, which is typically fast, or by BGP keepalive
   timer expiration, which is naturally slower.  When this happens, the
   BGP processes withdraw the corresponding beacon prefixes and the



Lapukhov & Nkposong      Expires March 06, 2014                [Page 18]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   controller will declare the corresponding edge down.  This will
   result in re-run of SPF for all active topologies and push of new
   routing information down to the network.  Since the central
   controller is involved in reconvergence, the restoration time will be
   longer, compared to the restoration process driven purely by
   underlying BGP processes.  Indeed, the restoration time now include
   failure detection time, SPF re-computations and new prefixes push.
   However, it could be observed that such centralized reconvergence is
   free from the BGP Path-Hunting problem, and hence improvements could
   be noticed in complex meshed topologies.

   Furthermore, recovery could be faster if multiple paths (ECMP) exist
   for a prefix, and only a single path fails.  In this case, BGP
   process will simply invalidate the failed path even before the
   controller has signaled removal, and will continue with using only
   the active paths.  The details of this reconvergence are complicated,
   as changing ECMP is a hardware dependent operation.  Furthermore,
   some implementations may support the "consistent hashing" technique
   that minimizes impact of ECMP group base size change on flow
   affinities, as described in [RFC2992].

5.2.  BGP Controller failures

   Under normal circumstances, an operator may shut down a controller
   for maintenance or other reasons.  In this case, it is expected that
   BGP sessions be closed following normal BGP process, that is sending
   a BGP Notification message and terminating the TCP session.  As a
   result, all routers will withdraw the prefixes injected by the
   controller and recalculate the best-path.

   If the controller fails abnormally, e.g. process crashes, the TCP
   sessions that connect it to the underlying devices either will be
   torn down, or be closed upon expiration of BGP keepalive timer.  The
   latter will cause some delay before prefixes announced by the
   deceased controller are withdrawn.  For the duration of that time,
   the network will be forwarding traffic using possibly stale
   information.  Link/device failures will be handled locally, and in
   some cases may cause traffic black-holes, if the only programmed path
   fails.  The duration of this "state" time is equal to the time it
   takes to detect the controller failure, and update the BGP LocRIB,
   followed by RIB/FIB reprogramming.

   It is possible to use a single BGP controller along with BGP routing
   persistence feature, to maintain the injected paths even after the
   BGP Controller failure (see [I-D.uttaro-idr-bgp-persistence]).  After
   the controller restarts, it will simply refresh the "stale" routing
   information.  In this scenario, forcing the network to revert to the
   traditional BGP-based routing could be accomplished by instructing



Lapukhov & Nkposong      Expires March 06, 2014                [Page 19]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   the controller to inject its paths with low LOCAL_PREF value, less
   than the default used in the network.  The possible risk is that the
   controller may fail in such a fashion that it will not be able to
   inject any information in the network.

5.3.  Multiple BGP Controllers

   If a single BGP Controller is present in the network that does not
   implement BGP route persistence, the controller failure would result
   in the network becoming unmanaged, and falling back to traditional
   BGP routing.  To maintain resilience, it is possible to run multiple
   parallel BGP Controllers, assuming that they supply the network with
   the same routing information, and differentiate themselves as
   'primary' and 'backup'.  The latter property could be accomplished by
   using different LOCAL_PREF attribute values for primary/secondary
   controllers - this allows having multiple controllers, backing up
   each other.

   With multiple BGP Controllers, it becomes critical for all of them to
   perform the same routing decisions.  Even though only one controller
   is programming the network, the backup paths injected by the others
   must be consistent with the primary.  To accomplish that, all
   controllers must:

   o  Have the same view of the underlying network topology - i.e. build
      the same link-state graph.  In the simplest case, this could be
      accomplished by relying on eventual consistency, that is assuming
      that under non-partitioned scenario the controllers will
      eventually receive the same link-state probe prefixes and build
      the same resulting link-state database.  Alternatively, a
      consensus protocol, e.g. [PAXOS] could be executed amongst the
      members of the redundant group to synchronize the link-state
      database of the master process with the secondary processes.  This
      would ensure strong consistency of the link-state database, but
      could be over-bearing in terms of the state that may need to be
      kept replicated reliably.

   o  Maintain the same topology definition database and prefix-to-
      topology mapping table - as commanded by external applications.
      This is similar to the previous approach, but would involve much
      less state to synchronize.  Specifically, the topology definitions
      (e.g. new link costs) and prefix to topology mapping information
      need to be distributed.  This state is submitted to the
      controllers via an API defined for the third party applications.
      As before, it could be assumed a responsibility of an external
      application to program all controllers with the same state and
      ensure consistency.  Alternatively, another strongly consistent
      database could be used, leveraging the same consensus protocol.



Lapukhov & Nkposong      Expires March 06, 2014                [Page 20]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


5.4.  Network Partitioning

   This section reviews the possible "partitioning" scenarios, where
   parts of the network may become managed by different controllers.
   Situations like this are possible if the controllers are deployed
   diversely, and may end up in situation where one or more of the
   controllers lose iBGP peering sessions with some network devices.
   The main concern in such situations is programming the devices with
   inconsistent information that may cause routing loops.

   Firstly, notice that if device A can learn the "peering source"
   prefix announced by device B, and the BGP Controller can peer with A,
   then by transitivity the controller can also peer with B. This means
   that either the controller and device A cannot learn any routing
   information from B, or both of them can - excluding transient
   situations.  This property ensures that under proper configuration a
   set of devices is either completely managed, or completely unmanaged
   - that is, they share the same fate.  This eliminates the scenario
   where device A is programmed by the controller X, device B is
   programmed by the controller Y and the devices can each each other
   inband.

   Secondly, for the transient cases, when A and B have in-band
   connectivity, but for some time A is programmed by X and B is
   programmed by Y. Recall that absence of the iBGP session to the
   device translates into the fact that this device is declared as
   having "infinite" costs in the link-state database.  Thus, X will
   always bypass B and Y will always bypass A, and hence a routing loop
   may never form between A and B.

6.  Controller API

   This section provides a set of requirements and guidance to the BGP
   Controller API.  The general recommendation is to base the API on
   stateless principles, such as found in [REST] model.  This approach
   is efficient since no real-time event passing between the controller
   and third-party application is needed, e.g. for the purpose of active
   reaction to network failure events.  The proposed controller model
   assumes those events are handled by the messages exchange in the
   network-controller loop.  The following sections are structured the
   around "CRUD" - Create, Read, Update, Delete operations commonly used
   in REST model and use the HTTP verbs and pathnames for illustration.
   Furthermore, applications will be referenced as clients and the BGP
   Controller as the server in the text below interchangeably, though
   the API could be implemented by a module separate from the main BGP
   Controller logic.





Lapukhov & Nkposong      Expires March 06, 2014                [Page 21]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


6.1.  Pathnames and document names

   The server presents the following pathnames to group various objects:

   o  "/lsdb" - This is the document that stores the currently
      discovered inter-AS graph link state (link-state database).  This
      document cannot be modified, only read.  The LSDB data structure
      is a graph, represented in one of the common formats - e.g. as two
      collections: vertices and edges, where edges have associated
      states and weight (capacity).

   o  "/topologies/" - This is a directory that stores documents
      corresponding to different topologies.  Every document contains a
      topology definition.

   o  "/mappings/ipv4" - This is the document that stores the IPv4
      mappings to the topologies.  Notice that if the 0.0.0.0/0 prefix
      is not found it this file, it is implicitly mapped to the default
      topology.  Internally in the BGP Controller this is stored as an
      efficient radix-tree, but the document represents the mappings as
      a collection of prefixes and associated topologies.

   o  "/mappings/ipv6" - This is the document that stores the IPv6
      prefix mappings to the topologies.  Same as IPv4 mappings, with
      except to different address family.  As with the IPv4 case, if the
      ::/0 prefix is not found in this document, it is implicitly mapped
      to the default topology.

6.2.  Encoding of the documents and objects

   Either JSON or XML is an acceptable format for encoding the document
   contents for programmability.  JSON is preferred due to its
   lightweight nature and simpler semantics for transporting data
   structures.  The documents passed with RESTful calls will contain
   logical descriptions of the graph vertices and edges.  A vertex is
   uniquely identified by an opaque name, e.g. a text string.  The
   mapping between this identifier and the underlying network devices is
   to be done elsewhere in the controller data structures, and does not
   need to be exposed to the applications.

6.3.  Creating & Deleting State

   The only state that could be created is the collection of topology
   definitions, under the "/topology/" directory.  The topology objects
   are to be created using the "POST" HTTP operation - supplying some
   basic content, e.g. empty set of the links and associated costs using
   the appropriate encoding.  Correspondingly, a topology could be
   deleted using the DETELE operation.  Notice that the default topology



Lapukhov & Nkposong      Expires March 06, 2014                [Page 22]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   is not present in this directory, and thus could never be deleted.
   Notice that the separate "mapping" documents will be referencing the
   topology names, and when a topology is delete such mapping will
   become invalid.  It up to the implementation to handle such
   referential integrity - e.g. by ignoring such entries in the mapping
   document, or disallowing the topology file to be deleted as long as
   active references are present.

6.4.  Reading State

   Every document described above could be read and transported to the
   client using the HTTP GET request.  The document is transported
   completely in the corresponding encoding.  It is up to the controller
   to implement proper read/write locking to avoid inconsistencies in
   data when multiple clients are present.  No locking API should be
   ever exposed to the client, since that would affect the stateless
   nature of the communications.  Notice that reading the link-state
   database is mostly informative to the client, since handling of the
   network failures is performed by the BGP Controller.

6.5.  Writing State

   The topology definition documents and the IPv4/IPv6 mapping tables
   could be fully re-written using the HTTP PUT verb.  This means that
   with every operation, the client must supply the full new document,
   not an incremental change.  It's up to the client to perform the
   merge of the new change with the already existing information.  If
   consistency across multiple writers is required, it should be
   implemented by the clients, possibly via the use of an external
   shared locking API.  Referential integrity checks could be
   implemented in the controller, e.g. to validate that the topology
   references in the mapping actually exists, or alternatively could be
   left to the client.

   It is possible to implement incremental changes using the HTTP PATCH
   verb semantics (see [RFC5789]) in the server.  In this case, it's up
   to the server to perform proper merge of the incremental change and
   ensure there is no conflicts or duplicates.  This is a more complex
   model as compared to the simple "PUT" logic.

6.6.  Typical API Call Sequence

   A typical sequence of actions for a client willing to perform traffic
   engineering could be as following (assuming absence of the PATCH
   operation):

   o  Decide which prefixes are to be affected by this operation.




Lapukhov & Nkposong      Expires March 06, 2014                [Page 23]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   o  Create a topology to perform the link-state operation, or re-use
      the one previously created by this application.  Verify topology
      existence using the GET operation in the "/topologies" directory.

   o  Add new links with the desired costs to the topology.  If the
      topology alredy exists, read it first using GET operation, and
      then perform merge on the client side, later submitting the
      updated topology using PUT operation.

   o  Obtain current prefix mappings for the desired address family
      using the GET operation.  Parse the mappings and perform any
      consistency checks required, followed by adding the entries for
      prefixes to act upon, mapping them to the topology created/updated
      above.

   o  HTTP PUT the new mappings file, updating the one that existing in
      the server as a whole.

6.7.  Limitations

   The API is purposely focused only on routing information
   manipulation, and does not provide any ways to verify the requested
   operation has been accomplished.  Such monitoring should be done
   separately, using either mechanics available in BGP (e.g. by learning
   of the prefixes' new paths via separate session) or outside of BGP,
   e.g. in BGP Monitoring Protocol ([I-D.ietf-grow-bmp]) or Multi-
   Threaded Routing Toolkit ([RFC6396]).

7.  Security Considerations

   The design of the BGP Controller in its simplest form assumes no
   access control in the API is presents to the third-party
   applications.  Access could be limited at the transport level, e.g.
   by using protocol (HTTP) authentication or access control
   capabilities, but the API itself does not provide any logic to
   segregate applications - i.e. there is currently no way to limit an
   application to manipulating only a certain subset of the IP address
   space.

8.  Acknowledgements

   The authors would like to thank Robert Raszuk for reviewing the
   document and providing valueable feedback.

9.  References

9.1.  Normative References




Lapukhov & Nkposong      Expires March 06, 2014                [Page 24]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   [RFC4271]  Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
              Protocol 4 (BGP-4)", RFC 4271, January 2006.

   [RFC5789]  Dusseault, L. and J. Snell, "PATCH Method for HTTP", RFC
              5789, March 2010.

   [RFC1997]  Chandrasekeran, R., Traina, P., and T. Li, "BGP
              Communities Attribute", RFC 1997, August 1996.

9.2.  Informative References

   [I-D.lapukhov-bgp-routing-large-dc]
              Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for
              routing in large-scale data centers", draft-lapukhov-bgp-
              routing-large-dc-06 (work in progress), August 2013.

   [I-D.ietf-grow-bmp]
              Scudder, J., Fernando, R., and S. Stuart, "BGP Monitoring
              Protocol", draft-ietf-grow-bmp-07 (work in progress),
              October 2012.

   [RFC4786]  Abley, J. and K. Lindqvist, "Operation of Anycast
              Services", BCP 126, RFC 4786, December 2006.

   [RFC6774]  Raszuk, R., Fernando, R., Patel, K., McPherson, D., and K.
              Kumaki, "Distribution of Diverse BGP Paths", RFC 6774,
              November 2012.

   [RFC6976]  Shand, M., Bryant, S., Previdi, S., Filsfils, C.,
              Francois, P., and O. Bonaventure, "Framework for Loop-Free
              Convergence Using the Ordered Forwarding Information Base
              (oFIB) Approach", RFC 6976, July 2013.

   [RFC2992]  Hopps, C., "Analysis of an Equal-Cost Multi-Path
              Algorithm", RFC 2992, November 2000.

   [RFC6241]  Enns, R., Bjorklund, M., Schoenwaelder, J., and A.
              Bierman, "Network Configuration Protocol (NETCONF)", RFC
              6241, June 2011.

   [RFC6396]  Blunk, L., Karir, M., and C. Labovitz, "Multi-Threaded
              Routing Toolkit (MRT) Routing Information Export Format",
              RFC 6396, October 2011.

   [I-D.ietf-idr-add-paths]
              Walton, D., Retana, A., Chen, E., and J. Scudder,
              "Advertisement of Multiple Paths in BGP", draft-ietf-idr-
              add-paths-08 (work in progress), December 2012.



Lapukhov & Nkposong      Expires March 06, 2014                [Page 25]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   [I-D.ietf-idr-link-bandwidth]
              Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
              Extended Community", draft-ietf-idr-link-bandwidth-06
              (work in progress), January 2013.

   [I-D.raszuk-wide-bgp-communities]
              Raszuk, R., Haas, J., Amante, S., Steenbergen, R.,
              Decraene, B., and P. Jakma, "Wide BGP Communities
              Attribute", draft-raszuk-wide-bgp-communities-03 (work in
              progress), July 2012.

   [I-D.uttaro-idr-bgp-persistence]
              Uttaro, J., Chen, E., Decraene, B., and J. Scudder,
              "Support for Long-lived BGP Graceful Restart", draft-
              uttaro-idr-bgp-persistence-02 (work in progress), July
              2013.

   [JAKMA2008]
              Jakma, P., "BGP Path Hunting", 2008, <https://
              blogs.oracle.com/paulj/entry/bgp_path_hunting>.

   [PAXOS]    Wikipedia, ., "Paxos", ,
              <http://en.wikipedia.org/wiki/Paxos_(computer_science)>.

   [REST]     Wikipedia, ., "Representational state transfer", , <http:/
              /en.wikipedia.org/wiki/Representational_state_transfer>.

   [RWHITE2005]
              White, R., "Graph Overlays on Path Vector: A Possible Next
              Step in BGP", June 2005, <http://www.cisco.com/web/about/
              ac123/ac147/archived_issues/ipj_8-2/graph_overlays.html>.

   [KVALBEIN2007]
              Kvalbein, A. and O. Lysne, "How can Multi-Topology Routing
              be used for Intradomain Traffic Engineering?", 2007.

   [IEEE8023AD]
              IEEE 802.3ad, ., "IEEE Standard for Link aggregation for
              parallel links", October 2000.

   [RCP]      Caesar, M., Caldwell, D., Feamster, N., and J. Rexford,
              "Design and Implementation of a Routing Control Platform
              ", March 2005,
              <http://www.cs.princeton.edu/~jrex/papers/rcp-nsdi.pdf>.

Authors' Addresses





Lapukhov & Nkposong      Expires March 06, 2014                [Page 26]

Internet-Draft           draft-lapukhov-bgp-sdn           September 2013


   Petr Lapukhov
   Microsoft Corporation
   One Microsoft Way
   Redmond, WA  98052
   US

   Phone: +1 425 7032723
   Email: petrlapu@microsoft.com
   URI:   http://microsoft.com/


   Edet Nkposong
   Microsoft Corporation
   One Microsoft Way
   Redmond, WA  98052
   US

   Phone: +1 425 7071045
   Email: edetn@microsoft.com
   URI:   http://microsoft.com/































Lapukhov & Nkposong      Expires March 06, 2014                [Page 27]