Internet Exchange Route Server

The growing popularity of Internet exchange points (IXPs) brings a new set of requirements to interconnect participating networks. While bilateral exterior BGP sessions between exchange participants were previously the most common means of exchanging reachability information, the overhead associated with dense interconnection has caused substantial operational scaling problems for Internet exchange point participants.

This document outlines a specification for multilateral interconnections at IXPs. Multilateral interconnection is a method of exchanging routing information between three or more BGP speakers using a single intermediate broker system, referred to as a route server. Route servers are typically used on shared access media networks such as Internet exchange points (IXPs), to facilitate simplified interconnection between multiple Internet routers on such a network.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction to Multilateral Interconnection
    1.1. Specification of Requirements
2. Bilateral Interconnection
3. Multilateral Interconnection
4. Technical Considerations for Route Server Implementations
    4.1. Client UPDATE Messages
    4.2. Attribute Transparency
        4.2.1. NEXT_HOP Attribute
        4.2.2. AS_PATH Attribute
        4.2.3. MULTI_EXIT_DISC Attribute
        4.2.4. Communities Attributes
    4.3. Per-Client Prefix Filtering
        4.3.1. Prefix Hiding on a Route Server
        4.3.2. Mitigation Techniques
            4.3.2.1. Multiple Route Server RIBs
            4.3.2.2. Advertising Multiple Paths
5. Operational Considerations for Route Server Installations
    5.1. Route Server Scaling
        5.1.1. Tackling Scaling Issues
            5.1.1.1. View Merging and Decomposition
            5.1.1.2. Destination Splitting
            5.1.1.3. NEXT_HOP Resolution
    5.2. NLRI Leakage Mitigation
    5.3. Route Server Redundancy
    5.4. AS_PATH Consistency Check
    5.5. Implementing Routing Policies
        5.5.1. Communities
        5.5.2. Internet Routing Registry
6. Security Considerations
7. IANA Considerations
8. Acknowledgments
9. References
    9.1. Normative References
    9.2. Informative References
§ Authors' Addresses

1. Introduction to Multilateral Interconnection

Internet exchange points (IXPs) provide IP data interconnection facilities for their participants, typically using shared Layer-2 networking media such as Ethernet. The Border Gateway Protocol (BGP) [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.), an inter-Autonomous System routing protocol, is commonly used to facilitate exchange of network reachability information over such media.

In the case of bilateral interconnection between two exchange participant routers, each router must be configured with a BGP session to the other. At IXPs with many participants who wish to implement dense interconnection, this requirement can lead both to large router configurations and high administrative overhead. Given the growth in the number of participants at many IXPs, it has become operationally troublesome to implement densely meshed interconnections at these IXPs.

Multilateral interconnection is a method of interconnecting BGP speaking routers using a third party brokering system, commonly referred to as a route server and typically managed by the IXP operator. Each of the multilateral interconnection participants (usually referred to as route server clients) announces network reachability information to the route server using exterior BGP, and the route server in turn forwards this information to each other route server client connected to it, according to its configuration. Although a route server uses BGP to exchange reachability information with each of its clients, it does not forward traffic itself and is therefore not a router.

The term "route server" is often in a different context used to describe a BGP node whose purpose is to accept BGP feeds from multiple clients for the purpose of operational analysis and troubleshooting. A system of this form may alternatively be known as a "route collector" or a "route-views server". This document uses the term "route server" exclusively to describe multilateral peering brokerage systems.

1.1. Specification of Requirements

2. Bilateral Interconnection

Bilateral interconnection is a method of interconnecting routers using individual BGP sessions between each participant router on an IXP in order to exchange reachability information. While interconnection policies vary from participant to participant, most IXPs have significant numbers of participants who see value in interconnecting with as many other exchange participants as possible. In order for an IXP participant to implement a dense interconnection policy, it is necessary for the participant to liaise with each of their intended interconnection partners and if this partner agrees to interconnect, then both participants' routers must be configured with a BGP session to exchange network reachability information. If each exchange participant interconnects with each other participant, a full mesh of BGP sessions is needed, as detailed in Figure 1 (Full-Mesh Interconnection at an IXP).

     ___      ___
    /   \    /   \
 ..| AS1 |..| AS2 |..
:   \___/____\___/   :
:     | \    / |     :
:     |  \  /  |     :
: IXP |   \/   |     :
:     |   /\   |     :
:     |  /  \  |     :
:    _|_/____\_|_    :
:   /   \    /   \   :
 ..| AS3 |..| AS4 |..
    \___/    \___/

Figure 1 (Full-Mesh Interconnection at an IXP) depicts an IXP platform with four connected routers, administered by four separate exchange participants, each of them with a locally unique autonomous system number: AS1, AS2, AS3 and AS4. Each of these four participants wishes to exchange traffic with all other participants; this is accomplished by configuring a full mesh of BGP sessions on each router connected to the exchange, resulting in 6 BGP sessions across the IXP fabric.

The number of BGP sessions at an exchange has an upper bound of n*(n-1)/2, where n is the number of routers at the exchange. As many exchanges have relatively large numbers of participating networks, the quadratic scaling requirements of dense interconnection tend to cause operational and administrative overhead at large IXPs. Consequently, new participants to an IXP require significant initial resourcing in order to gain value from their IXP connection, while existing exchange participants need to commit ongoing resources in order to benefit from interconnecting with these new participants.

3. Multilateral Interconnection

Multilateral interconnection is implemented using a route server configured to use BGP to distribute network layer reachability information (NLRI) among all client routers. The route server preserves the BGP NEXT_HOP attribute from all received NLRI UPDATE messages, and passes these messages with unchanged NEXT_HOP to its route server clients, according to its configured routing policy. Using this method of exchanging NLRI messages, an IXP participant router can receive an aggregated list of prefixes from all other route server clients using a single BGP session to the route server instead of depending on BGP sessions with each other router at the exchange. This reduces the overall number of BGP sessions at an Internet exchange from n*(n-1)/2 to n, where n is the number of routers at the exchange.

In practical terms, this allows dense interconnection between IXP participants with low administrative overhead and significantly simpler and smaller router configurations. In particular, new IXP participants benefit from immediate and extensive interconnection, while existing route server participants receive reachability information from these new participants without necessarily having to adapt their configurations.

     ___      ___
    /   \    /   \
 ..| AS1 |..| AS2 |..
:   \___/    \___/   :
:      \      /      :
:       \    /       :
:        \__/        :
: IXP   /    \       :
:      |  RS  |      :
:       \____/       :
:        /  \        :
:       /    \       :
:    __/      \__    :
:   /   \    /   \   :
 ..| AS3 |..| AS4 |..
    \___/    \___/

As illustrated in Figure 2 (IXP-based Interconnection with Route Server), each router on the IXP fabric requires only a single BGP session to the route server, from which it can receive reachability information for all other routers on the IXP which also connect to the route server.

4. Technical Considerations for Route Server Implementations

4.1. Client UPDATE Messages

A route server MUST accept all UPDATE messages received from each of its clients for inclusion in its Adj-RIB-In. These UPDATE messages MAY by omitted from the route server's Loc-RIB or Loc-RIBs, due to filters configured for the purposes of implementing routing policy. The route server SHOULD perform one or more BGP Decision Processes to select routes for subsequent advertisement to its clients, taking into account possible configuration to provide multiple NLRI paths to a particular client as described in Section 4.3.2.2 (Advertising Multiple Paths) or multiple Loc-RIBs as described in Section 4.3.2.1 (Multiple Route Server RIBs). The route server SHOULD forward UPDATE messages where appropriate from its Loc-RIB or Loc-RIBs to its clients.

4.2. Attribute Transparency

As a route server primarily performs a brokering service, modification of attributes could cause route server clients to alter their BGP best-path selection process for received prefix reachability information, thereby changing the intended routing policies of exchange participants. Therefore, contrary to what is specified in section 5. of [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.), route servers SHOULD NOT update well-known BGP attributes received from route server clients before redistributing them to their other route server clients. Optional recognized and unrecognized BGP attributes, whether transitive or non-transitive, SHOULD NOT be updated by the route server and SHOULD be passed on to other route server clients.

4.2.1. NEXT_HOP Attribute

The NEXT_HOP, a well-known mandatory BGP attribute, defines the IP address of the router used as the next hop to the destinations listed in the Network Layer Reachability Information field of the UPDATE message. As the route server does not participate in the actual routing of traffic, the NEXT_HOP attribute MUST be passed unmodified to the route server clients, similar to the "third party" next hop feature described in section 5.1.3. of [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.).

4.2.2. AS_PATH Attribute

AS_PATH is a well-known mandatory attribute which identifies the autonomous systems through which routing information carried in the UPDATE message has passed.

As a route server does not participate in the process of forwarding data between client routers, and because modification of the AS_PATH attribute could affect route server client best-path calculations, the route server SHOULD NOT prepend its own AS number to the AS_PATH segment nor modify the AS_PATH segment in any other way.

4.2.3. MULTI_EXIT_DISC Attribute

MULTI_EXIT_DISC is an optional non-transitive attribute intended to be used on external (inter-AS) links to discriminate among multiple exit or entry points to the same neighboring AS. If applied to an NLRI UPDATE sent to a route server, the attribute (contrary to section 5.1.4 of [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.)) SHOULD be propagated to other route server clients and the route server SHOULD NOT modify the value of this attribute.

4.2.4. Communities Attributes

4.3. Per-Client Prefix Filtering

4.3.1. Prefix Hiding on a Route Server

While IXP participants often use route servers with the intention of interconnecting with as many other route server participants as possible, there are several circumstances where control of prefix distribution on a per-client basis is important for ensuring that desired interconnection policies are met.

     ___      ___
    /   \    /   \
 ..| AS1 |..| AS2 |..
:   \___/    \___/   :
:       \    / |     :
:        \  /  |     :
: IXP     \/   |     :
:         /\   |     :
:        /  \  |     :
:    ___/____\_|_    :
:   /   \    /   \   :
 ..| AS3 |..| AS4 |..
    \___/    \___/

Using the example in Figure 3 (Filtered Interconnection at an IXP), AS1 does not directly exchange prefix information with either AS2 or AS3 at the IXP, but only interconnects with AS4.

In the traditional bilateral interconnection model, prefix filtering to a third party exchange participant is accomplished either by not engaging in a bilateral interconnection with that participant or else by implementing outbound prefix filtering on the BGP session towards that participant. However, in a multilateral interconnection environment, only the route server can perform outbound prefix filtering in the direction of the route server client; route server clients depend on the route server to perform their filtering for them.

If the same prefix is sent to a route server from multiple route server clients with different BGP attributes, and traditional best-path route selection is performed on that list of prefixes, then the route server will select a single best-path prefix for propagation to all connected clients. If, however, the route server has been configured to filter the calculated best-path prefix from reaching a particular route server client, then that client will receive no reachability information for that prefix from the route server, despite the fact that the route server has received alternative reachability information for that prefix from other route server clients. This phenomenon is referred to as "prefix hiding".

For example, in Figure 3 (Filtered Interconnection at an IXP), if the same prefix were sent to the route server via AS2 and AS4, and the route via AS2 was preferred according to BGP's traditional best-path selection, but AS2 was filtered by AS1, then AS1 would never receive this prefix, even though the route server had previously received a valid alternative path via AS4. This happens because the best-path selection is performed only once on the route server for all clients.

It should be noted that prefix hiding will only occur on route servers which employ per-client prefix filtering; if an IXP operator deploys a route server without prefix filtering, then prefix hiding does not occur, as all paths are considered equally valid from the point of view of the route server.

There are several techniques which may be employed to prevent the prefix hiding problem from occurring. Route server implementations SHOULD implement at least one method to prevent prefix hiding.

4.3.2. Mitigation Techniques

4.3.2.1. Multiple Route Server RIBs

The most portable means of preventing the route server prefix hiding problem is by using a route server BGP implementation which performs the per-client best-path calculation for each set of prefixes which results after the route server's client filtering policies have been taken into consideration. This can be implemented by using per-client Loc-RIBs, with prefix filtering implemented between the Adj-RIB-In and the per-client Loc-RIB. Implementations MAY optimize this by maintaining prefixes not subject to filtering policies in a global Loc-RIB, with per-client Loc-RIBs stored as deltas.

This problem mitigation technique is highly portable, as it makes no assumptions about the feature capabilities of the route server clients.

4.3.2.2. Advertising Multiple Paths

The prefix distribution model described above assumes standard BGP session encoding where the route server sends a single path to its client for any given prefix. This path is selected using the BGP path selection decision process described in [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.). If, however, it were possible for the route server to send more than a single path to a route server client, then route server clients would no longer depend on receiving a single best path to a particular prefix; consequently, the prefix hiding problem described in Section 4.3.1 (Prefix Hiding on a Route Server) would disappear.

We present two methods which describe how such increased path diversity could be implemented.

4.3.2.2.1. Diverse BGP Path Approach

The number of paths which may be distributed to a client is constrained by the number of BGP sessions which the server and the client are willing to establish with each other. The distributed paths may be established from the global BGP Loc-RIB on the route server in addition to any per-client Loc-RIB. As there may be more potential paths to a given prefix than configured BGP sessions, this method is not guaranteed to eliminate the prefix hiding problem in all situations. Furthermore, this method may significantly increase the number of BGP sessions handled by the route server, which may negatively impact its performance.

4.3.2.2.2. BGP ADD-PATH Approach

If the ADD-PATH capability is negotiated bidirectionally between the route server and a route server client, and the route server client propagates multiple paths for the same prefix to the route server, then this could potentially cause the propagation of inactive, invalid or suboptimal paths to the route server, thereby causing loss of reachability to other route server clients. For this reason, ADD-PATH implementations on a route server SHOULD enforce send-only mode with the route server clients, which would result in negotiating receive-only mode from the client to the route server.

5. Operational Considerations for Route Server Installations

5.1. Route Server Scaling

While deployment of multiple Loc-RIBs on the route server presents a simple way to avoid the prefix hiding problem noted in Section 4.3.1 (Prefix Hiding on a Route Server), this approach requires significantly more computing resources on the route server than where a single Loc-RIB is deployed for all clients. As the [RFC4271] (Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” January 2006.) Decision Process must be applied to all Loc-RIBs deployed on the route server, both CPU and memory requirements on the host computer scale approximately according to O(P * N), where P is the total number of unique prefixes received by the route server and N is the number of route server clients which require a unique Loc-RIB. As this is a super-linear scaling relationship, large route servers may derive benefit from deploying per-client Loc-RIBs only where they are required.

Regardless of any Loc-RIB optimization implemented, the route server's control plane bandwidth requirements will scale according to O(P * N), where P is the total number of unique prefixes received by the route server and N is the total number of route server clients. In the case where P_avg (the arithmetic mean number of unique prefixes received per route server client) remains roughly constant even as the number of connected clients increases, this relationship can be rewritten as O((P_avg * N) * N) or O(N^2). This quadratic upper bound on the network traffic requirements indicates that the route server model will not scale to arbitrarily large sizes.

5.1.1. Tackling Scaling Issues

The network traffic scaling issue presents significant difficulties with no clear solution - ultimately, each client must receive a UPDATE for each unique prefix received by the route server. However, there are several potential methods for dealing with the CPU and memory resource requirements of route servers.

5.1.1.1. View Merging and Decomposition

A variation of this approach may be implemented on route servers by ensuring that separate Loc-RIBs are only configured for route server clients with unique export peering policies.

5.1.1.2. Destination Splitting

Destination splitting, also described in [RS‑ARCH] (Govindan, R., Alaettinoglu, C., Varadhan, K., and D. Estrin, “A Route Server Architecture for Inter-Domain Routing,” 1995.), describes a method for route server clients to connect to multiple route servers and to send non-overlapping sets of prefixes to each route server. As each route server computes the best path for its own set of prefixes, the quadratic scaling requirement operates on multiple smaller sets of prefixes. This reduces the overall computational and memory requirements for managing multiple Loc-RIBs and performing the best-path calculation on each. In order for this method to perform well, destination splitting would require significant co-ordination between the route server operator and each route server client. In practice, such levels of co-ordination are unlikely to work successfully, thereby diminishing the value of this approach.

5.1.1.3. NEXT_HOP Resolution

As route servers are usually deployed at IXPs which use flat layer 2 networks, recursive resolution of the NEXT_HOP attribute is generally not required, and can be replaced by a simple check to ensure that the NEXT_HOP value for each prefix is a network address on the IXP LAN's IP address range.

5.2. NLRI Leakage Mitigation

NLRI leakage occurs when a BGP client unintentionally distributes NLRI UPDATE messages to one or more neighboring BGP routers. NLRI leakage of this form to a route server can cause connectivity problems at an IXP if each route server client is configured to accept all prefix UPDATE messages from the route server. It is therefore RECOMMENDED when deploying route servers that, due to the potential for collateral damage caused by NLRI leakage, route server operators deploy NLRI leakage mitigation measures in order to prevent unintentional prefix announcements or else limit the scale of any such leak. Although not foolproof, per-client inbound prefix limits can restrict the damage caused by prefix leakage in many cases. Per-client inbound prefix filtering on the route server is a more deterministic and usually more reliable means of preventing prefix leakage, but requires more administrative resources to maintain properly.

5.3. Route Server Redundancy

As the purpose of an IXP route server implementation is to provide a reliable reachability brokerage service, it is RECOMMENDED that exchange operators who implement route server systems provision multiple route servers on each shared Layer-2 domain. There is no requirement to use the same BGP implementation or operating system for each route server on the IXP fabric; however, it is RECOMMENDED that where an operator provisions more than a single server on the same shared Layer-2 domain, each route server implementation be configured equivalently and in such a manner that the path reachability information from each system is identical.

5.4. AS_PATH Consistency Check

Route servers do not modify the AS_PATH attribute (as described in Section 4.2.2 (AS_PATH Attribute)), since they do not participate in the traffic exchange. Therefore a consistency check on the AS_PATH of an UPDATE received by a route server client would fail. It is therefore RECOMMENDED that route server clients disable the AS_PATH consistency check towards the route server.

5.5. Implementing Routing Policies

Prefix filtering is commonly implemented on route servers to provide prefix distribution control mechanisms for route server clients. There are a few commonly used strategies available.

5.5.1. Communities

Prefixes sent to the route server are tagged with certain COMMUNITIES attributes agreed upon beforehand between the operator and all participants. Based on the values, routes are propagated to all other participants, a subset of participants, or none. This allows for one-way filtering policies to be implemented on the route server; if a participant chooses not to exchange routes with a certain other participant, he will have to instruct the route server to not announce his own routes and filter incoming routes on his own router.

5.5.2. Internet Routing Registry

6. Security Considerations

On route server installations which do not employ prefix-hiding mitigation techniques, the prefix hiding problem outlined in section Section 4.3.1 (Prefix Hiding on a Route Server) can be used in certain circumstances to proactively block third party prefix announcements from other route server clients.

7. IANA Considerations

The new set of mechanism for route servers does not require any new allocations from IANA.

8. Acknowledgments

The authors would like to thank Chris Hall, Ryan Bickhart and Steven Bakker for their valuable input.

In addition, the authors would like to acknowledge the developers of BIRD, OpenBGPD and Quagga, whose open source BGP implementations include route server capabilities which are compliant with this document.

[I-D.ietf-grow-diverse-bgp-path-dist]	Raszuk, R., Fernando, R., Patel, K., McPherson, D., and K. Kumaki, “Distribution of diverse BGP paths.,” draft-ietf-grow-diverse-bgp-path-dist-02 (work in progress), July 2010 (TXT).
[I-D.ietf-idr-add-paths]	Walton, D., Retana, A., Chen, E., and J. Scudder, “Advertisement of Multiple Paths in BGP,” draft-ietf-idr-add-paths-04 (work in progress), August 2010 (TXT).
[RFC1997]	Chandrasekeran, R., Traina, P., and T. Li, “BGP Communities Attribute,” RFC 1997, August 1996 (TXT).
[RFC2119]	Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[RFC2622]	Alaettinoglu, C., Villamizar, C., Gerich, E., Kessens, D., Meyer, D., Bates, T., Karrenberg, D., and M. Terpstra, “Routing Policy Specification Language (RPSL),” RFC 2622, June 1999 (TXT).
[RFC4271]	Rekhter, Y., Li, T., and S. Hares, “A Border Gateway Protocol 4 (BGP-4),” RFC 4271, January 2006 (TXT).
[RFC4360]	Sangli, S., Tappan, D., and Y. Rekhter, “BGP Extended Communities Attribute,” RFC 4360, February 2006 (TXT).
[RFC4456]	Bates, T., Chen, E., and R. Chandra, “BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP),” RFC 4456, April 2006 (TXT).
[RS-ARCH]	Govindan, R., Alaettinoglu, C., Varadhan, K., and D. Estrin, “A Route Server Architecture for Inter-Domain Routing,” 1995.

[RFC1863]	Haskin, D., “A BGP/IDRP Route Server alternative to a full mesh routing,” RFC 1863, October 1995 (TXT).
[RFC3418]	Presuhn, R., “Management Information Base (MIB) for the Simple Network Management Protocol (SNMP),” STD 62, RFC 3418, December 2002 (TXT).
[RFC4223]	Savola, P., “Reclassification of RFC 1863 to Historic,” RFC 4223, October 2005 (TXT).
[RFC4760]	Bates, T., Chandra, R., Katz, D., and Y. Rekhter, “Multiprotocol Extensions for BGP-4,” RFC 4760, January 2007 (TXT).
[RFC5065]	Traina, P., McPherson, D., and J. Scudder, “Autonomous System Confederations for BGP,” RFC 5065, August 2007 (TXT).
[RFC5226]	Narten, T. and H. Alvestrand, “Guidelines for Writing an IANA Considerations Section in RFCs,” BCP 26, RFC 5226, May 2008 (TXT).

	Elisa Jasinska
	Limelight Networks
	2220 W 14th St
	Tempe, AZ 85281
	US
Email:	elisa@llnw.com

	Nick Hilliard
	INEX
	4027 Kingswood Road
	Dublin 24
	IE
Email:	nick@inex.ie

	Robert Raszuk
	Cisco Systems
	170 West Tasman Drive
	San Jose, CA 95134
	US
Email:	raszuk@cisco.com

	Niels Bakker
	AMS-IX B.V.
	Westeinde 12
	Amsterdam, NH 1017 ZN
	NL
Email:	niels.bakker@ams-ix.net

Internet Exchange Route Serverdraft-jasinska-ix-bgp-route-server-01

Abstract