Network Working Group Manav Bhatia Internet Draft Riverstone Networks Expires: March 2005 Joel M. Halpern Megisto Systems Advertising Equal Cost MultiPath routes in BGP draft-bhatia-ecmp-routes-in-bgp-01.txt Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes an extensible mechanism that will allow a BGP [BGP4] speaker to advertise equal cost multiple BGP routes for a destination to its peers. A new BGP capability [BGP-CAP], termed "Equal Cost Multipath Capability", is defined, which would allow a local BGP speaker to express its ability to support advertisement of such multiple paths to its peers. A new BGP attribute is introduced that will be used to advertise and withdraw multiple paths for the feasible and the un-feasible BGP routes to the remote peers. The mechanisms described in this document are applicable to all routers, both those with the ability to inject multiple routing entries in their forwarding table and those without. Bhatia & Halpern [Page 1] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [KEYWORDS] Table of Contents 1. Introduction...................................................2 2. Scenarios where advertising BGP ECMP routes may be useful......3 3. Equal Cost Multipath Capability................................7 4. Operation when both peers are ECMP capable.....................8 5. Procedures for the Local Speaker...............................8 6. Advertisement of ECMP BGP routes...............................8 6.1 Equal Cost Multi-Path Next Hop - ECMP_NEXT_HOP.............9 7. Procedures for the Receiving Speaker..........................11 8. Working with Non ECMP capable/EBGP peers......................11 9. Configuring BGP ECMP Support..................................13 10. Working with ECMP capable IBGP peers.........................13 11. Confederations...............................................14 12. Multiprotocol Extensions to BGP..............................14 13. Security Considerations......................................14 14. Acknowledgements.............................................15 15. IANA Considerations..........................................15 16. Appendix A...................................................15 16.1 Constructing AS_PATHs....................................15 16.2 Advertising synthetic AS_PATHs...........................16 17. References...................................................16 18. Author's Addresses...........................................17 19. Intellectual Property Notice.................................17 20. Full Copyright Notice........................................18 1. Introduction Currently BGP speakers cannot announce ECMP routes, even if they have some. This is because the BGP specification allows only one "best" route to be inserted into the Loc-RIB and to be announced to other BGP speakers. If another route for that same destination is received by a BGP receiver, then it is taken as an implicit withdraw for the previous route. Because of this limitation, a BGP speaker is thus, never able to advertise equal cost multipath routes for a destination to its peers. In some cases, the maximum that a current implementation can do when it receives multiple equal cost BGP routes is to insert all of them (or a subset of them based on its local policies) in its forwarding table and locally load balance for the destination. However, only one "best" BGP path is announced to the peers. The "best" path selection could be either based on the lower Router ID or the route which has Bhatia & Halpern [Page 2] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 been received first, if everything else is the same. Selecting the best path based on the Router ID (or higher loopback address) is deterministic, but can cause MED churn in some topologies while the latter selection criterion is non-deterministic. This document refers to all the candidate paths that remain after the tie breaking procedure, described in sec. 9.1.2.2 [BGP4], reaches step (f) as "ECMP BGP paths/routes". It should be noted that these paths shall have the same AS_PATH length, though the individual AS_PATHs could differ. Advertising BGP ECMP paths with different NEXT_HOPs holds value in IBGP scenarios because each IBGP peer can reach the NEXT_HOP on its own. However, information about the multiple NEXT_HOPs is not useful for EBGP peers, as (i) the receiving EBGP peer will not be aware of the NEXT_HOP information inside some other AS and because, (ii) the BGP speaker always resets the NEXT_HOP to itself when announcing routes to an EBGP peer. Because of this, individual equal cost BGP paths are not announced to EBGP peers. Only one path is announced which is an 'aggregate' of all the individual equal cost BGP paths for that destination. However, care must be taken to ensure that the AS_PATH length of the individual contributing AS_PATHs is retained in this 'aggregated' path and enough information is there, to enable the receiving peer (and the downstream peers) to detect AS loops. The use of BGP ECMP routes is most prevalent inside an AS to identify its local BGP routes that represent load balanced links. This is useful for applications that want to use the BGP protocol as a mechanism for propagating this information for load balancing across multiple IBGP paths. As a side effect, advertising BGP ECMP routes can also help solve some cases of persistent MED [MED] oscillations. Any effort to modify the way information flows through BGP runs the risk of introducing new oscillation conditions, even if it addresses existing conditions. The changes proposed in this draft preserve additional path information farther into an AS. In order to allow gradual deployment, the changes also specify how to compress that additional information when talking to non-upgraded nodes. These changes have been designed to preserve the stability of the decision process when working with nodes that use the standard algorithms for processing and comparing path information. 2. Scenarios where advertising BGP ECMP routes may be useful Bhatia & Halpern [Page 3] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 o Load Splitting when receiving BGP routes A (AS X) \ \ \ C (AS Y) / / / B (AS Z) A, B and C are BGP speakers in AS X, Y and Z respectively. Assume that C is peered up with similar sized ISPs A and B and accepts entire Internet feed from each one of these. It is common in such scenarios for the ISP C to receive multiple routes of equal cost from both A and B. Ordinarily, in order to use what it advertises, C can use only one "best" route learnt from either of A or B. Configuring C for load balancing involves a lot of prepending, modifying routes, splitting prefixes received from A and B, etc. Even then, it is difficult to have a 50/50 split across A and B as the load can only be statically split. The best and the most obvious soltution is to let C install multipath routes for the common destinations learnt from each of A and B. This document describes an extension to BGP that makes it possible for C to install equal cost multipath BGP routes and to advertise those to its peers in a manner that makes it possible for each of the downstream peers to run their AS loop detection algorithm. o Suboptimal Routing in Route Reflector clients Route Reflection [RR] can result in suboptimal routing due to the client not having full visibility to all the BGP paths in the AS. This is because the RR selects the best path and reflects only that best path to its clients. In case the RR has equal cost BGP routes, then it shall select the one based on the lower Router ID. As a result, the clients do not receive the full view of the available paths, or at least the paths that are equidistant from the RR. This is bad, as this can result in suboptimal routing from the client's perspective. A client may have selected a different best path if more paths had been made visible to it. With BGP ECMP, the RR can at least advertise all the equal cost BGP routes that it has to its client, giving the client more options to choose from. Bhatia & Halpern [Page 4] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 o Avoiding Persistent Route Oscillations ---------------------------------- / AS X \ | ----- | | / \ | | | | | | | RR | | | \ / | | -/+\- | | c1 / \ c2 | | ---- / \ ---- | | / \ / \ / \ | | ( Ra ) ( Rb ) | | \ / \ / | | -/\-- ------ | | / \ \ | | / \ \ | \ / \ \ / --/------\--------------------\---- / \ \ / --------------------------- / / \ --\-- \ --/- | \ / \ | // \\ | \ | | | | R2 | | \ | R3 | | | | | -\-- \ / | \\ // | / \ ----- | ---- | | | | AS Y | | R1 | | | \ / | | ---- | \ AS Z / ----------------------------- Consider the topology as shown in the above figure. Say, AS X consists of a Route Reflector (RR) and two clients Ra and Rb. Ra is connected to R2 in AS Y and R1 in AS Z. Rb is connected to R3 in AS Z. Assume that the Router ID of R1 < R2 and IGP cost c1 < c2. The dashed lines between the routers shows BGP peering. Assume that the BGP speakers in AS Y and AS Z receive a BGP UPDATE for 10.0.0.0/8 from AS W. Assume that they advertise the following path attributes to BGP speakers in AS X. R2: NLRI 10.0.0.0/8, AS_PATH Y W, MED 100, NEXT_HOP R2 R1: Bhatia & Halpern [Page 5] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 NLRI 10.0.0.0/8, AS_PATH Z W, MED 300, NEXT_HOP R1 R3: NLRI 10.0.0.0/8, AS_PATH Z W, MED 200, NEXT_HOP R3 The following events happen: [1] Ra receives UPDATEs from R2 and R1. Since they are from different ASes, MEDs are not compared and the tie breaks on the lower Router ID. Since R1 < R2, route from R1 is selected and advertised to the RR. Ra thus has the following path as the best one for 10.0.0.0/8: AS_PATH Z W, MED 300, NEXT_HOP R1 [2] Rb receives the UPDATE from R3, installs this and advertises the same to the RR. Rb thus has the following path for 10.0.0.0/8: AS_PATH Z W, MED 200, NEXT_HOP R3 [3] RR receives two UPDATEs from its clients. Since the neighboring AS is the same in both of them, the tie breaks on the route having the lower value of MED. It thus selects the route it learns from Rb as the best one and advertises this to Ra. [4] Ra now has all the three paths. Route learnt from Rb wins over the route learnt from R1 (lower MED) and the route learnt from R2 wins over the route learnt from Rb (EBGP > IBGP). [5] Ra thus sends an implicit WITHDRAW to the RR, replacing the earlier announcement with the route learnt from R2. [6] RR thus has the following paths for 10.0.0.0/8: i) AS_PATH Z W, MED 200, NEXT_HOP R3 ii) AS_PATH Y W, MED 100, NEXT_HOP R2 It selects the first path because the IGP cost to reach the NEXT_HOP is lesser for the first one. It thus, advertises this path to Rb and sends a WITHDRAW message to Ra, removing the path it had initially announced (one learnt from Rb) [7] Ra receives the WITHDRAW message from the RR and removes the path. Nothing is done as it is currently not the best path. [8] Rb receives the advertisement from RR, but doesn't do anything, as the path learnt from R3 is better (EBGP > IBGP). [9] Ra at this time has only two routes. One, learnt from R1 and the other learnt from R2. It has selected the route learnt from R2. After Bhatia & Halpern [Page 6] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 some time, this router runs its scanner process for validating the NEXT_HOPs. There it runs the best path algorithm and finds that the route learnt from R1 is better than the route learnt from R2, because of the lower Router ID. [10] Ra sends an implicit WITHDRAW to RR, replacing the earlier announcement with the route learnt from R2. .. The loop follows and it cycles again and again. Scenario 2: ECMP BGP is implemented on routers in AS X [1] If everything happens in a similar way as in the preceding example then Ra will have two paths to reach 10.0.0.0/8. Since everything else is the same, it will advertise both these routes to the RR. Note that Ra will not look at the Router ID, etc. for tie breaking if ECMP capabilities are implemented. [2] RR will now have three paths for 10.0.0.0/8. Path 3, from Rb and Paths 1 and 2 from Ra. Path 1: AS_PATH Y W, MED 100, NEXT_HOP R2 Path 2: AS_PATH Z W, MED 300, NEXT_HOP R1 Path 3: AS_PATH Z W, MED 200, NEXT_HOP R3 Out of Path 2 and Path 3, it will select Path 3 (lower MED). From Path 1 and Path 2, it will select Path 1, based on the lower IGP cost. [3] RR will advertise the new path to Rb. Nothing will happen on Rb, and it will continue using the same path as before. The network stabilizes and there are no more route oscillations. 3. Equal Cost Multipath Capability To advertise the ECMP Capability to a peer, a BGP speaker uses BGP Capabilities Advertisement [BGP-CAP]. This capability is advertised using some Capability code (TBD) and Capability length 0. By advertising the ECMP Capability to a peer, a BGP speaker conveys to the peer that the speaker is capable of receiving and properly handling the ECMP Updates from that peer. Bhatia & Halpern [Page 7] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 4. Operation when both peers are ECMP capable In the following sections, "Local speaker" refers to a router which is advertising the BGP ECMP routes, and the "Receiving Speaker" refers to a router that peers with the former to accept multiple BGP routes for a destination. Consider that the ECMP Capability has been exchanged between the Local speaker and the Receiving speaker, and a BGP session between them is established. The following sections detail the procedures that shall be followed by the Local speaker as well as the Receiving speaker once the ECMP capability has been exchanged, and the local speaker has with it some ECMP BGP paths. 5. Procedures for the Local Speaker Once the Local speaker receives multiple BGP paths for the same destination from different peers (or the same, in case it's peered up with an 'ecmp-capable' router) then it shall run its decision process to select the best BGP routes. It will inject the best ones in its forwarding table and advertise those to its peers that have exchanged the ECMP capability. Section 9.1.2.2 of [BGP4] explains the tie breaking procedure for selecting only one of the routes, from the multiple routes present in Adj-Ribs-In, for inclusion in the associated Loc-Rib. This document modifies this algorithm to support inclusion of multiple routes in the Loc-RIB and subsequently, advertisement of multiple ECMP routes to the peers. The change introduced is as follows: After the step (e) in sec 9.1.2.2 whatever candidate BGP routes exist are all considered for inclusion in the Loc-RIB and are announced to the remote BGP speaker supporting this capability. We shall see later how this is done. 6. Advertisement of ECMP BGP routes To provide backward compatibility, as well as to simplify introduction of the ECMP capabilities into BGP, a new BGP attribute, Equal Cost Multi-Path Next Hop (ECMP_NEXT_HOP) is introduced. This may be used in addition to the existing NEXT_HOP attribute to announce multiple next-hops for the destinations listed in the Network Layer Reachability Information of the UPDATE message. Bhatia & Halpern [Page 8] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 All prefixes announced using this attribute MUST not replace the previous advertisements and thus multiple BGP paths for a prefix can be advertised by the Local Speaker. If the same prefix is later announced with only the NEXT_HOP attribute then it MUST be taken as an implicit withdraw for all the previous paths advertised by that peer for that destination. An UPDATE message that contains feasible routes and carries ECMP_NEXT_HOP and no NEXT_HOP attribute MUST not be considered as an implicit withdrawal. The Receiving Speaker MUST simply append these routes in its Adj-RIBs-In, as additional paths to that destination. If some attributes (LocPref, MED, etc) change for a previously announced BGP ECMP route, causing it to become not the best route, then an explicit withdraw message MUST be sent to all the peers to whom this route had been earlier announced. 6.1 Equal Cost Multi-Path Next Hop - ECMP_NEXT_HOP This is an optional non-transitive attribute that can be used for advertising multiple next-hops associated with a NLRI. The attribute contains one or more triples , where each triple is encoded as shown below: +---------------------------------------------------+ | Address Family Identifier (2 octets) | +---------------------------------------------------+ | Subsequent Address Family Identifier (1 octet) | +---------------------------------------------------+ | Number of Next Hops (1 octet) | +---------------------------------------------------+ | Length of the First Next Hop (1 octet) | +---------------------------------------------------+ | Network Address of First Next Hop (variable) | +---------------------------------------------------+ | Length of the Second Next Hop (1 octet) | +---------------------------------------------------+ | Network Address of Second Next Hop (variable) | +---------------------------------------------------+ | . . . | +---------------------------------------------------+ | . . . | +---------------------------------------------------+ | Length of the Nth Next Hop (1 octet) | +---------------------------------------------------+ | Network Address of Nth Next Hop (variable) | +---------------------------------------------------+ Bhatia & Halpern [Page 9] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 The use and meaning of these fields are as follows: Address Family Identifier: This field carries the identity of the Network Layer protocol associated with the Network Address that follows. Presently defined values for this field are specified in RFC1700 [AFI]. Subsequent Address Family Identifier This field in combination with the Address Family Identifier field identifies the Network Layer protocol associated with the Network Address of the Next Hop(s) [SAFI]. Number of Next-Hops: This field carries the total number of ECMP BGP routes for the given NLRI. Length of Nth Next Hop Network Address: A 1 octet field whose value expresses the length of the "Network Address of Next Hop" field as measured in octets. For IPv6 routes the value shall be set to 16, when only a global address is present, or 32 if a link-local address is also included in the Next Hop field [BGP-IPv6]. Network Address of Nth Next Hop: This is a variable length field that contains the Network Address of the next router on the path to the destination. The N next-hops listed in the ECMP_NEXT_HOP path attribute defines the Network Layer address of the routers that should be used as next- hop to the destinations listed in the UPDATE message. The receiver MUST not remove any previous routes and MUST add the route received with an ECMP_NEXT_HOP attribute rather than replace the previous routes. When advertising more than one ECMP hop with identical attributes the sender SHALL send a single update with multiple next-hops listed in this attribute. When advertising more than one ECMP hop which do not have identical attributes, multiple BGP updates must be sent with the ECMP_NEXT_HOP attribute included to suppress route replacement. Bhatia & Halpern [Page 10] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 7. Procedures for the Receiving Speaker The Receiving Speaker upon receiving the ECMP_NEXT_HOP attribute will understand that the Local Speaker has advertised ECMP BGP routes. In a single UPDATE message all the prefixes will have identical attributes, except for the next-hops, which will be carried in the ECMP_NEXT_HOP attribute. It will run the modified decision process as explained in the Section 4 and depending upon the result will either - inject multiple routes into Local-RIB and advertise multiple paths to its peers OR - inject a single route which has better path attributes than the other routes that it has just received. If the Receiving Peer receives some withdrawn routes along with the other path attributes and ECMP_NEXT_HOP attribute then it shall understand that some of the previously advertised ECMP BGP have been removed and an implementation MUST proceed with removing all such paths. If a peer wants to withdraw all the ECMP BGP routes for some particular destination then it can send a normal BGP UPDATE message listing the NLRI in the WITHDRAWN Routes field. An implementation on the Receiving Speaker MUST, then remove all the ECMP routes for that destination which it heard from the Local speaker. If the Receiving Speaker receives an UPDATE message with the ECMP_NEXT_HOP attribute containing both, the feasible and the unfeasible routes, then it MUST consider these attributes for the feasible routes. All the destinations listed in the withdrawn routes shall be removed as per [BGP4]. 8. Working with Non ECMP capable/EBGP peers This section discusses how BGP ECMP routes are advertised to non ECMP capable or an EBGP peer. If an ECMP capable BGP speaker has installed multiple BGP paths in its forwarding table then it must advertise the AS_PATH for each one of these routes in a manner that makes it possible for the downstream peers to run the AS Path loop detection algorithm. However, care must be taken to ensure that the AS_PATH length remains unchanged. In such scenarios, the speaker must send out a synthetic AS_PATH to the non- ECMP BGP receiver (or an EBGP peer), where each element of the AS_PATH is an AS SET built with the AS values corresponding to each Bhatia & Halpern [Page 11] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 segment in each AS_PATH. This is possible since the individual contributing AS_PATHs are of the same length. The above procedure is described as a pseudo code. Note that the pseudo-code shown was chosen for clarity, not efficiency. It is not intended to specify any particular implementation. BGP implementations MAY use any algorithm which produces the same results as those described here. Given two AS_PATHs X and Y, each with N number of segments, which we wish to merge into a new combined AS_PATH, Z of N number of segments: Expand every AS_SEQUENCE segment in X and Y which contains multiple AS values into single-valued segments, such that the number of segments is equivalent to the path length, with order preserved. for every segment, n, from 0 to N create a segment Z(n) of type AS_SET for every AS value in segments X(n) and Y(n) add the AS value into Z(n) Resulting AS_PATH Z will consist of n AS_SETs, each AS_SET segment having all AS values in segments X(n) and Y(n). To cite an example, consider a BGP speaker (say in AS A1) having the following paths for a destination D1: Path 1: AS_PATH "a b c", Origin IGP, MED 10, NEXT_HOP N1 Path 2: AS_PATH "x y z", Origin IGP, MED 20, NEXT_HOP N2 It inserts these two paths in its forwarding table, and announces the following to its non-ecmp capable peer: AS_PATH: AS_SET (a,x) AS_SET (b,y) AS_SET (c,z) Origin IGP, MED 10, NEXT_HOP [N1 or N2] The AS_PATH constructed for advertisement to an EBGP peer is AS_PATH: AS_SEQUENCE "A1" AS_SET (a,x) AS_SET (b,y) AS_SET (c,z) Bhatia & Halpern [Page 12] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 Refer to Appendix A for more complex AS_PATH scenarios. The beauty of this new AS_PATH structure is that it retains the AS_PATH length of the original contributing paths and also retains enough AS information for the receiving peer to do loop prevention. For instance, if the router that has been used in the above example is peered up with another router R2 in AS "c", then R2 will reject all UPDATEs that have AS "c" in the AS_PATH. The ECMP capable router when advertising routes to a non-ecmp capable IBGP peer can pick any one of the NEXT_HOPs from the available list. This will not create any problems because this NEXT_HOP will actually fall within the AS_PATH set-sequence that is being advertised. For EBGP peers, the ECMP capable router, will as usual, put itself as the NEXT_HOP. 9. Configuring BGP ECMP Support An implementation MUST provide a configuration option to set and unset this feature irrespective of whether it is capable of injecting multiple routes into its Loc-RIB or not. It is recommended to advertise BGP ECMP routes to the peers even if the Local Speaker cannot insert multiple entries in its forwarding table. This way it can help its other IBGP peers make optimal decisions (especially if it's a RR), can help in MED oscillations, etc. The administrator should ensure that the maximum number of multipath routes that all the routers install in their FIB, remains consistent inside an AS. 10. Working with ECMP capable IBGP peers This section explains as to how ECMP feature will work in the normal scenarios. Assume that the two IBGP speakers A and B exchange this capability. Consider a case where A receives multiple updates for NLRI N' with Nexthops N0, .. Ni, .. Nm. Say it runs its decision process and finds that routes with the Nexthops Nj, Nk and Nl are equal and that it needs to advertise all three of them to B. Also assume that Nj and Nk share the same path attributes (Origin, AS Path, Local Pref, etc). A makes an UPDATE message and uses the ECMP_NEXT_HOP path attribute. It puts the AFI, number of next-hops as 2, length of the first next- Bhatia & Halpern [Page 13] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 hop (Nj), network address of Nj, length of Nk and the network address of Nk. When this UPDATE message is received by B, it looks at the ECMP_NEXT_HOP path attribute and understands that there are multiple routes to reach N'. It inserts two routes for N' with the next-hops as Nj and Nk. A also needs to announce N' with some other path attributes and the next-hop Nl. It makes an UPDATE message, puts the path attributes, and puts the ECMP_NEXT_HOP path attribute. It fills the AFI, number of next-hops as 1, length of the first next-hop Nl and the network address of Nl. This UPDATE message is sent to B. When B receives this UPDATE message it knows that this is not an implicit WITHDRAW from N' as it comes with the ECMP_NEXT_HOP path attribute. It simply appends this new route in its BGP database, runs the decision process, and proceeds as normal. Assume that at some point later, A needs to withdraw the route associated with the tuple [N', Nk]. It makes an UPDATE message, puts N' in the unfeasible routes and inserts path attributes and the ECMP_NEXT_HOP path attribute, keeping the next-hop inside as Nk. When B receives this UPDATE message it understands that A now wants to remove a route associated with N'. It looks at ECMP_NEXT_HOP and finds the next-hop as Nk. It thus removes, only the route associated with Nk. 11. Confederations Individual BGP ECMP routes with ECMP_NEXT_HOP path attribute MUST be announced to confederation IBGP and EBGP peers, if they are ecmp- capable. If not, then the treatment shall be the same as that for non-ecmp capable routers (as described in Section 8) [CONFED] 12. Multiprotocol Extensions to BGP Since the ECMP_NEXT_HOP includes both the AFI and SAFI, it is possible to advertise MPBGP ECMP routes. In this case, MP_REACH_NLRI [MPBGP] path attribute shall carry the NLRI information and ECMP_NEXT_HOP the information about the additional NEXT_HOPs. 13. Security Considerations This extension to BGP does not change the underlying security issues inherent in the existing BGP. Bhatia & Halpern [Page 14] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 14. Acknowledgements The authors would like to thank Paul Jakma, Tony Li, Greg Hankins, Abarbanel Benjamin and Curtis Villamizar for their valuable comments and suggestions. 15. IANA Considerations This document uses an attribute type to indicate additional next-hops for the BGP paths. This must be assigned by IANA as per RFC 2842. 16. Appendix A 16.1 Constructing AS_PATHs This section deals with some scenarios that could occur. Consider that ecmp capable Router R1 has received multiple paths for a destination D1 and it is connected to a non-ecmp capable/EBGP router R2. R1 thus cannot use the ECMP_NEXT_HOP attribute to announce these routes. [Scenario 1] Say, R1 has the following BGP paths that it installs in its FIB. Path 1: AS_PATH: AS_SEQ "a b", AS_SET "p1 p2", AS_SET "p3 p4", NEXT_HOP N1 Path 2: AS_PATH: AS_SEQ "x y", AS_SET "q1 q2 q3", AS_SET "q4", NEXT_HOP N2 It is to be noted in this case when R1 runs its decision process, AS Path lengths are the same because when counting this number, an AS_SET counts as 1, no matter how many ASes are in the SET. The AS_PATH that R1 thus constructs when announcing the UPDATE to R2 (assuming R2 is an IBGP peer) is: AS_PATH: AS_SET "a x", AS_SET "b y", AS_SET "p1 p2 q1 q2 q3", AS_SET "p3 p4 q4" It will create a new AS_SEQ segment if R2 is an EBGP peer. [Scenario 2] Say, R1 has the following BGP paths that it installs in its FIB. Bhatia & Halpern [Page 15] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 Path 1: AS_PATH: AS_SEQ "a b c", AS_SET "p1 p2", NEXT_HOP N1 Path 2: AS_PATH: AS_SEQ "x y", AS_SET "q1 q2 q3", AS_SET "q4", NEXT_HOP N2 The AS_PATH that R1 thus constructs when announcing the UPDATE to R2 (assuming R2 is an IBGP peer) is: AS_PATH: AS_SET "a x" AS_SET "b y" AS_SET "c q1 q2 q3" AS_SET "p1 p2 q4" 16.2 Advertising synthetic AS_PATHs This section discusses some optimizations/cleanups that can be done by a BGP speaker when constructing the synthetic AS_PATH, to advertise to a non-ecmp capable or an EBGP peer. X(n) and Y(n) are two AS_PATHs, each with N number of segments, which will be merged into a new combined synthetic AS_PATH, Z of N number of segments: - If X(n) and Y(n) are both of type AS_SEQUENCE and contain the same value, the type of Z(n) MUST be set to AS_SEQUENCE, the value being the single as value concerned common to X(n) and Y(n). - Duplicate AS values MUST be removed from Z(n) once all values have been added to it, if the implementation has not already discarded duplicate values while iterating through X(n) and Y(n) when constructing segment Z(n) 17. References [BGP4] Y. Rekhter, T. Li, and S. Hares, "A Border Gateway Protocol 4 BGP-4)", draft-ietf-idr-bgp4-24.txt, May 2004. [RR] T. Bates, R. Chandra, E. Chen, "BGP Route Reflection - An alternative to Full Mesh IBGP", draft-ietf-idr-rfc2796bis- 01.txt, [BGP-CAP] R. Chandra, J. Scudder, "Capabilities Advertisement with BGP-4", RFC 2842, May 2000. [MED] D. McPherson, V. Gill, D. Walton, and A. Retana, "Border Gateway Protocol (BGP) Persistent Route Oscillation Condition", RFC 3345, August 2002. Bhatia & Halpern [Page 16] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 [BGP-IPv6]Marques, P. and F. Dupont, "Use of BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing", RFC 2545, March 1999 [KEYWORDS]Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [AFI] http://www.iana.org/assignments/address-family-numbers [SAFI] http://www.iana.org/assignments/safi-namespace [CONFED] P. Traina, D. McPherson, and J. Scudder, "Autonomous System Confederations for BGP", draft-ietf-idr-rfc3065bis- 02.txt,May 2004. [MPBGP] T. Bates, R. Chandra, D. Katz, and Y. Rekhter, Multiprotocol Extension for BGP-4", draft-ietf-idr- rfc2858bis-06.txt, November 2004 18. Author's Addresses Joel M. Halpern Megisto Systems, Inc. jhalpern@megisto.com +1-301-444-1783 Manav Bhatia Riverstone Networks, Inc. manav@riverstonenet.com 19. Intellectual Property Notice The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it as made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. Bhatia & Halpern [Page 17] Internet Draft draft-bhatia-ecmp-routes-in-bgp-01.txt September 2004 The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. 20. Full Copyright Notice Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Bhatia & Halpern [Page 18]