Network Working Group Srihari Ramachandra Internet Draft Yakov Rekhter Expiration Date: February 2001 Rex Fernando John G. Scudder Cisco Systems Enke Chen Redback Networks Graceful Restart mechanism for BGP draft-ramachandra-bgp-restart-00.txt 1. Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 2. Abstract Usually when a router restarts or when BGP on a router restarts, all the peers of that router detect that the peering has gone down, and then came up. This "down/up" transition results in a "routing flap". The flap could spread across multiple routing domains. Such a flap has nothing but negative implications on the overall network performance. This document outlines an approach in BGP that allows to suppress such routing flaps. draft-ramachandra-BGP-restart-00.txt [Page 1] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 3. Introduction Usually when a router or when BGP on a router restarts, all the BGP peers detect that the session went down, and then came up. This "down/up" transition results in a "routing flap" and causes BGP route re-computation, generation of BGP routing updates and flap the forwarding tables. It could spread across multiple routing domains. Such routing flaps may create transient forwarding blackholes and/or transient forwarding loops. They also consume resources on the control plane of the routers affected by the flap. As such they are detrimental to the overall network performance. This document outlines an approach in BGP [BGP-4] that allows to minimize the negative effects on routing caused by BGP restart. It introduces a new BGP capability Graceful-Restart to announce session restart timer and preservation of FIB i.e. ability to forward packets for each address-family across BGP control plane restarts, a mechanism to indicate the end-of-RIB after sending BGP updates of a address-family and a mechanism for BGP peers to temporarily retain the Adj-RIBs-In across TCP transport resets. 4. Overview If BGP restarts on a router, it can request "Graceful-Restart" treatment from its peers which support this capability. The peers then retain the Adj-RIB-In previously learned from the restarted router. Over the new BGP session, they send Adj-RIB-Out to the router and indicate when they have finished. When all peers have sent their Adj-RIB-Out, the restarted router runs its decision process, populates its Loc-RIB, updates its FIB, and advertises its Adj-RIB-Out to all the peers indicating when it finishes. Routes in the Adj-RIB-Out advertised by the router replaces routes its peers had temporarily retained in their Adj-RIB-In. Any routes that are not replaced are deleted as if they had been withdrawn. Two or more adjacent BGP speakers when restarted and doing the above procedure at the same time advertise the "Don't wait for my End-of- RIB" bit in the capability thus avoiding one BGP speaker to indefinately wait for routes from the other. At the completion of this procedure, the router and its peers are resynchronized, with no or minimal route flap having been propagated and most user traffic having been unaffected. draft-ramachandra-BGP-restart-00.txt [Page 2] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 5. Graceful-Restart Capability A BGP Capability [BGP-CAP] is used indicate that a router can support Graceful-Restart with or without preservation of FIB for an address- family. The capability is encoded as follows: +------------------------------+ | Capability Code (1 octet) | +------------------------------+ | Capability Length (1 octet) | +------------------------------+ | First tuple (6 octets) | +------------------------------+ | Second tuple (6 octets) | +------------------------------+ ..... +------------------------------+ | Nth tuple (6 octets) | +------------------------------+ Each tuple is encoded as follows: +------------------------------------------+ | Address-Family-Identifier (2 octets) | +------------------------------------------+ | Sub-Address-Family-Identifier (1 octet) | +------------------------------------------+ | Restart flag (1 octet) | +------------------------------------------+ | Restart time (2 octets) | +------------------------------------------+ The use and meaning of the fields are as follows: Address Family Identifier (AFI): This field carries the identity of the Network Layer protocol associated with the Network Address that follows. Presently defined values for this field are specified in RFC1700 (see the Address Family Numbers section). Subsequent Address Family Identifier (SAFI): This field provides additional information about the type of the Network Layer Reachability Information carried in the draft-ramachandra-BGP-restart-00.txt [Page 3] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 attribute. Restart Flags: This field contains accumulative bit flags. Two flags (starting from the least significant bit) are defined: Restart-State bit, and Forwarding-State bit. The "Restart-State" bit can be used to avoid possible deadlock caused by waiting for End-of-RIB when multiple BGP speakers restart. Value 1 indicates that the BGP speaker has restarted, and its peer should not wait for the End-of-RIB from the speaker before advertising its routes to the speaker. The "Forwarding-State" bit can be used to indicate if the forwarding state has indeed been preserved during the previous BGP restart. Value of 1 means "Yes". Restart Time: This is the estimated time it would take for the BGP peer to be re-established after a restart. This can be used to speed up the routing convergence by its peer in case that the BGP speaker does not come back after a restart. If a router supports graceful restart for BGP, it MUST send the Graceful- Restart capability listing the it exchanges with its peer indicating whether FIB is preserved or not. The Session Restart time SHOULD NOT be greater than the BGP HOLDTIME used in the OPEN. This SHOULD be set to the smallest possible value sufficient to allow the BGP speaker to restart and attempt reconnection to its peer before the old routes of the corresponding address-family expire. draft-ramachandra-BGP-restart-00.txt [Page 4] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 6. End-of-RIB indication Routers using these procedures need to be able to detect when a peer has transferred its complete Adj-RIB-Out. Although this document specifies the use of the End-of-RIB indication for use when recovering from a BGP restart, we note that it may be useful to generate the End-of-RIB indication at the end of any preliminary RIB exchange, even when not recovering from a restart. [Two options have been identified for the end-of-RIB indication. We need to select one.] Option 1: End-of-RIB is indicated by reception of an empty address-family withdraw UPDATE message. For IPv4 unicast NLRI, empty withdrawn message will act as end-of-RIB. For other address-families, message having only empty MP_UNREACH [BGP-MP] for that will act as End-of-RIB. Option 2: End-of-RIB is indicated by reception of a new BGP message type defined as follows: Type: code TBD - RIB-END Message Format: This has the following implication: Requires a new capability to support only End-of-RIB support without Graceful-Restart. 7. Operation In this section, "restarting router" refers to a router whose BGP has been restarted and which initiates a new connection using the Graceful-Restart capability and "receiving router" refers to a router receiving such a connection. 7.1. Procedures for restarting BGP speaker If a BGP speaker wishes to restart without causing route flapping or interrupting forwarding, it MUST have established its previous peering sessions using the BGP Graceful-Restart capability. In the following subsections "the BGP speaker" refers to the restarting draft-ramachandra-BGP-restart-00.txt [Page 5] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 speaker. If the BGP speaker wishes to use Graceful-Restart with a particular neighbor, it MUST NOT send a NOTIFICATION to that neighbor. The TCP session may or may not be terminated during the BGP speaker's restart. 7.1.1. Establishing BGP connections after restart The BGP speaker MUST set Restart-State in the Restart flag for each address-family listed in the Graceful-Restart capability. This will indicate to the other peers that the BGP speaker has restarted. The BGP speaker MAY set Forwarding-State in the Restart flag for a address-family if one of these conditions is satisfied: i. The BGP speaker MUST have reliably determined that its forwarding plane has retained the FIB information for that address-family related to the prior connection, or ii. The BGP speaker has been explicitly configured to set it even though it may not have preserved its forwarding state. The BGP speaker MUST NOT set Forwarding-State in the Restart flag for a address-family if: i. The BGP speaker MUST have reliably determined that its forwarding plane has NOT retained the FIB information for that address-family. 7.1.2. RIB Acquisition The BGP speaker upon restart processes the messages from its peers as normal. The BGP speaker should reacquire the Adj-RIBs-In for each address-family it exchanged from all peers and run the decision process before advertising routes to any prior peers. When the BGP speaker has received the End-of-RIB indication for a address-family from all ESTABLISHED peers, it generates the GR_converged event for that address-family. The BGP speaker MAY delay generating the GR_converged event to allow it to complete acquiring routes from new peers as well. The implementation may maintain an upper bound on the time the BGP speaker listens for End-of-RIB from all peers. The expiration of such draft-ramachandra-BGP-restart-00.txt [Page 6] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 a timer before all peers send End-of-RIB generates GR_converge_timeout event. 7.1.3. IGP Convergence In situations where both IGP and BGP has restarted, it would be advantageous to wait for IGP to converge before the BGP speaker runs the decision process during the RIB acquisition as explained in 7.1.2. 7.1.4. (GR_converged OR GR_converge_timeout events) When either of the above named events has occurred for a address- family, the BGP speaker has completed the RIB acquisition phase. It runs the decision process, populates its Loc-RIB and FIB (removing old Loc-RIB or FIB routes of that address-family which were temporarily retained), and updates its peers with its Adj-RIBs-Out as specified by the BGP protocol. It also deletes the GR_converge_timeout timer. When the BGP speaker has finished sending its Adj-RIB-Out to a peer, it MUST send the End-of-RIB indication for that address-family. 7.2. Procedures for BGP peers receiving Graceful-Restart request The following subsections discuss procedures for peers which have not restarted, i.e. the peers of a router which has restarted according to the procedures in section 7.1. 7.2.1. Termination of TCP session When BGP restarts, the TCP session between the BGP speaker and each of its peers may or may not be terminated, depending on the underlying TCP implementation used, whether or not [BGP-AUTH] is in use, and the specific circumstances of the restart. If the TCP session terminates due to NOTIFICATION, i.e. either the reception of a it or the detection of a protocol error leading to the transmission of a NOTIFICATION, then normal BGP procedures MUST be followed. When close of the TCP session to the neighbor which had advertised the Graceful-Restart capability is detected, the following procedure MUST be followed. draft-ramachandra-BGP-restart-00.txt [Page 7] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 For each address-family which had the Forwarding-State in Restart flag set, the Adj-RIB-In of that address-family is marked for deletion after the GR_Session_Restart time expires. For each address-family which had the Forwarding-State not set, the Adj-RIB-In of that address-family is immediately deleted. The latter must be treated as if non-Graceful-Restart capable peer closed its session. 7.2.2. Reception of new BGP connection When a router receives a new BGP connection it acts as follows: If the router has an ESTABLISHED BGP connection with the restarting peer, and if that connection has announced support for BGP NSF using the BGP NSF Capability; or if the router has no BGP connection but is holding an Adj-RIB-In which is associated with the initiating peer according to the procedure in 7.2.1, then the new connection is tentatively accepted. a. If the restarted peer does not advertise the BGP Graceful-Restart Capability, the GR_cant_recover event is generated for Adj-RIBs-In for all address-families exchanged with the peer. b. If the restarted peer does not set Forwarding-State in Restart flag for a address-family, GR_cant_recover event is generated for the associated Adj-RIB-In. c. If the restarted peer does set Forwarding-State in Restart flag for a address-family, GR_Session_Restart_time timer, if any, is deleted. When and if the connection reaches ESTABLISHED state, the procedure given in 7.2.3 is followed. If the connection is terminated without reaching ESTABLISHED state, the GR_error event is generated for Adj- RIBs-IN for all address-families exchanged. 7.2.3. Recovery phase When a connection has been accepted from a peer which has advertised the Graceful-Restart with preserving FIB for some address-families, that peer is said to be recovering for that address-family. When the connection enters ESTABLISHED state, a timer is started for each such address-family. The purpose of the timer is to bound the length of time the peer's routes for that address-family are retained. The duration of the timer is implementation-specific. draft-ramachandra-BGP-restart-00.txt [Page 8] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 Once the restarted router converges, it will send UPDATES representing its Adj-RIBs-Out for various address-families. Routes from the UPDATES replace routes from the Adj-RIB-In which was retained according to 7.2.1. The replaced routes are not marked for deletion. When the peer indicates the End-of-RIB of a address-family based on Option 1 or 2 of section 6, GR_recovery_completed event is generated for the associated Adj-RIB-In of that address-family. If End-of-RIB is not heard for a address-family before the timer expires, the GR_recovery_expired event is generated for the associated Adj-RIB-In of such a address-family. 7.2.4. GR_recovery_expired, GR_recovery_completed, GR_Error, GR_cant_recover events. When any of the events occur, the timers, if any are stopped. Any routes in the Adj-RIB-In for the address-family associated with the event which are still marked for deletion are deleted. This is to be treated as if they had been withdrawn by the peer. Normal BGP procedures apply and the router is no longer in recovery phase for that address-family. 8. Deployment Considerations We note that when a BGP Graceful-Restart capable router restarts, there is a potential for transient routing loops or blackholes in the network if routing information changes before the router has completed its recovery phase. If no routing information changes, there is no issue as long as all BGP speakers are BGP Graceful- Restart capable. If not all BGP speakers are BGP Graceful-Restart capable, there is an increased exposure to transient routing loops or blackholes, even if the network is in steady state. We note that if the potential for transients is a concern, the GR_recovery_expired timer provides a bound on the duration of transients caused by RIB de-synchronization. The lower the timer is set, the less time desynchronization will endure; however, the lower the timer is set, the likelier it is that spurious routing flap will occur. Note that this routing flap can also cause transients which would not have occurred had the routing flap been avoided by setting the timer to a higher value. We anticipate that in general, it is draft-ramachandra-BGP-restart-00.txt [Page 9] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 preferable to avoid flap, even though it increases the potential length of RIB desynchronization. That is, we anticipate that the potential transients caused by RIB desynchronization will have less impact than those caused by routing flap. Finally, we note that there is little benefit to deploying BGP Graceful-Restart in an AS whose IGPs have no similar Graceful-Restart capability. 9. Security Considerations Since with this proposal a new connection can cause an old one to be terminated, it might seem to open the door to denial of service attacks. However, we note that unauthenticated BGP is already known to be vulnerable to denials of service through attacks on the TCP transport. The TCP transport is commonly protected through use of [BGP-AUTH]; such authentication will equally protect against denials of service through spurious new connections. Thus we conclude that this proposal does not change the underlying security model (and issues) of BGP-4. 10. Acknowledgments To be supplied. 11. References [BGP-4] Rekhter, Y., and T. Li, 'A Border Gateway Protocol 4 (BGP- 4)', RFC 1771, March 1995. [BGP-MP] Bates, T., Chandra, R., Katz, D., and Rekhter, Y., 'Multiprotocol Extensions for BGP-4', RFC 2283, March 1998. [BGP-CAP] Chandra, R., Scudder, J., 'Capabilities Advertisement with BGP-4', RFC 2842, May 2000. [BGP-AUTH] Heffernan A. 'Protection of BGP Sessions via the TCP MD5 Signature Option.', RFC 2385, August 1998. draft-ramachandra-BGP-restart-00.txt [Page 10] Internet Draft draft-ramachandra-BGP-restart-00.txt August 2000 12. Author Information Srihari Ramachandra Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134 e-mail: rsrihari@cisco.com Yakov Rekhter Cisco Systems, Inc. 170 Tasman Drive San Jose, CA, 95134 e-mail: yakov@cisco.com John G. Scudder Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134 e-mail: jgs@cisco.com Rex Fernando Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134 e-mail: rex@cisco.com Enke Chen Redback Networks, Inc. 350 Holger Way San Jose, CA 95134 e-mail: enke@redback.com draft-ramachandra-BGP-restart-00.txt [Page 11]