Network Working Group                             Srihari Ramachandra
Internet Draft                                    Yakov Rekhter
Expiration Date: February 2001                    Rex Fernando
                                                  John G. Scudder
                                                  Cisco Systems
                                                  Enke Chen
                                                  Redback Networks


                   Graceful Restart mechanism for BGP

                  draft-ramachandra-bgp-restart-00.txt


1. Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as ``work in progress.''

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


2. Abstract

   Usually when a router restarts or when BGP on a router restarts, all
   the peers of that router detect that the peering has gone down, and
   then came up. This "down/up" transition results in a "routing flap".
   The flap could spread across multiple routing domains. Such a flap
   has nothing but negative implications on the overall network
   performance. This document outlines an approach in BGP that allows to
   suppress such routing flaps.






draft-ramachandra-BGP-restart-00.txt                            [Page 1]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


3. Introduction

   Usually when a router or when BGP on a router restarts, all the BGP
   peers detect that the session went down, and then came up. This
   "down/up" transition results in a "routing flap" and causes BGP route
   re-computation, generation of BGP routing updates and flap the
   forwarding tables. It could spread across multiple routing domains.
   Such routing flaps may create transient forwarding blackholes and/or
   transient forwarding loops. They also consume resources on the
   control plane of the routers affected by the flap. As such they are
   detrimental to the overall network performance.

   This document outlines an approach in BGP [BGP-4] that allows to
   minimize the negative effects on routing caused by BGP restart. It
   introduces a new BGP capability Graceful-Restart to announce session
   restart timer and preservation of FIB i.e. ability to forward packets
   for each address-family across BGP control plane restarts, a
   mechanism to indicate the end-of-RIB after sending BGP updates of a
   address-family and a mechanism for BGP peers to temporarily retain
   the Adj-RIBs-In across TCP transport resets.


4. Overview

   If BGP restarts on a router, it can request "Graceful-Restart"
   treatment from its peers which support this capability. The peers
   then retain the Adj-RIB-In previously learned from the restarted
   router. Over the new BGP session, they send Adj-RIB-Out to the router
   and indicate when they have finished.

   When all peers have sent their Adj-RIB-Out, the restarted router runs
   its decision process, populates its Loc-RIB, updates its FIB, and
   advertises its Adj-RIB-Out to all the peers indicating when it
   finishes. Routes in the Adj-RIB-Out advertised by the router replaces
   routes its peers had temporarily retained in their Adj-RIB-In. Any
   routes that are not replaced are deleted as if they had been
   withdrawn.

   Two or more adjacent BGP speakers when restarted and doing the above
   procedure at the same time advertise the "Don't wait for my End-of-
   RIB" bit in the capability thus avoiding one BGP speaker to
   indefinately wait for routes from the other.

   At the completion of this procedure, the router and its peers are
   resynchronized, with no or minimal route flap having been propagated
   and most user traffic having been unaffected.





draft-ramachandra-BGP-restart-00.txt                            [Page 2]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


5. Graceful-Restart Capability

   A BGP Capability [BGP-CAP] is used indicate that a router can support
   Graceful-Restart with or without preservation of FIB for an address-
   family.

   The capability is encoded as follows:

                   +------------------------------+
                   | Capability Code (1 octet)    |
                   +------------------------------+
                   | Capability Length (1 octet)  |
                   +------------------------------+
                   | First tuple (6 octets)       |
                   +------------------------------+
                   | Second tuple (6 octets)      |
                   +------------------------------+
                   .....
                   +------------------------------+
                   | Nth tuple (6 octets)         |
                   +------------------------------+

           Each tuple is encoded as follows:

                   +------------------------------------------+
                   | Address-Family-Identifier (2 octets)     |
                   +------------------------------------------+
                   | Sub-Address-Family-Identifier (1 octet)  |
                   +------------------------------------------+
                   | Restart flag (1 octet)                   |
                   +------------------------------------------+
                   | Restart time (2 octets)                  |
                   +------------------------------------------+

   The use and meaning of the fields are as follows:

     Address Family Identifier (AFI):

       This field carries the identity of the Network Layer protocol
       associated with the Network Address that follows. Presently
       defined values for this field are specified in RFC1700 (see
       the Address Family Numbers section).


     Subsequent Address Family Identifier (SAFI):

       This field provides additional information about the type of
       the Network Layer Reachability Information carried in the



draft-ramachandra-BGP-restart-00.txt                            [Page 3]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


       attribute.


     Restart Flags:

       This field contains accumulative bit flags. Two flags (starting
       from the least significant bit) are defined: Restart-State bit,
       and Forwarding-State bit.

       The "Restart-State" bit can be used to avoid possible deadlock
       caused by waiting for End-of-RIB when multiple BGP speakers
       restart. Value 1 indicates that the BGP speaker has restarted,
       and its peer should not wait for the End-of-RIB from the speaker
       before advertising its routes to the speaker.

       The "Forwarding-State" bit can be used to indicate if the
       forwarding state has indeed been preserved during the previous
       BGP restart. Value of 1 means "Yes".


     Restart Time:

       This is the estimated time it would take for the BGP peer to be
       re-established after a restart. This can be used to speed up the
       routing convergence by its peer in case that the BGP speaker does
       not come back after a restart.

   If a router supports graceful restart for BGP, it MUST send the
   Graceful- Restart capability listing the <AFI, SAFI> it exchanges
   with its peer indicating whether FIB is preserved or not.

   The Session Restart time SHOULD NOT be greater than the BGP HOLDTIME
   used in the OPEN. This SHOULD be set to the smallest possible value
   sufficient to allow the BGP speaker to restart and attempt
   reconnection to its peer before the old routes of the corresponding
   address-family expire.















draft-ramachandra-BGP-restart-00.txt                            [Page 4]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


6. End-of-RIB indication

   Routers using these procedures need to be able to detect when a peer
   has transferred its complete Adj-RIB-Out.

   Although this document specifies the use of the End-of-RIB indication
   for use when recovering from a BGP restart, we note that it may be
   useful to generate the End-of-RIB indication at the end of any
   preliminary RIB exchange, even when not recovering from a restart.

   [Two options have been identified for the end-of-RIB indication.  We
   need to select one.]

   Option 1:

     End-of-RIB is indicated by reception of an empty address-family
     withdraw UPDATE message. For IPv4 unicast NLRI, empty withdrawn
     message will act as end-of-RIB. For other address-families,
     message having only empty MP_UNREACH [BGP-MP] for that <AFI,
     sub-AFI> will act as End-of-RIB.

   Option 2:

     End-of-RIB is indicated by reception of a new BGP message type
     defined as follows:

             Type: code TBD - RIB-END

             Message Format: <AFI, sub-AFI>

     This has the following implication: Requires a new capability to
     support only End-of-RIB support without Graceful-Restart.


7. Operation

   In this section, "restarting router" refers to a router whose BGP has
   been restarted and which initiates a new connection using the
   Graceful-Restart capability and "receiving router" refers to a router
   receiving such a connection.


7.1. Procedures for restarting BGP speaker

   If a BGP speaker wishes to restart without causing route flapping or
   interrupting forwarding, it MUST have established its previous
   peering sessions using the BGP Graceful-Restart capability. In the
   following subsections "the BGP speaker" refers to the restarting



draft-ramachandra-BGP-restart-00.txt                            [Page 5]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


   speaker.

   If the BGP speaker wishes to use Graceful-Restart with a particular
   neighbor, it MUST NOT send a NOTIFICATION to that neighbor. The TCP
   session may or may not be terminated during the BGP speaker's
   restart.


7.1.1. Establishing BGP connections after restart

   The BGP speaker MUST set Restart-State in the Restart flag for each
   address-family listed in the Graceful-Restart capability. This will
   indicate to the other peers that the BGP speaker has restarted.

   The BGP speaker MAY set Forwarding-State in the Restart flag for a
   address-family if one of these conditions is satisfied:

     i.  The BGP speaker MUST have reliably determined that its
         forwarding plane has retained the FIB information for that
         address-family related to the prior connection, or

     ii. The BGP speaker has been explicitly configured to set it even
         though it may not have preserved its forwarding state.

   The BGP speaker MUST NOT set Forwarding-State in the Restart flag for
   a address-family if:

     i.  The BGP speaker MUST have reliably determined that its
         forwarding plane has NOT retained the FIB information for
         that address-family.


7.1.2. RIB Acquisition

   The BGP speaker upon restart processes the messages from its peers as
   normal.  The BGP speaker should reacquire the Adj-RIBs-In for each
   address-family it exchanged from all peers and run the decision
   process before advertising routes to any prior peers.

   When the BGP speaker has received the End-of-RIB indication for a
   address-family from all ESTABLISHED peers, it generates the
   GR_converged event for that address-family.

   The BGP speaker MAY delay generating the GR_converged event to allow
   it to complete acquiring routes from new peers as well.

   The implementation may maintain an upper bound on the time the BGP
   speaker listens for End-of-RIB from all peers. The expiration of such



draft-ramachandra-BGP-restart-00.txt                            [Page 6]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


   a timer before all peers send End-of-RIB generates
   GR_converge_timeout event.


7.1.3. IGP Convergence

   In situations where both IGP and BGP has restarted, it would be
   advantageous to wait for IGP to converge before the BGP speaker runs
   the decision process during the RIB acquisition as explained in
   7.1.2.


7.1.4. (GR_converged OR GR_converge_timeout events)

   When either of the above named events has occurred for a address-
   family, the BGP speaker has completed the RIB acquisition phase. It
   runs the decision process, populates its Loc-RIB and FIB (removing
   old Loc-RIB or FIB routes of that address-family which were
   temporarily retained), and updates its peers with its Adj-RIBs-Out as
   specified by the BGP protocol. It also deletes the
   GR_converge_timeout timer.

   When the BGP speaker has finished sending its Adj-RIB-Out to a peer,
   it MUST send the End-of-RIB indication for that address-family.


7.2. Procedures for BGP peers receiving Graceful-Restart request

   The following subsections discuss procedures for peers which have not
   restarted, i.e. the peers of a router which has restarted according
   to the procedures in section 7.1.


7.2.1. Termination of TCP session

   When BGP restarts, the TCP session between the BGP speaker and each
   of its peers may or may not be terminated, depending on the
   underlying TCP implementation used, whether or not [BGP-AUTH] is in
   use, and the specific circumstances of the restart.

   If the TCP session terminates due to NOTIFICATION, i.e. either the
   reception of a it or the detection of a protocol error leading to the
   transmission of a NOTIFICATION, then normal BGP procedures MUST be
   followed.

   When close of the TCP session to the neighbor which had advertised
   the Graceful-Restart capability is detected, the following procedure
   MUST be followed.



draft-ramachandra-BGP-restart-00.txt                            [Page 7]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


   For each address-family which had the Forwarding-State in Restart
   flag set, the Adj-RIB-In of that address-family is marked for
   deletion after the GR_Session_Restart time expires. For each
   address-family which had the Forwarding-State not set, the Adj-RIB-In
   of that address-family is immediately deleted. The latter must be
   treated as if non-Graceful-Restart capable peer closed its session.


7.2.2. Reception of new BGP connection

   When a router receives a new BGP connection it acts as follows:

   If the router has an ESTABLISHED BGP connection with the restarting
   peer, and if that connection has announced support for BGP NSF using
   the BGP NSF Capability; or if the router has no BGP connection but is
   holding an Adj-RIB-In which is associated with the initiating peer
   according to the procedure in 7.2.1, then the new connection is
   tentatively accepted.

     a. If the restarted peer does not advertise the BGP
        Graceful-Restart Capability, the GR_cant_recover event is
        generated for Adj-RIBs-In for all address-families exchanged
        with the peer.

     b. If the restarted peer does not set Forwarding-State in
        Restart flag for a address-family, GR_cant_recover event
        is generated for the associated Adj-RIB-In.

     c. If the restarted peer does set Forwarding-State in Restart
        flag for a address-family, GR_Session_Restart_time timer,
        if any, is deleted.

   When and if the connection reaches ESTABLISHED state, the procedure
   given in 7.2.3 is followed.  If the connection is terminated without
   reaching ESTABLISHED state, the GR_error event is generated for Adj-
   RIBs-IN for all address-families exchanged.


7.2.3. Recovery phase

   When a connection has been accepted from a peer which has advertised
   the Graceful-Restart with preserving FIB for some address-families,
   that peer is said to be recovering for that address-family.

   When the connection enters ESTABLISHED state, a timer is started for
   each such address-family. The purpose of the timer is to bound the
   length of time the peer's routes for that address-family are
   retained. The duration of the timer is implementation-specific.



draft-ramachandra-BGP-restart-00.txt                            [Page 8]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


   Once the restarted router converges, it will send UPDATES
   representing its Adj-RIBs-Out for various address-families. Routes
   from the UPDATES replace routes from the Adj-RIB-In which was
   retained according to 7.2.1. The replaced routes are not marked for
   deletion.

   When the peer indicates the End-of-RIB of a address-family based on
   Option 1 or 2 of section 6, GR_recovery_completed event is generated
   for the associated Adj-RIB-In of that address-family. If End-of-RIB
   is not heard for a address-family before the timer expires, the
   GR_recovery_expired event is generated for the associated Adj-RIB-In
   of such a address-family.


7.2.4. GR_recovery_expired, GR_recovery_completed, GR_Error,
   GR_cant_recover events.

   When any of the events occur, the timers, if any are stopped.

   Any routes in the Adj-RIB-In for the address-family associated with
   the event which are still marked for deletion are deleted. This is to
   be treated as if they had been withdrawn by the peer.

   Normal BGP procedures apply and the router is no longer in recovery
   phase for that address-family.


8. Deployment Considerations

   We note that when a BGP Graceful-Restart capable router restarts,
   there is a potential for transient routing loops or blackholes in the
   network if routing information changes before the router has
   completed its recovery phase. If no routing information changes,
   there is no issue as long as all BGP speakers are BGP Graceful-
   Restart capable.

   If not all BGP speakers are BGP Graceful-Restart capable, there is an
   increased exposure to transient routing loops or blackholes, even if
   the network is in steady state.

   We note that if the potential for transients is a concern, the
   GR_recovery_expired timer provides a bound on the duration of
   transients caused by RIB de-synchronization.  The lower the timer is
   set, the less time desynchronization will endure; however, the lower
   the timer is set, the likelier it is that spurious routing flap will
   occur.  Note that this routing flap can also cause transients which
   would not have occurred had the routing flap been avoided by setting
   the timer to a higher value.  We anticipate that in general, it is



draft-ramachandra-BGP-restart-00.txt                            [Page 9]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


   preferable to avoid flap, even though it increases the potential
   length of RIB desynchronization.  That is, we anticipate that the
   potential transients caused by RIB desynchronization will have less
   impact than those caused by routing flap.

   Finally, we note that there is little benefit to deploying BGP
   Graceful-Restart in an AS whose IGPs have no similar Graceful-Restart
   capability.


9. Security Considerations

   Since with this proposal a new connection can cause an old one to be
   terminated, it might seem to open the door to denial of service
   attacks.  However, we note that unauthenticated BGP is already known
   to be vulnerable to denials of service through attacks on the TCP
   transport. The TCP transport is commonly protected through use of
   [BGP-AUTH]; such authentication will equally protect against denials
   of service through spurious new connections.

   Thus we conclude that this proposal does not change the underlying
   security model (and issues) of BGP-4.


10. Acknowledgments

   To be supplied.


11. References

   [BGP-4]   Rekhter, Y., and T. Li, 'A Border Gateway Protocol 4 (BGP-
   4)', RFC 1771, March 1995.

   [BGP-MP] Bates, T., Chandra, R., Katz, D., and Rekhter, Y.,
   'Multiprotocol Extensions for BGP-4', RFC 2283, March 1998.

   [BGP-CAP] Chandra, R., Scudder, J., 'Capabilities Advertisement with
   BGP-4', RFC 2842, May 2000.

   [BGP-AUTH] Heffernan A. 'Protection of BGP Sessions via the TCP MD5
   Signature Option.', RFC 2385, August 1998.









draft-ramachandra-BGP-restart-00.txt                           [Page 10]





Internet Draft    draft-ramachandra-BGP-restart-00.txt       August 2000


12. Author Information


   Srihari Ramachandra
   Cisco Systems, Inc.
   170 West Tasman Drive
   San Jose, CA 95134
   e-mail: rsrihari@cisco.com

   Yakov Rekhter
   Cisco Systems, Inc.
   170 Tasman Drive
   San Jose, CA, 95134
   e-mail: yakov@cisco.com

   John G. Scudder
   Cisco Systems, Inc.
   170 West Tasman Drive
   San Jose, CA 95134
   e-mail: jgs@cisco.com

   Rex Fernando
   Cisco Systems, Inc.
   170 West Tasman Drive
   San Jose, CA 95134
   e-mail: rex@cisco.com

   Enke Chen
   Redback Networks, Inc.
   350 Holger Way
   San Jose, CA 95134
   e-mail: enke@redback.com



















draft-ramachandra-BGP-restart-00.txt                           [Page 11]