Network Working Group R. Raszuk Internet-Draft K. Patel Expires: January 3, 2006 R. Fernando Cisco Systems July 2, 2005 External failure propagation. draft-raszuk-ext-failure-propagation-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on January 3, 2006. Copyright Notice Copyright (C) The Internet Society (2005). Abstract The current BGP specification calls for sending prefix based routing information when a BGP peer fails to all other peers so that they could converge using the new information. Certain network events could be communicated to BGP speakers in an aggregated fashion. This not only minimizes control plane traffic but more importantly reduces the time to react to these events by the Raszuk, et al. Expires January 3, 2006 [Page 1] Internet-Draft EFP-BGP July 2005 network and consequently reduces the time to converge. This draft suggests extensions to the protocol to react to such events in a concise manner.In this version of the document the scope of the propagation will be contained to a single domain. Table of Contents 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Specification of Requirements . . . . . . . . . . . . . . . 4 4. Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 5. VLID allocation . . . . . . . . . . . . . . . . . . . . . . 4 6. VLID association to routes and their propagation . . . . . . 5 7. VLID signaling . . . . . . . . . . . . . . . . . . . . . . . 7 7.1 Analysis of propagation options . . . . . . . . . . . . . 7 7.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 7 7.3 Message ordering protection . . . . . . . . . . . . . . . 8 8. Operation . . . . . . . . . . . . . . . . . . . . . . . . . 9 8.1 BGP paths redundancy requirement . . . . . . . . . . . . . 9 8.2 VLID propagation via route reflectors . . . . . . . . . . 9 8.3 Sequence of events . . . . . . . . . . . . . . . . . . . . 10 9. Security Considerations . . . . . . . . . . . . . . . . . . 10 10. IANA Consideration . . . . . . . . . . . . . . . . . . . . . 10 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 12.1 Normative References . . . . . . . . . . . . . . . . . . 11 12.2 Informative References . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 12 Intellectual Property and Copyright Statements . . . . . . . 13 Raszuk, et al. Expires January 3, 2006 [Page 2] Internet-Draft EFP-BGP July 2005 1. Terminology The following list describes acronyms and definitions for terms used throughout this document: o 2547 - RFC2547 - MPLS based Virtual Private Networks [4] o ADD-PATH - Advertisement of Multiple Paths in BGP [10] o AS - Autonomous System o ASBR - Autonomous Systems Border Router o BFD - Bi-directional Failure Detection [8] o BGP - Border Gateway Protocol [5] o CE - Customer Edge. A customer-owned device that has as its next hop a service provider device. o IGP - Interior Gateway Protocol o IPv4 - Internet Protocol version 4 o IPv6 - Internet Protocol version 6 o EFP - External failure propagation: The out of bound failure signaling which is subject of this specification. o MPLS - Multi Protocol Label Switching. o PE - Provider Edge. A service provider device that has as its next hop one or more customer devices. o VLID - Virtual Link Id: A value assigned by the border router (PE/ ASBR) indicating the state of the peering device or a state of a link to such peer. o VPNv4 - Virtual Private Network for IPv4 o VPNv6 - virtual Private Network for IPv6 2. Introduction In most of today's BGP deployments the external peer's failures results (on the ASBR or a PE node) in the best path calculation followed by the per prefix native BGP signaling of a new path or a withdraw message if no other path is available. In this proposal we recommend the enhancement to this traditional paradigm. In parallel to per prefix signaling the proposal suggests creation of hierarchy. While permitting the traditional BGP mechanism of per prefix signaling to proceed at it's own pace we define an abstraction value (VLID) which will be assigned on a per peer basis and signaled immediately following peer's link failure or peer's node failure event. The most important benefit of this behavior is to trigger the alternate path computation on all BGP speakers domain wide. This reduces the time taken to flood the event to a single message propagation delay in the network and makes the protocol messaging as well as convergence invariant of the number of prefixes involved. Raszuk, et al. Expires January 3, 2006 [Page 3] Internet-Draft EFP-BGP July 2005 3. Specification of Requirements The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1] 4. Applicability EFP provides toolset to speed up remote critical failure detection impacting current data paths without the necessity to modify forwarding plane. It also eliminates even temporary occurrences of any sub optimal routing which is not avoidable in any tunnel based protection solutions. That is achived by triggering switchover to an alternative exit points on all BGP speakers including the ingress nodes to the domain thus assuring the optimal BGP path selection execution. This solution should be able to address all of the below operational and deployment scenarios: o IPv4/IPv6 forwarding without next hop self on ASBR o IPv4/IPv6 forwarding with next hop self on ASBRs o IPv4/Ipv6 forwarding with native or tunneled core o VPNv4/VPNv6 remote CE failure (directly connected p2p to PE) o VPNv4/VPNv6 remote CE failure (multihop connected p2p to PE) o VPNv4/VPNv6 remote CE failure (connected via multi-access to PE) o VPNv4/VPNv6 PE-CE link failure (Point to point or multi-access) o VPNv4/VPNv6 PE failure The solution should work equally well for BGP learned routes or redistributed locally by the border edge router. 5. VLID allocation In order to be able to later detect and map various failure scenarios to BGP routing information a proper marking is necessary to take place ahead of failure time on externally received routes via EBGP or at the redistribution from any other routing protocol. We first examine the following failure types: a. CE node failure b. PE-CE link failure c. PE/ASBR failure The link/node liveness detection of the peer can be done using: IGP hellos, BFD, physical link failures or even highly discouraged but still used in practice low BGP keepalive interval. This document Raszuk, et al. Expires January 3, 2006 [Page 4] Internet-Draft EFP-BGP July 2005 does not mandate the use of any of them leaving the trigger itself to the implementation or customer choice. All of the above triggers should be supported with the proposed extension. In today's networks next hop handling for external routes can be divided into two operational scenarios: o Setting next to self hop on PE/ASBR o Not setting next hop on PE/ASBR and redistributing peer's next hop into local IGP Note that some applications may force user to set next hop on the edge of the autonomous system (for example 2547). In the case of PE/ ASBR failure when set next hop self operation has occurred on ingress to the AS the failure propagation of entire node can be accomplished by the IGP flooding. In some IGP topologies next hop leaking between IGP areas/levels is necessary. The same IGP based event propagation can also be used to signal external next hop liveness when no next hop self set action occurred at the AS boundary. In all other failure cases there needs to be a functional component which is responsible for association on a per BGP process basis (per each independent BGP instance with it's own BGP router id), a unique value to received route's here called VLID. Such value will represent remote links and peering devices. The term link here represents a virtual link and not a physical one. For point to point interface types virtual link may map directly to physical/logical link, but on the multi-access interfaces an abstraction layer will be required which will map CE node to each virtual link value even if physical medium is shared between a number of CEs connected to PE. Such an abstraction also will address all flavors of multihop access techniques as long as proper detection is in place to notice a failure in a timely fashion. 6. VLID association to routes and their propagation The meaning of virtual links IDs is valid only with conjunction with the VL's originator BGP router-id within a given Autonomous System. BGP Virtual Link IDs should not be propagated via EBGP sessions, unless operator allows propagation of unchanged next hops for a given EBGP peering. BGP Virtual Link IDs shall not be propagated to those BGP speakers who did not indicated EFP BGP capability. Raszuk, et al. Expires January 3, 2006 [Page 5] Internet-Draft EFP-BGP July 2005 A new BGP Virtual Link Attribute is defined to carry information about allocated Virtual Link ID from a BGP speaker to all NLRIs present in corresponding MP_REACH_NLRI attribute. The format of the new BGP Virtual Link Attribute is defined in Figure 1: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attr flags | Attr Type code| Attr Length | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | BGP Rtr_ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Virtual Link ID ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... Virtual Link ID | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 BGP Virtual Link Attribute Attribute Flags & Type code fields: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|0|0|0|0|0|0|0| TBD | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 BGP Virtual Link Attribute Flags o Bit 0 - Optional attribute (value 1) o Bit 1 - Non-transitive attribute (value 0) o Bit 2 - Partial bit (0 for optional and non-transitive) o Bit 3 - Attribute length of one octet (value 0) o Bit 4-7 - Unused (value all zeros) o Type code - Attribute type code (TBD) o Length - 16 octets The VL ID assignment scheme can be as flexible as an implementation allows. In particular an implementation may select to define it's own internal format for 6 octets VL ID value such that octets represent various node's failure scenarios. Since the VLIDs have only local significance the specification of many flavors of their definitions is not necessary for proper protocol operation. Raszuk, et al. Expires January 3, 2006 [Page 6] Internet-Draft EFP-BGP July 2005 7. VLID signaling 7.1 Analysis of propagation options There are multiple means external events encoded in VLIDs could be propagated via an AS to other BGP speakers. We considered options to use IGPs for flooding, forms of reliable multicast flooding as well as new BGP sub-address family. The last one has been chosen for following reasons: o Selective reception of external failure state by only those BGP speakers who require this type of information o Easy propagation via entire domain including transit via different IGP areas or levels o Elimination of unnecessary transit points to avoid increased propagation delays o Containment of the solution within a single protocol thereby eliminating the need for multiple protocols (and hence multiple components within a router) to interact with each other to implement this scheme. It keeps the solution and its implementation simple. 7.2 Encoding Virtual link's state information will be propagated across given domain with a new SAFI. Manually or automatically created new BGP peering sessions will be required to be established. The type code for the new EFP SAFI will be assigned (TBD). The NLRI format for the new EFP SAFI is represented as [BGP_Rtr_ID: VL_ID] where BGP Rtr ID is a 4 octet value indicating BGP router ID of the BGP speaker who originated VL's and 6 octet VL IDs representing the allocated identifiers for external links or peering nodes. The minimum length of EFP NLRI can contain just the BGP Rtr ID value (length of 4 octets) indicating that any prefixes originated by this node will need to be invalidated regardless of the VL_ID value they carry (application example: controlled reload of one of BGP processes during planned maintenance without impacting IGP). The max NLRI length can be of the size of 10 octets. The new BGP capability msg has been defined to signal EFP capability between BGP speakers. Each BGP speaker that wishes to participate in the new EFP address family must use the Multiprotocol Extensions Capability Code as defined in [BGP-MP] to advertise the new EFP (AFI, Raszuk, et al. Expires January 3, 2006 [Page 7] Internet-Draft EFP-BGP July 2005 SAFI) pair. A BGP speaker participating in the distribution of EFP information and configured as Route Reflector should prioritize distribution of the VL information against it's other BGP data processing to avoid any delays for remote peers to get the convergence critical information in a timely fashion. When implementation supports on a per BGP address family processing prioritization EFP address family should have the highest priority. This is recommended mostly for two reasons: o Any other AF may depend on it's information o The amount of information required to be send should be much smaller then the amount of corresponding prefixes to be processed and propagated. Support of new EFP address family shall automatically indicate support for handling BGP optional non-transitive EFP Virtual Link Attribute. 7.3 Message ordering protection To support the deployment model of propagating the new EFP AFI/SAFI via an existing route reflectors in order to accommodate for possible bgp message propagation delays and update reordering, a 2 octet value counter has been defined. It's main role is to assure that BGP reacts only to the latest event and not delayed one (out of sequence) due to some propagation problems in the network. The counter shall restart when max has been reached from the value of 1. The counter should only be consulted while processing updates for the already existed VL NLRIs in the BGP RIB. When the value is lower then already existing for a given VL NLRI the incoming update should be dropped and no new BGP action taken. The exception to this rule will be present when the incoming message contains the value of 1 and the value of previous message has not reached two octet maximum. Such a situation will take place when the originator restarts and in this case we should not impact already advertised prefixes before the normal BGP route propagation would have taken place. We define a new BGP community type called Virtual Links Counter community to carry associated virtual link counter value Figure 1. The Virtual Links Counter Community is of an extended type. Virtual Links Counter is a new type of BGP Extended Community Attribute. It is a non transitive in the AS scope and in the same time transitive in a per bgp speaker scope within the domain. In Raszuk, et al. Expires January 3, 2006 [Page 8] Internet-Draft EFP-BGP July 2005 analogy to a non transitive EFP Attribute it is allowed to be propagated across EBGP sessions only when next-hop is preservation has been configured on such a session by the operator. It carries two octet counter for the associated VL NLRIs in the corresponding MP_REACH_NLRI attribute and is only allocated on the ASBR/PE nodes. The value of the high-order octet of the Type Field is 0x40. The value of the low-order octet of the Type field for this community is (TBD). The value of the Global Administrator sub-field (2 octets) is used to carry the VL counter. The Local Administrator sub-field is reserved for further use. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0x40 | TBD | Virtual Link Counter | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 Virtual Links Counter BGP Extended Community 8. Operation 8.1 BGP paths redundancy requirement In order to make an instant switching decision at the egress nodes the ingress node has to propagate best external path to all of his IBGP peers along with the associated VL IDs. In the case where there is more then one path in the BGP VRF table (including VRF zero) received from different peer a second best path along with it's own VL ID is advised to be propagated. In such a case even for different RDs per vrf RRs would need to support ADD_PATH and not eliminate path distribution by propagating only the best one to it's clients. Not propagating second best path may result in unnecessary loss of connectivity not greater then would be today but easy to highly minimize when employing second best and EFP approach. 8.2 VLID propagation via route reflectors In traditional IP based switching route reflectors if deployed need to propagate more then a single best path. That can be accomplished Raszuk, et al. Expires January 3, 2006 [Page 9] Internet-Draft EFP-BGP July 2005 with the use of ADD Paths scheme. For some address families in particular 2547 the same can be accomplished by configuring different RD per vrf on all PEs. No additional changes will be required on RRs. 8.3 Sequence of events The following events are expected to happen at the failure scenario: a. Each PE allocates a VLID to each received path from remote CEs/ ASBRs b. BGP marks the received routes from a external peers by adding the new Virtual Link attribute to the update messages c. Each PE advertises by IBGP best external path for a given vrf (incl VRF0). d. For non 2547 networks route reflectors if used need to be able to fwd more then only best path received from the ASBRs (use of ADD- PATH [10] recommended). For 2547 application that extesion to route reflectors is not required When different RD allocation per vrf recommendation is in place. That extension is also not required when route reflectors are not used for given AFI/SAFI. e. On the event of any PE-CE link failure or CE node failure PE/ASBR signals transitioned VL IDs state in MP_UNREACH_NLRI and propagates them via iBGP new EFP AFI/SAFI to the peers f. BGP Rtr ID + VL ID length pair uniquely identifies invalid paths and triggers local switchover to other paths for a given prefix. g. For the transition period sender will also follow up with the traditional per prefix withdraw or update message. When the network wide deployment of routers supporting EFP is assured the need for origination and propagation of per prefix withdraws following EFP signaling could be eliminated. 9. Security Considerations This extension to BGP does not change the underlying security issues inherent in the existing IBGP [2]. 10. IANA Consideration The following type codes have to be allocated by the current allocation rules: o New attribute type code for BGP Virtual Link Attribute o New SAFI value for the new EFP SAFI o New type code for the Virtual Link Counter BGP Extended community Raszuk, et al. Expires January 3, 2006 [Page 10] Internet-Draft EFP-BGP July 2005 11. Acknowledgements The authors would like to express a special thanks to the following individuals for contributing their ideas and support for writing this specification: Tony Li, Yakov Rekhter, David Wardd, Russ White, Enke Chen. 12. References 12.1 Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [2] Heffernan, A., "Protection of BGP Sessions via the TCP MD5 Signature Option", RFC 2385, August 1998. [3] Bates, T., Rekhter, Y., Chandra, R., and D. Katz, "Multiprotocol Extensions for BGP-4", RFC 2858, June 2000. [4] Rosen, E., "BGP/MPLS IP VPNs", draft-ietf-l3vpn-rfc2547bis-03 (work in progress), October 2004. [5] Rekhter, Y., "A Border Gateway Protocol 4 (BGP-4)", draft-ietf-idr-bgp4-26 (work in progress), October 2004. [6] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended Communities Attribute", draft-ietf-idr-bgp-ext-communities-08 (work in progress), February 2005. [7] Sangli, S., Rekhter, Y., Fernando, R., Scudder, J., and E. Chen, "Graceful Restart Mechanism for BGP", draft-ietf-idr-restart-10 (work in progress), June 2004. [8] Katz, D. and D. Ward, "Bidirectional Forwarding Detection", draft-ietf-bfd-base-02 (work in progress), March 2005. 12.2 Informative References [9] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547, March 1999. [10] Walton, D., Cook, D., Retana, A., and J. Scudder, "Advertisement of Multiple Paths in BGP", draft-walton-bgp-add-paths-00 (work in progress), May 2002. Raszuk, et al. Expires January 3, 2006 [Page 11] Internet-Draft EFP-BGP July 2005 Authors' Addresses Robert Raszuk Cisco Systems Inc. 170 West Tasman Dr San Jose, CA 95134 US Phone: (408)525-7588 Email: raszuk@cisco.com Keyur Patel Cisco Systems Inc. 170 West Tasman Dr San Jose, CA 95134 US Phone: (408)526-7183 Email: keyupate@cisco.com Rex Fernando Cisco Systems Inc. 170 West Tasman Dr San Jose, CA 95134 US Phone: (408)525-1253 Email: rex@cisco.com Raszuk, et al. Expires January 3, 2006 [Page 12] Internet-Draft EFP-BGP July 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Raszuk, et al. Expires January 3, 2006 [Page 13]