INTERNET-DRAFT Ali Sajassi Intended Status: Standards Track Samer Salam Sami Boutros Keyur Patel Cisco Expires: January 4, 2011 July 4, 2011 E-VPN Ethernet Segment Route draft-sajassi-l2vpn-evpn-segment-route-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Sajassi et al. Expires December 19, 2011 [Page 1] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 Abstract [E-VPN] defines a solution and architecture for BGP MPLS-based Ethernet VPNs. This document describes an additional BGP route and associated route attributes that enhance the multi-homing capabilities of the solution. These are: the Ethernet Segment Route, the ESI Import Extended Community, the DF Election Attribute and the Inter-chassis Communication Attribute. This draft describes their usage, advantages and encoding. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Motivation and Usage . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Preventing Transient Loops and Packet Duplication . . . . . 3 2.2 Support of Multi-Chassis Ethernet Bundles . . . . . . . . . 4 2.3 Designated Forwarder (DF) Election with VLAN Carving . . . 5 2.4 Route Scalability with Granular DF Election . . . . . . . . 5 2.5 Avoiding Relearning of Subscriber/Session State . . . . . . 6 3 BGP Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Ethernet Segment Route . . . . . . . . . . . . . . . . . . . 6 3.2 ES-Import Extended Community . . . . . . . . . . . . . . . . 6 3.3 DF Election Attribute . . . . . . . . . . . . . . . . . . . 7 3.4 Inter-chassis Communication Attribute . . . . . . . . . . . 7 4 DF Election with Paxos Algorithm . . . . . . . . . . . . . . . . 8 5 LACP State Synchronization . . . . . . . . . . . . . . . . . . . 9 6 VLAN Carving . . . . . . . . . . . . . . . . . . . . . . . . . 10 7 Subscriber/Session State Synchronization . . . . . . . . . . . 12 8 Security Considerations . . . . . . . . . . . . . . . . . . . 12 9 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 10 References . . . . . . . . . . . . . . . . . . . . . . . . . 12 10.1 Normative References . . . . . . . . . . . . . . . . . . 12 10.2 Informative References . . . . . . . . . . . . . . . . . 12 Author's Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 Sajassi et al. Expires December 19, 2011 [Page 2] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 1 Introduction [E-VPN] defines a solution and architecture for BGP MPLS-based Ethernet L2VPN services with advanced multi-homing capabilities. To that end, [E-VPN] defines a new BGP NLRI with 5 route types: 1. Ethernet Auto-Discovery (A-D) route 2. MAC advertisement route 3. Inclusive Multicast Route 5. Selective Multicast Auto-Discovery (A-D) Route 6. Leaf Auto-Discovery (A-D) Route In this draft, we define one additional route type: 4. Ethernet Segment Route This route primarily enhances the multi-homing capabilities of the E- VPN solution in the following areas: - Preventing transient loops and packet duplication - Support of multi-chassis Ethernet bundles - Designated Forwarder election with VLAN carving - Avoiding relearning of subscriber/session state In addition to the above route, 3 new BGP route attributes are defined: the ESI Import Extended Community attribute, the DF Election attribute and the Inter-chassis Communication attribute. Section 2 discusses the motivation and usage of the new route and attributes. Section 3 describes the BGP encoding. 1.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 2 Motivation and Usage This section focuses on the reasons for defining the Ethernet Segment route and its associated 3 BGP attributes, and describes its usage in E-VPN. 2.1 Preventing Transient Loops and Packet Duplication The Designated Forwarder (DF) election procedures defined in [E-VPN] require that each MES constructs a candidate list of DFs from the received Ethernet A-D routes. By default, each MES then independently Sajassi et al. Expires December 19, 2011 [Page 3] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 chooses the MES with the highest IP address as the elected DF. There is no handshake mechanism between the MESes that are connected to the same Ethernet Segment. As a result of that, during routing transients, different MESes may end up electing different DFs for the same Ethernet Segment due to inconsistent views of the network. If the Ethernet Segment is a multi-homed device, this may lead to transient packet duplication. If the Ethernet Segment is a multi- homed network, the presence of multiple DFs may lead to transient forwarding loops in addition to potential packet duplication. To eliminate these issues, a handshake mechanism is required between the MES nodes connected to the same Ethernet Segment, to ensure a common view of the network among them. This handshake is performed using the DF Election attribute carried in the Ethernet Segment route, as discussed in the 'DF Election with Paxos Algorithm' section. 2.2 Support of Multi-Chassis Ethernet Bundles When a CE is multi-homed to a set of MES nodes using the [802.1AX] Link Aggregation Control Protocol (LACP), the MESes must act as if they were a single LACP speaker for the Ethernet links to form a bundle, and operate correctly as a Link Aggregation Group (LAG). To achieve this, the MESes connected to the same multi-homed CE must synchronize LACP configuration and operational data among them. The synchronization is required for the following reasons: - to determine if the links in the Ethernet bundle are to operate in all-active or hot-standby resiliency mode - to detect and handle CE mis-configuration when LACP Port Key is configured on the MES - to detect and handle mis-wiring between CE and MES when LACP Port Key is configured on the MES - to deterministically agree on which link(s) should join a bundle based on port and system priorities, especially when the number of links exceeds the aggregation capacity of the MESes, and the MES LACP System Priority is higher than the CE's - to detect and react to actor/partner churn where the LACP speakers are not able to converge Synchronization of LACP state between MESes is performed using the Inter-chassis Communication attribute carried in the Ethernet Segment route, as described in the 'LACP State Synchronization' section below. Sajassi et al. Expires December 19, 2011 [Page 4] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 2.3 Designated Forwarder (DF) Election with VLAN Carving In the case where multiple MES nodes offer redundant connectivity for an Ethernet Segment, it is preferred to elect multiple DFs (one DF per VLAN) in order to distribute the traffic among the redundancy group members. This process of electing different DFs for different VLANs on an Ethernet Segment, for purpose of load-balancing, is referred to as 'VLAN Carving'. The VLAN carving algorithm must ensure even distribution of VLANs among the MES nodes servicing the same Ethernet Segment. As new MES devices get commissioned or decommissioned, the VLANs must be redistributed over the available devices for even load-balancing. However, in the case of link, port or node failure, the VLAN carving algorithm should ensure that only the affected VLANs are reassigned to different MES(es), and none of the other active VLANs are shuffled. Otherwise, the fault decoupling capability of the redundancy group would be compromised. VLAN carving requires exchange of information among the MES nodes connected to an Ethernet Segment in order to agree upon how the VLANs will be distributed. Since this information is only relevant to the MES nodes that are directly connected to a specific Ethernet Segment, the exchanges and associated processing should be localized to the redundancy group members. DF Election with VLAN carving is performed using the DF Election attribute carried in the Ethernet Segment route, as described in the "VLAN Carving" section below. 2.4 Route Scalability with Granular DF Election [E-VPN] allows for DF election to be performed at the granularity of either an Ethernet Segment or combination of Ethernet Segment and VLAN on that segment. In the latter case, an Ethernet A-D route per (ESI, VLAN) must be advertised by the MES regardless of whether the service interface is port-based, VLAN-based, VLAN bundling-based or VLAN aware bundling-based. In case of port-based and VLAN bundling- based services, these routes are only required for DF election and not for advertising forwarding labels. By using the Ethernet Segment route instead of the Ethernet A-D route for DF election, it is still possible to have per-VLAN DF granularity while significantly reducing the number of BGP routes advertised. For e.g., consider an Ethernet Segment ESI1 used for a port-based service. By using the Ethernet A-D route for per (ESI, VLAN) DF election, 4095 routes are needed. Whereas, using the Ethernet Segment route, only a single route is required. Sajassi et al. Expires December 19, 2011 [Page 5] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 2.5 Avoiding Relearning of Subscriber/Session State For certain applications, the MES builds and maintains per subscriber or per session 'soft' state that is used for either optimizing the traffic forwarding or enforcing security. Examples of such per subscriber/session state includes: - multicast state derived from IGMP or PIM snooping - IP address to MAC address bindings gleaned from snooping ARP and/or DHCP packets, and used to prevent address spoofing or masquerading When a set of MES nodes provides multi-homed connectivity for an Ethernet Segment, this 'soft' state is built on the active MES node that forwards and snoops the relevant protocol packets. In case of a link or node failure, the state must be reconstructed on the backup MES (e.g. by waiting for the next IGMP query or ARP message or by issuing unsolicited queries). This may cause traffic disruption and affect the availability of the service. Alternatively, the state can be synchronized among the MES nodes via BGP, and that would enhance the convergence of the service after failure. Synchronization of subscriber/session state between MES nodes is performed using the Inter-chassis Communication attribute carried in the Ethernet Segment route, as described in the 'Subscriber/Session State Synchronization' section below. 3 BGP Encoding This section defines the encoding of the BGP route and attributes. 3.1 Ethernet Segment Route The Ethernet Segment Route is encoded in the E-VPN NLRI defined in [E-VPN] using the Route Type value of 4. The Route Type Specific field of the NLRI is formatted as follows: +---------------------------------------+ | RD (8 octets) | +---------------------------------------+ |Ethernet Segment Identifier (10 octets)| +---------------------------------------+ 3.2 ES-Import Extended Community This is a new transitive extended community carried with the Ethernet Segment route. When used, it enables all the MESes connected to the Sajassi et al. Expires December 19, 2011 [Page 6] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 same multi-homed site to import the Ethernet Segment routes. The value is derived automatically from the ESI by encoding the 6-byte MAC address portion of the ESI in the ES-Import Extended Community. The format of this extended community is as follows: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0x44 | Sub-Type | ES-Import | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ES-Import Cont'd | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3.3 DF Election Attribute +---------------------------------------+ | State (2 octets) | +---------------------------------------+ | Sequence No. (4 octets) | +---------------------------------------+ | Local No. of links (2 octets) | +---------------------------------------+ | Total No. of links (2 octets) | +---------------------------------------+ | Flags (1 octet) | +---------------------------------------+ | No. of IP addresses (1 octet) | +---------------------------------------+ | Ordered list of tuples: | | [IP address Length (1 octet), | | IP Address (4 or 16 bytes)]| | +---------------------------------------+ State field can take one of the following values: 0x0000 Initializing 0x0001 Proposal Pending 0x0002 Promise Pending 0x0003 Active Flags field is encoded as follows: 7 bits: reserved Least significant bit: Protecting flag 3.4 Inter-chassis Communication Attribute +---------------------------------------+ | Type (2 octets) | Sajassi et al. Expires December 19, 2011 [Page 7] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 +---------------------------------------+ | Length (1 or 2 octets) | +---------------------------------------+ | Opaque (var) | +---------------------------------------+ 4 DF Election with Paxos Algorithm The procedures in this section guarantee that all MES nodes in a given redundancy group agree on a unique DF for a given Ethernet Segment. This eliminates the problem of transient forwarding loops and transient packet duplicates described above. The procedures can be broken down to the following steps: 1. When a MES discovers the ESI of the attached Ethernet Segment, it advertises an Ethernet Segment route with the associated ES-Import extended community attribute and with the 'Initializing' code in the State field of the DF Election attribute. 2. The MES then starts a timer to allow the reception of Ethernet Segment routes from other MES nodes in the same redundancy group. 3. When the timer expires, each MES builds an ordered list of the IP addresses of all the MES nodes connected to the Ethernet Segment (including itself), in increasing numeric value. 4. The first MES in the ordered list then elects itself as the Arbiter Node (AN). It initiates the handshake by sending an Ethernet Segment route with 'Proposal Pending' code in the State field of the DF Election attribute. 5. When a MES node receives an Ethernet Segment route with the 'Proposal Pending' code, it takes one of the following options: a. If the receiving MES ranks the transmitting MES's IP address as the top entry in its local ordered list, it acknowledges the handshake by responding with an Ethernet Segment route with the 'Promise Pending' code in the State field of the DF Election attribute. This includes the scenario where the receiving MES forfeits the AN role to another advertising MES with a numerically lower IP address. b. If the receiving MES does not rank the transmitting MES's IP address as the top entry in its local ordered list, and the receiving MES had advertised an Ethernet Segment route with the 'Initializing' code or with the 'Proposal Pending' code, then the MES takes no further action. Sajassi et al. Expires December 19, 2011 [Page 8] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 6. When the AN receives 'Promise Pending' from all of the MES nodes in the ordered list, it sends an updated Ethernet Segment route with the 'Active' code in the DF Election attribute. 7. When the other MES nodes in the redundancy group receive the 'Active' code from the AN, they respond with an updated Ethernet Segment route with the 'Active' code in the DF Election attribute. This concludes the handshake. In the case where the DF election is performed at the granularity of an Ethernet Segment, i.e. there is a single DF for all VLANs on the segment, the Arbiter Node is effectively the Designated Forwarder for the segment. All the MES nodes start off with their ports, that are connected to the segment, blocked in Step 1 (for multi-destination traffic from core). And in Step 6, the MES confirmed as the AN (i.e. DF) unblocks its port towards the Ethernet Segment. DF election at the granularity of (Ethernet Segment, VLAN) is discussed in the "VLAN Carving" section below. 5 LACP State Synchronization To support CE multi-homing with multi-chassis Ethernet bundles, the MES nodes connected to a given CE should synchronize [802.1AX] LACP state amongst each other. This includes the following LACP specific configuration parameters: - System Identifier (MAC Address): uniquely identifies a LACP speaker. - System Priority: determines which LACP speaker's port priorities are used in the Selection logic. - Aggregator Identifier: uniquely identifies a bundle within a LACP speaker. - Aggregator MAC Address: identifies the MAC address of the bundle. - Aggregator Key: used to determine which ports can join an Aggregator. - Port Number: uniquely identifies an interface within a LACP speaker. - Port Key: determines the set of ports that can be bundled. - Port Priority: determines a port's precedence level to join a bundle in case the number of eligible ports exceeds the maximum number of links allowed in a bundle. The above information must be synchronized between the MES nodes wishing to form a multi-chassis bundle with a given CE, in order for the former to convey a single LACP peer to that CE. This is required for initial system bring-up and upon any configuration change. Furthermore, the MESes must also synchronize operational (run-time) data, in order for the LACP Selection logic state-machines to Sajassi et al. Expires December 19, 2011 [Page 9] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 execute. This operational data includes the following LACP operational parameters, on a per port basis: - Partner System Identifier: this is the CE System MAC address. - Partner System Priority: the CE LACP System Priority - Partner Port Number: CE's AC port number. - Partner Port Priority: CE's AC Port Priority. - Partner Key: CE's key for this AC. - Partner State: CE's LACP State for the AC. - Actor State: PE's LACP State for the AC. - Port State: PE's AC port status. The above state needs to be communicated between MESes forming a multi-chassis bundle during LACP initial bring-up, upon any configuration change and upon the occurrence of a failure. It should be noted that the above configuration and operational state is localized in scope and is only relevant to PEs within a given Redundancy Group, i.e. which connect to the same Ethernet Segment over a given Ethernet bundle. Furthermore, the communication of state changes, upon failures, must occur with minimal latency, in order to minimize the switchover time and consequent service disruption. Without synchronization of the above parameters, the system is subject to the issues outlined in section 2.2 above. 6 VLAN Carving It is possible to elect multiple DFs per Ethernet Segment (one per VLAN) by using a slightly modified version of the procedures described in the "DF Election with Paxos Algorithm" section above. In step 3, each of the MES nodes assigns an ordinal for itself based on the order of its IP address in the list. The first MES in the list (the one with the numerically lowest IP address) is given an ordinal of 0. The ordinals are used to determine which MES node will be the DF for a given VLAN on the Ethernet Segment using the following rule: Assuming a redundancy group of N MES nodes, the MES with ordinal i is the DF for VLAN V when (V MOD N) = i. In step 6, the AN unblocks only the VLANs for which it is a DF for the Ethernet Segment. In step 7, each MES node unblocks only the VLANs for which it is a DF for the Ethernet Segment. Sajassi et al. Expires December 19, 2011 [Page 10] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 In the case of a port, link or node failure, the AN takes over the forwarding for the affected VLANs on the segment and advertises an updated Ethernet Segment route with the 'Active' code and 'Protecting' flag set in the DF Election attribute. Therefore, when VLAN carving is used, the AN acts as the Backup DF (BDF) for the Ethernet Segment. This ensures that only the affected VLANs are failed over, and none of the other VLANs are shuffled. When the fault clears, the following procedure is followed to revert the VLANs to the recovering MES: 1. The recovering MES advertises an Ethernet Segment route with the 'Initializing' code in the State field of the DF Election attribute. 2. The recovering MES receives from the other MES nodes Ethernet Segment routes with the 'Active' code in the DF Election attribute. The MES can, then, build its ordered list. 3. The recovering MES advertises an Ethernet Segment route with the 'Proposal-Pending' code in the DF Election attribute. This is meant to indicate to the AN that the recovering MES is ready to take over its VLANs. 4. Upon receiving the route with the 'Proposal Pending' code, the AN blocks all the VLANs that belong to the recovering MES. The AN then advertises an updated Ethernet Segment route with the 'Protecting' flag cleared. 5. Upon receiving the above route from the AN, the recovering MES unblocks the VLANs for which it is the DF. The recovering MES then transmits an Ethernet Segment route with the 'Active' code. This completes the reversion. If the failed MES is the AN, then the MES node with the second best claim to be AN (i.e. whose IP address is the second in the ordered list) takes over the failed VLANs and advertises an updated Ethernet Segment route with the 'Active' code and 'Protecting' flag set in the DF Election attribute. The procedures for reversion, in this case, are as follows: 1. The recovering AN advertises an Ethernet Segment route with the 'Initializing' code in the State field of the DF Election attribute. 2. The recovering AN receives from the other MES nodes Ethernet Segment routes with the 'Active' code in the DF Election attribute. 3. The recovering AN advertises an Ethernet Segment route with the 'Proposal-Pending' code in the DF Election attribute. Sajassi et al. Expires December 19, 2011 [Page 11] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 4. The other MES nodes respond to that advertisement with Ethernet Segment routes with the 'Promise-Pending' code in the DF Election attribute. At this point, the BDF blocks all the VLANs that belong to the recovering AN before advertising its Ethernet Segment route, with the 'Promise-Pending' code and 'Protecting' flag cleared. 5. The recovering AN unblocks the VLANs for which it is the DF upon receiving the 'Promise-Pending' advertisements from the BDF. The AN then advertises an Ethernet Segment route with the 'Active' code once it receives the Ethernet Segment route with 'Promise-Pending' code from all of the MES nodes in the redundancy group. 6. The other MES nodes respond with Ethernet Segment routes with the 'Active' code. This marks the end of the reversion. 7 Subscriber/Session State Synchronization Synchronization of subscriber/session state between MES nodes is performed using the Inter-chassis Communication attribute carried in the Ethernet Segment route. The various applications are responsible for the encoding and decoding of the relevant data, and this is outside the scope of this draft. BGP provides a reliable transport service in this case. 8 Security Considerations There are no additional security aspects beyond those of VPLS/H-VPLS that need to be considered. 9 IANA Considerations To be added in a later revision. 10 References 10.1 Normative References [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 10.2 Informative References [E-VPN] Aggarwal et al., "BGP MPLS Based Ethernet VPN", draft- raggarwa-sajassi-l2vpn-evpn-02.txt, work in progress, Sajassi et al. Expires December 19, 2011 [Page 12] INTERNET DRAFT draft-sajassi-l2vpn-evpn-segment-route June 17, 2011 March, 2011. [EVPN-REQ] Sajassi et al., "Requirements for Ethernet VPN (E-VPN)", draft-sajassi-raggarwa-l2vpn-evpn-req-00.txt, work in progress, October, 2010. Author's Addresses Ali Sajassi Cisco 170 West Tasman Drive San Jose, CA 95134, US Email: sajassi@cisco.com Samer Salam Cisco 595 Burrard Street, Suite 2123 Vancouver, BC V7X 1J1, Canada Email: ssalam@cisco.com Sami Boutros Cisco 170 West Tasman Drive San Jose, CA 95134, US Email: sboutros@cisco.com Keyur Patel Cisco 170 West Tasman Drive San Jose, CA 95134, US Email: keyupate@cisco.com Sajassi et al. Expires December 19, 2011 [Page 13]