IPv6 maintenance Working Group (6man) M. Zhang Internet-Draft S. Kapadia Intended status: Standards Track L. Dong Expires: September 10, 2015 Cisco Systems March 9, 2015 Improving Scalability of Switching Systems in Large Data Centers draft-zhang-6man-scale-large-datacenter-00 Abstract Server virtualization has been overwhelmingly accepted especially in cloud-based data centers. Accompanied with expansion of services and technology advancements, the size of a data center has increased significantly. There could be hundreds or thousands of physical servers installed in a single large data center which implies that the number of Virtual Machines (VMs) could be in the order of millions. Effectively supporting millions of VMs with limited hardware resources, becomes a real challenge to networking vendors. This document describes a method to scale a switching system with limited hardware resources using IPv6 in large data center environments. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 10, 2015 The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must Ming, et al. Expires September 10, 2015 [Page 1] Internet-Draft Scalability of Switching Systems in DC March 2015 include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents Abstract ......................................................... 1 1. Introduction ................................................... 2 1.1 Specification of Requirements ............................... 4 2. Terminology .................................................... 4 3. Large Data Center Requirements ................................. 5 4. Scaling Through Aggregation .................................... 5 5. SSP Aggregation ................................................ 8 6. Programming in FIB CAM with Special Mask ....................... 9 7. VM Mobility .................................................... 11 8. Scaling Neighbor Discovery ..................................... 11 9. DHCPv6 ......................................................... 12 10. BGP ........................................................... 12 11. Scalability ................................................... 13 12. DC edge router/switch ......................................... 14 12.1 DC Cluster Interconnect .................................... 14 13. Multiple VRFs and Multiple Tenancies .......................... 15 13.1 Resource Allocation and Scalability with VRFs .............. 15 14. Security ...................................................... 16 15. References .................................................... 16 Authors' Address .................................................. 16 1. Introduction Server virtualization is extremely common in large data centers realized with a large number of Virtual Machines (VMs) or containers. Typically, multiple VMs share the resources of a physical server. Accompanied with expansion of services and technology advancements, the size of a data center has increased significantly. There could be hundreds or thousands of physical servers in a single large data center, which implies that the number of VMs could be in the order of millions. Such large number of VMs imposes challenges to network equipment providers on how to effectively support millions of VMs with limited hardware resources. The CLOS based spine-leaf topology has become a defacto-standard of choice for data center deployments. A typical data center topology consists of two tiers of switches: Aggregation or spine tier and ccess/Edge or leaf tier. Figure 1 describes a two tiers network topology in a data center cluster. S1 to Sn are spine switches. L1 to Lm are leaf switches. Every leaf switches has at least one direct connection to every spine switch. H1 to Hz are hosts/VMs attached to leaf switches directly or indirectly through L2 switches. E1 is an edge router/switch. Multiple data center clusters are interconnected by edge routers/switches. Ming, et al. Expires September 10, 2015 [Page 2] Internet-Draft Scalability of Switching Systems in DC March 2015 +---+ +---+ +---+ |S1 | |S2 | ... |Sn | +-+-+ +-+-+ +-+-+ | | | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+ | Link Connections | | Every Spines connects to every Leaf | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+ | | | | +-+-+ +-+-+ +-+-+ +-+-+ connect |L1 | |L2 | ... |Lm | |E1 +-->> to other +---+ +---+ +---+ +---+ cluster / \ \ | / \ \ | +-+-+ +-+-+ +-+-+ +-+-+ |H1 | |H2 | |H3 | ... |Hz | +---+ +---+ +---+ +---+ Figure 1: Typical two tier network topology in a DC cluster Switches at the aggregation tier are large expensive entities with many ports to interconnect multiple access switches together and provide fast switching between access switches. Switches at access tier are low cost, low latency, smaller switches that are connected to physical servers for switching traffic among local servers and servers connected to other access switches through aggregation switches. For maximizing profit, low cost, and low latency ASICs are commonly selected when designing access switches, more commonly SoCs or system-on-chips. In these types of ASICs, the Layer 3 hardware Forwarding Information Base (FIB) table is typically split into two tables: 1) Host Route Table or HRT for host routes (/32 for IPv4 host routes and /128 for IPv6 host routes) that is typically implemented as a hash table; 2) Longest Prefix Match (LPM) Table for prefix routes. Due to high cost of implementing a large LPM table in ASIC either with traditional Ternary Content Addressable Memory [TCAM] or other alternatives, LPM table size in hardware is restricted to a few thousand entries (from 16k to 64k for IPv4) on access switches. Note that with the size of an IPv6 address being 4 times as long as an IPv4 address, the effective number of FIB LPM entries available for IPv6 is essentially one-fourth (or half depending on the width of the LPM entry). Note that the same tables need to be shared by all IPv4, IPv6, unicast and multicast traffic. For years, people are looking for solutions for super scale data centers, but there has been no major break-throughs. Overlay-based [OVERLAYS] approaches using VXLAN, TRILL, FabricPath, LISP etc. have certainly allowed for separation of the tenant end-host address space from the topology address space thereby providing a level of indirection that aids scalability by reducing the requirements on the aggregation switches. However, the scale requirements on the access switches still remains high since they need to be aware of all the Ming, et al. Expires September 10, 2015 [Page 3] Internet-Draft Scalability of Switching Systems in DC March 2015 tenant end-host addresses to support any-to-any communication requirement in large data centers (both East-West and North-South traffic). Software-Defined-Network controllers gaining a lot of traction, there has been a direction to go toward a God-box like model where all the information about all the end hosts will be known. In this way, based on incoming packet, if an access switch does not know how to reach a destination, it queries the God-box and locally caches the information (the vanilla OpenFlow model). The inherent latency associated with this approach as well as the centralized model presents a single-point of failure due to which such systems will not scale beyond a point. Alternatively, the access switch can forward unknown traffic toward a set of Oracle boxes (typically one or more aggregation switches with huge tables that know all about end-hosts) which in turn takes care of forwarding traffic to the destination. As scale increases, throwing more silicon at the solution is never a good idea. The costs for building such large systems will be prohibitively high making it impractical to deploy these systems in the field. This document describes an innovative approach to improve scalability of switching systems for large data centers with IPv6-based end-hosts or VMs. Major improvements include: 1) Reduced resource allocation from FIB tables in hardware both on access switches and almost no FIB resource allocation on aggregation switches. One single cluster can support multi-millions of hosts/VMs. 2) Eliminate L2 flooding and and L3 multicast for NDP packets between access switches 3) Reduction in the control plane processing on the access switches. 1.1 Specification of Requirements The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 2. Terminology HRT: Host Route Table in packet forwarding ASIC LPM: Longest Path Match Table in packet forwarding ASIC Switch ID: A unique ID for a switch in a DC cluster Cluster ID: A unique ID for a DC cluster in a data center VRF: Virtual Routing and Forwarding Instance Ming, et al. Expires September 10, 2015 [Page 4] Internet-Draft Scalability of Switching Systems in DC March 2015 Switch Subnet (SS): Subnet of a VLAN on an access switch in a data center cluster. Switch Subnet Prefix (SSP): An IPv6 prefix assigned to a switch subnet. It consists of Subnet Prefix, Cluster ID, and Switch ID. In a VRF, there could be one SSP per VLAN per access switch. Aggregated Switch Subnet Prefix (ASSP): It equals to SSP excluding Subnet ID. For better scalability, SSPs in a VRF on an access switch can be aggregated to a single ASSP. It is used for hardware programming and IPv6 forwarding. Cluster Subnet Prefix (CSP): Subnet prefixes for forwarding between DC clusters. It consists of Subnet Prefix and Cluster ID. DC Cluster Prefix: A common IPv6 prefix used by all hosts/VMs in a DC Cluster. Subnet ID: The ID for a subnet in a data center. It equals to Subnet Prefix excluding DC Cluster Prefix. 3. Large Data center Requirements These are the major requirements for large data centers: Any subnet, any where, any time Multi-million hosts/VMs Any to Any communication VLANs (aka subnets) span across access switches VM Mobility Control plane scalability Easy management, trouble-shooting, debug-ability Scale-out model 4. Scaling Through Aggregation The proposed architecture employs a distributed gateway approach at the access layer. Distributed gateway allows localization of the failure domains as well as distributed processing of ARP, DHCP etc. messages thereby allowing for a scale-out model without any restriction on host placement (any subnet, any where). Forwarding within the same subnet adheres to bridging semantics while forwarding across subnets is achieved via routing. For communication between end-hosts in different subnets below the same access switch, routing is performed locally at that access switch. For communication between end-hosts in different subnets on different access switches, routing lookups are performed both on the ingress access switch as well as the egress access switch. With distributed subnets and a distributed gateway deployment, host (/128) Ming, et al. Expires September 10, 2015 [Page 5] Internet-Draft Scalability of Switching Systems in DC March 2015 addresses need to be propagated between the access switches using some IGP such as MP-BGP. As the number of hosts in the data center goes up, this would be a huge burden on the control plane in terms of advertisement of every single host address prefix. The problem is further exacerbated with the fact that a host can have multiple addresses. Our proposal indicates how this problem can be solved via flexible address assignment and intelligent control and data plane processing. A Data Center Cluster (DCC) is a data center network that consists of a cluster of aggregation switches and access switches for switching traffic among all servers connected to the access switches in the cluster. A data center can include multiple DCCs. One unique DC Cluster Prefix (DCCP) MUST be assigned to a DCC. DC Cluster Prefix could be locally unique if the prefix is not exposed to the external internet or globally unique otherwise. A public IPv6 address block can be procured from IANA. With the assigned address block, a service provider or enterprise can subdivide the block into multiple prefixes for their networks and Data Center Clusters (DCC). A DCCP length SHOULD be less than 64 bits. With the bits left between DCCP and IPv6 Network Prefix, many subnet prefixes can be allocated. All subnet prefixes in the DC cluster SHOULD share the common DCCP. A new terminology is introduced in this document - Switch Subnet Prefix (SSP) which is defined as follow: [RFC 4291] defines the 128-bit unicast IPv6 address format. It consists of two portions: Subnet Prefix and Interface ID. 64-bits Subnet Prefix is most common and highly recommended. For this scaling method, we subdivide the Interface ID in IPv6 address: N bits for Host ID, 16 bits for Switch ID, and 8 bits for Cluster ID. Interface ID format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0 0 0 0 0 0 0| Cluster ID | Switch ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Host ID (variable length) . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A SSP is assigned to a VLAN on an access switch. SSP includes the Subnet Prefix assigned to the VLAN, the Switch ID for the access switch, and Cluster ID for the Cluster. Each access switch MUST has a unique Switch ID in a DC cluster. Switch ID is assigned by a user or from a management tool. Because the Switch Ming, et al. Expires September 10, 2015 [Page 6] Internet-Draft Scalability of Switching Systems in DC March 2015 ID is a portion of IPv6 address for all host addresses assigned to hosts attached to the same access switch, it is recommended to assign the Switch IDs with certain characteristics, for example its location, so that it could be helpful when trouble-shooting traffic-loss issues in large data centers where millions of VMs are hosted. Each cluster MUST have a unique Cluster ID in a data center at a campus. Cluster ID is used for routing traffic across DC clusters. Switch Subnet Prefix Example | 48 | 16 | 8 | 8 | 16 | 32 | +---------------+-----+---+---+-----+---------+ |2001:0DB8:000A:|000A:|00:|C5:|0001:|0000:0001| +---------------+-----+---+---+-----+---------+ Cluster ID: C5 Switch ID: 1 VLAN: 100 DC Cluster Prefix: 2001:DB8:A::/48 Subnet ID: A Subnet Prefix: 2001:DB8:A:A::/64. Cluster Subnet Prefix 2001:DB8:A:A:C5::/80 Switch Subnet Prefix: 2001:DB8:A:A:C5:1::/96 Host Address: 2001:A:A:A:0:C5:1:1/128 In this example, the DC Cluster Prefix 2001:DB8:A::/48 is a common prefix for the cluster. From the Cluster Prefix block, there is plenty of address space (16 bits Subnet ID) available for subnet prefixes. 2001:DB8:A:A::/64 is a subnet prefix assigned to a subnet in this example that is assigned to VLAN 100. Note that for the purpose of exposition, we assume a 1:1 correspondence between a VLAN and a subnet. However, the proposal does not have any restriction if multiple subnets are assigned to the same VLAN or vice-versa. The subnet prefix is for a logical L3 interface/VLAN typically referred to as an Integrated Routing and Bridging (IRB) interface. The subnet or VLAN spans across multiple access switches thereby allowing placement of any host any where within the cluster. On each access switch, there is a Switch Subnet Prefix ( SSP) per subnet or VLAN. 2001:DB8:A:A:C5:1::/96 is the SSP for VLAN 100 on switch 1. It is combination of the Subnet Prefix, Cluster ID, and Switch ID. A Host/VM Address provisioned to a host/VM connected to this access switch MUST include the SSP associated to the VLAN on the switch. In this example, 2001:DB8:A:A:C5:1:0:1/128 is a host/VM address assigned to a host/VM connected to the access switch. Host/VM addresses can be configured Using Stateful DHCPv6 or other network management tools. In this model, DHCPv6 is chosen for illustration. It illustrates how IPv6 host addresses are assigned from DHCPv6 server. Similar implementations can be done with other protocol/tools. Section 11 describes how address pools are configured on Ming, et al. Expires September 10, 2015 [Page 7] Internet-Draft Scalability of Switching Systems in DC March 2015 DHCPv6 server and how information between switches and DHCP server is exchanged with DHCPv6 messages that allows seamless address assignment based on the proposed scheme. This makes it completely transparent to the end-user thereby alleviating the problems of address management from the network administrator. 5. SSP Aggregation Typically, a routing domain is identified by a Virtual Routing and Forwarding (VRF) instance. Reachability within a VRF is achieved via regular layer-3 forwarding or routing. By default, reachability from within a VRF to outside as well as vice-versa is restricted. In that sense, a VRF provides isolation for a routing domain. A tenant can be associated with multiple VRFs and each VRF can be associated with multiple subnets/VLANs. There can be overlapping IP addressing across VRFs allowing address re-usage. To simplify implementation, reduce software processing, and improve scalability, all SSPs in a VRF on an access switch can be aggregated into a single Aggregated SSP (ASSP). Only one ASSP is needed per switch for a VRF in a DC cluster. ASSPs are employed to aid simplified processing both in the control plane as well as the data plane. Typically, for every subnet instantiated on an access switch, a corresponding subnet prefix needs to be installed in the LPM that points to the glean adjacency. With ASSP, only a single entry needs to be installed in the LPM irrespective of the number of subnets that are instantiated on the access switch. In addition, the same benefit is leveraged at the remote access switches where there needs to be a single ASSP installed for every other access switch independent of what subnets are instantiated at the remote switches. More details of how this FIB programming is achieved are presented in the next section. ASSP entries on an access switch MUST be distributed to all other access switches in a cluster through a routing protocol such as BGP. When an ASSP entry is learned through IGP/BGP protocol, a LPM entry SHOULD be installed. Because of better scalability in large data center environment (BGP Reflector Router can be used to reduce number of peers a BGP node communicates with), BGP is recommended for this forwarding model. In this document, we describe how BGP can be used for ASSP and CSP distribution. A new BGP Opaque Extended community is specified in section 10 for this solution. As mentioned earlier, in modern data centers, overlay networks are typically used for forwarding data traffic between access switches. On aggregation switches, a very small number of FIB entries are needed for underlay reachability since the aggregation switches are oblivious to the tenant host addresses. So aggregation switching platforms can be designed to be simple, low latency, high port density, and low cost. ASSP entries programmed in LPM table are for forwarding data traffic between access switches. The rewrite information in the corresponding next-hop (or Adjacency) entry SHOULD include information to forward Ming, et al. Expires September 10, 2015 [Page 8] Internet-Draft Scalability of Switching Systems in DC March 2015 packets to the egress access switch corresponding to the Switch ID. One ASSP local FIB CAM entry is also needed. The rewrite information in the corresponding next-hop (or Adjacency) entry SHOULD include information to punt packet to local processor. This local FIB CAM entry is used for triggering address resolution if a destined host/VM is not in the IPv6 Neighbor Cache (equivalent to a glean entry). Host/VM addresses (/128) discovered through IPv6 ND protocol are installed in Host Route table (HRT) on the access switch and only on that access switch. Host routes learned through routing protocol MUST NOT be programmed HRT table in hardware. Note exception can occur if a VM moves across access switch boundary. For VM moves across access switch boundary, special handlings are required that will be discussed in a different draft for VM Mobility. A IPv6 unicast data packet from a host/VM connected to an ingress switch destined to another host on an egress switch is forwarded in the following steps: 1) It arrives at the ingress switch; 2) A L3 lookup in FIB (LPM) CAM table hits an entry because the packets destination address includes the Switch Subnet Prefix; 3) The packet is forwarded to the egress switch based on the FIB CAM entry and the corresponding Adjacency entry; 4) The packet is forwarded to its destined host by the egress switch. For forwarding packets outside of the DC Cluster, a Default route ::/0 SHOULD be installed in FIB CAM that routes packets to one of DC edge routers/switches that provides reachability both to other data center sites as well as the external world (Internet). To summarize in this forwarding model, only local Host/VM routes are installed in HRT table. That greatly reduces the number of HRT table entries required at an access switch. ASSP routes are installed in LPM table for forwarding traffic between access switches. Because of ASSPs are independent of subnet/VLANs, the total number of LPM entries required are greatly reduced. These reduced requirements on the HRT and LPBM on the access switches allow supporting very large number of VMs with much smaller hardware FIB tables. Similar forwarding model SHOULD be implemented in software. For example, if special mark is used as discussed in section 6, when forwarding an IPv6 packet in an SSP enabled VRF, the SSP subnet bits can be masked with 0s when doing lookup in software FIB. If it results in a match with an ASSP entry, the packet will be forwarded to the egress access switch with the adjacency attached to the ASSP. 6. Programming in FIB CAM with Special Mask Typically, FIB lookup requires a longest prefix match (LPM) for which a CAM is utilized. CAM in ASIC is implemented with Value bits and mask bits for each of its entries. Value bits are the value (0 or 1) of the bits in the memory for L3 forwarding lookup against a lookup key in the CAM table that includes typically the destination address of a data Ming, et al. Expires September 10, 2015 [Page 9] Internet-Draft Scalability of Switching Systems in DC March 2015 packet to be forwarded (the lookup key is typically vpn-id (corresponding to the VRF, destination-IP). The mask bits are used to include or exclude each bit in the value field of a CAM entry when deciding if a match has occurred or not. Mask bit=1, or mask-in, means include the value bit and mask bit=0, or mask-out, means exclude the value bit or its a DONT-CARE (corresponding value bit can be 1 or 0). When programming the FIB CAM for all Switch Subnet Prefixes from an ACCESS switch, only one entry is installed in FIB CAM per destination ACCESS switch by masking in all DC Cluster Prefix bits, masking out all bits after DC Cluster Prefix and before the Cluster ID bits, and then masking in both Cluster ID bits and ACCESS ID bits and masking out remaining bits. For example, DC Cluster Prefix: 2001:0DB8:000A::/48 Cluster ID: 0xC5 ACCESS ID in hex: 0x1234 FIB CAM programming Value: 2001:0DB8:000A:0000:00C5:1234:0000:0000 Mask: FFFF:FFFF:FFFF:0000:00FF:FFFF:0000:0000 With one such FIB CAM entry, it can match all Switch Subnet Prefixes that includes the DC Cluster Prefix 2001:0DB8:000A::/48, Cluster ID 0xC5 and Switch ID 0x1234 no matter what values on those bits between DC Cluster Prefix and the Cluster ID. That means only single FIB CAM entry is needed for all packets destined to hosts connected to a switch no matter what subnet prefixes are configured on VLANs on that switch. On a given switch, one FIB CAM entry is required for each of other access switches in the DC Cluster. In case the LPM is not implemented as a true CAM but instead as an algorithmic CAM as is the case with some of the ASICs, an alternative approach can be employed. That is to set all subnet bits to 0s when programming an ASSP entry in LPM table. Subnet bits SHOULD be cleared when doing lookup in LPM table. This approach requires certain changes in lookup logic of the ASIC. Note that the above explanation applies on a per VRF basis since the FIB lookup is always based on (VRF, Destination-IP). For example, in a data center with 100 access switches, if a VRF spans 10 access switches, then the number of LPM entries on those 10 access switches for this VRF is equal to 10 (1 local and 9, one for each of the remote switches). Section 11 provides additional details on scalability in terms of the number of entries required for supporting a large multi-tenant data center with millions of VMs. Ming, et al. Expires September 10, 2015 [Page 10] Internet-Draft Scalability of Switching Systems in DC March 2015 7. VM Mobility VM mobility will be discussed in a separate IETF draft. 8. Scaling Neighbor Discovery Another major issue with the traditional forwarding model is the scalability of processing the Neighbor Discovery protocol (NDP) messages. In a data center cluster with large number of VLANs and as many of the VLANs span across multiple access switches, the volume of NDP messages handled by software on an access switch could be huge that can easily overwhelm the CPU. On the other hand, the large number of entries in neighbor cache on an access switch could causes HRT table overflow. In our proposed forwarding model, Neighbor Discovery can be distributed to access switches as described below. Please note all following descriptions in this section only apply to ND operation for global unicast target. No ND operation change is required for Link-local target. All NDP messages from host/VMs are restricted to the local access switch. Multicast NDP messages are flooded to all local switch ports on a VLAN and also copied to local CPU. It SHOULD NOT be sent on link(s) connected to aggregation switches. When a multicast NS message is received, if its target matches with the local ASSP, then it can be ignored because the host/VM SHOULD reply to the NS since the destination is also locally attached to the access switch; otherwise, a unicast NA message MUST be sent by the switch with link-layer address equals to the switch's MAC (aka Router MAC). When an unicast data packet is received, if the destination address belongs to a remote switch, it will match the ASSP for the remote switch in FIB table and be forwarded to the remote switch. On the remote switch, if that destined host/VM is not discovered yet, the data packet will be punt to the CPU and a ND will be triggered for host discovery in software. Distributed ND model can reduce software processing in CPU substantially. It also takes much less space in hardware HRT table. Most importantly there is no flooding both in L2 and L3. Flooding is a major concern in large data centers so it SHOULD be avoided as much as possible. A subnet prefix and a unique address are configured on a L3 logical interface on access switch. When the L3 logical interface has member ports on multiple switches, the same subnet prefix and the address MUST be configured on the L3 logical interface on all those switches. ND operation on hosts/VMs remains the same without any change. Ming, et al. Expires September 10, 2015 [Page 11] Internet-Draft Scalability of Switching Systems in DC March 2015 9. DHCPv6 This section describes the host address assignment model with DHCPv6 protocol. Similar implementations can be done with other protocols and management tools. DHCPv6 Relay Agent [RFC 3315] SHOULD be supported on access switches for this address assignment proposal. [draft-ietf-dhc-topo-conf-04] specifies recommendations on real DHCPv6 Relay Agent deployments. For the forwarding model described in this document, the method of using link-address as described in section 3.2 of [draft-ietf-dhc-topo-conf-04] SHOULD be implemented as follows: The Switch Subnet Prefix (SSP) for the subnet on the switch SHOULD be used as link-address in Relay-Forward message sent from switch. On DHCPv6 server, the link-address is used to identify the link. A prefix or address range should be configured on DHCPv6 server for the link. The prefix or address range MUST match with the SSP on the switch. By doing these, it is guaranteed that addresses assigned by DHCPv6 server always include the SSP for the interface on the switch. The number of SSP address pools could be very large on the DHCP server. This can be alleviated by employing a cluster of DHCP servers to ensure effective load distribution of client DHCPv6 requests. 10. BGP As mentioned earlier, ASSP entries are redistributed to all access switches through BGP. ASSP entries learned from BGP are inserted in RIB. They will be used for FIB CAM programming in hardware and IPv6 Forwarding in software. In this document, we define a BGP opaque extended community that can be attached to BGP UPDATE messages to indicate the type of routes that are advertised in the BGP UPDATE messages. This is the IPv6 Route Type Community [RFC4360] with the following encoding: +-------------------------------------+ | Type 0x3 or 0x43 (1 octet) | +-------------------------------------+ | Sub-type 0xe (1 octet) | +-------------------------------------+ | Route Type (1 octets) | +-------------------------------------+ | Subnet ID Length (1 octet) | +-------------------------------------+ | Reserved (4 octets) | +-------------------------------------+ Type Field: The value of the high-order octet of this Opaque Extended Community is 0x03 or 0x43. The value of the low-order octet of the extended type Ming, et al. Expires September 10, 2015 [Page 12] Internet-Draft Scalability of Switching Systems in DC March 2015 field for this community is 0x0E(or another value allocated by IANA). Value Field: The 6 octet Value field contains three distinct sub-fields, described below: The route type sub-field defines the type of IPv6 routes carried in this BGP message. The following values are defined: 1: ASSP_Route indicates that the routes carried in this BGP Update message are ASSP route 2: CSP_Route indicates that the routes carried in this BGP Update message are CSP routes The Subnet ID Length specifies the number of bits in an ASSP route. Those bits can be ignored in the FIB look up either with special mask when FIB lookup CAM is used or an alternative way as described in section 5. This field is only used when the route type is ASSP_Route. The 4 octet reserved field is for future use. The IPv6 Route Type Community does not need to be carried in the BGP Withdraw messages. All operations SHOULD follow [RFC4360]. There is no exception for this forwarding model. 11. Scalability With this innovative forwarding model, the scalability of data center switching system is improved significantly while still allowing any-to-any communications between all hosts, and no restriction on host placement or host mobility. FIB TCAM utilization on an access switch becomes independent of number of VLANs/subnets instantiated on that switch. It is important to note that the number of host prefix routes (/128) only depends on the number of VMs that are local to an access switch. Network administrator can add as many as access switches with the same network design and would never worry about running out of FIB HRT resources. This greatly simplifies network design for large data centers Ming, et al. Expires September 10, 2015 [Page 13] Internet-Draft Scalability of Switching Systems in DC March 2015 The total number of VMs can be supported in a data center cluster can be calculated as the following (assuming single VRF): Number of LPM entries: Only one LPM entry per access switch is required for local ASSP. The total number of LPM entries on an access switch is equivalent to the total number of access switches in a DC cluster plus 1 (for the default route). Number of HRT entries: There will be one HRT entry for each directly connected host/VM. Scalability Calculation H: max number of HRT entries V: Number of VMs/port P: number of ports/access switch H = V x P For example, 48 ports/access switch, 128 VMs/port H = 48 x 128 = 6k HRT entries/access switch T: total number of hosts/VMs L: number of access switches T = H x L xample: 200 access switches 1.2 Million (6k x 200) VMs can be supported in a large data center cluster. 12. DC edge router/switch Multiple data center clusters can be interconnected with DC edge routers/switches. The same subnet can span across multiple data center clusters. While each subnet has a unique subnet prefix, each cluster in which that subnet extends has a unique cluster subnet prefix. This will be advertised over BGP to the edge routers, which in turn will attract traffic for hosts that are part of that subnet in a given cluster. Again, procedure to handle host mobility across clusters will be described separately in a different draft. 12.1 DC Cluster Interconnect This section describes a way to support VLAN across DC clusters for this forwarding model. Subnet Prefixes SHOULD be advertised by routing protocol within a DC Cluster, but subnet prefixes SHOULD NOT be installed in hardware FIB table. On a DC edge router/switch, Cluster Subnet Prefixes (CSP) can be configured or auto-generated if SSP is enabled. CSP is special prefix to be used at DC edge router/switch to forward traffic between directly Ming, et al. Expires September 10, 2015 [Page 14] Internet-Draft Scalability of Switching Systems in DC March 2015 connected DC clusters. Please refer to section 4 for CSP definition and example. There SHOULD be one CSP per subnet. A CSP SHOULD be advertised through a routing protocol between DC edge router/switch that connects the DC Clusters. In section 10, special BGP option is defined for advertising CSP routes. CSP routes SHOULD not be advertised into a DC cluster. CSP route message SHOULD be handled as follow: On CSP originating DC edge router/switch, CSP SHOULD NOT be installed in FIB table in hardware. On the receiving DC edge router/switch, CSP SHOULD be installed in FIB table in hardware. All bits between DCCP and Cluster ID MUST be masked out if the special mask scheme can be implemented, or set those bits to 0s if FIB key mask is not supported. Because CSPs consume FIB CAM space, user SHOULD determine if there is enough FIB CAM resource on DC edge router/switch before enabling this feature. 13. Multiple VRFs and Multiple Tenancies For flexibility to users, an implementation can let user to enable/disable this feature at VRF level on one or more access switches. When it is enabled in a VRF, all functionalities described in this document SHOULD be applied to that VRF on all those access switches. No behavior changes SHOULD happen in other VRFs without this feature enabled. Multi-tenancy can be supported by employing multiple VRFs. A tenant can be allocated VRFs. 13.1 Resource Allocation and Scalability with VRFs For supporting more VRFs in a DC cluster, a DC network administrator can enable this feature for a VRF only on a few access switches in the cluster. The max number of VRFs can be calculated with this formula: Scalability Calculation L: Number of LPM entries V: number of VRFs P: number of ACCESSs per VRF (average) L = V x (P + 1) or V = L/(P + 1) Example 8k LPM entries available per access switch and on average 9 ACCESSs are allocated per VRF. Number of VRFs that can be supported: V = 8000/(9 + 1) = 800 Ming, et al. Expires September 10, 2015 [Page 15] Internet-Draft Scalability of Switching Systems in DC March 2015 More VRFs can be supported if the number of access switches per VRF is decreased. To support a large number of VRFs or tenants, larger LPM tables MAY be required. That SHOULD be considered at ASIC design phase. 14. Security No new security threat is expected to be imposed by this proposal. 15. References [TCAM] Soraya Kasnavi and Vincent C. Gaudet, Paul Berube and Jose Nelson Amaral, A Hardware-Based Longest Prefix Matching Scheme for TCAMs IEEE, 2005 [OVERLAYS] S. Hooda, S. Kapadia, P. Krishnan, Using TRILL, FabricPath, and VXLAN: Designing Massively Scalable Data Centers (MSDC) with Overlays, ISBN-978-1587143939, 2014 [RFC 4291] IP Version 6 Addressing Architecture [RFC 4861] Neighbor Discovery for IP version 6 (IPv6) [RFC 3315] Dynamic Host Configuration Protocol for IPv6 (DHCPv6) [draft-ietf-dhc-topo-conf-04] Customizing DHCP Configuration on the Basis of Network Topology [RFC 4271] A Border Gateway Protocol 4 (BGP-4) [RFC4360] BGP Extended Community Attribute Authors' Addresses Ming Zhang Cisco Systems 170 West Tasman Dr San Jose, CA 95134 USA Phone: +1 408 853 2419 EMail: mzhang@cisco.com Shyam Kapadia Cisco Systems 170 West Tasman Dr Ming, et al. Expires September 10, 2015 [Page 16] Internet-Draft Scalability of Switching Systems in DC March 2015 San Jose, CA 95134 USA Phone: +1 408 527 8228 EMail: shkapadi@cisco.com Liqin Dong Cisco Systems 170 West Tasman Dr San Jose, CA 95134 USA Phone: +1 408 527 1532 EMail: liqin@cisco.com