Internet DRAFT - draft-zhang-6man-scale-large-datacenter

draft-zhang-6man-scale-large-datacenter



IPv6 maintenance Working Group (6man)                          M. Zhang
Internet-Draft                                               S. Kapadia  
Intended status: Standards Track                                L. Dong
Expires: September 10, 2015                               Cisco Systems  
                                                          March 9, 2015

													   
   Improving Scalability of Switching Systems in Large Data Centers
              draft-zhang-6man-scale-large-datacenter-00


Abstract

   Server virtualization has been overwhelmingly accepted especially in 
   cloud-based data centers. Accompanied with expansion of services and 
   technology advancements, the size of a data center has increased 
   significantly. There could be hundreds or thousands of physical servers
   installed in a single large data center which implies that the number 
   of Virtual Machines (VMs) could be in the order of millions. 
   Effectively supporting millions of VMs with limited hardware resources,
   becomes a real challenge to networking vendors. This document describes a
   method to scale a switching system with limited hardware resources 
   using IPv6 in large data center environments.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.  
   The list of current Internet-Drafts is at 
   http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 10, 2015
   
   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html


Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   
Ming, et al.          Expires  September 10, 2015               [Page 1]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   
   
Table of Contents

Abstract  .........................................................   1
1. Introduction ...................................................   2
  1.1 Specification of Requirements ...............................   4
2. Terminology ....................................................   4
3. Large Data Center Requirements .................................   5
4. Scaling Through Aggregation ....................................   5
5. SSP Aggregation ................................................   8
6. Programming in FIB CAM with Special Mask .......................   9
7. VM Mobility ....................................................  11
8. Scaling Neighbor Discovery .....................................  11
9. DHCPv6 .........................................................  12
10. BGP ...........................................................  12
11. Scalability ...................................................  13
12. DC edge router/switch .........................................  14
  12.1 DC Cluster Interconnect ....................................  14
13. Multiple VRFs and Multiple Tenancies ..........................  15
  13.1 Resource Allocation and Scalability with VRFs ..............  15
14. Security ......................................................  16
15. References ....................................................  16
Authors' Address ..................................................  16


1. Introduction

   Server virtualization is extremely common in large data centers 
   realized with a large number of Virtual Machines (VMs) or containers. 
   Typically, multiple VMs share the resources of a physical server. 
   Accompanied with expansion of services and technology advancements, the
   size of a data center has increased significantly. There could be 
   hundreds or thousands of physical servers in a single large data 
   center, which implies that the number of VMs could be in the order of 
   millions. Such large number of VMs imposes challenges to network 
   equipment providers on how to effectively support millions of VMs with 
   limited hardware resources.

   The CLOS based spine-leaf topology has become a defacto-standard of 
   choice for data center deployments. A typical data center topology 
   consists of two tiers of switches: Aggregation or spine tier and 
   ccess/Edge or leaf tier. 

   Figure 1 describes a two tiers network topology in a data center 
   cluster. S1 to Sn are spine switches. L1 to Lm are leaf switches. 
   Every leaf switches has at least one direct connection to every 
   spine switch. H1 to Hz are hosts/VMs attached to leaf switches 
   directly or indirectly through L2 switches. E1 is an edge 
   router/switch. Multiple data center clusters are interconnected by 
   edge routers/switches.
 
Ming, et al.          Expires  September 10, 2015               [Page 2]

Internet-Draft    Scalability of Switching Systems in DC      March 2015


                  +---+      +---+               +---+
                  |S1 |      |S2 |    ...        |Sn |
                  +-+-+      +-+-+               +-+-+
                    |          |                   |
               +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
               |            Link Connections            |
               |  Every Spines connects to every Leaf   |
               +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
                |          |              |            |
              +-+-+      +-+-+          +-+-+        +-+-+     connect
              |L1 |      |L2 |  ...     |Lm |        |E1 +-->> to other
              +---+      +---+          +---+        +---+     cluster
              /  \          \             |
             /    \          \            |
          +-+-+  +-+-+     +-+-+        +-+-+        
          |H1 |  |H2 |     |H3 |  ...   |Hz |        
          +---+  +---+     +---+        +---+        
	
         Figure 1: Typical two tier network topology in a DC cluster	
			 
				  
   Switches at the aggregation tier are large expensive entities with 
   many ports to interconnect multiple access switches together and 
   provide fast switching between access switches. Switches at access 
   tier are low cost, low latency, smaller switches that are connected 
   to physical servers for switching traffic among local servers and 
   servers connected to other access switches through aggregation 
   switches. For maximizing profit, low cost, and low latency ASICs are 
   commonly selected when designing access switches, more commonly SoCs 
   or system-on-chips. In these types of ASICs, the Layer 3 hardware 
   Forwarding Information Base (FIB) table is typically split into two 
   tables: 1) Host Route Table or HRT for host routes (/32 for IPv4 host
   routes and /128 for IPv6 host routes) that is typically implemented 
   as a hash table; 2) Longest Prefix Match (LPM) Table for prefix 
   routes. Due to high cost of implementing a large LPM table in ASIC 
   either with traditional Ternary Content Addressable Memory [TCAM] or 
   other alternatives, LPM table size in hardware is restricted to a few
   thousand entries (from 16k to 64k for IPv4) on access switches. Note 
   that with the size of an IPv6 address being 4 times as long as an 
   IPv4 address, the effective number of FIB LPM entries available for 
   IPv6 is essentially one-fourth (or half depending on the width of the
   LPM entry). Note that the same tables need to be shared by all IPv4, 
   IPv6, unicast and multicast traffic.

   For years, people are looking for solutions for super scale data 
   centers, but there has been no major break-throughs. Overlay-based 
   [OVERLAYS] approaches using VXLAN, TRILL, FabricPath, LISP etc. have 
   certainly allowed for separation of the tenant end-host address space 
   from the topology address space thereby providing a level of 
   indirection that aids scalability by reducing the requirements on the 
   aggregation switches. However, the scale requirements on the access 
   switches still remains high since they need to be aware of all the 
 
Ming, et al.          Expires  September 10, 2015               [Page 3]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   tenant end-host addresses to support any-to-any communication 
   requirement in large data centers (both East-West and North-South 
   traffic). 

   Software-Defined-Network controllers gaining a lot of traction, there 
   has been a direction to go toward a God-box like model where all the 
   information about all the end hosts will be known. In this way, based 
   on incoming packet, if an access switch does not know how to reach a 
   destination, it queries the God-box and locally caches the 
   information (the vanilla OpenFlow model). The inherent latency 
   associated with this approach as well as the centralized model presents
   a single-point of failure due to which such systems will not scale 
   beyond a point. Alternatively, the access switch can forward unknown 
   traffic toward a set of Oracle boxes (typically one or more aggregation
   switches with huge tables that know all about end-hosts) which in turn 
   takes care of forwarding traffic to the destination. As scale increases,
   throwing more silicon at the solution is never a good idea. The costs 
   for building such large systems will be prohibitively high making it 
   impractical to deploy these systems in the field. 

   This document describes an innovative approach to improve scalability 
   of switching systems for large data centers with IPv6-based end-hosts or
   VMs. Major improvements include: 1) Reduced resource allocation from FIB
   tables in hardware both on access switches and almost no FIB resource 
   allocation on aggregation switches. One single cluster can support 
   multi-millions of hosts/VMs. 2) Eliminate L2 flooding and and L3 
   multicast for NDP packets between access switches 3) Reduction in the 
   control plane processing on the access switches.


1.1 Specification of Requirements

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC 2119 [RFC2119].


2. Terminology

   HRT:
     Host Route Table in packet forwarding ASIC

   LPM:
     Longest Path Match Table in packet forwarding ASIC

   Switch ID: 	
     A unique ID for a switch in a DC cluster

   Cluster ID:
     A unique ID for a DC cluster in a data center

   VRF:
     Virtual Routing and Forwarding Instance
 
Ming, et al.          Expires  September 10, 2015               [Page 4]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   Switch Subnet (SS):
    Subnet of a VLAN on an access switch in a data center cluster.  

   Switch Subnet Prefix (SSP): 
     An IPv6 prefix assigned to a switch subnet. It consists of Subnet 
     Prefix, Cluster ID, and Switch ID. In a VRF, there could be one SSP
     per VLAN per access switch.

   Aggregated Switch Subnet Prefix (ASSP): 
     It equals to SSP excluding Subnet ID. For better scalability, SSPs 
     in a VRF on an access switch can be aggregated to a single ASSP. It
     is used for hardware programming and IPv6 forwarding.

   Cluster Subnet Prefix (CSP): 
     Subnet prefixes for forwarding between DC clusters. It consists of 
     Subnet Prefix and Cluster ID.

   DC Cluster Prefix: 
     A common IPv6 prefix used by all hosts/VMs in a DC Cluster. 

   Subnet ID:
    The ID for a subnet in a data center. It equals to Subnet Prefix 
    excluding DC Cluster Prefix.

	
3. Large Data center Requirements

   These are the major requirements for large data centers:

     Any subnet, any where, any time
     Multi-million hosts/VMs
     Any to Any communication 
     VLANs (aka subnets) span across access switches
     VM Mobility
     Control plane scalability
     Easy management, trouble-shooting, debug-ability
     Scale-out model


4. Scaling Through Aggregation

   The proposed architecture employs a distributed gateway approach at the 
   access layer. Distributed gateway allows localization of the failure 
   domains as well as distributed processing of ARP, DHCP etc. messages 
   thereby allowing for a scale-out model without any restriction on host 
   placement (any subnet, any where). Forwarding within the same subnet 
   adheres to bridging semantics while forwarding across subnets is 
   achieved via routing. For communication between end-hosts in different 
   subnets below the same access switch, routing is performed locally at 
   that access switch. For communication between end-hosts in different 
   subnets on different access switches, routing lookups are performed both
   on the ingress access switch as well as the egress access switch. With 
   distributed subnets and a distributed gateway deployment, host (/128) 
 
Ming, et al.          Expires  September 10, 2015               [Page 5]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   addresses need to be propagated between the access switches using some 
   IGP such as MP-BGP. As the number of hosts in the data center goes up, 
   this would be a huge burden on the control plane in terms of 
   advertisement of every single host address prefix. The problem is 
   further exacerbated with the fact that a host can have multiple 
   addresses. Our proposal indicates how this problem can be solved via 
   flexible address assignment and intelligent control and data plane 
   processing. 

   A Data Center Cluster (DCC) is a data center network that consists of a 
   cluster of aggregation switches and access switches for switching 
   traffic among all servers connected to the access switches in the 
   cluster. A data center can include multiple DCCs. One unique DC Cluster 
   Prefix (DCCP) MUST be assigned to a DCC. DC Cluster Prefix could be 
   locally unique if the prefix is not exposed to the external internet or 
   globally unique otherwise.

   A public IPv6 address block can be procured from IANA. With the assigned
   address block, a service provider or enterprise can subdivide the block 
   into multiple prefixes for their networks and Data Center Clusters 
   (DCC). A DCCP length SHOULD be less than 64 bits. With the bits left 
   between DCCP and IPv6 Network Prefix, many subnet prefixes can be 
   allocated. All subnet prefixes in the DC cluster SHOULD share the common
   DCCP.


   A new terminology is introduced in this document - Switch Subnet Prefix 
   (SSP) which is defined as follow: 

   [RFC 4291] defines the 128-bit unicast IPv6 address format. It consists 
   of two portions: Subnet Prefix and Interface ID. 64-bits Subnet Prefix 
   is most common and highly recommended. For this scaling method, we 
   subdivide the Interface ID in IPv6 address: N bits for Host ID, 16 bits 
   for Switch ID, and 8 bits for Cluster ID. 

   Interface ID format

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |0 0 0 0 0 0 0 0|   Cluster ID  |      Switch ID                | 
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      .                Host ID (variable length)                      .
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      


   A SSP is assigned to a VLAN on an access switch. SSP includes the Subnet
   Prefix assigned to the VLAN, the Switch ID for the access switch, and 
   Cluster ID for the Cluster. 

   Each access switch MUST has a unique Switch ID in a DC cluster. Switch 
   ID is assigned by a user or from a management tool. Because the Switch 
 
Ming, et al.          Expires  September 10, 2015               [Page 6]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   ID is a portion of IPv6 address for all host addresses assigned to hosts
   attached to the same access switch, it is recommended to assign the 
   Switch IDs with certain characteristics, for example its location, so 
   that it could be helpful when trouble-shooting traffic-loss issues in 
   large data centers where millions of VMs are hosted. 

   Each cluster MUST have a unique Cluster ID in a data center at a campus.
   Cluster ID is used for routing traffic across DC clusters.


   Switch Subnet Prefix Example

   |       48      | 16  | 8 | 8 | 16  |   32    |
   +---------------+-----+---+---+-----+---------+
   |2001:0DB8:000A:|000A:|00:|C5:|0001:|0000:0001|
   +---------------+-----+---+---+-----+---------+


   Cluster ID:                  C5
   Switch ID:                   1
   VLAN:                        100     
   DC Cluster Prefix:           2001:DB8:A::/48 
   Subnet ID:                   A
   Subnet Prefix:               2001:DB8:A:A::/64.
   Cluster Subnet Prefix        2001:DB8:A:A:C5::/80
   Switch Subnet Prefix:        2001:DB8:A:A:C5:1::/96
   Host Address:                2001:A:A:A:0:C5:1:1/128


   In this example, the DC Cluster Prefix 2001:DB8:A::/48 is a common 
   prefix for the cluster. From the Cluster Prefix block, there is plenty 
   of address space (16 bits Subnet ID) available for subnet prefixes. 
   2001:DB8:A:A::/64 is a subnet prefix assigned to a subnet in this 
   example that is assigned to VLAN 100. Note that for the purpose of 
   exposition, we assume a 1:1 correspondence between a VLAN and a subnet. 
   However, the proposal does not have any restriction if multiple subnets 
   are assigned to the same VLAN or vice-versa. The subnet prefix is for a 
   logical L3 interface/VLAN typically referred to as an Integrated Routing
   and Bridging (IRB) interface. The subnet or VLAN spans across multiple 
   access switches thereby allowing placement of any host any where within 
   the cluster. On each access switch, there is a Switch Subnet Prefix (
   SSP) per subnet or VLAN. 2001:DB8:A:A:C5:1::/96 is the SSP for VLAN 100 
   on switch 1. It is combination of the Subnet Prefix, Cluster ID, and 
   Switch ID. A Host/VM Address provisioned to a host/VM connected to this 
   access switch MUST include the SSP associated to the VLAN on the switch.
   In this example, 2001:DB8:A:A:C5:1:0:1/128 is a host/VM address assigned
   to a host/VM connected to the access switch.

   Host/VM addresses can be configured Using Stateful DHCPv6 or other 
   network management tools. In this model, DHCPv6 is chosen for 
   illustration. It illustrates how IPv6 host addresses are assigned from 
   DHCPv6 server. Similar implementations can be done with other 
   protocol/tools. Section 11 describes how address pools are configured on
 
Ming, et al.          Expires  September 10, 2015               [Page 7]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   DHCPv6 server and how information between switches and DHCP server is 
   exchanged with DHCPv6 messages that allows seamless address assignment 
   based on the proposed scheme. This makes it completely transparent to 
   the end-user thereby alleviating the problems of address management 
   from the network administrator. 


5. SSP Aggregation

   Typically, a routing domain is identified by a Virtual Routing and 
   Forwarding (VRF) instance. Reachability within a VRF is achieved via 
   regular layer-3 forwarding or routing. By default, reachability from 
   within a VRF to outside as well as vice-versa is restricted. In that 
   sense, a VRF provides isolation for a routing domain. A tenant can be 
   associated with multiple VRFs and each VRF can be associated with 
   multiple subnets/VLANs. There can be overlapping IP addressing across 
   VRFs allowing address re-usage. To simplify implementation, reduce 
   software processing, and improve scalability, all SSPs in a VRF on an 
   access switch can be aggregated into a single Aggregated SSP (ASSP). 
   Only one ASSP is needed per switch for a VRF in a DC cluster. ASSPs are 
   employed to aid simplified processing both in the control plane as well 
   as the data plane.

   Typically, for every subnet instantiated on an access switch, a 
   corresponding subnet prefix needs to be installed in the LPM that points
   to the glean adjacency. With ASSP, only a single entry needs to be 
   installed in the LPM irrespective of the number of subnets that are 
   instantiated on the access switch. In addition, the same benefit is 
   leveraged at the remote access switches where there needs to be a single
   ASSP installed for every other access switch independent of what subnets
   are instantiated at the remote switches. More details of how this FIB 
   programming is achieved are presented in the next section.

   ASSP entries on an access switch MUST be distributed to all other access
   switches in a cluster through a routing protocol such as BGP. When an 
   ASSP entry is learned through IGP/BGP protocol, a LPM entry SHOULD be 
   installed. Because of better scalability in large data center 
   environment (BGP Reflector Router can be used to reduce number of peers 
   a BGP node communicates with), BGP is recommended for this forwarding 
   model. In this document, we describe how BGP can be used for ASSP and 
   CSP distribution. A new BGP Opaque Extended community is specified in 
   section 10 for this solution. 

   As mentioned earlier, in modern data centers, overlay networks are 
   typically used for forwarding data traffic between access switches. On 
   aggregation switches, a very small number of FIB entries are needed for 
   underlay reachability since the aggregation switches are oblivious to 
   the tenant host addresses. So aggregation switching platforms can be 
   designed to be simple, low latency, high port density, and low cost.

   ASSP entries programmed in LPM table are for forwarding data traffic 
   between access switches. The rewrite information in the corresponding 
   next-hop (or Adjacency) entry SHOULD include information to forward 
 
Ming, et al.          Expires  September 10, 2015               [Page 8]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   packets to the egress access switch corresponding to the Switch ID. 

   One ASSP local FIB CAM entry is also needed. The rewrite information in 
   the corresponding next-hop (or Adjacency) entry SHOULD include 
   information to punt packet to local processor. This local FIB CAM entry 
   is used for triggering address resolution if a destined host/VM is not 
   in the IPv6 Neighbor Cache (equivalent to a glean entry).  Host/VM 
   addresses (/128) discovered through IPv6 ND protocol are installed in 
   Host Route table (HRT) on the access switch and only on that access 
   switch. Host routes learned through routing protocol MUST NOT be 
   programmed HRT table in hardware. Note exception can occur if a VM moves
   across access switch boundary. For VM moves across access switch 
   boundary, special handlings are required that will be discussed in a 
   different draft for VM Mobility.

   A IPv6 unicast data packet from a host/VM connected to an ingress switch
   destined to another host on an egress switch is forwarded in the 
   following steps: 1) It arrives at the ingress switch; 2) A L3 lookup in 
   FIB (LPM) CAM table hits an entry because the packets destination 
   address includes the Switch Subnet Prefix; 3) The packet is forwarded to
   the egress switch based on the FIB CAM entry and the corresponding 
   Adjacency entry; 4) The packet is forwarded to its destined host by the 
   egress switch. 

   For forwarding packets outside of the DC Cluster, a Default route ::/0 
   SHOULD be installed in FIB CAM that routes packets to one of DC edge 
   routers/switches that provides reachability both to other data center 
   sites as well as the external world (Internet). 

   To summarize in this forwarding model, only local Host/VM routes are 
   installed in HRT table. That greatly reduces the number of HRT table 
   entries required at an access switch. ASSP routes are installed in LPM 
   table for forwarding traffic between access switches. Because of ASSPs 
   are independent of subnet/VLANs, the total number of LPM entries 
   required are greatly reduced. These reduced requirements on the HRT and 
   LPBM on the access switches allow supporting very large number of VMs 
   with much smaller hardware FIB tables. 

   Similar forwarding model SHOULD be implemented in software. For example,
   if special mark is used as discussed in section 6, when forwarding an 
   IPv6 packet in an SSP enabled VRF, the SSP subnet bits can be masked 
   with 0s when doing lookup in software FIB. If it results in a match 
   with an ASSP entry, the packet will be forwarded to the egress access 
   switch with the adjacency attached to the ASSP.


6. Programming in FIB CAM with Special Mask

   Typically, FIB lookup requires a longest prefix match (LPM) for which a 
   CAM is utilized. CAM in ASIC is implemented with Value bits and mask 
   bits for each of its entries. Value bits are the value (0 or 1) of the 
   bits in the memory for L3 forwarding lookup against a lookup key in the 
   CAM table that includes typically the destination address of a data 
 
Ming, et al.          Expires  September 10, 2015               [Page 9]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   packet to be forwarded (the lookup key is typically vpn-id 
   (corresponding to the VRF, destination-IP). The mask bits are used to 
   include or exclude each bit in the value field of a CAM entry when 
   deciding if a match has occurred or not. Mask bit=1, or mask-in, means
   include the value bit and mask bit=0, or mask-out, means exclude the 
   value bit or its a DONT-CARE (corresponding value bit can be 1 or 0). 

   When programming the FIB CAM for all Switch Subnet Prefixes from an 
   ACCESS switch, only one entry is installed in FIB CAM per destination 
   ACCESS switch by masking in all DC Cluster Prefix bits, masking out all 
   bits after DC Cluster Prefix and before the Cluster ID bits, and then 
   masking in both Cluster ID bits and ACCESS ID bits and masking out 
   remaining bits.

   For example,

   DC Cluster Prefix:    2001:0DB8:000A::/48
   Cluster ID: 0xC5
   ACCESS ID in hex: 0x1234

   FIB CAM programming                   
      Value:    2001:0DB8:000A:0000:00C5:1234:0000:0000
      Mask:     FFFF:FFFF:FFFF:0000:00FF:FFFF:0000:0000


   With one such FIB CAM entry, it can match all Switch Subnet Prefixes 
   that includes the DC Cluster Prefix 2001:0DB8:000A::/48, Cluster ID 0xC5
   and Switch ID 0x1234 no matter what values on those bits between DC 
   Cluster Prefix and the Cluster ID. That means only single FIB CAM entry 
   is needed for all packets destined to hosts connected to a switch no 
   matter what subnet prefixes are configured on VLANs on that switch. On a
   given switch, one FIB CAM entry is required for each of other access 
   switches in the DC Cluster. 

   In case the LPM is not implemented as a true CAM but instead as an 
   algorithmic CAM as is the case with some of the ASICs, an alternative 
   approach can be employed. That is to set all subnet bits to 0s when 
   programming an ASSP entry in LPM table. Subnet bits SHOULD be cleared 
   when doing lookup in LPM table. This approach requires certain changes 
   in lookup logic of the ASIC.


   Note that the above explanation applies on a per VRF basis since the FIB
   lookup is always based on (VRF, Destination-IP). For example, in a data 
   center with 100 access switches, if a VRF spans 10 access switches, then
   the number of LPM entries on those 10 access switches for this VRF is 
   equal to 10 (1 local and 9, one for each of the remote switches). 
   Section 11 provides additional details on scalability in terms of the 
   number of entries required for supporting a large multi-tenant data 
   center with millions of VMs.



 
Ming, et al.          Expires  September 10, 2015              [Page 10]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

7. VM Mobility

   VM mobility will be discussed in a separate IETF draft.


8. Scaling Neighbor Discovery

   Another major issue with the traditional forwarding model is the 
   scalability of processing the Neighbor Discovery protocol (NDP) 
   messages. In a data center cluster with large number of VLANs and as 
   many of the VLANs span across multiple access switches, the volume of 
   NDP messages handled by software on an access switch could be huge that 
   can easily overwhelm the CPU. On the other hand, the large number of 
   entries in neighbor cache on an access switch could causes HRT table 
   overflow. 

   In our proposed forwarding model, Neighbor Discovery can be distributed 
   to access switches as described below. Please note all following 
   descriptions in this section only apply to ND operation for global 
   unicast target. No ND operation change is required for Link-local 
   target. 

   All NDP messages from host/VMs are restricted to the local access 
   switch. 

   Multicast NDP messages are flooded to all local switch ports on a VLAN 
   and also copied to local CPU. It SHOULD NOT be sent on link(s) connected
   to aggregation switches.

   When a multicast NS message is received, if its target matches with the 
   local ASSP, then it can be ignored because the host/VM SHOULD reply to 
   the NS since the destination is also locally attached to the access 
   switch; otherwise, a unicast NA message MUST be sent by the switch with 
   link-layer address equals to the switch's MAC (aka Router MAC).  

   When an unicast data packet is received, if the destination address 
   belongs to a remote switch, it will match the ASSP for the remote switch
   in FIB table and be forwarded to the remote switch. On the remote 
   switch, if that destined host/VM is not discovered yet, the data packet 
   will be punt to the CPU and a ND will be triggered for host discovery in
   software.    

   Distributed ND model can reduce software processing in CPU 
   substantially. It also takes much less space in hardware HRT table. Most
   importantly there is no flooding both in L2 and L3. Flooding is a major 
   concern in large data centers so it SHOULD be avoided as much as 
   possible. 

   A subnet prefix and a unique address are configured on a L3 logical 
   interface on access switch. When the L3 logical interface has member 
   ports on multiple switches, the same subnet prefix and the address MUST 
   be configured on the L3 logical interface on all those switches. ND 
   operation on hosts/VMs remains the same without any change. 
 
Ming, et al.          Expires  September 10, 2015              [Page 11]

Internet-Draft    Scalability of Switching Systems in DC      March 2015


9. DHCPv6

   This section describes the host address assignment model with DHCPv6 
   protocol. Similar implementations can be done with other protocols and 
   management tools. 

   DHCPv6 Relay Agent [RFC 3315] SHOULD be supported on access switches for
   this address assignment proposal. [draft-ietf-dhc-topo-conf-04] 
   specifies recommendations on real DHCPv6 Relay Agent deployments. For 
   the forwarding model described in this document, the method of using 
   link-address as described in section 3.2 of 
   [draft-ietf-dhc-topo-conf-04] SHOULD be implemented as follows:

   The Switch Subnet Prefix (SSP) for the subnet on the switch SHOULD be 
   used as link-address in Relay-Forward message sent from switch. On 
   DHCPv6 server, the link-address is used to identify the link. A prefix 
   or address range should be configured on DHCPv6 server for the link. 
   The prefix or address range MUST match with the SSP on the switch. By 
   doing these, it is guaranteed that addresses assigned by DHCPv6 server 
   always include the SSP for the interface on the switch. 
 
   The number of SSP address pools could be very large on the DHCP server. 
   This can be alleviated by employing a cluster of DHCP servers to ensure 
   effective load distribution of client DHCPv6 requests.

10. BGP

   As mentioned earlier, ASSP entries are redistributed to all access 
   switches through BGP. ASSP entries learned from BGP are inserted in RIB.
   They will be used for FIB CAM programming in hardware and IPv6 
   Forwarding in software. 

   In this document, we define a BGP opaque extended community that can be 
   attached to BGP UPDATE messages to indicate the type of routes that are 
   advertised in the BGP UPDATE messages. This is the IPv6 Route Type 
   Community [RFC4360] with the following encoding:

                     +-------------------------------------+
                     | Type 0x3 or 0x43 (1 octet)          |
                     +-------------------------------------+
                     | Sub-type 0xe (1 octet)              |
                     +-------------------------------------+
                     | Route Type   (1 octets)             |
                     +-------------------------------------+
                     | Subnet ID Length (1 octet)          |
                     +-------------------------------------+
                     | Reserved (4 octets)                 |
                     +-------------------------------------+

   Type Field:
   The value of the high-order octet of this Opaque Extended Community is 
   0x03 or 0x43.  The value of the low-order octet of the extended type 
 
Ming, et al.          Expires  September 10, 2015              [Page 12]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   field for this community is 0x0E(or another value allocated by IANA).

   Value Field:
   The 6 octet Value field contains three distinct sub-fields, described
      below:

   The route type sub-field defines the type of IPv6 routes carried in this
   BGP message. The following values are defined:

       1: ASSP_Route indicates that the routes carried in this BGP Update 
       message are  ASSP route
       2: CSP_Route indicates that the routes carried in this BGP Update 
       message are CSP routes

   The Subnet ID Length specifies the number of bits in an ASSP route. 
   Those bits can be ignored in the FIB look up either with special mask 
   when FIB lookup CAM is used or an alternative way as described in 
   section 5. This field is only used when the route type is ASSP_Route.

   The 4 octet reserved field is for future use.

   The IPv6 Route Type Community does not need to be carried in the BGP 
   Withdraw messages.

   All operations SHOULD follow [RFC4360]. There is no exception for this 
   forwarding model.


11. Scalability

   With this innovative forwarding model, the scalability of data center 
   switching system is improved significantly while still allowing 
   any-to-any communications between all hosts, and no restriction on host 
   placement or host mobility. 

   FIB TCAM utilization on an access switch becomes independent of number 
   of VLANs/subnets instantiated on that switch. 

   It is important to note that the number of host prefix routes (/128) 
   only depends on the number of VMs that are local to an access switch. 
   Network administrator can add as many as access switches with the same 
   network design and would never worry about running out of FIB HRT 
   resources. This greatly simplifies network design for large data centers
 
Ming, et al.          Expires  September 10, 2015              [Page 13]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   The total number of VMs can be supported in a data center cluster can be
   calculated as the following (assuming single VRF):

   Number of LPM entries: 
       Only one LPM entry per access switch is required for local ASSP. 
       The total number of LPM entries on an access switch is equivalent
       to the total number of access switches in a DC cluster plus 1 
       (for the default route). 

   Number of HRT entries:
   There will be one HRT entry for each directly connected host/VM.

   Scalability Calculation

   H: max number of HRT entries
   V: Number of VMs/port
   P: number of ports/access switch

     H = V x P

   For example,
     48 ports/access switch, 128 VMs/port
     H = 48 x 128 = 6k HRT entries/access switch

   T: total number of hosts/VMs
   L: number of access switches

     T = H x L
   xample: 200 access switches
   1.2 Million (6k x 200) VMs can be supported in a large data center 
   cluster.

12. DC edge router/switch 

   Multiple data center clusters can be interconnected with DC edge 
   routers/switches. The same subnet can span across multiple data center 
   clusters. While each subnet has a unique subnet prefix, each cluster in 
   which that subnet extends has a unique cluster subnet prefix. This will 
   be advertised over BGP to the edge routers, which in turn will attract 
   traffic for hosts that are part of that subnet in a given cluster. 
   Again, procedure to handle host mobility across clusters will be 
   described separately in a different draft. 
 
12.1 DC Cluster Interconnect

   This section describes a way to support VLAN across DC clusters for this
   forwarding model. 

   Subnet Prefixes SHOULD be advertised by routing protocol within a DC 
   Cluster, but subnet prefixes SHOULD NOT be installed in hardware FIB 
   table. On a DC edge router/switch, Cluster Subnet Prefixes (CSP) can be 
   configured or auto-generated if SSP is enabled. CSP is special prefix to
   be used at DC edge router/switch to forward traffic between directly 
 
Ming, et al.          Expires  September 10, 2015              [Page 14]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   connected DC clusters. Please refer to section 4 for CSP definition and 
   example. There SHOULD be one CSP per subnet. 

   A CSP SHOULD be advertised through a routing protocol between DC edge 
   router/switch that connects the DC Clusters. In section 10, special BGP 
   option is defined for advertising CSP routes. CSP routes SHOULD not be 
   advertised into a DC cluster.

   CSP route message SHOULD be handled as follow:

   On CSP originating DC edge router/switch, CSP SHOULD NOT be installed in
   FIB table in hardware. On the receiving DC edge router/switch, CSP 
   SHOULD be installed in FIB table in hardware. All bits between DCCP and 
   Cluster ID MUST be masked out if the special mask scheme can be 
   implemented, or set those bits to 0s if FIB key mask is not supported.

   Because CSPs consume FIB CAM space, user SHOULD determine if there is 
   enough FIB CAM resource on DC edge router/switch before enabling this 
   feature.   


13. Multiple VRFs and Multiple Tenancies

   For flexibility to users, an implementation can let user to 
   enable/disable this feature at VRF level on one or more access switches.
   When it is enabled in a VRF, all functionalities described in this 
   document SHOULD be applied to that VRF on all those access switches. No 
   behavior changes SHOULD happen in other VRFs without this feature 
   enabled.
 
   Multi-tenancy can be supported by employing multiple VRFs. A tenant can 
   be allocated VRFs. 


13.1 Resource Allocation and Scalability with VRFs

   For supporting more VRFs in a DC cluster, a DC network administrator can
   enable this feature for a VRF only on a few access switches in the 
   cluster. The max number of VRFs can be calculated with this formula:  

   Scalability Calculation
   L: Number of LPM entries 
   V: number of VRFs
   P: number of ACCESSs per VRF (average)

        L = V x (P + 1)   or
        V = L/(P + 1)


   Example 
   8k LPM entries available per access switch and on average 9 ACCESSs are 
   allocated per VRF.
   Number of VRFs that can be supported:  V = 8000/(9 + 1) = 800
 
Ming, et al.          Expires  September 10, 2015              [Page 15]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   More VRFs can be supported if the number of access switches per VRF is 
   decreased.

   To support a large number of VRFs or tenants, larger LPM tables MAY be 
   required. That SHOULD be considered at ASIC design phase.  


14. Security

   No new security threat is expected to be imposed by this proposal. 


15. References
 
   [TCAM] Soraya Kasnavi and Vincent C. Gaudet, Paul Berube and Jose 
      Nelson Amaral, A Hardware-Based Longest Prefix Matching Scheme for 
      TCAMs IEEE, 2005

   [OVERLAYS] S. Hooda, S. Kapadia, P. Krishnan, Using TRILL, FabricPath, 
      and VXLAN: Designing Massively Scalable Data Centers (MSDC) with 
      Overlays, ISBN-978-1587143939, 2014

   [RFC 4291] IP Version 6 Addressing Architecture

   [RFC 4861] Neighbor Discovery for IP version 6 (IPv6)

   [RFC 3315] Dynamic Host Configuration Protocol for IPv6 (DHCPv6)

   [draft-ietf-dhc-topo-conf-04] Customizing DHCP Configuration on the 
      Basis of Network Topology

   [RFC 4271] A Border Gateway Protocol 4 (BGP-4)

   [RFC4360] BGP Extended Community Attribute




Authors' Addresses

   Ming Zhang
   Cisco Systems
   170 West Tasman Dr
   San Jose, CA 95134
   USA

   Phone: +1 408 853 2419
   EMail: mzhang@cisco.com


   Shyam Kapadia
   Cisco Systems
   170 West Tasman Dr
 
Ming, et al.          Expires  September 10, 2015              [Page 16]

Internet-Draft    Scalability of Switching Systems in DC      March 2015

   San Jose, CA 95134
   USA

   Phone: +1 408 527 8228
   EMail: shkapadi@cisco.com


   Liqin Dong
   Cisco Systems
   170 West Tasman Dr
   San Jose, CA 95134
   USA

   Phone: +1 408 527 1532
   EMail: liqin@cisco.com