Internet DRAFT - draft-dalela-dc-approaches

draft-dalela-dc-approaches



Network Working Group                                         A. Dalela
Internet Draft                                            Cisco Systems
Intended status: Standards Track                      December 30, 2011
Expires: June 2012



                      Datacenter Solution Approaches
                     draft-dalela-dc-approaches-00.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on June 30, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.






Dalela                  Expires June 30, 2012                  [Page 1]

Internet-Draft          Datacenter Approaches             December 2011


Abstract

   There are many approaches to addressing virtualized datacenter
   scaling problems. Examples of these approaches include, L2 vs. L3
   forwarding, host-based vs. network-based solutions, fat-access and
   lean-core vs. fat-core and lean-access, flat addressing vs.
   encapsulation, protocol learning vs. directories for location
   discovery, APIs vs. protocols for orchestration, etc. Different
   solutions being proposed today take one or more of these approaches
   in combination, although sometimes the question of approach itself
   may not be settled. Given the multiple facets of the datacenter
   problem, and many approaches to solve each problem, it becomes hard
   to discuss a solution when some approaches may be acceptable while
   others are not. This document discusses the pros and cons of various
   approaches. The goal is not to describe a specific solution, but to
   evaluate the various approaches. This document concludes with a set
   of recommendations on which approaches are most optimal for a
   holistic solution to the entire problem set.

Table of Contents

   1. Introduction...................................................3
   2. Conventions used in this document..............................3
   3. Terms and Acronyms.............................................4
   4. Problem Statement..............................................4
   5. Possible Solution Approaches...................................4
      5.1. Addressing Approaches.....................................4
         5.1.1. Mobile IP Approach...................................4
         5.1.2. Two Address Spaces...................................4
         5.1.3. Host Based Solutions.................................6
         5.1.4. Hierarchical Addressing..............................7
      5.2. Multi-Tenancy Approaches..................................7
         5.2.1. VLAN Based Approaches................................8
         5.2.2. GRE Encapsulation....................................8
         5.2.3. MPLS Header..........................................8
      5.3. Datacenter Interconnectivity Approaches...................9
         5.3.1. BGP MPLS VPN Approach................................9
         5.3.2. New Routing Protocol at Datacenter Edge.............10
         5.3.3. L2 Overlay Interconnects............................11
         5.3.4. Common Intra and Inter Datacenter Technology........12
      5.4. Forwarding Approaches....................................13
         5.4.1. L3 Forwarding.......................................13
         5.4.2. L2 Forwarding.......................................14
         5.4.3. Hybrid Approaches...................................15
      5.5. Discovery Approaches.....................................15
         5.5.1. Protocol Based Route Learning.......................16
         5.5.2. Address Location Registries.........................16


Dalela                  Expires June 30, 2012                  [Page 2]

Internet-Draft          Datacenter Approaches             December 2011


         5.5.3. Routing-Registry Hybrid Approach....................17
      5.6. Cloud Control Approaches.................................18
         5.6.1. Application APIs....................................18
         5.6.2. Network Protocol Approach...........................19
   6. Recommendations...............................................19
   7. Network Architecture..........................................22
   8. Security Considerations.......................................23
   9. IANA Considerations...........................................23
   10. Conclusions..................................................23
   11. References...................................................23
      11.1. Normative References....................................23
      11.2. Informative References..................................23
   12. Acknowledgments..............................................23



1. Introduction

   The problem statement [REQ] describes a set of problems that need to
   be collectively solved for datacenters. Many of these problems are
   inter-linked, and a solution to one problem that overlooks the others
   makes the solution to other problems a little harder. Any approach
   that is adopted to solving the datacenter problems therefore should
   be evaluated against a wider set of issues that need to be
   collectively addressed rather than one at a time.

   Given a broader set of issues, this document tries to evaluate the
   various solution approaches against those issues. The goal here is
   not to propose a specific solution, but to understand the pros and
   cons of taking an approach with respect to the wider problem set.

   We conclude this document with a set of recommendations on the
   approaches that can be used in combination to address the entire
   problem set. This can then be used to devise specific solutions. The
   discussion of those solutions will not need to re-open questions
   about the approach itself, and that is hopefully better.

2. Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [RFC2119].

   In this document, these words will appear with that interpretation
   only when in ALL CAPS. Lower case uses of these words are not to be
   interpreted as carrying RFC-2119 significance.



Dalela                  Expires June 30, 2012                  [Page 3]

Internet-Draft          Datacenter Approaches             December 2011


3. Terms and Acronyms

   NA

4. Problem Statement

   This is described in the problem statement document [REQ].

5. Possible Solution Approaches

   This section discusses the various design approaches that can be
   adopted to solving the datacenter issues. These include approaches to
   solve mobility, inter-connectivity of datacenters, handling multi-
   paths to a destination, cloud orchestration, etc.

5.1. Addressing Approaches

   Addressing issues primarily arise due to mobility, and secondarily
   because of connecting public and private domains which might be using
   the same IP address range. Both issues are important for datacenters.

5.1.1. Mobile IP Approach

   In the Mobile IP approach a mobile node is assigned a location
   independent address whose routes are advertised by the Home Agent.
   The mobile node itself is bound at the link-level to the Foreign
   Agent. The traffic is then tunneled between Home and Foreign agents.
   The challenge here is that all packets must pass through the Home
   Agent (at least when going towards the mobile node) and this cannot
   use the shortest path or multiple paths to destination.

   Shortest paths and multiple paths to a destination are essential
   requirements for datacenter traffic. The mobile IP approach is
   therefore unsuited for datacenter traffic.

5.1.2. Two Address Spaces

   Many current approaches separate the location address space from the
   identifier address space. The location address space refers to the
   routers or switches while the identifier address space refers to
   hosts. The mapping between the location and identifier address spaces
   can be done by carrying host-routes within the native routing
   protocol, by a new routing protocol that carries host routes over the
   native protocol or by snooping existing protocol packets like ARP.
   Subsequently, packets are tunneled to the location switch or router
   in the outer address, de-capsulated, and forwarded to host.



Dalela                  Expires June 30, 2012                  [Page 4]

Internet-Draft          Datacenter Approaches             December 2011


   As VMs move, location independence requires host-level locator-
   identifier bindings to be pushed into the network.

   If these bindings are pushed everywhere using the native routing
   protocol, these bindings will be present in both the access and the
   core. The first bottleneck faced in this case will be in the core,
   which has to hold many host routes. As these hosts increase, this
   approach becomes impossible to scale in the network core.

   If however these bindings are created using a new routing protocol
   that runs between edges or by snooping existing protocol packets
   (such as ARP), at the edges, the location-identifier bindings are
   only present at the network edges and not the core. This approach is
   obviously an improvement over the native routing protocol approach.

   However, as host mobility increases, and the corresponding hosts are
   placed in different locations, the host routes at the edge begin to
   increase rapidly. For example, if a host has 25 VM, each with 4
   virtual NICs, and an access switch connects to 48 such hosts, and
   each virtual NIC on a host corresponds with 50 other NICs that are
   situated in different locations, the total number of host routes
   needed at the access will be 25 * 4 * 50 * 48 = 240,000.

   This number obviously depends on the application and network design.
   In some cases, a host may correspond with thousands of hosts, but may
   not be virtualized. In other cases, the number of VMs per physical
   host may be more although they don't have as many virtual interfaces.
   In the worst case, all the above numbers could be higher.

   Also note that the host routes are in addition to other things: (a)
   network routes, (b) local host-port bindings, (c) policies such as
   access lists, etc., which currently exist and will continue.
   Regardless of whether the number of host routes is large or not, what
   is undeniable is that these are additional entries.

   Experience shows that network sizes grow at an exponential rate, and
   the VM density per host, the distribution of compute across multiple
   nodes, and VM mobility are trends that will only increase with time.
   Each of these factors will increase host routes. The expectation is
   also that massively scaled datacenters should decrease the overall
   cost of infrastructure. The cost of compute will decrease as mobility
   and distribution are applied but the cost of the network will
   increase with growing table sizes. This puts compute and network at
   opposite ends of the cost trend, and long-term this is not viable.

   Encapsulated packets make the application of security and QoS
   policies a little harder. The firewall, load-balancer, application


Dalela                  Expires June 30, 2012                  [Page 5]

Internet-Draft          Datacenter Approaches             December 2011


   optimizer, packet policers, or other kinds of network services have
   to be aware that packets have to be analyzed based upon the inner
   addresses and not based on the destination switch or router address.
   This is particularly true when the same destination has hosts
   belonging to many tenants each with different policies. This fact
   complicates the design of all network services, and may make existing
   hardware accelerated network service equipment obsolete. Different
   encapsulation techniques are further incompatible with each other and
   with network services that might be separately deployed.

5.1.3. Host Based Solutions

   Address space separation can be achieved in the host instead of the
   network. For instance, it is possible that a host is aware of two IP
   addresses, one that it exposes to the network and the other that it
   exposes to the applications. When an application needs to send a
   packet to another application, it would use the other application's
   address. But, the host operating system below the application will
   map the application address to the remote host address.

   This scheme becomes very intuitive with VMs. Now, a remote host is
   identified by the IP address of the VM hypervisor while the
   application is identified by the VM. When a VM sends a packet, the
   hypervisor will append its IP as the outer IP. It will also resolve
   the location of the remote application to a remote hypervisor's IP
   through a new protocol, and forward the packet. The network has a
   static IP configuration and is unaware of the existence of VMs. Since
   any VM can be on any hypervisor, VMs are location independent.

   Since each VM will periodically ARP for its destination, these ARPs
   also need to be trapped by the hypervisor (or the virtual switch
   inside the hypervisor). The switch can respond back locally through a
   cache or emit another protocol query to a mapping database.

   Since the rest of the network is unaware of the existence of the VM,
   the difficulties described above with respect to network services
   appear here as well. Security for example may also have to be
   implemented in the hypervisor based firewall. There are additional
   overheads in processing each packet and adding/removing headers.
   Since all this happens on the host CPU, a greater percentage of CPU
   time is spent on network processing. This is more expensive because
   hardware accelerated network processing will do that at much greater
   speeds and with much lower amounts of energy consumed.

   Another disadvantage of doing network functions in the host is that
   the total number of network devices to manage grows by a few orders
   of magnitude. For example, if this was applied to firewall management


Dalela                  Expires June 30, 2012                  [Page 6]

Internet-Draft          Datacenter Approaches             December 2011


   of each tenant's personal VM firewall, the total number of firewalls
   to be managed will be very high (of the order of physical hosts). The
   operator cannot have a single consolidated view of all the firewall
   rules in a single place. And if additional rules had to be installed,
   they would need to be propagated to many firewalls.

5.1.4. Hierarchical Addressing

   IP addressing is already hierarchical, so by this we mean use of
   Hierarchical MAC addresses. A hierarchical MAC address has "network-
   bits" and a "host-bits" just like an IP address. The boundary between
   the network and host could be fixed or variable.

   As an example, a hierarchical MAC's higher-order bits could represent
   a "switch-id" while the lower order bits could represent the "host-
   id". Given that a MAC address has 48 bits as compared to the 32 bits
   in the IPv4 network, use of hierarchical MAC addresses implies that
   the a datacenter cluster cloud could be many times larger than the
   IPv4 Internet! If packets are forwarded using hierarchical MAC
   addresses, it brings L3 scaling properties to L2 networks.

   Note that L2 networks already have mobility. In current L2, this
   mobility is possible based on a fixed MAC address whose location has
   to be detected through conversational learning on the L2 switches
   before packets are forwarded. Learning and broadcast in the network
   however make current L2 networks not scalable. Hierarchical MACs
   solve this issue. It is now not necessary to learn the full MAC
   address but only the higher-order bits. If the higher order bits
   represent switch-ids, then this learning never needs to be changed
   unless a switch is added or removed from the network. The total
   number of hardware entries anywhere in the network equals the total
   number of switches and remains agnostic of VM mobility.

   Note that the entries on switch-id's are in lieu of network routing
   entries and can be treated as network routing entries. Therefore the
   total entries required is nearly the same as that required currently
   for static L3 routing. The host still has two addresses (IP and MAC),
   but now the identifier is IP and the locator is MAC.

5.2. Multi-Tenancy Approaches

   Depending on the type of forwarding (L2 or L3) different types of
   multi-tenant segmentation can be applied. As described in the problem
   statement [REQ] both L2 and L3 segments can have issues.





Dalela                  Expires June 30, 2012                  [Page 7]

Internet-Draft          Datacenter Approaches             December 2011


5.2.1. VLAN Based Approaches

   Of course there are only 4096 VLANs and therefore this approach can't
   be scaled to many tenants. Further, a customer may need more than one
   VLAN and they may span these VLANs from private domains.

   To allow these scenarios, extensions of VLAN such as Q-in-Q could be
   used. The inner Q could represent the VLAN and the outer Q the
   customer and this will allow 4096 customers who have the full range
   of 4096 VLANs. This should accommodate each customer, but it may not
   be enough to support enough customers in cloud.

   We might now use a Q-in-Q-in-Q to segment customers into customer
   classes (such as gold, silver, bronze, etc.). Alternately, we can
   treat the 36 bits as a contiguous VLAN space that can be allocated to
   users on demand. The latter has the issue that a mapping between
   private and public VLAN spaces will need to be done at the network
   edges. For instance if a private VLAN 10 corresponds to a public VLAN
   100, then a mapping between 10 and 100 must be maintained at the edge
   and the packet must be modified in both directions. The total number
   of such mappings may not be very high, and these may be distributed
   over many Provider Edge (PE) or Customer Edge (CE) routers.

5.2.2. GRE Encapsulation

   The GRE Key is 32 bits long and therefore can support a very large
   number of customer segments. However, GRE will work with L3
   forwarding because it is transmitted inside the IP header. This
   segmentation scheme has no scaling problems except that to support IP
   mobility, the mobility schemes themselves require encapsulation,
   which has the challenges as described above. The net result of this
   scheme is that there are two headers required - one for segmentation
   (GRE) and another for mobility. Since this is running over L3, all L2
   information (such as VLAN) would be lost.

5.2.3. MPLS Header

   MPLS has been used in the internet to segment flows. The MPLS label
   is 20-bits long, which can be used to support over a million
   customers. Note that each customer could use a full range of 4096
   VLANs as well, so this does not overlap the L2 segments with tenant
   segments. This scheme works equally well with L2 and L3 networks, and
   affords a sufficient amount of scale in both cases.

   This scheme can also be used to give per-customer quality of service
   or other types of policies as the packets traverse the Internet. It



Dalela                  Expires June 30, 2012                  [Page 8]

Internet-Draft          Datacenter Approaches             December 2011


   is this ability to use MPLS labels across private, public and
   internet domains that makes it a very convenient option.

   This segment can be inserted in the packet at the access layer inside
   the datacenter (similar to how VLAN tags are inserted) and removed at
   the remote access layer. For remote connectivity with single tenant
   datacenters, the tenant id could be inserted and removed at the
   Customer Edge (CE) router. Cloud datacenter would transparently pass
   the packet into the datacenter and remove the tenant at the access.

   The segmented packets can be transported over L2 VPNs. The
   authenticated VPN tunnel endpoints should be used to map (and drop)
   packets whose endpoint addresses don't match with the segment. The L2
   VPN could for example be a EoMPLS VPN whose MPLS label stack should
   be matched against the tenant identifier in the Ethernet packet. The
   cloud can be treated as one more "site" for the cloud customer and
   MPLS VPN services can be extended to these customers.

5.3. Datacenter Interconnectivity Approaches

   Three broad approaches are possible for datacenter interconnectivity.
   First, push the datacenter routes into the Internet and let the
   Internet determine the right location of a host. Second, the location
   is determined at the edge of the datacenter, and packets are
   transported over the Internet but the mechanisms within and between
   the datacenters are different. Third, we use an overlay scheme
   between datacenter edges, but a common mechanism is used within and
   between datacenters. These approaches are described below.

5.3.1. BGP MPLS VPN Approach

   This approach involves a flat addressing which has traditionally been
   used for site-to-site connectivity. In this approach, the intranet
   routes are pushed into the Internet through BGP. Routing (unicast and
   multicast) between the sites is handled by the Internet core.
   However, traditionally there have been no mechanisms to support VM
   mobility. This mobility can cause address fragmentation and bloat the
   forwarding tables in the Internet. The advantage of this approach is
   that bandwidth and security are guaranteed.

   While this approach is not the preferred mechanism, in some cases
   (such as the Virtual Private Cloud, where an entire subnet is
   reserved for a customer at the provider's site) it might be used.
   Ideally, in this scenario, VM mobility would be restricted to within
   the site. If the subnet is provided by the customer itself, then the
   customer could potentially move the entire subnet from one provider
   to another in case of disaster (assuming that the services are


Dalela                  Expires June 30, 2012                  [Page 9]

Internet-Draft          Datacenter Approaches             December 2011


   recreated in the new location through automated schemes). The edge
   router at the new location would advertize the routes to the entire
   subnet and packets would be transparently routed.

5.3.2. New Routing Protocol at Datacenter Edge

   In this approach, a new routing protocol would propagate the routes
   of the moving hosts between the edge routers. Once routes to a host
   are known at the edge routers, the packets would be encapsulated into
   an IP header with the destination address of the destination router.
   This is similar to separating the identifier and locator address
   spaces as described for intra-datacenter mobility earlier.

   Before the location is propagated via the routing protocol, the
   location must be detected first. This has to be achieved by using
   conversation learning. This learning could be based on traditional L2
   learning, some variation of L2, or by running the new routing
   protocol end-to-end within and between datacenters. In each of these
   cases, some packet from the host with its IP address must be seen on
   the network. Once the host has been detected, its location can be
   propagated. If a host has never spoken, its location would not be
   known and the host is unreachable. This problem is avoided in L2
   networks where a source will broadcast ARP to force a response from a
   destination that is otherwise not sending any packets. Conversation
   learning of the host location is therefore absolutely necessary.

   This shows that L3 location independent schemes must use the L2 type
   conversation learning. The encapsulation scheme in L3 case may be
   different but the basic mechanisms are identical in two cases.

   The routing tables need to be segmented into VRFs to identify
   different tenants. If two sites of a customer are connected to two
   sites of a provider, collectively these four sites form a VRF. The
   peers in one VRF will be different than the peers in another VRF. At
   this time if the protocol uses conversation learning to advertize
   routes, it needs to know ahead of time which VRF an IP should be
   advertised into. This is because the IP across these VRFs might be
   duplicated. That means that the VRF advertisements must depend on how
   the packets are segmented inside the datacenter.

   For instance, the VLANs, GRE keys or MPLS labels as described above
   should be mapped to VRFs. Since hosts are dynamically detected,
   location propagation from intra-datacenter to inter-datacenter must
   incorporate the segment as well. Similarly, traffic received from a
   far-end must also contain the appropriate segmentation technique
   (e.g. GRE, MPLS label, or some Route Identifier in header) to
   identify that the packet belongs to a particular VRF.


Dalela                  Expires June 30, 2012                 [Page 10]

Internet-Draft          Datacenter Approaches             December 2011


   If datacenters are relatively static, the signaling demands at the
   edge (to program new locator-identifier bindings) may be no worse
   than DNS resolution that is employed infrequently to resolve the name
   to IP binding before sending packets. The entries at the edge would
   be long-lived. However, if the datacenters are very dynamic and lots
   of resources are rapidly created this can become an overhead. Such
   issues may also arise in case of disaster recovery or site outage
   when resources are rapidly recreated in another site.

   The forwarding plane scale needs for inter-datacenter connectivity
   are identical to that in intra-datacenter encapsulation schemes. That
   is, host route entries are required for host mobility across sites.
   In the inter-datacenter case, because of fewer edge points, these
   entries will be concentrated at fewer points, and will require higher
   capacity routers at the edge. Note that inter-datacenter mobility is
   a key use-case in "follow the sun" models.

   Inter-datacenter connectivity also needs to build multicast
   distribution trees into the edge routers. This will require similar
   approaches as PIM for the intra-datacenter cases. Note that these
   trees may need to be optimized for workload placement such that the
   tree directly routes packets between sites that have the most number
   of clients for a given multicast group.

5.3.3. L2 Overlay Interconnects

   In some cases, it is necessary to span the VLAN across sites. For
   example, a web-server and application server may be located at one
   site while the database server and the storage are in another site.
   The application and database servers are within a VLAN.

   If the VLAN is spanned across multiple sites, there is need to
   control the broadcast at the edges. For example, this may involve
   using the discovered IP to MAC bindings to respond to periodic ARP
   broadcasts. Similar to multicast trees, VLAN spanning also involves
   construction of broadcast trees. And similar to how multicast routes
   are propagated between intra and inter-datacenter, a single per-VLAN
   spanning tree needs to be constructed for broadcast. The multicast
   and broadcast trees need to be aware of workload density between
   sites to optimize the broadcast and multicast traffic.

   There are significant challenges related to virtual MAC overlap when
   connecting multiple datacenters. Note that virtual MACs are assigned
   administratively and these can overlap when many sites are connected,
   especially when private and public domains that cross administrative
   boundaries are connected. These overlaps will cause traffic loss.



Dalela                  Expires June 30, 2012                 [Page 11]

Internet-Draft          Datacenter Approaches             December 2011


   The scaling issues with the L2 schemes are identical to those as seen
   within the datacenter or for L3 inter-datacenter interconnects. That
   is, host routes are required for VM mobility. In fact with L2, the
   scaling is worsened because L2 addresses can't be summarized like L3
   addresses. There will always be a per MAC entry even if the entire
   subnet is located at one site.

5.3.4. Common Intra and Inter Datacenter Technology

   This approach treats multiple interconnected datacenters as one huge
   domain. The interconnection between sites must of course take place
   over the L3 Internet, but the networking technology can just treat
   that as an overlay. That is, the remote location is determined
   according to intra-datacenter forwarding, and tunneled over L3. The
   scaling properties of this approach are identical to the scaling
   properties of the various intra-datacenter approaches.

   For example, if encapsulation is used within the datacenter for
   mobility, and there are N switches in the first datacenter and M
   switches in the second, then the first datacenter will need M
   mappings between remote switch addresses and the edge locator switch
   address, while the second datacenter will need N such mappings.

   This is much better than when we use a different technology within
   and between datacenters. As example, by extending the encapsulation
   scheme we don't need host routes, but only switch routes. This is a
   few orders of magnitude more scalable at the edge. But, note that if
   both datacenters are large, it may worsen the scaling at the access
   because a host in one location is talking to multiple hosts in
   another location. The encapsulation approach scales well in the core
   and this is true when the core includes a tunnel over the Internet.

   Similarly, if hierarchical MAC addresses were assigned within the
   datacenters, and the switch-ids across datacenters are mutually
   exclusive, then these two datacenters can be treated as one large
   datacenter. Each datacenter will need to store M and N bindings at
   the edge, similar to the encapsulation case above. This scheme scales
   well both at the access, in the core, and at the datacenter edges.

   While there are many advantages in using the same technology across
   datacenters, there can be challenges in managing these administrative
   domains in the same way. For instance, switch-ids across these
   networks must be non-overlapping. These problems are no worse than if
   different approaches are employed within and between datacenters
   because one has to ensure unique MAC and IP addressing anyway.
   Hierarchical addressing in fact reduces the overhead from unique host



Dalela                  Expires June 30, 2012                 [Page 12]

Internet-Draft          Datacenter Approaches             December 2011


   MACs to unique switch IDs. Protocols that assign switch-ids uniquely
   would further reduce the overhead to unique IP only.

5.4. Forwarding Approaches

   Industry is divided in opinion on this and a lot has been already
   said about this. What can we add here? We are not going to repeat
   what has been already said, but make two additional points.

   First, datacenter traffic includes not just TCP/IP but also Fiber
   Channel and InfiniBand. These technologies were developed at a time
   when Ethernet did not provide high speeds. Now that Ethernet gives
   10G and 40G speeds, it is no longer necessary to maintain separate
   networks. These networks can be converged over L2 or L3, and this is
   an important consideration to keep in mind in deciding the right
   approach. Maintaining multiple parallel networks isn't practical.

   Second, there are scaling issues in L2 when the network size grows,
   aside from issues that inter-VLAN traffic (L3) does not use ECMP
   which constrain the cross-section bandwidth across VLANs. These
   scaling issues should be taken into account in deciding an approach.

5.4.1. L3 Forwarding

   Datacenters have a significant amount of non-TCP/IP traffic. In fact
   bandwidths on these links have traditionally been much higher than
   Ethernet (which the reason that they were designed because Ethernet
   could not deliver those speeds earlier). The bandwidth gap no longer
   exists, but it is important to continue using these technologies.
   Fiber Channel (FC) is used for SAN while InfiniBand (IB) is used for
   networked IPC. FC is used in most enterprise networks while IB is
   used in High Performance Computing (HPC) clusters.

   Mechanisms to converge non-TCP/IP traffic over TCP/IP have been made.
   These mechanisms have two broad types of issues. First, if the TCP/IP
   runs in software, the overheads in TCP/IP consume a lot of CPU and
   render a lower performance. Second, if TCP/IP runs in hardware, the
   cost of the NIC is very high given the complexity of doing TCP in
   hardware. The cost/performance of the TCP/IP based solutions is not
   at the desired level for FC and IB traffic types. However, if a
   provider does not have significant FC/IB traffic or is prepared to
   bear the cost of more expensive NICs, then TCP/IP based solutions -
   such as iSCSI for FC and iWARP for IB - can also be employed.

   As already discussed L3 scales very well but does not natively
   support mobility. Encapsulations need to be used to support mobility
   but these create significant scaling issues at the access.


Dalela                  Expires June 30, 2012                 [Page 13]

Internet-Draft          Datacenter Approaches             December 2011


5.4.2. L2 Forwarding

   L2 forwarding simplifies network storage and IPC. Ethernet can be
   used to converge TCP/IP, FC and IB traffic onto the same physical
   link at the desired levels of cost and performance. This will lead to
   a reduction in the datacenter networking costs, by eliminating
   multiple types of NICs, cables and switches. The total number of
   ports can also be reduced, increasing port utilization.

   However, to support non-TCP/IP traffic, L2 networks also need to
   support Datacenter Bridging (DCB) specifications. These include
   Congestion Notification, per-priority VL, and DCBX. These changes
   require hardware changes at the access and may not be preferred in
   the short run. Providers may prefer to use L3 in the short run.

   Traditional L2 forwarding further brings several scaling issues.

   First, when packets cross VLAN boundaries, they must use a default
   gateway. Inter-VLAN traffic passes through this default gateway,
   which means that it cannot use multi-paths to a destination. As the
   inter-VLAN traffic grows the chances of packet drops are high,
   because this traffic cannot use multi-paths to destination.

   Second, traditional L2 forwarding requires each MAC address to be
   learnt, and that is a scaling concern, especially in the core. This
   problem can be addressed by encapsulating packets into remote
   locators, only so long as the datacenter is not connected to the L3
   internet. When a datacenter is connected to L3 internet and hosts can
   be accessed from outside, per-host IP to MAC bindings are needed at
   the datacenter edge. This obviates the benefits of encapsulation in
   the core, because the core needs per host L3-L2 mappings.

   Third, if we solve the inter-VLAN traffic problem by distributing the
   default gateway across many such devices (to enable multi-path), it
   requires all the switches at the L2-L3 boundary to learn about all
   the IP-MAC bindings. Effectively, now we have multi-path but the
   original scaling problems with L2 are back because each network point
   in the core needs to know the MAC-IP binding for each host.

   Fourth, the problem of ARP broadcast in a VLAN and STP turning off
   ports is well-known. However note that ARP and STP are separate
   issues from the above scaling issues, which will exist even when STP
   is off or if ARP scaling issues have been addressed.






Dalela                  Expires June 30, 2012                 [Page 14]

Internet-Draft          Datacenter Approaches             December 2011


5.4.3. Hybrid Approaches

   Hybrid approaches bring L3 routing algorithms to L2. These turn off
   STP and enable multi-paths. However, this does not address the
   mobility problem. In the L2 network, this implies learning all MAC
   addresses in the core. To avoid this, encapsulation can be used,
   which simplifies the core, but makes the access much worse.
   Hierarchical MAC addresses can solve these scaling problems. They
   don't need encapsulation and hence they address scaling problems
   arising from host mobility at both access and in the core.

   Hierarchical MACs create a global address space for MAC addresses.
   Hence, these packets can cross VLAN boundaries easily. The trick
   required here is not to tag unicast packets with VLAN tags (L2
   multicast and broadcast packets must still be tagged with VLAN tags).
   The packets must however be marked with the appropriate tenant id of
   choice. The packet will be forwarded to destination using the MAC
   address and matched against the allowed tenant id on the destination
   port. The packet will be dropped at the destination port if the
   tenant id's at the source and destination ports do not match.

   When a L2 datacenter has to be connected to the L3 Internet, L2-L3
   mappings are required at the datacenter-Internet boundary. This is
   because inside the datacenter packets are switched based on MAC
   addresses while outside they are routed based on L3. This requires
   per-host entries to map each host IP to their hierarchical MAC, with
   one important difference. The difference is that these entries are
   required only for the north-south traffic and hence don't need to be
   present at every core switch. These per-host entries can therefore be
   distributed over multiple core switches, each of which advertizes a
   per-tenant set of IP routes to the PE router. The default gateway for
   all internet routes can be pinned on one of the core routers and this
   will allow the distribution of L2-L3 entries.

   Note that these L2-L3 mappings will be created through ARP broadcast
   when hosts in datacenter converse with Internet hosts. If these
   conversations are few, then the L2-L3 entries are correspondingly
   reduced. The key mechanism for scaling however is distribution over
   multiple core switches, which will work in all cases.

5.5. Discovery Approaches

   Two broad discovery approaches are proposed today. First, address
   discovery can be based on traditional routing protocols that push the
   address location into the network. This has the potential of causing
   instability due to frequent device creation and mobility. Second,
   address discovery can be pushed into a central registry, from where


Dalela                  Expires June 30, 2012                 [Page 15]

Internet-Draft          Datacenter Approaches             December 2011


   it can be pulled or pushed on a need basis. This approach bypasses
   the update everywhere and can update only select locations.

5.5.1. Protocol Based Route Learning

   A traditional routing protocol will carry each subnet or individual
   host route at the control plane and propagate its location. The
   location would be known everywhere through the control plane and this
   can be programmed in hardware. We have seen that all host-route
   approaches are not scalable at the forwarding plane. Individual route
   updates are also heavy on the control plane. In fact frequent updates
   due to link toggling, resource creation and deletion, mobility, etc.
   will create serious convergence issues in the network.

   Traditional L3 networks have been based on static subnets that don't
   change frequently. This helps in scaling the network and keeping it
   converged. This property of networks needs to be preserved, although
   the challenge with L3 is mobility and scaling issues with mobility.

5.5.2. Address Location Registries

   An address or subnet is discovered (through conversation learning or
   static configuration) and propagated into a registry, along with the
   address of the location to reach it. Any network node that has to
   send traffic to this address can look up the registry to find the
   address location before transmission. Once looked up, the location
   can be cached for a long period of time. This has the advantage that
   it serves information on-demand. The disadvantage is that when the
   information changes, everyone will not be aware of the change. They
   will therefore continue to forward packets to the old location, and
   the packets will be black-holed. If however, every network entity is
   made aware of the change immediately through an update upon change,
   then this becomes similar to the routing update above.

   There are sometimes concerns expressed about learning the routes
   real-time after the packet arrives. In general, the number of such
   lookups is of the same order of magnitude as DNS lookups which are
   done at the host level. The signaling overheads are not therefore
   significant per se, if the flows are all legitimate.

   The difference between host DNS lookup and the real-time route lookup
   is that no packets are being sent before DNS lookup whereas a large
   packet burst could be sent in the route lookup case. The burst cannot
   be forwarded until a route has been received. This is a potential
   security issue, if users send spurious bursts to non-existent IP
   addresses. The router will buffer the packets and send queries which
   will fail. Meanwhile, legitimate packets would have been queued up


Dalela                  Expires June 30, 2012                 [Page 16]

Internet-Draft          Datacenter Approaches             December 2011


   and will result in tail drop. Spurious IP scanning attacks can be
   launched to try and reach non-existent addresses. These attacks can
   be used to significantly load the control plane as well.

5.5.3. Routing-Registry Hybrid Approach

   In the hybrid approach, a static routing protocol is applied to
   discover all network routes, similar to Registry based approaches. In
   the hierarchical MAC approach, this is a route table of switch-ids.
   Packets are to be forwarded based upon these network routes. The
   trigger for location discovery is however tied to the ARP request.
   The ARP request must be trapped at the access switch and forwarded to
   a central Registry. They difference here is that the trigger to the
   Registry query is not arrival of data traffic, but the arrival of an
   ARP request. This approach mimics the DNS behavior more accurately
   because during an ARP request, no packets are being sent. Note that
   this solution will work only in a L2 network.

   While IP scanning attacks will load the control plane with location
   discovery there is no issue about tail drops. Further, more
   sophisticated control plane mechanisms can be done to detect such IP
   scans since the triggers are control plane messages.

   When the VM moves, two possible schemes can be adopted. First, the
   new MAC address can be flooded to all corresponding hosts, via a
   Gratuitous ARP. The access switches will trap the Gratuitous ARP and
   create a binding to the new location. If we are using hierarchical
   MACs then bear in mind that many hosts will reject a Gratuitous ARP
   to avoid MAC hijacking. This is thus not an optimal solution.

   Second, a temporary redirect entry at the earlier source may be
   installed to redirect packets from the old to the new location. Note
   that the ARP cache will be refreshed by each host periodically
   (typically 15-30 seconds), so the redirect is not permanent. The
   registry owns the installation of the temporary redirect. This
   creates a sub-optimal routing path for a short period of time, but it
   avoids the heavy control plane traffic to update every new source
   with the new location. In time, every host will ARP for the
   destination again and will learn about the new location. The
   temporary redirect can therefore be removed after 15-30 seconds,
   which is the time within which we can expect the host to re-ARP.

   This solution solves the sudden control plane burst on a move, but it
   introduces the problem that ARPs have to be periodically forwarded to
   the registry to resolve. This isn't a scaling problem for the router
   control plane, but a scaling issue for the central registry. Note
   that ARPs on a L2 network can be huge. Forwarding them to a central


Dalela                  Expires June 30, 2012                 [Page 17]

Internet-Draft          Datacenter Approaches             December 2011


   registry therefore needs to be handled with care. Of course, this
   central registry can as well be a load-balanced cluster of many nodes
   that share the data between them. That way, the ARP load can be
   dynamically addressed as the scale increases.

5.6. Cloud Control Approaches

   Cloud control comprises of several functions, including discovery
   within and across sites, orchestration of resources, debugging and
   analytics. There are two broad approaches to cloud control. First,
   the cloud control is built with application level APIs, such as HTTP
   based web-services (SOAP or REST). Second, the cloud control is
   embedded as a network protocol and closely tied to other network
   functions. These approaches are discussed below.

5.6.1. Application APIs

   An application API is a client-server model of communication. These
   APIs when run over HTTP have the advantage that they can cross
   firewalls. They are easy to implement and directly expose developer
   level constructs for software programming.

   There are however some limitations in the use of APIs. First, every
   API projects the application view of information into the network
   (the packet format is constructed from the API format). In the longer
   term, this means APIs will generally not interoperate, because of
   semantic and syntactical differences. If we converge upon a single
   API standard, services deployed using existing APIs will not work.
   Second, APIs as client-server constructs don't facilitate discovery,
   which depends on broadcast and solicitation, prior to knowing the IP
   or DNS of the endpoints. Third, APIs don't facilitate transactions
   with the ability to commit or cancel in case of failures. APIs don't
   give the ability to ask questions half-way through a transaction or
   cancel a transaction mid-way. An API may hang and closing the
   connection may result in leaked resources. Fourth, APIs don't
   facilitate a policy control at the network edges which is very
   important when connecting private and public domains or two public
   domains. Fifth, it is harder to build single sign-on capabilities
   with APIs because API authentication depends on the server, which
   needs to have the user's credentials although these credentials may
   not be shared across different administrative boundaries.

   Even more important than the above issues is that API orchestration
   is generally unaware of network topology. When orchestrating a
   distributed system it is very important to know the topology. For
   instance, if a VM is being allocated, bandwidth may need to be
   reserved on the path. Likewise, if a VM is being moved, appropriate


Dalela                  Expires June 30, 2012                 [Page 18]

Internet-Draft          Datacenter Approaches             December 2011


   policies like QoS and security need to be dragged along. Firewall
   rules may need to be installed in the path to the VM. In case of
   disaster recovery, it is important to know which paths packets will
   take to the new destination. All these things require a view of the
   network topology, both logical and physical. It isn't enough to know
   the IP address of the various devices, but also the paths.

5.6.2. Network Protocol Approach

   Network topology is known in the network. A close coupling between
   the network state and the orchestration is needed for effective
   orchestration. A significant portion of orchestration is making the
   decision about the location of a service based on whether capacity is
   available. This includes compute, network, storage, security, etc.
   Orchestration across these multiple domains cannot be done without a
   good knowledge of network topology. A close coupling between network
   and orchestration is also needed to debug performance issues, or when
   services aren't being created in the desired manner.

   This close coupling between network and orchestration is easily
   achieved if the orchestration is embedded in the network because then
   it can easily access the network state such as the location of
   devices, the shortest paths, bandwidth availability, etc.

   To achieve this, a standard protocol is needed to orchestrate multi-
   domain services. This protocol can be used by all existing APIs or
   even new ones. The protocol will represent the network view of
   information while APIs represent the application view. Protocols have
   always been used in the Internet for interoperability. Using such
   protocols it would be possible to interoperate currently incompatible
   APIs. For instance, different APIs could be used in private and
   public domain as long as they exchange information using a common
   protocol. Protocols also facilitate easy discovery using mechanisms
   such as broadcast and multicast, reducing the configuration overhead.

6. Recommendations

   Based on the discussion above, the scaling properties of various
   mobility solutions are listed below. There are four types of scaling
   issues discussed so far: (a) datacenter access, (b) datacenter
   interconnect, (c) datacenter-internet, and (d) datacenter core.

   These functions can be combined in the same network device, or may be
   kept separate. Logical separation allows for a clearer discussion of
   the scaling attributes of these functions. Further reasoning of
   having these separate is described in greater detail below.



Dalela                  Expires June 30, 2012                 [Page 19]

Internet-Draft          Datacenter Approaches             December 2011


   The below table summarizes the host-route issues in the various
   scenarios at the various points in the network.

   +-------------------------------------------------------------------+
   |        Switch Scaling Requirements for Datacenter Mobility        |
   +-------------------------------------------------------------------+
   |   Approach     |   Access  |   Core   | Interconnect |  Internet  |
   +===================================================================+
   |  Vanilla L2    |    HIGH   |  MASSIVE |    MASSIVE   |    HIGH    |
   +-------------------------------------------------------------------+
   | L2/L3 Encap    |    HIGH   |   LOW    |    MASSIVE   |    HIGH    |
   | w/ separate    |           |          |              |            |
   | DC and Inter-DC|           |          |              |            |
   | approaches     |           |          |              |            |
   +-------------------------------------------------------------------+
   | L2/L3 Encap    |    HIGH   |   LOW    |      LOW     |    HIGH    |
   | w/ identical   |           |          |              |            |
   | DC and Inter-DC|           |          |              |            |
   | approaches     |           |          |              |            |
   +-------------------------------------------------------------------+
   | Hierarchical   |    LOW    |   LOW    |      LOW     |    HIGH    |
   | MAC addressing |           |          |              |            |
   +-------------------------------------------------------------------+

           Table-1: Scaling Comparison of Datacenter Approaches

   From the above, we can see that the Hierarchical MAC addressing fares
   better than all other approaches. The only place it has a high need
   is at the datacenter-internet boundary. This issue can be addressed
   by distributing these over multiple cores since the boundary only
   involves north-south traffic and does not need ECMP.

   Based on this analysis, the following conclusions can be arrived at,
   as recommendations for further work:

   -  It is important distinguish the datacenter interconnect boundary,
     the datacenter-internet boundary, datacenter core and access from
     a scaling perspective. This is because private addresses can be
     advertized between datacenters, but they can't be advertised into
     the internet. At the internet boundary north-south traffic is
     required but at the core, east-west traffic is required.

   -  The technology within and between datacenters should be identical.
     This allows us to treat datacenter interconnects similar to the
     datacenter core and interconnects can be scaled easily using
     common techniques. Interconnects can use MPLS VPNs and a cloud can
     be treated as a new "site" for private networks.


Dalela                  Expires June 30, 2012                 [Page 20]

Internet-Draft          Datacenter Approaches             December 2011


   -  Hierarchical MACs offer the best scaling and mobility properties.
     They will lead to the most scalable network designs. The scaling
     properties are particularly important at access because of the
     huge number of access devices in the datacenter.

   -  Hierarchical MAC assignments could be manual or could be done
     automatically using a new protocol. The new protocol could include
     just switch/router level or even host level assignments.

   -  Hierarchical MACs (when combined with DCB) can also be used to
     consolidate TCP/IP, Storage and IPC traffic over Ethernet. If DCB
     is not available, then iSCSI and iWARP can be used over L2
     forwarding. This affords the best scaling properties in the
     interim. Over time, when DCB is available, datacenter can move to
     consolidating FC and IB traffic over Ethernet.

   -  A hybrid discovery approach that separates host and network
     address discovery needs to be used to maintain network resiliency.
     Routing protocols will do network discovery while ARP should be
     used for host location discovery. This gives the best results for
     both the forwarding and control plane scale.

   -  ARP scaling is a control plane scaling issue and should be
     addressed through central registries. A new protocol is required
     to interact with the registry. This protocol must have mechanisms
     to query and update the registry. This protocol must also support
     installing temporary redirects (can be done through updates).

   -  Segmentation must involve an identifier orthogonal to the VLAN
     tag, because this can easily overlap across boundaries. Given the
     use of L2 networks, the tag should be just above the Ethernet
     layer. MPLS is a layer 1.5 technology that can be used. Note that
     it does not require label switching inside the datacenter to use
     these tags, because packets will still be forwarded using MAC
     addresses. MPLS tags will only identify various tenants, and are
     to be treated just like VLAN tags, although in a separate space.
     Full VLAN range (including Q-in-Q) will be available for each
     tenant. MPLS already segments customers in the Internet.

   -  Cloud control needs a protocol that runs parallel to other network
     protocols to facilitate discovery through broadcast or multicast.
     A close coupling between the orchestration and networking
     functions can be achieved if this protocol runs in the network.
     This does not hinder use of variety of API formats. But, it gives
     mechanisms to provide a better intelligence into orchestration.




Dalela                  Expires June 30, 2012                 [Page 21]

Internet-Draft          Datacenter Approaches             December 2011


7. Network Architecture

   This section is illustrative only. We have already shown that the
   different datacenter functions (access, core, interconnect and
   internet boundary) have different scaling properties, with different
   types of datacenter approaches. This section shows how these
   functions can be integrated together. Treating these functions
   separately allows independent assessment of scale needs.

       +--------+      +--------+       +--------+      +--------+
       |  Core  |      |  Core  |       |  Core  |      |  Core  |
       +--------+      +--------+       +--------+      +--------+

                            ....................
                                 ECMP Mesh
                            ....................

  +------+  +------+  +----+  +----+  +----+  +----+  +------+  +------+
  | DC-I |  | DC-I |  | AC |  | AC |  | AC |  | AC |  | L3-I |  | L3-I |
  +------+  +------+  +----+  +----+  +----+  +----+  +------+  +------+

                Figure-1: Illustrative Network Architecture

   In the above picture, "Core" represents the datacenter core with
   links to all DC-I, L3-I and AC. This allows any to any connectivity
   between Access, Interconnect and Internet boundaries. "DC-I" is the
   Datacenter Interconnect between various datacenters. "AC" represents
   all the access switches. Aggregation layer is not shown, but could be
   present depending on the scaling needs. The "L3-I" represents the L3
   Internet termination at the Datacenter boundary.

   Note that a large datacenter will have several thousand Access
   switches and a few dozen Core switches. The number of L3-I switches
   depends on the extent to which the network faces traffic from outside
   the Internet. If this was a HPC cloud, the Internet traffic will be
   very small. If this was a Web 2.0 cloud, the Internet traffic would
   be a higher percentage of the total traffic. If this was a hosted
   public cloud with small and medium sized applications, most the
   traffic would be north-south and concentrated at L3-I. Accordingly
   the L3-I function needs to be scaled independently.

   Similarly, the extent of the DC-I function depends on the number of
   datacenters being connected and the inter-datacenter traffic. In case
   of extensive site-to-site mobility or in the case of hybrid cloud,
   this function would be heavily loaded. If there is no site-to-site
   mobility or no hybrid clouds, the traffic here would be low.



Dalela                  Expires June 30, 2012                 [Page 22]

Internet-Draft          Datacenter Approaches             December 2011


8. Security Considerations

   NA

9. IANA Considerations

   NA

10. Conclusions

   This document analyzed multiple approaches that can be adopted for
   addressing datacenter issues and makes recommendation on a consistent
   approach. These recommendations can be used to further discuss and/or
   develop cloud datacenter problems in a holistic manner.

11. References

11.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

11.2. Informative References

   [REQ] Datacenter Network and Operations Requirements
             http://www.ietf.org/id/draft-dalela-dc-requirements-00.txt

12. Acknowledgments

   This document was prepared using 2-Word-v2.0.template.dot.



















Dalela                  Expires June 30, 2012                 [Page 23]

Internet-Draft          Datacenter Approaches             December 2011


   Authors' Addresses

   Ashish Dalela
   Cisco Systems
   Cessna Business Park
   Bangalore
   India 560037

   Email: adalela@cisco.com








































Dalela                  Expires June 30, 2012                 [Page 24]