Network working group X. Xu Internet Draft Huawei Technologies Category: Standard Track Expires: January 2011 July 2, 2010 Virtual Subnet: A Scalable Data Center Network Architecture draft-xu-virtual-subnet-00 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on July 2, 2011. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Xu, et al. Expires January 2, 2011 [Page 1] Internet-Draft Virtual Subnet July 2010 Abstract This document proposes a scalable data center network architecture which, as an alternative to the Spanning Tree Protocol Bridge network, uses a Layer 3 routing infrastructure to provide scalable virtual Layer 2 network connectivity services. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Table of Contents 1. Problem Statement............................................3 2. Terminology..................................................3 3. Design Goals.................................................4 4. Architecture Description.....................................4 4.1. Unicast.................................................5 4.1.1. Communications within a Service Domain.............5 4.1.2. Communications between Service Domains.............6 4.2. Multicast/Broadcast.....................................8 4.3. ARP Cache..............................................10 4.4. APR Proxy..............................................10 4.5. DHCP Agent Relay.......................................10 5. Conclusions.................................................11 6. Limitations.................................................11 7. Future work.................................................11 8. Security Considerations.....................................12 9. IANA Considerations.........................................12 10. Acknowledgements...........................................12 11. References.................................................12 11.1. Normative References..................................12 11.2. Informative References................................12 Authors' Addresses.............................................13 Xu, et al. Expires January 2, 2011 [Page 2] Internet-Draft Virtual Subnet July 2010 1. Problem Statement With the popularity of cloud computation services, the scale of today's data center expands larger and larger, and data centers containing tens to hundreds of thousands of servers are not uncommon. In most data centers, server clusters consisting of large numbers of servers are widely used for server load-balancing. Currently, many popular applications of server clusters require that servers within a given cluster SHOULD be located on an IP subnet. Besides, Virtual Machine (VM) migration technologies are also widely used to achieve service agility which requires VMs to be able to migrate to any physical server while keeping the same IP address. In addition, many data center applications (e.g., server cluster and grid computing) result in increased server-to-server traffic. Hence, a scalable and high-bandwidth Layer 2 network connectivity service is desired for the interconnection of servers within huge data centers. Unfortunately, today's data center network architecture which relies on the Spanning-Tree Protocol (STP) Bridge technology, can not address the above challenges (i.e., large scale of servers and high- bandwidth demands for server-to-server interconnection) facing those large-scale data centers. First, the STP can not maximize the utilization of the total network bandwidths to provide enough bandwidth capacity between servers since it can only calculate out a single forwarding tree for all connected servers and can not support Equal Cost Multiple Path (ECMP); Second, the scalability of the forward table would become a big concern when the existing large Layer 2 network scales even larger, since the STP Bridge forwarding depends on the flat MAC addresses; Third, the broadcast storm impact on the network performance becomes much serious and unpredictable in a continually growing large-scale STP Bridge network. 2. Terminology This memo makes use of the terms defined in [RFC4364],[MVPN], [RFC2236] and [RFC2131]. Below are provided terms specific to this document: - Service Domain: A group of servers which are dedicated for a given service. In most cases, these servers of a service domain are located in an IP subnet. Xu, et al. Expires January 2, 2011 [Page 3] Internet-Draft Virtual Subnet July 2010 3. Design Goals To overcome the limitations of the STP Bridge network, the new network architecture for data centers SHOULD be able to meet following design objectives: - Bandwidth Utilization Maximization To provide enough bandwidth between the servers, the server-to- server traffic SHOULD be ensured to always travel through the shortest path while achieving the Equal Cost Multiple Path (ECMP). - Layer 2 Semantics To be backward compatible with the current data center applications (e.g., cluster, VM migration etc.), servers of a given service domain SHOULD be connected as if they were on a Local Area Network (LAN) or an IP subnet. - Domain Isolation Due to performance isolation and security considerations, servers of different service domains SHOULD be isolated just as if they were on different VLANs. - Forwarding Table Scalability To accommodate tens to hundreds of thousands of servers within a given data center, the forwarding table of each forwarding element in the data center network SHOULD be scalable enough. - Broadcast Storm Suppression. To reduce the impact of broadcast storms on the network performance, broadcast/multicast traffic SHOULD be limited within a very small scope of network. 4. Architecture Description Generally speaking, this new data center network architecture partially takes advantage of the MPLS/BGP VPN technology [RFC4364] to construct a scalable large IP subnet across the MPLS/IP backbone network of a date center. The following sections describe the architecture in details. Xu, et al. Expires January 2, 2011 [Page 4] Internet-Draft Virtual Subnet July 2010 4.1. Unicast 4.1.1. Communications within a Service Domain BGP/MPLS VPN technology with some extensions, as an alternative to the STP Bridge, is deployed in a data center network. Hosts, as Customer Edge (CE) devices are attached to Provider Edge (PE) routers directly or through a Layer 2 switch. Here, different service domains are mapped to distinct VPNs so as to achieve domain isolation, and different sites of a particular VPN are configured with an identical IP subnet to achieve service agility, that is to say, all PEs attached to the same VPN are configured with an identical IP subnet on the corresponding Virtual Routing Forwarding (VRF) attachment circuits. PEs automatically generate connected host routes for each VRF according to the Address Resolution Protocol (APR) table of the corresponding VPN, and then exchange their connected host routes with each other via BGP. APR proxy is enabled for each VPN on PEs, thus, upon receiving an ARP request for a remote host from a local host, the PE as an ARP proxy returns one of its MAC addresses as a response. +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10/8 | | | |VPN_A:10/8 | | | | | | | | +-----+ ++---+-+ +-+---++ +-----+ | | | A +----+ PE-1 | | PE-2 +----+ B | | | +-----+ ++-+-+-+ +-+---++ +-----+ | | 10.1.1.1/32 | | | IP/MPLS Backbone | | 10.1.1.2/32 | +-----------------+ | | | +------------------+ | +--------------------+ | | V +-------+------------+--------+ |VRF ID |Destination |Next Hop| +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | +-------+------------+--------+ Figure 1: Intra-domain Communication Example As shown in Figure 1, host A sends an ARP request for host B before communicating with B. Upon receiving this ARP request, PE-1 lookups the associated VRF to find the host route for B. If found, PE-1 Xu, et al. Expires January 2, 2011 [Page 5] Internet-Draft Virtual Subnet July 2010 acting as an ARP proxy returns one of its own MAC addresses as a response to that ARP request. Otherwise, no ARP response SHOULD be replied. After obtaining the ARP response from PE-1, A sends an IP packet destined for B with destination MAC address of PE-1' MAC address. Upon receiving this IP packet, PE-1 acting as an ingress PE ,tunnels the packet towards PE-2 according to the associated VRF. PE-2 as an egress PE in turn will forward the packet to B according to the associated VRF. In a word, this is a special BGP/MPLS VPN application scenario in which connected host routes of each VRF are automatically generated according to the ARP table of the corresponding VPN and are exchanged among PEs attaching to the same VPN. 4.1.2. Communications between Service Domains For hosts located in different VPNs (i.e., service domains) to communicate with each other, the VPNs SHOULD not have any overlapping address spaces. Besides, each VPN SHOULD be configured with at least one default route, e.g., the default gateway router of a given VPN is connected to a PE attached to that VPN on which a default route SHOULD be configured in the associated VRF and then be advertised to other PEs of that VPN. As shown in Figure 2, PE-1 and PE-3 are attached to one VPN (i.e. VPN A) while PE-2 and PE-4 are attached to another VPN (i.e., VPN B). Host A and its default gateway router (i.e., GW-1) are connected to PE-1 and PE-3, respectively. Host B and its default gateway router (i.e., GW-2) are connected to PE-2 and PE-4, respectively. A sends an ARP request for its default gateway (i.e., 10.1.1.1) before communicating with B. Upon receiving this ARP request, PE-1 lookups the associated VRF to find the host route for the default gateway. If found, PE-1 as an ARP proxy, returns one of its own MAC addresses as a response. After obtaining the ARP response, A constructs an IP packet destined for B and encapsulates it in an Ethernet frame with destination MAC address of PE-1' MAC and then sends it out. Upon receiving this packet, PE-1 as an ingress PE, tunnels it towards PE- 3 according to the best-match route for that packet (i.e., the default route) in the associated VRF. PE-3 as an egress PE, in turn, forwards this packet towards the default gateway router (i.e., GW-1) according to the best-match route for that packet (i.e., the configured default route). After the packet arrives at the default gateway router of B (i.e., GW-2) after hop-by-hop forwarding, GW-2 will send an APR request for B. PE-4 as an ARP proxy, returns one of its own MAC addresses as a response. GW-2 encapsulates the IP packet within an Ethernet frame with destination MAC address of PE-4's MAC address and then sends it out. Upon receiving this packet, PE-4 as an ingress PE, tunnels it towards PE-2 according to the associated Xu, et al. Expires January 2, 2011 [Page 6] Internet-Draft Virtual Subnet July 2010 VRF. PE-2 as an egress PE, in turn, forwards it towards B according to the associated VRF. +-------+------------+--------+ +-------+------------+--------+ |VRF_ID |Destination |Next Hop| |VRF_ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-1 | | VPN_B |20.1.1.2/32 | PE-2 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_B |20.1.1.1/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 |10.1.1.1| | VPN_B | 0.0.0.0/0 |20.1.1.1| +-------+------------+--------+ +-------+------------+--------+ ^ ^ | +--------------------+ | | | IP Network | | | +----+-----------+---+ | | +---+--+ +---+--+ | | | GW-1 | | GW-2 | | | +---+--+ +--+---+ | |VPN A:10.1.1.1/32| |VPN B:20.1.1.1/32| | | | | +-------------+---+--+ +--+---+------------- +-+ PE-3 +----+ PE-4 +-+ +-----------------+ | +------+ +------+ | +------------------+ |VPN A:10/8 | | | |VPN_B:20/8 | | | | | | | | +-----+ ++--+--+ +--+--++ +-----+ | | | A +----+ PE-1 | | PE-2 +----+ B | | | +-----+ ++-++--+ +--++-++ +-----+ | | 10.1.1.2/32 | || IP/MPLS Backbone || | 20.1.1.2/32 | +-----------------+ || || +------------------+ |+----------------------+| | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | Local | | VPN_B |20.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | PE-3 | | VPN_B |20.1.1.1/32 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 | PE-3 | | VPN_B | 0.0.0.0/0 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ Figure 2: Inter-domain Communication Example Xu, et al. Expires January 2, 2011 [Page 7] Internet-Draft Virtual Subnet July 2010 4.2. Multicast/Broadcast The MVPN technology [MVPN], especially the Protocol-Independent- Multicast (PIM) tree option with some extensions, is partially reused here to support link-local multicast between hosts of a given service domain (i.e., VPN). That is to say, the customer multicast group addresses of a given VPN are 1:1 or n: 1 mapped to the provider multicast group dedicated for that VPN when transporting the customer multicast traffic across the backbone. For broadcast, a dedicated provider multicast group is reserved for carrying broadcast traffic across the IP/MPLS backbone. In other words, customer broadcast is processed on PEs as a special customer multicast group. Unless otherwise mentioned, the customer multicast term pertains to customer multicast and broadcast. All PEs attaching to a given VPN SHOULD maintain the identical mappings from customer multicast group addresses to provider multicast group addresses. To isolate the customer multicast traffics of different VPNs traveling through the backbone, different VPNs SHOULD be assigned distinct provider multicast group address ranges without any overlapping. +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10/8 | | | |VPN_A:10/8 | | | | | | | | +-----+ E0++---+-+ +-+---++ +-----+ | | | A +----+ PE-1 | | PE-2 +----+ B | | | +-----+ ++-+-+-+ +-+---++ +-----+ | | 10.1.1.1/32 | | | IP/MPLS Backbone | | 10.1.1.2/32 | +-----------------+ | | | +------------------+ | +--------------------+ | | V +-------+---------------+----------+-------+--------+ |VRF ID | Customer G |Provider G| To PE | From PE| +-------+---------------+----------+-------+--------+ | VPN_A | 224.1.1.1/32 | 239.1.1.1| True | True | +-------+---------------+----------+-------+--------+ | VPN_A | 224.0.0.0/4 | 239.1.1.2| True | True | +-------+---------------+----------+-------+--------+ | VPN_A |255.255.255.255| 239.1.1.3| True | True | +-------+---------------+----------+-------+--------+ Figure 3: Link-local Multicast/Broadcast Communication Example The multicast forwarding entry can be configured manually by the network operators or generated dynamically according to the Internet Xu, et al. Expires January 2, 2011 [Page 8] Internet-Draft Virtual Subnet July 2010 Group Management Protocol (IGMP) Membership Report/Leave messages received from CEs or remote PEs. Ingress PEs forward customer multicast packets to other PEs (i.e., egress PEs) of the same VPN via a provider multicast distribution tree, according to the best- match multicast forwarding entry of the associated VRF in case that the "To PE" field of that entry is set to True. Otherwise (i.e., that field set to False), ingress PEs are not allowed to forward the customer multicast packets to remote egress PEs. Egress PEs forward customer multicast packets received from the provider multicast distribution tree to CEs via VRF attachment circuits, according to the best-match multicast forwarding entry of the associated VRF in case that the "From PE" field of that entry is set to True. Otherwise (i.e., that field set to False), egress PEs are not allowed to forward the customer multicast packets to CEs. For IGMP messages to be conveyed successfully across the IP/MPLS backbone, some multicast forwarding entries of special multicast groups including all-routers multicast group (i.e., 224.0.0.2) and all- systems group (224.0.0.1) SHOULD be configured in the corresponding VRF in advance. Besides, according to the IGMP specification [RFC2236], Group-Specific Query messages are sent to the group being queried and Membership Report messages are sent to the group being reported, Upon receiving these packets from CEs, the PE SHOULD convey them over the corresponding provider multicast distribute tree dedicated for the all-systems group (224.0.0.1) of a given VRF. To avoid IGMP Membership Report suppression, those Membership Report messages received from PEs or CEs SHOULD not be forwarded to CEs. As an alternative to conveying IGMP Report/Leave messages through the provider multicast distribute tree, customer multicast routing information exchange among PEs can also be achieved by using the approaches defined in [MVPN-BGP]. As shown in Figure 3, upon receiving a multicast/broadcast packet from a CE (e.g., host A), if this packet is destined for 224.1.1.1, PE-1 will encapsulate it into a provider multicast packet with destination IP address of 239.1.1.1; If it is destined for an IP multicast address other than 224.1.1.1, PE-1 will encapsulate it into a provider multicast packet with destination IP address of 239.1.1.2; if this is a broadcast packet. PE-1 will encapsulate it into a provider multicast packet with destination IP address of 239.1.1.3 which is dedicated for conveying broadcast of that VPN. The customer multicast forwarding entries, no matter configured manually or learnt automatically according to the IGMP Membership Reports sent from local CEs, will automatically trigger PEs to join the corresponding provider multicast groups in the MPLS/IP backbone. For example, assume PE-2 receives an IGMP member report for a given customer multicast group (e.g., 224.1.1.1) from a local CE (e.g., Xu, et al. Expires January 2, 2011 [Page 9] Internet-Draft Virtual Subnet July 2010 host B), it SHOULD automatically join a provider multicast group (i.e., 239.1.1.1) corresponding to that customer multicast group. 4.3. ARP Cache After rebooting, a PE SHOULD send ARP requests for every IP address within the subnet of each VPN. Thus, after a round of ARP requests, the PE will know all local hosts according to the received ARP responses. After this, the PE will send ARP requests in unicast to those already learnt local hosts to keep the learnt ARP entries from expiring. When receiving a gratuitous ARP from a local host, the PE SHOULD cache this APR into the ARP table of the corresponding VPN if that entry does not exist yet. When a PE has received a host route for its local host from remote PEs, it SHOULD immediately send an ARP request for that local host in unicast to check whether this host is still connected locally. If no response received (considering the VM migration scenario), the PE SHOULD delete the ARP entry for that host from its APR table and then withdraw that connected host route accordingly. If an ARP response received (imaging the host multi-homing scenario), the PE just needs to update the ARP entry for that local host as normal. 4.4. APR Proxy The PE, acting as an ARP proxy, will only respond those ARP requests for remote hosts which have been learnt from other PEs. That is to say, the ARP proxy SHOULD not respond ARP requests for local hosts. Otherwise, in case that the ARP response from the PE covers that from the real requested host, the packets destined for that local host would have to be relayed by the PE, which is the so-called hairpin issue. When VRRP [RFC2338] is enabled together with ARP proxy, only the VRRP master is delegated to act as an ARP proxy and return the VRRP virtual MAC address as a response. 4.5. DHCP Agent Relay To avoid the Dynamic Host Configuration Protocol (DHCP) [RFC2131] broadcast message flooding through the whole data center network, the DHCP Agent Relay function can be enabled on PEs. In this way, the DHCP broadcast messages from DHCP clients (i.e., local CE hosts) would be transformed into DHCP unicast messages by the DHCP agent relays (i.e., PEs) and then be forwarded to the DHCP servers in unicast. Xu, et al. Expires January 2, 2011 [Page 10] Internet-Draft Virtual Subnet July 2010 5. Conclusions By using Layer 3 routing in the backbone of the data center network to replace the STP Bridge forwarding, the traffic between any two servers is forwarded along the short path between them. Besides, the ECMP can also be easily achieved in Layer 3 routing networks. Thus, the total network bandwidth of the data center network is utilized to maximum extent. By reusing the BGP/MPLS VPN to exchange host routes of a given VPN among PEs, the servers of that VPN are communicated with each other just as if they are located with in a LAN or subnet. Due to the tunnels used in MPLS/BGP VPN, the forwarding tables of P routers just need to hold the reachability information of tunnel endpoints (i.e., PEs). Meanwhile, the forwarding tables of PE routers can also be ensured to scale well by distributing VPNs among different PEs, that is to say, a given PE router only needs to hold the routing tables of those VPNs to which the PE is attached. Thus, the forwarding table scalability issues with data center networks are largely alleviated. By enabling the APR proxy function on PEs, the ARP broadcast messages from local CE hosts are blocked on the attached PEs. Thus, the APR broadcast messages will not be flooded through the whole data center network. Besides, by enabling the DHCP agent relay function on PEs, the DHCP broadcast messages from DHCP clients (i.e., local CE hosts) would be transformed into unicast messages by the DHCP agent relays (i.e., PEs) and then be forwarded to the DHCP servers in unicast. Thus, the broadcast storms in the data center networks are largely suppressed. 6. Limitations Since the data center network architecture described in this document partially reuses the BGP/MPLS VPN technology to construct a large-scale IP subnet, rather than a real LAN, the non-IP traffic can not be supported in this architecture. However, we believe IP is the dominate communication protocol in today's data center network, those non-IP legacy applications will disappear from the data center network with the elapse of time. 7. Future work If necessary, IS-IS or OSPF can also be extended to support the similar function as the special BGP/MPLS VPN described in this Xu, et al. Expires January 2, 2011 [Page 11] Internet-Draft Virtual Subnet July 2010 document. In addition, IPv6 data center network will be considered as a part of the further work. 8. Security Considerations TBD. 9. IANA Considerations There is no requirement for IANA. 10. Acknowledgements Thanks to Dacheng Zhang for his editorial review. 11. References 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 11.2. Informative References [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress), Janurary 2010. [MVPN-BGP], R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter, C. Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt, September 2009. [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol", RFC 2338, April 1998. [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, November 1997. Xu, et al. Expires January 2, 2011 [Page 12] Internet-Draft Virtual Subnet July 2010 Authors' Addresses Xiaohu Xu Huawei Technologies, No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District, Beijing 100085, P.R. China Phone: +86 10 82836073 Email: xuxh@huawei.com