Network working group X. Xu Internet Draft Huawei Technologies Category: Standard Track Expires: February 2011 August 24, 2010 Virtual Subnet: A Scalable Data Center Network Architecture draft-xu-virtual-subnet-02 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on February 24, 2011. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Xu Expires February 24, 2011 [Page 1] Internet-Draft Virtual Subnet August 2010 Abstract This document proposes a scalable data center network architecture which, as an alternative to the Spanning Tree Protocol Bridge network, uses a Layer 3 routing infrastructure to provide scalable virtual Layer 2 network connectivity services. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Table of Contents 1. Problem Statement............................................3 2. Terminology..................................................3 3. Design Goals.................................................3 4. Architecture Description.....................................4 4.1. Unicast.................................................4 4.1.1. Communications within a Service Domain.............4 4.1.2. Communications between Service Domains.............6 4.2. Multicast/Broadcast.....................................7 4.3. Host Discovery..........................................9 4.4. APR Proxy..............................................10 4.5. DHCP Relay Agent.......................................11 5. Conclusions.................................................11 6. Limitations.................................................12 7. Future work.................................................12 8. Security Considerations.....................................12 9. IANA Considerations.........................................12 10. Acknowledgements...........................................12 11. References.................................................12 11.1. Normative References..................................12 11.2. Informative References................................12 Authors' Addresses.............................................13 Xu Expires February 24, 2011 [Page 2] Internet-Draft Virtual Subnet August 2010 1. Problem Statement With the popularity of cloud services, the scale of today's data centers expands larger and larger. In addition, virtual machine migration technology, which allows a virtual machine to be able to migrate to any physical server while keeping the same IP address, is becoming more and more prevalent for achieving service agility in data centers. As a result, large Layer 2 networks are needed for server-to-server connectivity. Meanwhile, due to the huge-volume traffic exchanged between servers, the L2 networks SHOULD be able to provide enough capacity for server-to-server interconnections. Unfortunately, today's network architecture for data centers which relies on the Spanning-Tree Protocol (STP) Bridge technology, can not address the above challenges facing those large-scale data centers, e.g., large scale of servers and high-bandwidth demands for server-to-server interconnections. First, STP can only calculate out a single forwarding tree for all connected servers and it can not support multi-path routing, e.g., Equal Cost Multi-Path (ECMP), hence it can't maximize the utilization of the totally available network resources to provide enough bandwidth capacity between servers; Second, since the STP Bridge forwarding depends on the flat MAC addresses, the scalability of the forward table would become a big issue, especially when the existing large Layer 2 network scales even larger; Third, the broadcast storm impact on the network performance becomes much more serious and unpredictable in the continually growing large-scale STP Bridge networks. 2. Terminology This memo makes use of the terms defined in [RFC4364], [MVPN], [RFC2236] and [RFC2131]. Below are provided terms specific to this document: - Service Domain: A group of servers which are dedicated for a given service and are usually located in a separate IP subnet. 3. Design Goals To overcome the limitations of the STP Bridge networks, in this document we propose a new network architecture for data centers called Virtual Subnet (VS), which aims to meet the following design objectives: - Bandwidth Utilization Maximization Xu Expires February 24, 2011 [Page 3] Internet-Draft Virtual Subnet August 2010 To provide enough bandwidth between servers, the server-to-server traffic SHOULD always be delivered through the shortest path while achieving load-balancing by using multi-path routing. - Layer 2 Connectivity To be backward compatible with the current applications running in data centers (e.g., virtual machine migration), those servers of a given service domain SHOULD be connected as if they were on a Local Area Network (LAN) or an IP subnet. - Domain Isolation Due to considerations of performance isolation and security, servers belonging to different service domains SHOULD be isolated just as if they were located on dedicated Virtual LANs (VLAN) or IP subnets. - Forwarding Table Scalability To accommodate tens to hundreds of thousands of servers within a single data center network, the forwarding tables of all forwarding devices (e.g., routers or switches) SHOULD be scalable enough. - Broadcast Storm Suppression To reduce the impact of broadcast storms imposed on the network performance, broadcast domains SHOULD be limited to their smallest scope. 4. Architecture Description VS actually uses BGP/MPLS VPN technology [RFC4364] with some extensions, together with some other proven technologies including ARP proxy [RFC925][RFC1027] to build a scalable large IP subnet across the MPLS/IP backbone of the data center network. As a result, VS can be deployed today as a scalable data center network. The following sections describe VS in details. 4.1. Unicast 4.1.1. Communications within a Service Domain As shown in Figure 1, BGP/MPLS VPN technology with some extensions, as an alternative to the STP Bridge, is deployed in a data center network. To achieve service domain isolation, each service domains is mapped to a distinct VPN and servers of a given service domain, Xu Expires February 24, 2011 [Page 4] Internet-Draft Virtual Subnet August 2010 as Customer Edge (CE) hosts, are attached to Provider Edge (PE) routers of the corresponding VPN directly or through one or more Ethernet switches. In addition, to build a large IP subnet across the MPLS/IP backbone, different sites of a particular VPN are associated with an identical IP subnet. That is to say, each PE attached to a given VPN is configured with a distinct IP address of an identical IP subnet on the corresponding Virtual Routing Forwarding (VRF) attachment circuits. Each PE creates connected host routes for each attached VRF automatically according to the Address Resolution Protocol (ARP) table of the corresponding VPN. Instead of exchanging the route for the configured IP subnet, PEs belonging to a given VPN exchange connected host routes among them via BGP. In addition, APR proxy is enabled on PEs for each attached VPN, thus, upon receiving from a local CE host an ARP request for a remote CE host, the PE as an ARP proxy returns one of its MAC addresses in the corresponding ARP reply. +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10/8 | | | |VPN_A:10/8 | | | | | | | | +------+ ++---+-+ +-+---++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-+-+-+ +-+-+-++ +------+ | | 10.1.1.1/8 | | | IP/MPLS Backbone | | | 10.1.1.2/8 | +-----------------+ | | | | +------------------+ | +--------------------+ | | | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ Figure 1: Intra-domain Communication Example Now host A broadcasts an ARP request for host B before communicating with B. Upon the receipt of this ARP request, PE-1 lookups the associated VRF to find the host route for B. If found and the route is learnt from a remote PE, PE-1 acting as an ARP proxy returns one of its own MAC addresses in the response to that ARP request. Otherwise, no ARP reply SHOULD be sent. After obtaining the ARP reply from PE-1, A sends an IP packet to B with destination MAC Xu Expires February 24, 2011 [Page 5] Internet-Draft Virtual Subnet August 2010 address of PE-1's MAC address. Upon receiving this packet, PE-1 acting as an ingress PE, tunnels the packet towards PE-2 which in turn, as an egress PE, forwards the packet to B. 4.1.2. Communications between Service Domains For servers in different VPNs (i.e., service domains) to communicate with each other, these VPNs SHOULD not be configured with any overlapping addresses, and each VPN SHOULD be configured with a default route towards the corresponding default gateway (i.e. a CE router). +-------+------------+--------+ +-------+------------+--------+ |VRF_ID |Destination |Next Hop| |VRF_ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-1 | | VPN_B |20.1.1.2/32 | PE-2 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_B |20.1.1.1/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 |10.1.1.1| | VPN_B | 0.0.0.0/0 |20.1.1.1| +-------+------------+--------+ +-------+------------+--------+ ^ ^ | +--------------------+ | | | IP Network | | | +----+-----------+---+ | | +---+--+ +---+--+ | | | GW-1 | | GW-2 | | | +---+--+ +--+---+ | |VPN A:10.1.1.1/8 | |VPN B:20.1.1.1/8 | | | | | +-------------+---+--+ +--+---+-------------+ +-+ PE-3 +----+ PE-4 +-+ +-----------------+ | +------+ +------+ | +------------------+ |VPN A:10/8 | | | |VPN_B:20/8 | | | | | | | | +------+ ++--+--+ +--+--++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-++--+ +--++-++ +------+ | | 10.1.1.2/8 | || IP/MPLS Backbone || | 20.1.1.2/8 | +-----------------+ || || +------------------+ |+----------------------+| | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | Local | | VPN_B |20.1.1.2/32 | Local | Xu Expires February 24, 2011 [Page 6] Internet-Draft Virtual Subnet August 2010 +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | PE-3 | | VPN_B |20.1.1.1/32 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 | PE-3 | | VPN_B | 0.0.0.0/0 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ Figure 2: Inter-domain Communication Example As shown in Figure 2, PE-1 and PE-3 are attached to one VPN (i.e. VPN A) while PE-2 and PE-4 are attached to another VPN (i.e., VPN B). Host A and its default gateway router (i.e., GW-1) are connected to PE-1 and PE-3, respectively. PE-3 is configured with a default route for VPN A and this default route is advertised to other PEs. Similarly, host B and its default gateway router (i.e., GW-2) are connected to PE-2 and PE-4, respectively. PE-4 is configured with a default route for VPN B and this default route is advertised to other PEs. Now A sends an ARP request for its default gateway (i.e., 10.1.1.1) before communicating with B. Upon receiving this ARP request, PE-1 lookups the associated VRF to find the host route for the default gateway. If found and the route is learnt from a remote PE, PE-1 as an ARP proxy, returns one of its own MAC addresses in the ARP reply. After obtaining the ARP reply, A sends an IP packet for B with destination MAC address of PE-1's MAC. Upon receiving this packet, PE-1 as an ingress PE, tunnels it towards PE-3 according to the best-match route for that packet (i.e., the default route) in the associated VRF. PE-3 as an egress PE, in turn, forwards this packet towards the default gateway router (i.e., GW-1). After the packet arrives at the default gateway router for B (i.e., GW-2) after traveling through an IP network, GW-2 forwards the packet to PE-4 with destination MAC address of PE-4's MAC address if it has learnt an ARP for B from PE-4. Otherwise, GW-2 SHOULD broadcast an APR request for B. Upon receiving this packet, PE-4 as an ingress PE, tunnels it towards PE-2 which in turn, forwards it towards B. 4.2. Multicast/Broadcast The MVPN technology [MVPN], especially the Protocol-Independent- Multicast (PIM) tree option with some extensions, is partially reused here to support link-local multicast between servers of a given service domain (i.e., VPN). That is to say, the customer multicast group addresses of a given VPN are 1:1 or n: 1 mapped to the provider multicast group dedicated for that VPN when transporting the customer multicast traffic across the backbone. For broadcast, a dedicated provider multicast group is reserved for carrying broadcast traffic across the IP/MPLS backbone. In other words, customer broadcast is processed on PEs as a special customer Xu Expires February 24, 2011 [Page 7] Internet-Draft Virtual Subnet August 2010 multicast group. Unless otherwise mentioned, the customer multicast term pertains to customer multicast and broadcast. All PEs attached to a given VPN SHOULD maintain the identical mappings from customer multicast group addresses to provider multicast group addresses. To isolate the customer multicast traffics of different VPNs traveling through the backbone, different VPNs SHOULD be assigned distinct provider multicast group address ranges without any overlapping. +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10/8 | | | |VPN_A:10/8 | | | | | | | | +------+ E0++---+-+ +-+---++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-+-+-+ +-+---++ +------+ | | 10.1.1.1/8 | | | IP/MPLS Backbone | | 10.1.1.2/8 | +-----------------+ | | | +------------------+ | +--------------------+ | | V +-------+---------------+----------+-------+--------+ |VRF ID | Customer G |Provider G| To PE | From PE| +-------+---------------+----------+-------+--------+ | VPN_A | 224.1.1.1/32 | 239.1.1.1| True | True | +-------+---------------+----------+-------+--------+ | VPN_A | 224.0.0.0/4 | 239.1.1.2| True | True | +-------+---------------+----------+-------+--------+ | VPN_A |255.255.255.255| 239.1.1.3| True | True | +-------+---------------+----------+-------+--------+ Figure 3: Link-local Multicast/Broadcast Communication Example The multicast forwarding entry can be configured manually by the network operators or generated dynamically according to the Internet Group Management Protocol (IGMP) Membership Report/Leave messages received from CE hosts or remote PEs. Ingress PEs forward customer multicast packets to other PEs (i.e., egress PEs) of the same VPN via a provider multicast distribution tree, according to the best- match multicast forwarding entry of the associated VRF in case that the ''To PE'' field of that entry is set to True. Otherwise (i.e., that field set to False), ingress PEs are not allowed to forward the customer multicast packets to remote egress PEs. Egress PEs forward customer multicast packets received from the provider multicast distribution tree to CE hosts via VRF attachment circuits, according to the best-match multicast forwarding entry of the associated VRF in case that the ''From PE'' field of that entry is set to True. Xu Expires February 24, 2011 [Page 8] Internet-Draft Virtual Subnet August 2010 Otherwise (i.e., that field set to False), egress PEs are not allowed to forward the customer multicast packets to CE hosts. For IGMP messages to be conveyed successfully across the IP/MPLS backbone, some multicast forwarding entries of special multicast groups including all-routers multicast group (i.e., 224.0.0.2) and all-systems group (224.0.0.1) SHOULD be configured in the corresponding VRF in advance. Besides, according to the IGMP specification [RFC2236], Group-Specific Query messages are sent to the group being queried and Membership Report messages are sent to the group being reported, Upon receiving these packets from CE hosts, the PE SHOULD convey them over the corresponding provider multicast distribution tree dedicated for the all-systems group (224.0.0.1) of a given VRF. To avoid IGMP Membership Report suppression, those Membership Report messages received from PEs or CE hosts SHOULD not be forwarded to CE hosts. As an alternative to conveying IGMP Report/Leave messages through the provider multicast distribution tree, customer multicast routing information exchange among PEs can also be achieved by using the approaches defined in [MVPN-BGP]. As shown in Figure 3, upon receiving a multicast/broadcast packet from a CE (e.g., host A), if this packet is destined for 224.1.1.1, PE-1 will encapsulate it in a provider multicast packet with destination IP address of 239.1.1.1; If it is destined for an IP multicast address other than 224.1.1.1, PE-1 will encapsulate it in a provider multicast packet with destination IP address of 239.1.1.2; if this is a broadcast packet. PE-1 will encapsulate it in a provider multicast packet with destination IP address of 239.1.1.3 which is dedicated for conveying broadcast packets of that VPN. The customer multicast forwarding entries, no matter configured manually or learnt automatically according to the IGMP Membership Reports sent from local CEs, will automatically trigger PEs to join the corresponding provider multicast groups in the MPLS/IP backbone. For example, assume PE-2 receives an IGMP member report for a given customer multicast group (e.g., 224.1.1.1) from a local CE (e.g., host B), it SHOULD automatically join a provider multicast group (i.e., 239.1.1.1) corresponding to that customer multicast group. 4.3. Host Discovery To discover all local CE hosts, a PE SHOULD perform at least ARP scan once after rebooting. For example, it broadcasts an ARP request for each IP address within the subnet of each attached VPN (including the network and broadcast addresses). Alternatively, it could also broadcast an ARP request for a direct broadcast address (i.e., 255.255.255.255), upon receipt of such an ARP request, any host SHOULD respond with an ARP reply containing its IP and MAC Xu Expires February 24, 2011 [Page 9] Internet-Draft Virtual Subnet August 2010 addresses. After a round of ARP scan, the PE will discover all local CE hosts and cache their corresponding ARP entries in its ARP table. After that, the PE could send ARP requests in unicast to each already-learnt local CE host periodically so as to keep the corresponding ARP entry from expiring. This can also be useful to check whether a given CE host with known IP and MAC addresses is still present on the subnet. Using unicast ARP requests has the advantage that it is quieter than using the broadcast because it won't be received by all hosts on the subnet. When receiving a gratuitous ARP from a local host, the PE SHOULD cache it in the ARP table immediately if no ARP entry for that host exists yet. Otherwise, the PE SHOULD just update the corresponding ARP entry in its ARP table. Most operating systems generate a gratuitous ARP request when the host boots up, the host's network interface or links comes up, or an address assigned to the interface changes. In the scarce scenarios where a host does not generate a gratuitous ARP, the PE would have to perform ARP scan periodically although it has side-effects on the network performance. When a given PE receives a host route for one of its local CE hosts from a remote PE, it SHOULD immediately send an ARP request for that local CE host to check whether this CE host is still connected locally. If an ARP reply received in a short amount of time (imaging the host multi-homing scenario), the PE just needs to update the ARP entry for that local host as normal. Otherwise (considering the virtual machine migration scenario), the PE SHOULD delete the ARP entry corresponding to that host from its APR table. Meanwhile, the PE SHOULD send a gratuitous ARP on behalf of the local host, with the sender hardware address being set as one of its own MAC addresses, in order to update the ARP entry for that host which is cached on other local hosts. As a result, the subsequent packets destined for that host will be sent towards the PE by the other local CE hosts. 4.4. APR Proxy The PE, acting as an ARP proxy, SHOULD only respond to those ARP requests for remote hosts which have been learnt via BGP from other PEs. That is to say, the ARP proxy SHOULD not respond to ARP requests for local hosts. Otherwise, in case that the ARP reply from the ARP proxy covers that from the requested host, the packets destined for that local host would have to be unnecessarily relayed by the PE. When the Virtual Router Redundancy Protocol (VRRP) [RFC2338] is enabled together with ARP proxy, only the VRRP master is delegated Xu Expires February 24, 2011 [Page 10] Internet-Draft Virtual Subnet August 2010 to act as an ARP proxy and it SHOULD return the VRRP virtual MAC address in the ARP reply. 4.5. DHCP Relay Agent To avoid the Dynamic Host Configuration Protocol (DHCP) [RFC2131] broadcast message flooding through the whole data center network, the DHCP Relay Agent function can be enabled on PEs. In this way, the DHCP broadcast messages from DHCP clients (i.e., local CE hosts) would be transformed into DHCP unicast messages by the DHCP Relay Agents (i.e., PEs) and then be forwarded to the DHCP servers in unicast. 5. Conclusions By using Layer 3 routing in the backbone of the data center network to replace the STP Bridge forwarding, the traffic between any two servers is forwarded along the short path between them. Besides, the ECMP can also be easily achieved in Layer 3 routing networks. Thus, the total network bandwidth of the data center network is utilized to maximum extent. By reusing the BGP/MPLS VPN technology to exchange host routes of a given VPN among PEs, the servers of that VPN are allowed to communicate with each other just as if they are located on a single subnet. Due to the tunnels used in MPLS/BGP VPN, the forwarding tables of P routers just need to hold the reachability information of tunnel endpoints (i.e., PEs). Meanwhile, the forwarding tables of PE routers can also be ensured to scale well by distributing VPNs among different PEs, that is to say, thanks to the Outbound Route Filtering (ORF) capability of BGP, a given PE router only needs to hold the routing tables of those VPNs to which the PE is attached. Thus, the forwarding table scalability issues with data center networks are largely alleviated. By enabling the APR proxy function on PEs, the ARP broadcast messages from local CE hosts are blocked on the attached PEs. Thus, the APR broadcast messages will not be flooded through the whole data center network. Besides, by enabling the DHCP Relay Agent function on PEs, the DHCP broadcast messages from DHCP clients (i.e., local CE hosts) would be transformed into unicast messages by the DHCP Relay Agents and then be forwarded to the DHCP servers in unicast. Thus, the broadcast storms in the data center networks are largely suppressed. Xu Expires February 24, 2011 [Page 11] Internet-Draft Virtual Subnet August 2010 6. Limitations Since the data center network architecture described in this document partially reuses the BGP/MPLS VPN technology to construct a large-scale IP subnet, rather than a real LAN, the non-IP traffic can not be supported in this architecture. However, we believe IP is the dominate communication protocol in today's data center networks, those non-IP legacy applications will disappear from the data center networks with the elapse of time. 7. Future work IPv6-based data center network will be considered as a part of the further work. 8. Security Considerations TBD. 9. IANA Considerations There is no requirement for IANA. 10. Acknowledgements Thanks to Dino Farinacci for his valuable comments. 11. References 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 11.2. Informative References [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress), Janurary 2010. [MVPN-BGP] R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter, C. Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt, September 2009. Xu Expires February 24, 2011 [Page 12] Internet-Draft Virtual Subnet August 2010 [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC Information Sciences Institute, October 1984. [RFC1027] Smoot Carl-Mitchell, John S. Quarterman,'' Using ARP to Implement Transparent Subnet Gateways'', RFC 1027, October 1987. [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol", RFC 2338, April 1998. [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, November 1997. Authors' Addresses Xiaohu Xu Huawei Technologies, No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District, Beijing 100085, P.R. China Phone: +86 10 82836073 Email: xuxh@huawei.com