Network working group X. Xu Internet Draft Huawei Technologies Category: Standard Track Expires: March 2011 September 1, 2010 Virtual Subnet: A Scalable Data Center Network Architecture draft-xu-virtual-subnet-03 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on March 1, 2011. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Xu Expires March 1, 2011 [Page 1] Internet-Draft Virtual Subnet September 2010 Abstract This document proposes a scalable data center network architecture which, as an alternative to the Spanning Tree Protocol Bridge network, uses a Layer 3 routing infrastructure based on BGP/MPLS IP VPN technology [RFC4364] with some extensions, together with some other proven technologies including ARP proxy [RFC925][RFC1027] to provide scalable virtual Layer 2 network connectivity services. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Table of Contents 1. Problem Statement...........................................3 2. Terminology.................................................3 3. Design Goals................................................3 4. Architecture Description....................................4 4.1. Unicast IP traffic.....................................5 4.1.1. Unicast IP Traffic inside a Service Domain........5 4.1.2. Unicast IP Traffic between Service Domains........6 4.2. Multicast/Broadcast IP Traffic.........................7 4.3. CE Host Discovery......................................8 4.4. CE Host Multi-homing and Mobility......................9 4.5. APR Proxy..............................................9 4.6. DHCP Relay Agent.......................................9 5. VS vs VPLS..................................................9 6. VS vs IPLS.................................................10 7. Conclusion.................................................11 8. Future work................................................11 9. Security Considerations....................................11 10. IANA Considerations.......................................11 11. Acknowledgements..........................................12 12. References................................................12 12.1. Normative References.................................12 12.2. Informative References...............................12 Authors' Addresses............................................13 Xu Expires March 1, 2011 [Page 2] Internet-Draft Virtual Subnet September 2010 1. Problem Statement With the popularity of cloud services, the scale of today's data centers expands larger and larger. In addition, virtual machine migration technology, which allows a virtual machine to be able to migrate to any physical server while keeping the same IP address, is becoming more and more prevalent for achieving service agility in data centers. As a result, large Layer 2 networks are needed for server-to-server connectivity. Meanwhile, due to the huge-volume traffic exchanged between servers, the Layer 2 networks SHOULD provide enough capacity for server-to-server interconnections. Unfortunately, today's data center network using the Spanning-Tree Protocol (STP) Bridge technology, can not address the above challenges facing today's large-scale data centers in several ways. First, STP can calculate out only one single forwarding tree for all connected servers of a particular Virtual Local Area Network (VLAN) and it can not support multi-path routing, e.g., Equal Cost Multi- Path (ECMP), hence the available network capacity in data center networks can't be highly utilized so as to provide enough bandwidth between servers; Second, since the Bridge forwarding is based on the flat MAC addresses, the scalability of the Bridge forwarding table would become a big issue, especially when the existing large Layer 2 network scales even larger; Third, broadcast storm impacts imposed by some protocols, e.g., Address Resolution Protocol (ARP) and the flooding of unknown destination unicast frames become much more serious and unpredictable in the continually growing large-scale STP Bridge networks. 2. Terminology This memo makes use of the terms defined in [RFC4364], [MVPN], [RFC2236] and [RFC2131]. Below are provided terms specific to this document: - Service Domain: A group of servers which are dedicated for a given service and are usually located on a separate IP subnet. 3. Design Goals To overcome the limitations of the STP Bridge networks as mentioned above, this document describes Virtual Subnet (VS), a practical data center network architecture, which aims to meet the following objectives: - Bandwidth Utilization Maximization Xu Expires March 1, 2011 [Page 3] Internet-Draft Virtual Subnet September 2010 To provide enough bandwidth between servers, the server-to-server traffic SHOULD always be delivered along the shortest path while multi-path routing is used for load-balancing purpose. - Layer 2 Connectivity To be compatible with the applications running in today's data centers (e.g., virtual machine migration), servers of a given service domain SHOULD be connected as if they were on a Local Area Network (LAN) or an IP subnet. - Domain Isolation To achieve performance and security isolation, servers belonging to different service domains SHOULD be isolated just as if they were located on separate Virtual LANs (VLAN) or IP subnets. - Forwarding Table Scalability To accommodate tens to hundreds of thousands of servers in a single data center network, the forwarding tables of those forwarding devices (e.g., routers or switches) SHOULD be scalable enough. - Broadcast Storm Suppression To alleviate the serious impacts on network performances which are imposed by broadcast storms, broadcast domains SHOULD be limited to their smallest scopes. 4. Architecture Description VS actually uses BGP/MPLS IP VPN technology [RFC4364] with some extensions, together with other proven technologies including ARP proxy [RFC925][RFC1027] to build scalable large IP subnets across the MPLS/IP backbone of the data center network. As a result, VS can be deployed today as a practical and scalable data center network. Since VS constructs large-scale IP subnets, rather than real LANs, the non-IP traffic would not be supported in VS anymore. However, given that IP traffic is the predominant type of traffic in today's data center networks and the non-IP traffic will disappear from the data center networks with the elapse of time, we believe that VS can be used as a practical data center network solution in most cases. The following sections describe VS in detail. Xu Expires March 1, 2011 [Page 4] Internet-Draft Virtual Subnet September 2010 4.1. Unicast IP traffic 4.1.1. Unicast IP Traffic inside a Service Domain +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10/8 | | | |VPN_A:10/8 | | | | | | | | +------+ ++---+-+ +-+---++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-+-+-+ +-+-+-++ +------+ | | 10.1.1.1/8 | | | IP/MPLS Backbone | | | 10.1.1.2/8 | +-----------------+ | | | | +------------------+ | +--------------------+ | | | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ Figure 1: Unicast IP Traffic inside a Service Domain As shown in Figure 1, BGP/MPLS IP VPN technology with some extensions is deployed in a data center network. To maintain proper isolation of one service domain from another, each service domain is mapped to a distinct VPN and servers of a given service domain, as Customer Edge (CE) hosts, are attached to Provider Edge (PE) routers directly or through one or more Ethernet switches. In addition, to build large IP subnets across the MPLS/IP backbone, different sites of a particular VPN are associated with an identical IP subnet, in another words, PE routers attached to a given VPN are configured with distinct IP addresses of the same IP subnet on their VRF attachment circuits. PE routers create host routes for local CE hosts automatically according to their corresponding ARP entries. Instead of distributing the routes for the configured VPN subnets, PE routers distribute host routes for local CE hosts to each other. In addition, APR proxy is implemented on PE routers for every attached VPN, thus, upon receiving from a local CE host an ARP request for a remote CE host, the PE as an ARP proxy returns its own MAC address as a response. Xu Expires March 1, 2011 [Page 5] Internet-Draft Virtual Subnet September 2010 Assume host A broadcasts an ARP request for host B before communicating with B, upon the receipt of this ARP request, PE-1 lookups the associated VRF to find the host route for B. If found and the route is learnt from a remote PE, PE-1 acting as an ARP proxy returns its own MAC address in the response to that ARP request. Otherwise, no ARP reply SHOULD be sent. After obtaining the ARP reply from PE-1, A sends an IP packet to B with destination MAC address of PE-1's MAC address. Upon receiving this packet, PE-1 acting as an ingress PE, tunnels the packet towards PE-2 which in turn, as an egress PE, forwards the packet to B. 4.1.2. Unicast IP Traffic between Service Domains +-------+------------+--------+ +-------+------------+--------+ |VRF_ID |Destination |Next Hop| |VRF_ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-1 | | VPN_B |20.1.1.2/32 | PE-2 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_B |20.1.1.1/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 |10.1.1.1| | VPN_B | 0.0.0.0/0 |20.1.1.1| +-------+------------+--------+ +-------+------------+--------+ ^ ^ | +--------------------+ | | | IP Network | | | +----+-----------+---+ | | +---+--+ +---+--+ | | | GW-1 | | GW-2 | | | +---+--+ +--+---+ | |VPN A:10.1.1.1/8 | |VPN B:20.1.1.1/8 | | | | | +-------------+---+--+ +--+---+-------------+ +-+ PE-3 +----+ PE-4 +-+ +-----------------+ | +------+ +------+ | +------------------+ |VPN A:10/8 | | | |VPN_B:20/8 | | | | | | | | +------+ ++--+--+ +--+--++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-++--+ +--++-++ +------+ | | 10.1.1.2/8 | || IP/MPLS Backbone || | 20.1.1.2/8 | +-----------------+ || || +------------------+ |+----------------------+| | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ Xu Expires March 1, 2011 [Page 6] Internet-Draft Virtual Subnet September 2010 | VPN_A |10.1.1.2/32 | Local | | VPN_B |20.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | PE-3 | | VPN_B |20.1.1.1/32 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A | 0.0.0.0/0 | PE-3 | | VPN_B | 0.0.0.0/0 | PE-4 | +-------+------------+--------+ +-------+------------+--------+ Figure 2: Unicast IP Traffic between Service Domains For servers of different VPNs (i.e., service domains) to communicate with each other, these VPNs SHOULD not be configured with any overlapping address spaces, besides, each VPN SHOULD be configured with at least one default route towards the gateway router (i.e. a CE router). As shown in Figure 2, PE-1 and PE-3 are attached to one VPN (i.e. VPN A) while PE-2 and PE-4 are attached to another VPN (i.e., VPN B). Host A and its gateway router (i.e., GW-1) are connected to PE-1 and PE-3, respectively. PE-3 is configured with a default route for VPN A and this default route is advertised to other PE routers. Similarly, host B and its gateway router (i.e., GW-2) are connected to PE-2 and PE-4, respectively. PE-4 is configured with a default route for VPN B and this default route is advertised to other PE routers. Now A sends an ARP request for its gateway (i.e., 10.1.1.1) before communicating with B. Upon receiving this ARP request, PE-1 lookups the associated VRF to find the host route for the gateway. If found and the route is learnt from a remote PE, PE-1 as an ARP proxy, returns its own MAC address in the ARP reply. After obtaining the ARP reply, A sends an IP packet for B with destination MAC address of PE-1's MAC. Upon receiving this packet, PE-1 as an ingress PE, tunnels it towards PE-3 according to the best-match route for that packet (i.e., the default route). PE-3 as an egress PE, in turn, forwards this packet towards the gateway router (i.e., GW-1). After the packet arrives at the gateway router for B (i.e., GW-2) after traversing through an IP network, GW-2 forwards the packet to PE-4 with destination MAC address of PE-4's MAC address if it has learnt an ARP for B from PE-4. Otherwise, GW-2 SHOULD broadcast an APR request for B. Upon receiving this packet, PE-4 as an ingress PE, tunnels it towards PE-2 which in turn, forwards it towards B. 4.2. Multicast/Broadcast IP Traffic The MVPN technology [MVPN], in particular, the Protocol-Independent- Multicast (PIM) tree option with some extensions, is partially reused here to support IP multicast and broadcast between CE hosts of the same VPN. For example, PE routers attached to a given VPN join a default provider multicast distribution tree which is dedicated for that VPN. PE routers receiving customer multicast or Xu Expires March 1, 2011 [Page 7] Internet-Draft Virtual Subnet September 2010 broadcast traffic from local CE hosts forward such traffic to other remote PE routers over the corresponding default provider multicast distribution tree. When customer multicast or broadcast traffic is received from a provider multicast distribution tree, PE routers forward such traffic to the associated VRF attachment circuits. For the customer multicast group of a particular VPN which carries high-volume traffic and not all sites of that VPN need the traffic of that customer multicast group, a dedicated provider multicast distribution tree other than the default provider multicast distribution tree for that VPN can be assigned optionally. As a result, those PE routers of that VPN that have no local CE hosts interested in that customer multicast group will not receive such traffic from remote PE routers anymore. More details about how to support multicast and broadcast traffic in VS will be explored in a later version of this document. 4.3. CE Host Discovery To discover all local CE hosts including gateway routers, PE routers SHOULD perform at least once ARP scan on the attached VPN subnet after rebooting. For example, a PE broadcasts an ARP request for each IP address within the subnet of each attached VPN. Alternatively, this PE could also broadcast an ARP request for a directed broadcast address (i.e., 255.255.255.255) or an ALL-Systems multicast group address (i.e., 224.0.0.1), that is to say, the target protocol address field is filled with 2555.255.255.255 or 224.0.0.1. Any CE host receiving this ARP request SHOULD respond with an ARP reply containing its IP and MAC addresses. After a round of such ARP scan, the PE will discover all local CE hosts and cache their ARP entries in its ARP table. After that, the PE could send ARP requests in unicast to each already-learnt local CE host periodically so as to check whether the CE host is still present on the subnet. Using unicast ARP requests has the advantage that it is quieter than using the broadcast because it won't be received by all CE hosts on the subnet. When receiving a gratuitous ARP from a local CE host, the PE SHOULD cache the ARP entry of that CE host in its ARP table immediately if no ARP entry for that CE host exists yet. Otherwise, the PE SHOULD just update the corresponding ARP entry of that CE host. Most operating systems generate a gratuitous ARP request when the host boots up, the host's network interface or links comes up, or an address assigned to the interface changes. In the scarce scenarios where a host does not generate a gratuitous ARP, the PE would have to perform ARP scan periodically. Xu Expires March 1, 2011 [Page 8] Internet-Draft Virtual Subnet September 2010 4.4. CE Host Multi-homing and Mobility When a given PE receives a host route for one of its local CE hosts from a remote PE, it SHOULD immediately send an ARP request for that CE host to the attached VPN subnet so as to determine whether that CE host is still connected locally. If an ARP reply is received in a short amount of time (imaging the CE host multi-homing scenario), the PE just needs to update the ARP entry for that CE host as normal. Otherwise (considering the virtual machine migration scenario), the PE SHOULD delete the ARP entry corresponding to that host from its APR table. Meanwhile, the PE SHOULD broadcast a gratuitous ARP on the attached VPN subnet on behalf of that CE host, with the sender hardware address field being filled with one of its own MAC addresses. As a result, the ARP entry for that CE host which has been cached on other local CE hosts is updated. 4.5. APR Proxy The PE, acting as an ARP proxy, SHOULD only respond to the ARP requests for those CE hosts which have been learnt from other remote PE routers. Especially, the PE SHOULD not respond to ARP requests for local CE hosts. Otherwise, in case that the ARP reply from the PE covers that from the requested CE host, the packet for that local CE host which is sent from another local CE would be unnecessarily relayed by the PE. When Virtual Router Redundancy Protocol (VRRP) [RFC2338], together with ARP proxy is enabled on multiple PE routers which are attached to the same VPN site, only the PE acting as VRRP master is delegated to perform ARP proxy function on the shared VPN subnet. In addition, it SHOULD use the virtual MAC address of that VRRP group in any ARP packet it sends, e.g., an APR reply to the ARP request from a local CE hosts. 4.6. DHCP Relay Agent To avoid flooding Dynamic Host Configuration Protocol (DHCP) [RFC2131] broadcast messages through the data center network, DHCP Relay Agent can be implemented on PE routers for each attached VPN. Thus, DHCP broadcast messages received from DHCP clients on local CE hosts would be relayed by DHCP Relay Agents on PE routers to DHCP servers in unicast. 5. VS vs VPLS Virtual Private LAN Service (VPLS) [RFC4761, RFC4762] provides private LAN services for IP as well as other protocols. To some Xu Expires March 1, 2011 [Page 9] Internet-Draft Virtual Subnet September 2010 extent, PE routers in VPLS work much similar as STP Bridges. As a result, the broadcast storm issues are intactly inherited from traditional STP bridge networks to VPLS. At the cost of being lacking in support for non-IP traffic, VS alleviates the broadcast storm issues by using CE host route based Layer 3 routing and ARP proxy technologies on PE routers. In addition, if CE hosts of multiple VPNs are attached to a PE router through an intermediate Ethernet bridge, in VPLS, this intermediate bridge would have to learn the MAC addresses of both local CE hosts and remote CE hosts of these attached VPNs. However, in VS, such intermediate bridge only needs to learn MAC addresses of local CE hosts and local PE routers due to the ARP proxy implemented on PE routers. 6. VS vs IPLS Both VS and IP LAN Service (IPLS) [IPLS] are IP only L2VPN technologies. In IPLS, ARP packets even including the unicast ARP reply packets are forwarded from attachment circuits to "multicast" PWs (although ARP request broadcast packets can be suppressed by PEs on which there are matching ARP entries for the ARP requests in their ARP caches). Besides, the received APR packets from the "multicast" PWs will be flooded to all CEs. As a result, the broadcast storm imposed by ARP traffic is worsen rather than being alleviated. Besides, as said in IPLS, "An IP frame received over a unicast PW is prepended with a MAC header before transmitting it on the appropriate attachment circuits and the source MAC address is the PE's own local MAC address or a MAC address which has been specially configured on the PE for this use." As a result, the intermediary Ethernet switches between the PE and CEs can not keep the MAC entries of the remote CEs from expiring even there is continuous traffic between these CEs. Note that the destination MAC address of the packet to the remote CE which is sent from a local CE is the MAC of the remote CE, rather than the local PE's MAC. Thus, flooding unknown destination unicast frames would not be avoided anymore on the above Ethernet switches unless these intermediary switches are configured to not age out the learned MAC entries (whether such configuration has any side-effects is uncertain). Third, IPLS prohibits connection of a common LAN or VLAN to more than one PE. In other words, IPLS can not support CE hosts being multi-homed to multiple PE Routers to achieve redundancy and load-balancing. Xu Expires March 1, 2011 [Page 10] Internet-Draft Virtual Subnet September 2010 In contrast, all the above three issues with IPLS do not exist in VS while supporting IP only L2VPN services. 7. Conclusion By using Layer 3 routing on the backbone of the data center network to replace the STP Bridge forwarding, traffic between any two servers is forwarded along the shortest path between them and multi- path routing is easily achieved. Thus, the total network bandwidth of the data center network is utilized to maximum extent. By reusing the BGP/MPLS IP VPN technology to build large IP subnets across the backbones of data center networks, servers of a given VPN are allowed to communicate with each other just as if they are on the same subnet. Due to the BGP/MPLS IP VPN technology, forwarding tables of P routers is sized to the number of PE routers rather than the total number of servers. Meanwhile, forwarding tables of PE routers can also scale well by distributing VPN instances and their corresponding routing table entries among multiple PE routers. Especially, thanks to the Outbound Route Filtering (ORF) capability of BGP, PE routers only needs to maintain the routing tables of their attached VPNs. Thus, the forwarding table scalability issue with data center networks is largely alleviated. By enabling APR proxy function on PE routers, ARP broadcast messages from local CE hosts are blocked by local PE routers. Thus, APR broadcast messages will not flood the whole data center network. Besides, by enabling DHCP Relay Agent function on PE routers, DHCP broadcast messages from local CE hosts would be transformed into unicast messages by the DHCP Relay Agents and then be forwarded to DHCP servers in unicast. Thus, the broadcast storms in the data center networks are largely suppressed. 8. Future work How to support IPv6 CE hosts in VS is for future study. 9. Security Considerations TBD. 10. IANA Considerations There is no requirement for IANA. Xu Expires March 1, 2011 [Page 11] Internet-Draft Virtual Subnet September 2010 11. Acknowledgements Thanks to Dino Farinacci for his valuable comments. 12. References 12.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 12.2. Informative References [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress), Janurary 2010. [MVPN-BGP] R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter, C. Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt (work in progress), September 2009. [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol or Converting Network Protocol Addresses to 48-bit Ethernet Addresses for Transmission on Ethernet Hardware", RFC-826, Symbolics, November 1982. [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC Information Sciences Institute, October 1984. [RFC1027] Smoot Carl-Mitchell, John S. Quarterman, "Using ARP to Implement Transparent Subnet Gateways", RFC 1027, October 1987. [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol", RFC 2338, April 1998. [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, November 1997. Xu Expires March 1, 2011 [Page 12] Internet-Draft Virtual Subnet September 2010 [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 4761, January 2007. [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service (VPLS) Using Label Distribution Protocol (LDP) Signaling", RFC 4762, January 2007. [IPLS] H. Shah., et. al., "IP-Only LAN Service (IPLS)", draft-ietf- l2vpn-ipls-09.txt (work in progress), February 2010. Authors' Addresses Xiaohu Xu Huawei Technologies, No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District, Beijing 100085, P.R. China Phone: +86 10 82882573 Email: xuxh@huawei.com Xu Expires March 1, 2011 [Page 13]