Network working group X. Xu Internet Draft S. Hares Category: Informational Huawei Technologies Expires: September 2012 March 12, 2012 Virtual Subnet: A Host Route based Subnet Extension Solution draft-xu-virtual-subnet-07 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 12, 2012. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Xu & Hares Expires September 12, 2012 [Page 1] Internet-Draft Virtual Subnet March 2012 Abstract This document describes a host route based subnet extension solution referred to as Virtual Subnet, which mainly reuses existing BGP/MPLS IP VPN [RFC4364] and ARP proxy [RFC925][RFC1027] technologies. Virtual Subnet provides a scalable approach for interconnecting geographically dispersed cloud data centers. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Table of Contents 1. Introduction ................................................ 3 2. Terminology ................................................. 5 3. Solution Description......................................... 5 3.1. Unicast ................................................ 5 3.1.1. Intra-subnet Unicast .............................. 5 3.1.2. Inter-subnet Unicast .............................. 6 3.2. Multicast/Broadcast .................................... 7 3.3. CE Host Discovery ...................................... 7 3.4. ARP Proxy .............................................. 8 3.5. CE Host Mobility ....................................... 8 3.6. Forwarding Table Scalability ........................... 8 3.6.1. MAC Table Reduction on DC Switches ................ 8 3.6.2. FIB Reduction on PE Routers ....................... 9 3.6.3. RIB Reduction on PE Routers ....................... 9 3.7. ARP Table Scalability on DC Gateways .................. 10 3.8. ARP/Unknown Uncast Flood Avoidance .................... 11 3.9. Active-active Multi-homing ............................ 11 3.10. Path Optimization .................................... 11 4. Future Work ................................................ 11 5. Security Considerations .................................... 11 6. IANA Considerations ........................................ 11 7. Acknowledgements ........................................... 12 8. References ................................................. 12 8.1. Normative References .................................. 12 8.2. Informative References ................................ 12 Authors' Addresses ............................................ 13 Xu & Hares Expires September 12, 2012 [Page 2] Internet-Draft Virtual Subnet March 2012 1. Introduction To achieve business continuance and disaster prevention, Virtual Machine (VM) migration across data centers is increasingly adopted in today's data centers. For instance, VM migration across data centers can be used in many scenarios such as data center maintenance, data center migration, data center consolidation, data center expansion, or data center disaster avoidance. For interconnecting multiple Infrastructure-as-a-Service (IaaS) cloud data centers in a scalable manner, the following requirements provided by data center and cloud operation groups SHOULD be considered for any solution: 1) Subnet Extension To allow seamlessly migration of a VM from one data center to another without IP renumbering, the subnet that the VM belongs to SHOULD be extended across those data centers. 2) VPN Instance Scalability In a modern cloud data center environment, thousands of even tens of thousands of tenants could be hosted over a shared network infrastructure. For security and performance isolation considerations, these tenants SHOULD be isolated from one another. Hence, the data center interconnect solution SHOULD be capable of providing a large enough VPN space, at least beyond 4K VPN instances, for tenant isolation. 3) Forwarding Table Scalability With the development of virtualization technologies, a single cloud data center containing millions of VMs is not uncommon today. [Gashinsky] noted in a NANOG 52 presentation there are 2.4 million VMs in a single data center. This number already implies a big challenge for data center switches from the perspective of forwarding table scalability. Hence an ideal data center interconnect solution SHOULD avoid the forwarding table size from growing by folds as the number of data centers to be interconnected increases. Furthermore, the use of existing L2VPN or L3VPN technology for interconnecting data centers needs to take into account the scale of forwarding tables of PE routers. 4) ARP Table Scalability on DC Gateways Xu & Hares Expires September 12, 2012 [Page 3] Internet-Draft Virtual Subnet March 2012 [NARTEN-ARMD] notes that the ARP table on data center gateway routers can be a "pain" point for cloud data centers that contains millions of VMs. Therefore, an ideal data center interconnection solution SHOULD avoid the ARP table size from growing by multiples as the number of data centers to be connected increases. 5) ARP/Unknown Unicast Flood Suppression or Avoidance It's well-known that the flooding of ARP broadcast and unknown unicast traffic within large Layer2 networks will lead to certain performance impact on both networks and hosts. As multiple data centers each containing millions of VMs are interconnected together across the WAN, and more importantly, the bandwidth of which is usually insufficient and costly, the problems mentioned above will become even worse. As such, how to suppress or even avoid the flooding of ARP broadcast and unknown unicast traffic across data centers becomes increasingly desirable for avoiding unnecessary consumption of the bandwidth resource and the CPU resource. 6) Active-active Multi-homing In order to utilize the bandwidth of all available upstream paths while providing resilient connectivity between the data center and the transport network, active-active multi-homing is increasingly advocated by data center operators as a replacement of the traditional active-standby multi-homing approach. 7) Path Optimization A subnet usually represents a location in the network. However, in a subnet that has been extended across multiple geographically dispersed data centers, the location semantics of subnets are lost. As a result, the client-to-server traffic between enterprise applications (i.e. cloud users), may travel via sub- optimal paths based on subnet routes. Furthermore, this traffic may travel across the expensive cross-data center links. This document describes a host route based subnet extension solution referred to as Virtual Subnet (VS), which meets the requirements for cloud data center interconnect as described above. Since VS mainly reuses existing technologies including BGP/MPLS IP VPN [RFC4364] and ARP proxy [RFC925][RFC1027], it allows service providers that are offering IaaS cloud services to realize a scalable data center interconnection based on their existing reliable MPLS/IP VPN infrastructures and provisioning experiences. Xu & Hares Expires September 12, 2012 [Page 4] Internet-Draft Virtual Subnet March 2012 Please note that VS is only targeted at scenarios where the traffic across data centers is just routable IP traffic. In such scenarios, data center operators could benefit from the advantages of this host route based data center interconnect solution over traditional VPLS solutions [RFC 4761, RFC4762]. For instance, in VS, the flooding domain associated with the IP subnet that has been extended across multiple data centers, is confined into each data center and therefore the performance impact on networks and servers imposed by the flooding of ARP broadcast and unknown unicast traffic is greatly alleviated. In addition, the forwarding table scalability of data center switches and PE routers, and the ARP table scalability of data center gateways are greatly improved as well. 2. Terminology This memo makes use of the terms defined in [RFC4364], [RFC2338] [MVPN] and [VA-AUTO]. 3. Solution Description 3.1. Unicast 3.1.1. Intra-subnet Unicast +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10.0.0.0/8 | | | |VPN_A:10.0.0.0/8 | | | | | | | | +------+ ++---+-+ +-+---++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-+-+-+ +-+-+-++ +------+ | | 10.1.1.1/8 | | | IP/MPLS Backbone | | | 10.1.1.2/8 | +-----------------+ | | | | +------------------+ | +--------------------+ | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ Figure 1: Intra-subnet Unicast Xu & Hares Expires September 12, 2012 [Page 5] Internet-Draft Virtual Subnet March 2012 As shown in Figure 1, CE hosts dispersed across different data centers are actually within a single subnet (e.g., 10.0.0.0/8). PE routers discover their locally connected CE hosts by some means such as ARP or ICMP scan, and then create host routes for these CE hosts. These host routes are distributed across PE routers by using the existing BGP/MPLS IP VPN signaling. Meanwhile, ARP proxy is enabled on each PE router. Thus, upon receiving from a local CE host an ARP request for a remote CE host, PE router would return its own MAC address as a response. Assume host A sends an ARP request for host B before communicating with each other. Upon receiving the ARP request, PE-1 lookups the associated VRF table to find the corresponding host route for the target host (i.e., host B). If found, PE-1 as an ARP proxy returns its own MAC address to the above ARP request. Host A will then send IP packets for host B towards PE-1, which in turn forwards the received packets towards PE-2 according to existing L3VPN forwarding procedures. PE-2 then forwards received packets to host B as normal. 3.1.2. Inter-subnet Unicast +--------------------+ +-----------------+ | | +-----------------+ |VPN_A:10.0.0.0/8 | | | |VPN_A:10.0.0.0/8 | | | | | | | | +------+ ++---+-+ +-+---++ +-------++ | |Host A+----+ PE-1 | | PE-2 +--------+ GW | | +------+ ++-+-+-+ +-+-+-++ +-------++ | 10.1.1.1/8 | | | IP/MPLS Backbone | | | 10.1.1.2/8 | +-----------------+ | | | | +-----------------+ | +--------------------+ | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.0.0.0/8 | NULL | | VPN_A |10.0.0.0/8 | NULL | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |0.0.0.0/0 | PE-2 | | VPN_A |0.0.0.0/0 | GW | +-------+------------+--------+ +-------+------------+--------+ Figure 2: Inter-subnet Unicast Xu & Hares Expires September 12, 2012 [Page 6] Internet-Draft Virtual Subnet March 2012 As shown in Figure 2, to allow a CE host (e.g., host A) to communicate with other hosts outside its own subnet, a PE router (e.g., PE-2) which is connected to a CE gateway router (e.g., GW) would either be configured with or learn from that CE gateway router at least one route for the outside networks (e.g., a default route) with next-hop pointing to that CE gateway router, and this route would be distributed to other PE routers. Now host A sends an ARP request for its default gateway (i.e., GW) before communicating with a destination host outside its subnet. Upon receiving this ARP request, PE-1 as an ARP proxy returns its own MAC address as a response according to the same procedure as described in the above section. Upon receiving a packet destined for the outside from host A, PE-1 forwards it towards PE-2 according to the default route, which in turn forwards that packet to GW according to the default route as well. To avoid forwarding packets destined for nonexistent hosts within the scope of their configured VPN subnet mistakenly according to the default route, PE routers each are suggested to be configured with a null route for that VPN subnet. 3.2. Multicast/Broadcast To support IP multicast and broadcast between CE hosts of the same VPN instance, the MVPN technology [MVPN] could be reused in VS. For example, PE routers attached to a given VPN join a default provider multicast distribution tree which is dedicated for that VPN. Ingress PE routers, upon receiving customer multicast or broadcast traffic from their local CE hosts, forward such customer traffic towards remote PE routers of the same VPN through the corresponding default provider multicast distribution tree. More details about how to support multicast and broadcast in VS will be explored in a later version of this document. 3.3. CE Host Discovery PE routers must be able to discovery their locally connected CE hosts and keep the list of local CE hosts up to date in a timely manner so as to keep the availability of the host route information, especially after its rebooting up. PE routers could accomplish local CE host discovery by some traditional host discovery means such as ARP scan or ICMP scan. To keep their ARP cache fresh while avoiding too frequent ARP/ICMP scan behaviors, PE routers could send unicast ARP requests to those already discovered local CE hosts at shorter intervals as a complement of the regular ARP or ICMP scan at longer intervals. Xu & Hares Expires September 12, 2012 [Page 7] Internet-Draft Virtual Subnet March 2012 Furthermore, Link Layer Discovery Protocol (LLDP) described in [802.1AB] or VSI Discovery and Configuration Protocol (VDP) described in [802.1Qbg], or even interaction with the data center orchestration system could also be considered as a means for local CE host discovery. 3.4. ARP Proxy A PE router as an ARP proxy SHOULD only respond to ARP requests for those CE hosts for which there are matching host routes and those host routes are learnt from remote PE routers. In other words, the PE router SHOULD not respond to ARP requests for its local CE hosts or those nonexistent CE hosts within the configured subnet (i.e., no matching host routes found). In the CE multi-homing scenario where VRRP is enabled on PE routers, only the PE router which has been elected as the VRRP master is entitled to perform the ARP proxy function. 3.5. CE Host Mobility Once a CE host (e.g., a VM) moves from one VPN site to another, it will usually send out a gratuitous ARP request or reply when attaching to the new VPN site. Upon receiving that gratuitous ARP packet, the PE router attached to the new VPN site will create a host route for that CE host and then advertise that route to remote PE routers. When the old PE router to which that CE host was previously attached, receives the route update for that CE host, it would immediately send an ARP request or ICMP echo request for that CE host locally to check whether or not that CE host is still locally connected to it. If no response is returned within a certain period of time, the PE router would delete the ARP cache entry of that CE host and withdraw the corresponding host route. Meanwhile, the PE router would broadcast a gratuitous ARP packet on behalf of that CE host, with the sender hardware address field being filled with its own MAC addresses. As such, the ARP entry for that CE host that is cached on any local CE host would be updated accordingly. 3.6. Forwarding Table Scalability 3.6.1. MAC Table Reduction on DC Switches Since the MAC learning domain associated with a given subnet that has been extended across multiple data centers, has been partitioned and confined within each data center location, CE switches in data centers only needs to learn MAC addresses of local CE hosts, without any need for learning MAC addresses of remote CE hosts. Xu & Hares Expires September 12, 2012 [Page 8] Internet-Draft Virtual Subnet March 2012 3.6.2. FIB Reduction on PE Routers To suppress the FIB of PE routers, VS reuses the Virtual Aggregation [VA-AUTO] technology to achieve on-demand FIB installation. The detailed procedures are described as follows: 1) The extended subnet is configured as a Virtual Prefix (VP) and a Route-Reflector (RR) is configured as an Aggregation Point Router (APR) for that VP. PE routers as RR clients advertise host routes for their local CE hosts to the RR which in turn, as an APR, installs those host routes into FIB and then attach the "can- suppress" tag to those host routes before reflecting them to its clients. Those host routes attached with that tag will not be installed into FIB by clients since they are not APRs for those host routes. In addition, the RR as an APR would advertise a VP route (i.e., a route for the extended subnet) to all of its clients which in turn would install this VP route into FIB as normal. Upon receiving a packet from a local CE host, if no matching host route found, the packet will be forwarded by the ingress PE router to the RR according to the subnet route learnt from the RR, which in turn forwards the packet to the egress PE router according to the matching host route learnt from that egress PE router. In a word, the FIB table size of PE routers can be greatly reduced at the cost of path stretch. 2) Provided an ARP request for a remote CE host within the extended subnet is received from a given local CE host, the ingress PE router will immediately install the corresponding host route for that remote CE host into FIB, in case that there exists a host route for that CE host in the RIB but it has not yet been installed into FIB. Therefore, the forwarding path for the traffic destined for that remote CE host is optimized. Note that the FIB entries of remote host routes could expire if there has been no traffic using them for a certain period of time. 3.6.3. RIB Reduction on PE Routers To suppress the RIB of PE routers, VS reuses the BGP Outbound Route Filtering (ORF) mechanism to realize on-demand route announcement. Note that to use this RIB reduction approach, the ARP proxy functionality described before SHOULD be changed as: a PE router as an ARP proxy SHOULD respond to the ARP request for any CE host within the extended subnet, as long as the requested CE host is not a local CE host. The detailed procedures are described as follows: Xu & Hares Expires September 12, 2012 [Page 9] Internet-Draft Virtual Subnet March 2012 1) PE routers as RR clients advertise host routes for their local CE hosts to a RR which in turn, however doesn't reflect these host routes by default unless it receives corresponding ORF requests for them from its clients. The RR is configured with a null route for the extended subnet and that route is then advertised to all of its clients. Upon receiving a packet from a local CE host, if no matching host route found, the packet will be forwarded by the ingress PE router to the RR according to the subnet route learnt from the RR, which in turn forwards the packet to the egress PE router according to the matching host route learnt from that egress PE router. In a word, the RIB table size of PE routers can be greatly reduced at the cost of path stretch. 2) Provided an ARP request for a remote CE host within the extended subnet is received from a given local CE host, the ingress PE router will request the corresponding host route from its RR by using ORF (e.g., a group ORF containing Route-Target (RT) and prefix information) in case that there is no host route for that CE host yet in its RIB. Once the host route for the remote CE host is received from the RR, the subsequent packets destined for that CE host would be forwarded directly from the ingress PE router to the egress PE router. Note that the RIB entries of remote host routes could expire if there has been no traffic using their corresponding FIB entries for a certain period of time. Once the expiration time for a given RIB entry is approaching, the PE router would notice its RR to withdraw the corresponding host route by sending an ORF message. Upon receiving the corresponding withdraw message from its RR, the PE router will delete that host route from its RIB as normal. 3.7. ARP Table Scalability on DC Gateways In VS, DC gateways are directly connected to PE routers of the VS. Due to the usage of ARP proxy on PE routers, all CE hosts of a given VPN subnet share one MAC address (i.e., the MAC address of the connected PE router) from the DC gateways' points of view. Therefore, ARP entries for all CE hosts of a given VPN subnet can be aggregated into a single ARP entry (e.g., 10.0.0.0/8-> the MAC address of the PE router). Accordingly, DC gateways are required to use the longest- matching algorithm for ARP cache lookup instead of the existing exact-matching algorithm. In this way, the ARP table size of DC gateways can be reduced by several orders of magnitude. The side-effect introduced by this approach is DC gateways could possibly send packets destined for non-existing CE hosts to their connected PE routers of the VS. Fortunately, once those packets Xu & Hares Expires September 12, 2012 [Page 10] Internet-Draft Virtual Subnet March 2012 arrive at the PE routers, which in turn will drop them due to route missing. 3.8. ARP/Unknown Uncast Flood Avoidance In VS, the flooding domain associated with a given subnet that has been extended across multiple data centers, has been partitioned and confined within each data center. Therefore, the performance impact on networks and servers caused by the flooding of ARP broadcast and unknown unicast traffic is alleviated for the IP-only traffic case. 3.9. Active-active Multi-homing For the PE router redundancy purpose, a VPN site could be connected to more than one PE router. In this case, VRRP [RFC2338] SHOULD be enabled on these PE routers and only the PE router which has been elected as the VRRP master could perform the ARP proxy functionality. However, all PE routers, either as a VRRP master or a VRRP slave, are allowed to perform the host discovery procedure and accordingly advertise host routes for their local CE hosts. Hence, from the perspective of remote PE routers, there will be multiple host routes for a given CE host located within that multi-homed VPN site. In other words, active-active multi-homing is available for the inbound traffic of a given multi-homed VPN site. 3.10. Path Optimization To optimize the path for traffic between enterprise VPN sites (e.g., cloud users) and cloud data centers, host routes for CE hosts within cloud data centers could be directly propagated to the enterprise VPN sites. 4. Future Work How to support IPv6 CE hosts in VS is for future study. 5. Security Considerations TBD. 6. IANA Considerations There is no requirement for IANA. Xu & Hares Expires September 12, 2012 [Page 11] Internet-Draft Virtual Subnet March 2012 7. Acknowledgements Thanks to Dino Farinacci, Himanshu Shah, Nabil Bitar, Giles Heron, Ronald Bonica and Monique Morrow for their valuable comments and suggestions on this document. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 8.2. Informative References [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-10.txt, Work in Progress, Janurary 2010. [VA-AUTO] Francis, P., Xu, X., Ballani, H., Jen, D., Raszuk, R., and L. Zhang, "Auto-Configuration in Virtual Aggregation", draft-ietf-grow-va-auto-05.txt, Work in Progress, December 2011. [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC Information Sciences Institute, October 1984. [RFC1027] Smoot Carl-Mitchell, John S. Quarterman, "Using ARP to Implement Transparent Subnet Gateways", RFC 1027, October 1987. [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol", RFC 2338, April 1998. [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 4761, January 2007. [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service (VPLS) Using Label Distribution Protocol (LDP) Signaling", RFC 4762, January 2007. [802.1AB] IEEE Standard 802.1AB-2009, "Station and Media Access Control Connectivity Discovery", September 17, 2009. Xu & Hares Expires September 12, 2012 [Page 12] Internet-Draft Virtual Subnet March 2012 [802.1Qbg] IEEE Draft Standard P802.1Qbg/D2.0, "Virtual Bridged Local Area Networks -Amendment XX: Edge Virtual Bridging", Work in Progress, December 1, 2011. [NARTEN-ARMD] Narten, T., Karir, M., and I. Foo, "Problem Statement for ARMD", draft-ietf-armd-problem-statement-01.txt, Work in Progress, February 2012. [Gashinsky] Igor Gashinsky, "Data Center Scalability Panel", http://www.nanog.org/meetings/nanog52/presentations/Tuesda y/Gashinsky-3-Y-Datacenter-scalability.pdf, June 14, 2010, Authors' Addresses Xiaohu Xu Huawei Technologies, Beijing, China. Phone: +86 10 60610041 Email: xuxiaohu@huawei.com Susan Hares Huawei Technologies (FutureWei group) 2330 Central Expressway Santa Clara, CA 95050 Phone: +1-734-604-0332 Email: Susan.Hares@huawei.com shares@ndzh.com Xu & Hares Expires September 12, 2012 [Page 13]