Network working group                                             X. Xu  
Internet Draft                                      Huawei Technologies         
Category: Standard Track                                        
Expires: February 2011                                  August 24, 2010 
                                                                                
                                      
        Virtual Subnet: A Scalable Data Center Network Architecture  
                                      
                        draft-xu-virtual-subnet-02 


Status of this Memo 

   This Internet-Draft is submitted to IETF in full conformance with 
   the provisions of BCP 78 and BCP 79. 

   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups. Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 

   Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other documents 
   at any time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 

   The list of current Internet-Drafts can be accessed at   
   http://www.ietf.org/ietf/1id-abstracts.txt. 

   The list of Internet-Draft Shadow Directories can be accessed at   
   http://www.ietf.org/shadow.html. 

   This Internet-Draft will expire on February 24, 2011. 

Copyright Notice 

   Copyright (c) 2009 IETF Trust and the persons identified as the    
   document authors.  All rights reserved. 

   This document is subject to BCP 78 and the IETF Trust's Legal    
   Provisions Relating to IETF Documents 
   (http://trustee.ietf.org/license-info) in effect on the date of    
   publication of this document. Please review these documents 
   carefully, as they describe your rights and restrictions with 
   respect to this document.  

    

 
 
 
Xu                   Expires February 24, 2011               [Page 1] 

Internet-Draft               Virtual Subnet                 August 2010 
 
Abstract 

   This document proposes a scalable data center network architecture 
   which, as an alternative to the Spanning Tree Protocol Bridge 
   network, uses a Layer 3 routing infrastructure to provide scalable 
   virtual Layer 2 network connectivity services.  

Conventions used in this document 

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC-2119 [RFC2119]. 

Table of Contents 

    
   1. Problem Statement............................................3 
   2. Terminology..................................................3 
   3. Design Goals.................................................3 
   4. Architecture Description.....................................4 
      4.1. Unicast.................................................4 
         4.1.1. Communications within a Service Domain.............4 
         4.1.2. Communications between Service Domains.............6 
      4.2. Multicast/Broadcast.....................................7 
      4.3. Host Discovery..........................................9 
      4.4. APR Proxy..............................................10 
      4.5. DHCP Relay Agent.......................................11 
   5. Conclusions.................................................11 
   6. Limitations.................................................12 
   7. Future work.................................................12 
   8. Security Considerations.....................................12 
   9. IANA Considerations.........................................12 
   10. Acknowledgements...........................................12 
   11. References.................................................12 
      11.1. Normative References..................................12 
      11.2. Informative References................................12 
   Authors' Addresses.............................................13 










 
 
Xu                   Expires February 24, 2011               [Page 2] 

Internet-Draft               Virtual Subnet                 August 2010 
 
    
1. Problem Statement 

   With the popularity of cloud services, the scale of today's data 
   centers expands larger and larger. In addition, virtual machine 
   migration technology, which allows a virtual machine to be able to 
   migrate to any physical server while keeping the same IP address, is 
   becoming more and more prevalent for achieving service agility in 
   data centers. As a result, large Layer 2 networks are needed for 
   server-to-server connectivity. Meanwhile, due to the huge-volume 
   traffic exchanged between servers, the L2 networks SHOULD be able to 
   provide enough capacity for server-to-server interconnections. 

   Unfortunately, today's network architecture for data centers which 
   relies on the Spanning-Tree Protocol (STP) Bridge technology, can 
   not address the above challenges facing those large-scale data 
   centers, e.g., large scale of servers and high-bandwidth demands for 
   server-to-server interconnections. First, STP can only calculate out 
   a single forwarding tree for all connected servers and it can not 
   support multi-path routing, e.g., Equal Cost Multi-Path (ECMP), 
   hence it can't maximize the utilization of the totally available 
   network resources to provide enough bandwidth capacity between 
   servers; Second, since the STP Bridge forwarding depends on the flat 
   MAC addresses, the scalability of the forward table would become a 
   big issue, especially when the existing large Layer 2 network scales 
   even larger; Third, the broadcast storm impact on the network 
   performance becomes much more serious and unpredictable in the 
   continually growing large-scale STP Bridge networks.  

2. Terminology 

   This memo makes use of the terms defined in [RFC4364], [MVPN], 
   [RFC2236] and [RFC2131]. Below are provided terms specific to this 
   document: 

      - Service Domain: A group of servers which are dedicated for a 
      given service and are usually located in a separate IP subnet. 

3. Design Goals 

   To overcome the limitations of the STP Bridge networks, in this 
   document we propose a new network architecture for data centers 
   called Virtual Subnet (VS), which aims to meet the following design 
   objectives:  

      - Bandwidth Utilization Maximization 


 
 
Xu                   Expires February 24, 2011               [Page 3] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   To provide enough bandwidth between servers, the server-to-server 
   traffic SHOULD always be delivered through the shortest path while 
   achieving load-balancing by using multi-path routing.  

      - Layer 2 Connectivity 

   To be backward compatible with the current applications running in 
   data centers (e.g., virtual machine migration), those servers of a 
   given service domain SHOULD be connected as if they were on a Local 
   Area Network (LAN) or an IP subnet. 

      - Domain Isolation 

   Due to considerations of performance isolation and security, servers 
   belonging to different service domains SHOULD be isolated just as if 
   they were located on dedicated Virtual LANs (VLAN) or IP subnets. 

      - Forwarding Table Scalability 

   To accommodate tens to hundreds of thousands of servers within a 
   single data center network, the forwarding tables of all forwarding 
   devices (e.g., routers or switches) SHOULD be scalable enough. 

      - Broadcast Storm Suppression 

   To reduce the impact of broadcast storms imposed on the network 
   performance, broadcast domains SHOULD be limited to their smallest 
   scope. 

4. Architecture Description 

   VS actually uses BGP/MPLS VPN technology [RFC4364] with some 
   extensions, together with some other proven technologies including 
   ARP proxy [RFC925][RFC1027] to build a scalable large IP subnet 
   across the MPLS/IP backbone of the data center network. As a result, 
   VS can be deployed today as a scalable data center network. 

   The following sections describe VS in details. 

   4.1. Unicast 

   4.1.1. Communications within a Service Domain 

   As shown in Figure 1, BGP/MPLS VPN technology with some extensions, 
   as an alternative to the STP Bridge, is deployed in a data center 
   network. To achieve service domain isolation, each service domains 
   is mapped to a distinct VPN and servers of a given service domain, 

 
 
Xu                   Expires February 24, 2011               [Page 4] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   as Customer Edge (CE) hosts, are attached to Provider Edge (PE) 
   routers of the corresponding VPN directly or through one or more 
   Ethernet switches. In addition, to build a large IP subnet across 
   the MPLS/IP backbone, different sites of a particular VPN are 
   associated with an identical IP subnet. That is to say, each PE 
   attached to a given VPN is configured with a distinct IP address of 
   an identical IP subnet on the corresponding Virtual Routing 
   Forwarding (VRF) attachment circuits. Each PE creates connected host 
   routes for each attached VRF automatically according to the Address 
   Resolution Protocol (ARP) table of the corresponding VPN. Instead of 
   exchanging the route for the configured IP subnet, PEs belonging to 
   a given VPN exchange connected host routes among them via BGP. In 
   addition, APR proxy is enabled on PEs for each attached VPN, thus, 
   upon receiving from a local CE host an ARP request for a remote CE 
   host, the PE as an ARP proxy returns one of its MAC addresses in the 
   corresponding ARP reply. 

                          +--------------------+ 
    +-----------------+   |                    |   +------------------+ 
    |VPN_A:10/8       |   |                    |   |VPN_A:10/8        | 
    |                 |   |                    |   |                  | 
    |    +------+    ++---+-+                +-+---++    +------+     | 
    |    |Host A+----+ PE-1 |                | PE-2 +----+Host B|     | 
    |    +------+    ++-+-+-+                +-+-+-++    +------+     | 
    |   10.1.1.1/8    | | |  IP/MPLS Backbone  | | |    10.1.1.2/8    | 
    +-----------------+ | |                    | | +------------------+ 
                        | +--------------------+ |  
                        |                        |   
                        |                        | 
                        V                        V 
    +-------+------------+--------+     +-------+------------+--------+ 
    |VRF ID |Destination |Next Hop|     |VRF ID |Destination |Next Hop| 
    +-------+------------+--------+     +-------+------------+--------+ 
    | VPN_A |10.1.1.1/32 |  Local |     | VPN_A |10.1.1.2/32 |  Local | 
    +-------+------------+--------+     +-------+------------+--------+ 
    | VPN_A |10.1.1.2/32 |  PE-2  |     | VPN_A |10.1.1.1/32 |  PE-1  | 
    +-------+------------+--------+     +-------+------------+--------+ 

               Figure 1: Intra-domain Communication Example 

   Now host A broadcasts an ARP request for host B before communicating 
   with B. Upon the receipt of this ARP request, PE-1 lookups the 
   associated VRF to find the host route for B. If found and the route 
   is learnt from a remote PE, PE-1 acting as an ARP proxy returns one 
   of its own MAC addresses in the response to that ARP request. 
   Otherwise, no ARP reply SHOULD be sent. After obtaining the ARP 
   reply from PE-1, A sends an IP packet to B with destination MAC 

 
 
Xu                   Expires February 24, 2011               [Page 5] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   address of PE-1's MAC address. Upon receiving this packet, PE-1 
   acting as an ingress PE, tunnels the packet towards PE-2 which in 
   turn, as an egress PE, forwards the packet to B.  

   4.1.2. Communications between Service Domains 

   For servers in different VPNs (i.e., service domains) to communicate 
   with each other, these VPNs SHOULD not be configured with any 
   overlapping addresses, and each VPN SHOULD be configured with a 
   default route towards the corresponding default gateway (i.e. a CE 
   router).  

     +-------+------------+--------+   +-------+------------+--------+ 
     |VRF_ID |Destination |Next Hop|   |VRF_ID |Destination |Next Hop| 
     +-------+------------+--------+   +-------+------------+--------+ 
     | VPN_A |10.1.1.2/32 |  PE-1  |   | VPN_B |20.1.1.2/32 |  PE-2  | 
     +-------+------------+--------+   +-------+------------+--------+ 
     | VPN_A |10.1.1.1/32 | Local  |   | VPN_B |20.1.1.1/32 | Local  | 
     +-------+------------+--------+   +-------+------------+--------+ 
     | VPN_A | 0.0.0.0/0  |10.1.1.1|   | VPN_B | 0.0.0.0/0  |20.1.1.1| 
     +-------+------------+--------+   +-------+------------+--------+ 
             ^                                              ^ 
             |            +--------------------+            | 
             |            |     IP Network     |            | 
             |            +----+-----------+---+            | 
             |             +---+--+    +---+--+             | 
             |             | GW-1 |    | GW-2 |             | 
             |             +---+--+    +--+---+             | 
             |VPN A:10.1.1.1/8 |          |VPN B:20.1.1.1/8 | 
             |                 |          |                 | 
             +-------------+---+--+    +--+---+-------------+ 
                         +-+ PE-3 +----+ PE-4 +-+ 
    +-----------------+  | +------+    +------+ |  +------------------+ 
    |VPN A:10/8       |  |                      |  |VPN_B:20/8        | 
    |                 |  |                      |  |                  | 
    |    +------+    ++--+--+                +--+--++    +------+     | 
    |    |Host A+----+ PE-1 |                | PE-2 +----+Host B|     | 
    |    +------+    ++-++--+                +--++-++    +------+     | 
    |    10.1.1.2/8   | ||   IP/MPLS Backbone   || |    20.1.1.2/8    | 
    +-----------------+ ||                      || +------------------+ 
                        |+----------------------+| 
                        |                        | 
                        V                        V 
    +-------+------------+--------+   +-------+------------+--------+ 
    |VRF ID |Destination |Next Hop|   |VRF ID |Destination |Next Hop| 
    +-------+------------+--------+   +-------+------------+--------+ 
    | VPN_A |10.1.1.2/32 |  Local |   | VPN_B |20.1.1.2/32 |  Local | 

 
 
Xu                   Expires February 24, 2011               [Page 6] 

Internet-Draft               Virtual Subnet                 August 2010 
 
    +-------+------------+--------+   +-------+------------+--------+ 
    | VPN_A |10.1.1.1/32 |  PE-3  |   | VPN_B |20.1.1.1/32 |  PE-4  | 
    +-------+------------+--------+   +-------+------------+--------+ 
    | VPN_A | 0.0.0.0/0  |  PE-3  |   | VPN_B | 0.0.0.0/0  |  PE-4  | 
    +-------+------------+--------+   +-------+------------+--------+ 

               Figure 2: Inter-domain Communication Example 

   As shown in Figure 2, PE-1 and PE-3 are attached to one VPN (i.e. 
   VPN A) while PE-2 and PE-4 are attached to another VPN (i.e., VPN B). 
   Host A and its default gateway router (i.e., GW-1) are connected to 
   PE-1 and PE-3, respectively. PE-3 is configured with a default route 
   for VPN A and this default route is advertised to other PEs. 
   Similarly, host B and its default gateway router (i.e., GW-2) are 
   connected to PE-2 and PE-4, respectively. PE-4 is configured with a 
   default route for VPN B and this default route is advertised to 
   other PEs. Now A sends an ARP request for its default gateway (i.e., 
   10.1.1.1) before communicating with B. Upon receiving this ARP 
   request, PE-1 lookups the associated VRF to find the host route for 
   the default gateway. If found and the route is learnt from a remote 
   PE, PE-1 as an ARP proxy, returns one of its own MAC addresses in 
   the ARP reply. After obtaining the ARP reply, A sends an IP packet 
   for B with destination MAC address of PE-1's MAC. Upon receiving 
   this packet, PE-1 as an ingress PE, tunnels it towards PE-3 
   according to the best-match route for that packet (i.e., the default 
   route) in the associated VRF. PE-3 as an egress PE, in turn, 
   forwards this packet towards the default gateway router (i.e., GW-1). 
   After the packet arrives at the default gateway router for B (i.e., 
   GW-2) after traveling through an IP network, GW-2 forwards the 
   packet to PE-4 with destination MAC address of PE-4's MAC address if 
   it has learnt an ARP for B from PE-4. Otherwise, GW-2 SHOULD 
   broadcast an APR request for B. Upon receiving this packet, PE-4 as 
   an ingress PE, tunnels it towards PE-2 which in turn, forwards it 
   towards B. 

   4.2. Multicast/Broadcast  

   The MVPN technology [MVPN], especially the Protocol-Independent-
   Multicast (PIM) tree option with some extensions, is partially 
   reused here to support link-local multicast between servers of a 
   given service domain (i.e., VPN). That is to say, the customer 
   multicast group addresses of a given VPN are 1:1 or n: 1 mapped to 
   the provider multicast group dedicated for that VPN when 
   transporting the customer multicast traffic across the backbone. For 
   broadcast, a dedicated provider multicast group is reserved for 
   carrying broadcast traffic across the IP/MPLS backbone. In other 
   words, customer broadcast is processed on PEs as a special customer 

 
 
Xu                   Expires February 24, 2011               [Page 7] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   multicast group. Unless otherwise mentioned, the customer multicast 
   term pertains to customer multicast and broadcast. All PEs attached 
   to a given VPN SHOULD maintain the identical mappings from customer 
   multicast group addresses to provider multicast group addresses. To 
   isolate the customer multicast traffics of different VPNs traveling 
   through the backbone, different VPNs SHOULD be assigned distinct 
   provider multicast group address ranges without any overlapping.  

                          +--------------------+ 
    +-----------------+   |                    |   +------------------+ 
    |VPN_A:10/8       |   |                    |   |VPN_A:10/8        | 
    |                 |   |                    |   |                  | 
    |    +------+  E0++---+-+                +-+---++    +------+     | 
    |    |Host A+----+ PE-1 |                | PE-2 +----+Host B|     | 
    |    +------+    ++-+-+-+                +-+---++    +------+     | 
    |    10.1.1.1/8   | | |  IP/MPLS Backbone  |   |    10.1.1.2/8    | 
    +-----------------+ | |                    |   +------------------+ 
                        | +--------------------+   
                        |                          
                        |                          
                        V                          
   +-------+---------------+----------+-------+--------+ 
   |VRF ID |  Customer G   |Provider G| To PE | From PE| 
   +-------+---------------+----------+-------+--------+ 
   | VPN_A |  224.1.1.1/32 | 239.1.1.1| True  |  True  | 
   +-------+---------------+----------+-------+--------+ 
   | VPN_A |  224.0.0.0/4  | 239.1.1.2| True  |  True  | 
   +-------+---------------+----------+-------+--------+ 
   | VPN_A |255.255.255.255| 239.1.1.3| True  |  True  | 
   +-------+---------------+----------+-------+--------+ 

       Figure 3: Link-local Multicast/Broadcast Communication Example 

   The multicast forwarding entry can be configured manually by the 
   network operators or generated dynamically according to the Internet 
   Group Management Protocol (IGMP) Membership Report/Leave messages 
   received from CE hosts or remote PEs. Ingress PEs forward customer 
   multicast packets to other PEs (i.e., egress PEs) of the same VPN 
   via a provider multicast distribution tree, according to the best-
   match multicast forwarding entry of the associated VRF in case that 
   the ''To PE'' field of that entry is set to True. Otherwise (i.e., 
   that field set to False), ingress PEs are not allowed to forward the 
   customer multicast packets to remote egress PEs. Egress PEs forward 
   customer multicast packets received from the provider multicast 
   distribution tree to CE hosts via VRF attachment circuits, according 
   to the best-match multicast forwarding entry of the associated VRF 
   in case that the ''From PE'' field of that entry is set to True. 

 
 
Xu                   Expires February 24, 2011               [Page 8] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   Otherwise (i.e., that field set to False), egress PEs are not 
   allowed to forward the customer multicast packets to CE hosts. For 
   IGMP messages to be conveyed successfully across the IP/MPLS 
   backbone, some multicast forwarding entries of special multicast 
   groups including all-routers multicast group (i.e., 224.0.0.2) and 
   all-systems group (224.0.0.1) SHOULD be configured in the 
   corresponding VRF in advance. Besides, according to the IGMP 
   specification [RFC2236], Group-Specific Query messages are sent to 
   the group being queried and Membership Report messages are sent to 
   the group being reported, Upon receiving these packets from CE hosts, 
   the PE SHOULD convey them over the corresponding provider multicast 
   distribution tree dedicated for the all-systems group (224.0.0.1) of 
   a given VRF. To avoid IGMP Membership Report suppression, those 
   Membership Report messages received from PEs or CE hosts SHOULD not 
   be forwarded to CE hosts. As an alternative to conveying IGMP 
   Report/Leave messages through the provider multicast distribution 
   tree, customer multicast routing information exchange among PEs can 
   also be achieved by using the approaches defined in [MVPN-BGP].  

   As shown in Figure 3, upon receiving a multicast/broadcast packet 
   from a CE (e.g., host A), if this packet is destined for 224.1.1.1, 
   PE-1 will encapsulate it in a provider multicast packet with 
   destination IP address of 239.1.1.1; If it is destined for an IP 
   multicast address other than 224.1.1.1, PE-1 will encapsulate it in 
   a provider multicast packet with destination IP address of 239.1.1.2; 
   if this is a broadcast packet. PE-1 will encapsulate it in a 
   provider multicast packet with destination IP address of 239.1.1.3 
   which is dedicated for conveying broadcast packets of that VPN. 

   The customer multicast forwarding entries, no matter configured 
   manually or learnt automatically according to the IGMP Membership 
   Reports sent from local CEs, will automatically trigger PEs to join 
   the corresponding provider multicast groups in the MPLS/IP backbone. 
   For example, assume PE-2 receives an IGMP member report for a given 
   customer multicast group (e.g., 224.1.1.1) from a local CE (e.g., 
   host B), it SHOULD automatically join a provider multicast group 
   (i.e., 239.1.1.1) corresponding to that customer multicast group.     

   4.3. Host Discovery 

   To discover all local CE hosts, a PE SHOULD perform at least ARP 
   scan once after rebooting. For example, it broadcasts an ARP request 
   for each IP address within the subnet of each attached VPN 
   (including the network and broadcast addresses). Alternatively, it 
   could also broadcast an ARP request for a direct broadcast address 
   (i.e., 255.255.255.255), upon receipt of such an ARP request, any 
   host SHOULD respond with an ARP reply containing its IP and MAC 

 
 
Xu                   Expires February 24, 2011               [Page 9] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   addresses. After a round of ARP scan, the PE will discover all local 
   CE hosts and cache their corresponding ARP entries in its ARP table. 
   After that, the PE could send ARP requests in unicast to each 
   already-learnt local CE host periodically so as to keep the 
   corresponding ARP entry from expiring. This can also be useful to 
   check whether a given CE host with known IP and MAC addresses is 
   still present on the subnet. Using unicast ARP requests has the 
   advantage that it is quieter than using the broadcast because it 
   won't be received by all hosts on the subnet. When receiving a 
   gratuitous ARP from a local host, the PE SHOULD cache it in the ARP 
   table immediately if no ARP entry for that host exists yet. 
   Otherwise, the PE SHOULD just update the corresponding ARP entry in 
   its ARP table. Most operating systems generate a gratuitous ARP 
   request when the host boots up, the host's network interface or 
   links comes up, or an address assigned to the interface changes. In 
   the scarce scenarios where a host does not generate a gratuitous ARP, 
   the PE would have to perform ARP scan periodically although it has 
   side-effects on the network performance. 

   When a given PE receives a host route for one of its local CE hosts 
   from a remote PE, it SHOULD immediately send an ARP request for that 
   local CE host to check whether this CE host is still connected 
   locally. If an ARP reply received in a short amount of time (imaging 
   the host multi-homing scenario), the PE just needs to update the ARP 
   entry for that local host as normal. Otherwise (considering the 
   virtual machine migration scenario), the PE SHOULD delete the ARP 
   entry corresponding to that host from its APR table. Meanwhile, the 
   PE SHOULD send a gratuitous ARP on behalf of the local host, with 
   the sender hardware address being set as one of its own MAC 
   addresses, in order to update the ARP entry for that host which is 
   cached on other local hosts. As a result, the subsequent packets 
   destined for that host will be sent towards the PE by the other 
   local CE hosts.  

   4.4. APR Proxy 

   The PE, acting as an ARP proxy, SHOULD only respond to those ARP 
   requests for remote hosts which have been learnt via BGP from other 
   PEs. That is to say, the ARP proxy SHOULD not respond to ARP 
   requests for local hosts. Otherwise, in case that the ARP reply from 
   the ARP proxy covers that from the requested host, the packets 
   destined for that local host would have to be unnecessarily relayed 
   by the PE.  

   When the Virtual Router Redundancy Protocol (VRRP) [RFC2338] is 
   enabled together with ARP proxy, only the VRRP master is delegated 


 
 
Xu                   Expires February 24, 2011              [Page 10] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   to act as an ARP proxy and it SHOULD return the VRRP virtual MAC 
   address in the ARP reply. 

   4.5. DHCP Relay Agent 

   To avoid the Dynamic Host Configuration Protocol (DHCP) [RFC2131] 
   broadcast message flooding through the whole data center network, 
   the DHCP Relay Agent function can be enabled on PEs. In this way, 
   the DHCP broadcast messages from DHCP clients (i.e., local CE hosts) 
   would be transformed into DHCP unicast messages by the DHCP Relay 
   Agents (i.e., PEs) and then be forwarded to the DHCP servers in 
   unicast. 

5. Conclusions 

   By using Layer 3 routing in the backbone of the data center network 
   to replace the STP Bridge forwarding, the traffic between any two 
   servers is forwarded along the short path between them. Besides, the 
   ECMP can also be easily achieved in Layer 3 routing networks. Thus, 
   the total network bandwidth of the data center network is utilized 
   to maximum extent. 

   By reusing the BGP/MPLS VPN technology to exchange host routes of a 
   given VPN among PEs, the servers of that VPN are allowed to 
   communicate with each other just as if they are located on a single 
   subnet. 

   Due to the tunnels used in MPLS/BGP VPN, the forwarding tables of P 
   routers just need to hold the reachability information of tunnel 
   endpoints (i.e., PEs). Meanwhile, the forwarding tables of PE 
   routers can also be ensured to scale well by distributing VPNs among 
   different PEs, that is to say, thanks to the Outbound Route 
   Filtering (ORF) capability of BGP, a given PE router only needs to 
   hold the routing tables of those VPNs to which the PE is attached. 
   Thus, the forwarding table scalability issues with data center 
   networks are largely alleviated. 

   By enabling the APR proxy function on PEs, the ARP broadcast 
   messages from local CE hosts are blocked on the attached PEs. Thus, 
   the APR broadcast messages will not be flooded through the whole 
   data center network. Besides, by enabling the DHCP Relay Agent 
   function on PEs, the DHCP broadcast messages from DHCP clients (i.e., 
   local CE hosts) would be transformed into unicast messages by the 
   DHCP Relay Agents and then be forwarded to the DHCP servers in 
   unicast. Thus, the broadcast storms in the data center networks are 
   largely suppressed. 


 
 
Xu                   Expires February 24, 2011              [Page 11] 

Internet-Draft               Virtual Subnet                 August 2010 
 
6. Limitations 

   Since the data center network architecture described in this 
   document partially reuses the BGP/MPLS VPN technology to construct a 
   large-scale IP subnet, rather than a real LAN, the non-IP traffic 
   can not be supported in this architecture. However, we believe IP is 
   the dominate communication protocol in today's data center networks, 
   those non-IP legacy applications will disappear from the data center 
   networks with the elapse of time.  

7. Future work 

   IPv6-based data center network will be considered as a part of the 
   further work. 

8. Security Considerations 

   TBD. 

9. IANA Considerations 

   There is no requirement for IANA.  

10. Acknowledgements 

   Thanks to Dino Farinacci for his valuable comments. 

11. References 

11.1. Normative References 

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate               
             Requirement Levels", BCP 14, RFC 2119, March 1997. 

11.2. Informative References 

   [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private             
             Networks (VPNs)", RFC 4364, February 2006. 

   [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", 
             draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress), 
             Janurary 2010. 

   [MVPN-BGP] R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter,  C.   
             Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP 
             VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt, 
             September 2009. 

 
 
Xu                   Expires February 24, 2011              [Page 12] 

Internet-Draft               Virtual Subnet                 August 2010 
 
   [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC 

            Information Sciences Institute, October 1984. 

   [RFC1027] Smoot Carl-Mitchell, John S. Quarterman,'' Using ARP to 
             Implement Transparent Subnet Gateways'', RFC 1027, October 
             1987. 

   [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol",         
             RFC 2338, April 1998. 

   [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131,       
             March 1997. 

   [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 
             2", RFC 2236, November 1997. 

Authors' Addresses 

   Xiaohu Xu 
   Huawei Technologies, 
   No.3 Xinxi Rd., Shang-Di Information Industry Base,  
   Hai-Dian District, Beijing 100085, P.R. China 
   Phone: +86 10 82836073 
   Email: xuxh@huawei.com