Network Working Group P. Marques Internet-Draft L. Fang Intended status: Standards Track Cisco Systems Expires: September 12, 2012 P. Pan Infinera Corp A. Shukla Juniper Networks M. Napierala AT&T Labs March 2012 BGP-signaled end-system IP/VPNs. draft-marques-l3vpn-end-system-04 Abstract This document describes how the control plane specified by BGP/MPLS IP VPNs [RFC4364] can be used to provide a network virtualization solution for end-systems that meets the requirements of large scale data-centers. It specifies how the control and forwarding functions of a Provide Edge (PE) device described in [RFC4364] can be separated such that the forwarding function can be implemented in end-systems themselves. The solution is applicable to any encapsulation that can deliver packets across an IP network as a tunneled IP datagram plus a 20-bit label. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 12, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. Marques, Fang, Pan, Shukla & Napiestda [Page 1] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Terminoloy . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Applicability of BGP IP VPNs . . . . . . . . . . . . . . . . . 5 4. Virtual network end-points . . . . . . . . . . . . . . . . . . 7 5. VPN forwarder . . . . . . . . . . . . . . . . . . . . . . . . 8 6. XMPP signaling protocol . . . . . . . . . . . . . . . . . . . 10 7. Signaling gateway behavior . . . . . . . . . . . . . . . . . . 14 8. Operational Model . . . . . . . . . . . . . . . . . . . . . . 15 9. Security Considerations . . . . . . . . . . . . . . . . . . . 17 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 11.2. Informational References . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 1. Introduction This document describes the requirements for a network virtualization solution that satisfies the needs of large scale data-centers. It then discusses how the BGP IP VPNs [RFC4364] control plane can be used to provide a solution that meets these requirements. Subsequent sections provide a detailed discussion of the control and forwarding plane components. 1.1. Terminoloy This document makes use of the following terms: Data-center A logical set of compute, storage and network resources that may spread over multiple physical facilities. Its geographic scope is limited by a communication latency requirement. Typically, it represents an availability zone for applications. It may be the case that the same physical facilities support multiple logical data-centers. Signaling Gateway A software application that implements the control plane functionality of a BGP IP VPN PE device and a XMPP server that interacts with VPN forwarders. Virtual Interface An interface in an end-system that is used by a virtual machine or by applications. It performs the role of a CE interface in a BGP IP VPN network. Marques, Fang, Pan, Shukla & Napiestda [Page 2] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 VPN Forwarder The forwarding component of a BGP IP VPN PE device. 2. Requirements The main drivers for network virtualization in data-center solutions are network based access-control, multi-tenancy support and VM mobility. Data-centers often have the need to support one or more of these network functions which have often in the past been implemented using VLAN [IEEE.802-1Q] technologies. The prevalence of L2 based implementations in small scale deployments has created the perception that the network functionality itself requires L2 transparency and is best served by technologies that attempt to drive up the scope and scale of L2 broadcast domains. This document takes the view that the desired functionality requires a virtualization technology that provides an IP service, with layer 2 topology being irrelevant as long as the service goals are met. IP-based access control was the first of these functions to be commonly deployed in data-centers. Its goal is to provide a "closed user-group" among a set of end-points, often compute resources dedicated to one application, such that communication within that group occurs unfettered. This was accomplished by placing all these end-points on the same IP subnet. At the same time IP-based access control allows the use of traffic filtering polices to control/ restrict communication between the members of the group and other groups (inter-subnet communication). These "closed user-groups" are often used by IT organizations to both segregate applications as well a separate production, quality-assurance and development environments which often must be contained in different communication domains. The term "closed user-group" does not imply that there is no communication with external groups. Only that the membership in the group is administratively defined. Each "closed user-group" is a VPN in the terminology used by BGP IP VPNs. Traditionally this functionality has been implemented in data-center designs by using a VLAN between access and aggregation switches and controlling the VLAN-to-VLAN access control policies at the aggregation switches. This solution works well in an environment where the I/O bandwidth between compute resources is considered to be the resource to optimize for. In this scenario resources associated to a given application are dedicated to it and kept in physical proximity. Solutions that optimize for the maximum utilization of fungible compute and storage resources need a different approach. They require that the compute resources associated with an application (and thus a "closed user-group"/VPN) may be located anywhere in the data-center. And given that this physical spreading of resources already implies a very significant increase in data-center core bandwidth requirements, they must also make sure that packets only traverse the switching infrastructure once. Marques, Fang, Pan, Shukla & Napiestda [Page 3] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 If members of a "closed-user group" may be be present anywhere in the data-center, that implies a different operational environment for a VLAN based implementation. Either it would have to operate under the assumption that all switches belong to all VLANs or to be able to dynamically adjust the topology of each VLAN. The later implies that the dimension and scope of each broadcast domain is not know a-priori. In order to minimize core bandwidth and traffic latency the inter- subnet traffic exchange policies should be pushed as far as possible to the edge. The end-systems that source and receive the traffic are able to implement the virtualization and policy enforcement functionality while using the optimal path for traffic. With VLAN [IEEE.802-1Q] an ethernet end-point can only be a member of a single "closed user group". Ideally a single application end-point should be able to be a member of multiple "closed user groups". That is possible with L3 based technologies such as BGP IP VPNs. Multi-tenant support is an network function that is often combined with network based access control. Each tenant should be able to define multiple "closed user-groups", for instance on a per- application basis, and while "closed user-groups" from multiple tenants are often not allowed to interact directly, tenants are typically allowed to use common infrastructure services (e.g. storage, database services, application-services, etc). In both of these scenarios, the requirement is to control the IP traffic crossing multiple subnets (where an IP subnet is a "closed user group") such that it conforms to the defined traffic policies. The third requirement to be considered is the need to support "VM life-migration". This functionality requires a virtual IP topology by itself even in data-centers where network based access control or multi-tenancy are not a concern. Life-migration requires virtual machines to be moved from a physical server A to a physical server B such that the total migration time does not disrupt its communication sessions. This implies that transport sessions must answer keepalives before the user-specified timeout (often a few seconds). In some cases, the operation may also be constrained by the application latency requirements which are typically in the sub- second range. While the main bottleneck in this process is the process of saving and restoring the VM's modified memory pages, the connectivity restoration time is critical. Support for minimizing connectivity interruption and minimum latency impact would benefit from the ability for the same IP end-point to be received at the new physical location of the VM (B) as well as the ability to tunnel traffic from the old location (A) during the convergence period. It also requires a control plane that can minimize convergence time and that can decouple the instantiation of the end-point (i.e. the VM IP address becoming active) from its Marques, Fang, Pan, Shukla & Napiestda [Page 4] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 advertisement to the network as the preferred route to that IP address. It is important to note that, as with the previous network functions, we need to consider IP transport sessions with system both in the same "closed user group" (or subnet) as the VM as well as in different "closed user groups" (i.e. subnets). Any solution to this problem must be able to minimize connectivity restoration time across different IP subnets. 3. Applicability of BGP IP VPNs BGP IP VPNs [RFC4364] is the industry de-facto standard for providing "closed user group" functionality in WAN environments. It is used by service providers in environments where several millions of routes are present. It supports both isolated VPNs as well as overlapping VPNs (often referred to as "extranets"). In its traditional usage in Service Provider networks, BGP IP VPN functionality is implemented in a Provider Edge (PE) device that combines both BGP signaling as well a VRF-based forwarding functions. In practice, most PE devices in current use are multi-component systems with the signaling and forwarding functionality actually implemented in different processors attached to an internal network. This document assumes a similar separation of functionality in which signaling devices implement the control plane functionality of a PE device and a VPN forwarder (in the hypervisor/host OS or first-hop switch) implements the forwarding function usually found in a PE device "line-card". Operationally, BGP IP VPN technology has several important characteristics: It has a high-level of aggregation between customer interfaces and managed entities (Provider Edge devices). It defines VPNs as policies, allowing an interface to be a member of multiple VPNs and allowing for the topology of the virtual network to be modified by modifying the policy configuration. It scales horizontally in terms of event propagation. By increasing the number of signaling devices, implementing the PE control plane, it is possible to decrease the load on each signaling device when it comes to propagating events that originate in a specific location and must be propagated across the network. The last point is particularly relevant to the convergence characteristics required for large scale deployments. BGP's hierarchical route distribution capabilities allow a deployment to divide the workload by increasing the number of BGP signaling gateways. Marques, Fang, Pan, Shukla & Napiestda [Page 5] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 As an example consider a topology in which 100 BGP signaling gateways are deployed in a data-center each serving a subset of the VPN forwarding elements. The signaling gateways inter-connect to two top-level BGP Route Reflectors [RFC4456]. If an event (i.e. a VPN route change) needs to be propagated from a specific machine that activates a VM to 10.000 clients randomly distributed across a data-center, each of the BGP signaling gateways must generate 100 updates to its respective downstream clients. By modifying this topology such that another 100 signaling gateways are added, then each signaling gateway is now responsible to generate 50 client updates. This example illustrates the linear scaling properties of BGP: doubling the number of signaling gateways (i.e. the processing capacity) reduces in half the number of updates generated by each (i.e. load at each processing node). The same horizontal scaling techniques can be applied to the Route Reflector layer in the example above by subsetting the VPN Route space according to some pre-defined criteria (for instance VPN route target) and using a pair of Route Reflectors per subset. In the example above we assumed a dense membership in which all signaling gateways have local clients that are interested in a particular event. BGP also optimizes the route distribution for sparse events. The Route Target Constraint [RFC4684] extension, builds an optimal distribution tree for message propagation based on VPN membership. It ensures that only the Signaling Gateways with local receivers for a particular event do receive it also decreasing the total load on the upstream BGP speaker. In the WAN environment, BGP IP VPN control plane scaling is focused not primarily on route convergence times but on memory footprint of embedded devices. While memory footprint does not have a similar linear scaling behavior, memory technology in the data-center is often at 10x the scale of what is commonly found in WAN environments. The functionality present in the BGP IP VPN control plane addresses the requirements specified in the previous section. Specifically, it supports multiple potentially overlapping "groups", regular or "hub and spoke" topologies and the scaling characteristics necessary. The BGP IP VPN control plane supports not only the definition of "closed user-groups" (VPNs in its terminology) but also the propagation of inter-VPN traffic policies [RFC5575]. An application of that mechanism to "end-system" VPN is presented in [I-D.marques- sdnp-flow-spec]. Note that the signaling protocol itself is rather agnostic of the encapsulation used on the wire as long as this encapsulation has the ability to carry a 20 bit label. Marques, Fang, Pan, Shukla & Napiestda [Page 6] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 Several data-center deployments use a switching infrastructure that is only capable of providing an IP unicast service. In order to support them, implementations of this document should support the MPLS in GRE [RFC4023] encapsulation. Other encapsulations are possible, including UDP based encapsulations. 4. Virtual network end-points This document assumes that end-systems support one or more virtual network interfaces in addition to a physical interface that is associated with the data-center switching infrastructure. Virtual network interfaces can be associated with a VM or they can be used to provide network connectivity directly to applications in the same way that a "VPN tunnel" interface is used to provide access between an end-system (e.g. a laptop) and a remote corporate network. From an IP address assignment point of view, a virtual network interface is addressed out of the virtual IP topology and associated with a "closed user group" or VPN, while the physical interface of the machine is addressed in the network infrastructure topology. As a security measure, it is recommended that virtual and infrastructure topologies never be allowed to exchange traffic directly. In data-center environments, static IP address allocation is common since it is desirable to associate a permanent IP address to a VM and have that IP address remain constant even as the VM migrates. However dynamic address assignment through DHCP is also possible assuming that the VPN forwarder implements DHCP relay functionality. A virtual network interface is connected to a VPN forwarder. This VPN forwarder MAY be located on the hypervisor or host OS that co- resides on the same physical machine or it could be located in an external system, such as the first-hop switch. We refer to this second system as the external VPN forwarder. All traffic that ingresses or egresses through this virtual network interface is routed at the VPN forwarder which acts as the first-hop router (in the virtual topology). The IP configuration on the client side of this virtual network interface (e.g. in the guest OS) can follow one of two models: point-to-point interface model. multipoint interface model. In a point-to-point interface model, the system routing table (e.g. the guest OS) contains the following routing entires: a host route to the local IP address, a host route to the first-hop router via the virtual interface and a default route to the first-hop router. This is the model typically used in "VPN tunnel" configurations or other access technologies such as cable deployments or DSL. When this model is used, the first-hop router IP address is a link-local address that is the same on all first-hop routers across a specific deployment. This first-hop IP address should not change when a VM migrates between different machines. Marques, Fang, Pan, Shukla & Napiestda [Page 7] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 In a multi-point interface model, the system routing table (on the guest OS) contains the following routing entires: a host route to the local IP address, a subnet route to the local interface and optionally a default route to an specific router address within that subnet. In this model, the end-system will issue address resolution requests for any IP addresses it considers to be directly attached to the subnet. The VPN forwarder shall answer all address resolution requests with a virtual MAC address which SHOULD be the same across all VPN forwarders in a specific deployment. This virtual MAC address SHALL default to the VRRP [RFC5798] virtual router MAC address for VRID 1.. When the virtual topology first-hop router resides on the same physical machine, the host OS is responsible to map the virtual interface with a VPN specific routing table (without taking L2 addresses into consideration). In this case the mac-addresses known to the guest OS are not used on the wire. When the virtual topology first-hop router resides in an external system (e.g. the first hop-switch) the virtual interface shall be identified by the combination of the physical interface mac-address and a 802.1Q VLAN tag. The first-hop switch should use a virtual router MAC address to answer any address resolution queries. Whenever an external VPN forwarder is used and resiliency is desired, the external VPN forwarder should be redundant. It is desirable to use VRRP as a mechanism to control the flow of traffic between the end-system and the external VPN forwarder. VRRP already defines the necessary procedures to elect a single forwarder for a LAN. This specification uses the VRRP virtual router MAC address as the default L2 address for the VPN forwarder as a VM may move between locations where redundancy is and is not present. While the VRRP Virtual Router MAC will be used resolve any address resolution request made by the virtual interface client (e.g. the guest VM) this does not imply that a single default router is elected per virtual IP subnet. The ingress VPN forwarder will perform an IP forwarding decision based on the destination IP address of the (payload) traffic. VRRP router election is only relevant in selecting the VPN forwarder associated with a specific machine, when external forwarders are in use. 5. VPN forwarder In this solution, the Host OS/Hypervisor in the end-system must participate in the virtual network service. Given an end-system with multiple virtual interfaces, these virtual interfaces must be mapped onto the network by the guest OS such that applications on one virtual interface are not allowed to impersonate another virtual Marques, Fang, Pan, Shukla & Napiestda [Page 8] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 interface. When VPN forwarder functionality is implemented by the Host OS/ Hypervisor, intermediate systems in the network, be they access, aggregation or core switches, do not require any knowledge of the virtual network topology. This can simplify the design an operation of the data-center switching infrastructure. When it is not possible or desirable to add the VPN forwarding functionality to the end-system, it may be implemented by an external system, typically located as close as possible to the end-system itself. This could be a top-of-rack or aggregation switch. Both models, end-system and extern VPN forwarder can co-exist in a deployment. In order to implement the BGP IP VPN forwarder functionality a device MUST implement the following functionality: Support for multiple "Virtual Routing and Forwarding" (VRF) tables; VRF route entries map prefixes in the virtual network topology to a next-hop containing a infrastructure IP address and a 20-bit label allocated by the destination forwarder. The VRF table lookup follows the standard IP lookup (best-match) algorithm. Associate a end-system virtual interface with a specific VRF table; When the first-hop router is the Host OS/hypervisor this association is performed by an internal mechanism. When the first-hop router is external this association is performed using the mac-address of the end-system and a IEEE 802.1Q tag that identifies the virtual interface within the end-system. Encapsulate outgoing traffic (end-system to network) according to the result of the VRF lookup; Associate incoming packets (network to end-system) to a VRF according to the 20-bit label contained immediately after the GRE header; The VPN forwarder MAY support the ability to associate multiple virtual interfaces with the same VRF. When that is the case, locally originated routes, that is IP routes to the local virtual interfaces SHALL NOT be used to forward outbound traffic (from the virtual interfaces to the outside) unless a route advertisement has been received that matches that specific IP prefix and next-hop Marques, Fang, Pan, Shukla & Napiestda [Page 9] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 information. As an example, if a given VRF contains two virtual interfaces, "veth0" and "veth1", with the addresses 10.0.1.1/32 and 10.0.1.2/32 respectively, the initial forwarding state must be initialized such that traffic from either of these interfaces does not match the other's routing table entry. It may for instance match a default route advertised by a remote system. Traffic received from other VPN forwarders, however, must be delivered to the correct local interface. If at a subsequent stage a route is received from the signaling gateway such that 10.0.1.2/32 has a next-hop with the IP address of the local host and the correct label, the system may subsequently install a local routing table entry that delivers traffic directly to the "veth1" interface. The 20-bit label which is associated with a virtual-interface is of local significance only and SHOULD be allocated by the VPN forwarder. When an external VPN forwarder is used the end-system MUST associate each virtual interface with a VLAN [IEEE.802-1Q] that is unique on the end-system. The switching infrastructure MUST be configured such that multi-destination frames sourced from an end-system are only delivered to VPN forwarders used by this end-system and not to other end-systems. 6. XMPP signaling protocol Signaling Gateways must be made aware of virtual interface creation and deletion and of when IP addresses are added or removed from these virtual interfaces. VPN forwarders must receive VPN route information from which to populate their forwarding tables. When the VPN forwarder is a different system than the end-system where the virtual interface resides, it also needs to receive the interface and IP address events from the end-system. In this case it is the VPN forwarder that propagates these to the Signaling Gateway. When an external VPN forwarder is used, the end-system assigns the VLAN identifier used for each virtual interface. This information is conveyed by an optional parameter in the VPN subscription request. It is not propagated by the VPN forwarder. In order to exchange this information this specification uses the XMPP [RFC6120] protocol along with the PubSub Collection Nodes [pubsub] extension. Clients (end-systems and external VPN forwarders) establish persistent XMPP sessions. These sessions MUST use the XMPP Ping [xmpp-ping] extension in order to detect end-system failures. A client MAY connect to multiple servers (e.g. VPN-signaling gateways) for reliability. In this case it SHOULD publish its information to each of the gateways. It MAY choose to subscribe to VPN routing information once only from one of the available gateways. Marques, Fang, Pan, Shukla & Napiestda [Page 10] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 The information advertised by a client SHOULD be deleted after a configurable timeout, when the session closes. This timeout should default to 60 seconds. +---------+ +--------+ | gateway | ----------- | BGP | +---------+ +--------+ // \ / XMPP \ / // \ / +------------+ \ / | end-system | \ / +------------+ \/ \\ /\ XMPP / \ \\ / \ +---------+ / \ +--------+ | gateway | ----------- | BGP | +---------+ +--------+ The figure above represents a typical configuration in which an end- system (implementing the VPN forwarder functionality) is directly connected to two gateways, which are in turn connected to multiple BGP spakers which may be other BGP signaling gateways or BGP route reflectors. In deployment the number of gateways used will depend on the desired gateway to VPN forwarder ratio which affects the convergence time of event propagation. The XMPP "jid" used by the client shall be a 6-byte value that uniquely identifies it in the domain. This specification recommends the use of the MAC address of one of the physical ethernet interfaces. Each VPN shall be identified by a 64 ASCII character string. When external forwarders are used, its control software operates as a XMPP server processing requests from end-systems and as a client of one or more Signaling Gateways. The control software relays to the Signaling Gateways(s) messages it receives from the end-system. VPN routing information received from the Signaling Gateways(s) SHOULD NOT be propagated to the end-system. When a virtual interface is created, for instance as result of a Virtual Machine being instantiated on a end-system, the host operating-system software shall generate an XMPP Subscribe message to its server (the VPN-signaling gateway or external VPN forwarder). Subscription request from end-system to gateway (local VPN forwarder): Marques, Fang, Pan, Shukla & Napiestda [Page 11] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 The request above, instructs the signaling gateway to start populating the client's VRF table with any routing information that is available for this VPN. The XMPP node 'vpn-customer-name' is assumed to be a collection which is implicitly created by the VPN- signaling gateway. Creation of a virtual interface may precede any IP address becoming active on the interface, as it is the case with VM life migration. Subscription request from end-system to external VPN forwarder: vlan-id When an external VPN forwarder is used the end-system should include the VLAN identifier it assigned to the virtual interface as a subscription option. When a IP address is added to a virtual interface, the end-system will generate an XMPP Publish request. Publish request from end-system to gateway: Marques, Fang, Pan, Shukla & Napiestda [Page 12] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 to='network-control.domain.org' id='request1'> 'vpn-ip-address/32' 'infrastructure-ip-address' The VPN-signaling gateway will convert the information received in a the 'publish' request into the corresponding BGP route information such that:. It associates the specific request with a local VRF which it resolves by using a combination of the originator system-id and the collection 'node' attribute. It creates a BGP VPN route with a 'Route Distinguisher' (RD) which contains the the end-system's 'system-id' value and the specified IP prefix and 'label' as the Network Layer Reachability Information (NLRI) . The BGP next-hop address is set to the address of egress VPN forwarder. It associates the route with an extended community TDB containing the version number. Update notification from gateway to end-system: Marques, Fang, Pan, Shukla & Napiestda [Page 13] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 'vpn-ip-address>/32' 'infrastructure-ip-address' ... Notifications should be generated whenever a VPN route is added, modified or deleted. Note that the Update from the signaling gateway to the end-point does not contain the system-id of the destination end-point. When multiple possible routes exist for a given VPN IP address, for instance because the VM may be in the process of moving location, it is the responsibility of the gateway to select the best path to advertise to the end-system. When routes are withdrawn, the signaling gateway generates both a "collection disassociate" request as well as a node "delete" request. In situations where an automated system is controlling the instantiation of VMs it may be possible to have that system assign a non-decreasing version number for each instantiation of that particular VM. In that case, this number, carried in the 'version' field may be used to help gateways select the most recent instantiation of a VM during the interval of time where multiple routes are present. 7. Signaling gateway behavior BGP Signaling gateways SHALL support the address families: VPN-IPv4 (1, 128), VPN-IPv6 (2, 128) and RT-Constraint (1, 132) [RFC4684]. When a VPN-signaling gateway receives a request to create or modify a VPN route it SHALL generate a BGP VPN route advertisement with the corresponding information. It is assumed that the VPN-signaling gateways contain information regarding the mapping between end-system the tuple ('system-id', 'vpn-customer-names') and BGP Route Targets used to import and export information from the associated VRFs. This mapping is known via an out-of-band mechanism not specified in this document. Marques, Fang, Pan, Shukla & Napiestda [Page 14] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 Whenever the Signaling Gateway receives an XMPP subscription request, it SHALL consult its RT-Constraint Routing Information Base (RIB). If the Signaling Gateway does not already have locally originated route that corresponds to the route target being carried in the request, it SHALL create one and generate the corresponding BGP route advertisement. This route advertisement should only be withdrawn when there are no more downstream XMPP clients subscribed to the VPN. The 32bit route version number defined in the XML schema is advertised into BGP as a Extended community with type TBD. Signaling gateways SHOULD automatically assign a BGP route distinguisher per VPN routing table. 8. Operational Model In the simplest case, a VPN is a collection of systems that are allowed to exchange traffic with each other and only with each other. Since all the forwarding tables in this VPN have the same routing entires they are often referred to as symmetrical VPNs. In order to better illustrate the operation of the protocol we consider a simple example in which "host 1" and "host 2" both contain a VM that is a member of the same VPN. Each of these hosts has an XMPP session with a Signaling Gateway, SG1 and SG2 our example, and these Signaling Gateways are part of the same BGP mesh. When a virtual interface is created on "host 1", the local XMPP client generates a XMPP subscription message to its respective Signaling Gateway. This message contains a VPN identifier that has been assigned by the VM provisioning system. The Signaling Gateway maps that identifier to a BGP IP VPN configuration which contains the list of import and export route targets to be used for that particular VRF. Once the VM is operational, "host 1" will publish any IP addresses that are configured on the respective virtual interface. This will in turn cause the Signaling Gateway to advertise these to any other BGP speaker on the network which is connected to an attachement point of that VPN. +--------+ +-----------+ +----------+ | host 1 | <===> | signaling | <===> | BGP mesh | +--------+ | gateway | +----------+ +-----------+ +----------------+-------------+-------+-----------+ | VPN IP address | NEXT-HOP | label | Known via | +----------------+-------------+-------+-----------+ | 10.1.1.1/32 | 192.168.1.1 | 10000 | XMPP | | 10.1.1.2/32 | 192.168.2.1 | 20000 | BGP | +----------------+-------------+-------+-----------+ Marques, Fang, Pan, Shukla & Napiestda [Page 15] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 VPN Routing table on signaling gateway The figure above represents the contents of the VRF routing table on Signaling Gateway 1 after the IPv4 address 10.1.1.1 has been added to the virtual interface on host 1. It assumes that there is another attachement point for this VPN with the IPv4 address of 10.1.1.2. Host 1 has an infrastructure IP address of 192.168.1.1 configured on its physical interface while host 2 has IP address 192.168.2.1. The contents of the VRF routing table in the Signaling Gateway are advertised via XMPP Update notifications sent to host 1. This information is the used by the host to populate the forwarding table associated with that VPN. +--------+ +--------+ VM1 -- veth0 --| host 1 |=== [network] ===| host 2 |-- veth0 -- VM2 +--------+ +--------+ IP pkt ===> GRE encap ===> [IP net] ===> GRE decap ===> IP pkt [192.168.2.1, 20] map 20 to veth0 +----------------+--------------+-------+ | VPN IP address | Host address | label | +----------------+--------------+-------+ | 10.1.1.1/32 | localhost | 10000 | | 10.1.1.2/32 | 192.168.2.1 | 20000 | +----------------+--------------+-------+ VRF table on host1 When the VM on host 1 generates packets with a destination IP address of 10.1.1.2 these are routed by the VPN forwarder implemented in the Host OS. The packets are encapsulated with a GRE header that contains a 20-bit label assigned by host 2. When the VM on host 1 sends packets on the virtual interface it is using the Virtual Router MAC address as the destination MAC. When host 2 delivers packets to the remote VM it sets the Virtual Router MAC address as the source MAC address. This MAC address is not present on the GRE encapsulated packet. BGP/XMPP Signaling Gateways are software applications the implement both the BGP IP VPN PE control plane as well as XMPP server functionality. These application are not in the forwarding plane and do not need to be co-located with a network device. Network devices MAY have direct BGP sessions to the Signaling Gateways. For instance, a router or security appliance that supports BGP/MPLS IP VPNs over GRE may use its existing functionality to inter-operate directly with a collection of Virtual Machines. Marques, Fang, Pan, Shukla & Napiestda [Page 16] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 Signaling Gateways implement the VRF import policy and export policy functionality that is associated with PE routers in standard BGP IP/ VPN deployments. VPN forwarders receive forwarding information after policy and route selection is applied. These are unqualified routes in a specific VRF rather than VPN routing information qualified by a Route Distinguisher and with a set of Route Targets. A symmetrical VPN uses a vrf import and vrf export polices that contain a single route target, where the route target used for both import and export is the same. Different VPN topologies can be created by manipulating the vrf import and export configuration including "hub-and-spoke" topologies or overlapping VPNs. An example of a hub-and-spoke VPN configuration is one where all the traffic from VMs must be redirected though a middle-box (on a VM) for inspection. Assuming that the VMs of a particular user are configured to be in the VPN "tenant1". At an initial stage this "tenant1" VPN is symmetrical and uses a single Route Target in both its import and export policies. The middle-box functionality can be incrementally deployed by defining a new VPN, "tenant1-hub", and an associated Route Target. Accompanied with a change in the Signaling Gateway configuration such that VPN "tenant1" only imports routes with the Route Target associated with the hub. The "hub" VPN is assumed to advertise a prefix that covers all the VMs IP addresses. The "hub" VPN imports the VMs routes in order for it to be able to generate the XMPP updates to the "hub" end-system. This information is required for the return traffic from the hub to the spokes (the standard VMs). In such a scenario a single interface can connect the middle-box to the VMs in a given VPN which appear logically as downstream from it. Such a middle-box would often require connectivity to multiple VPNs, such as for instance an "outside" VPN which provides external connectivity to one or more "inside" VPNs. The functionality defined in this document in which the BGP IP VPN PE functionality is split into its control (Signaling Gateway) and forwarding (VPN forwarder) components is fully interoperable with existing BGP IP VPN PEs. This makes it possible to reuse existing systems. For example, at the edge of a data-center facility it may be desirable to use an existing router or appliance that aggregates IP VPN routing information and/or provides IP based services such as stateful packet inspection. Such a system can be configured, based on existing functionality, to suppress more specific routes than a specified aggregate while advertising the aggregate with a BGP NEXT_HOP containing the PE's IP address and a locally assigned label corresponding to a VRF where the more specific routes are present. 9. Security Considerations Marques, Fang, Pan, Shukla & Napiestda [Page 17] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 The signaling protocol defines the access control policies for each virtual interface and any VM associated with it. It is important to secure the end-system access to signaling gateways and the BGP infrastructure itself. The XMPP session between end-systems and the XMPP gateways MUST use mutual authentication. One possible strategy is to distribute pre- signed certificates to end-systems which are presented as proof of authorization to the signaling gateway. BGP sessions MUST be authenticated using a shared secret. This document recommends that BGP speaking systems filter traffic on port 179 such that only IP addresses which are known to participate in the BGP signaling protocol are allowed. 10. Acknowledgements Yakov Rekhter has contributed to this document by providing detailed feedback and suggestions. The authors would also like to thank Thomas Morin for his comments. 11. References 11.1. Normative References [RFC4023] Worster, T., Rekhter, Y. and E. Rosen, "Encapsulating MPLS in IP or Generic Routing Encapsulation (GRE)", RFC 4023, March 2005. [RFC4271] Rekhter, Y., Li, T. and S. Hares, "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006. [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [RFC4456] Bates, T., Chen, E. and R. Chandra, "BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP)", RFC 4456, April 2006. [RFC4684] Marques, P., Bonica, R., Fang, L., Martini, L., Raszuk, R., Patel, K. and J. Guichard, "Constrained Route Distribution for Border Gateway Protocol/MultiProtocol Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks (VPNs)", RFC 4684, November 2006. [RFC5575] Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J. and D. McPherson, "Dissemination of Flow Specification Rules", RFC 5575, August 2009. [RFC5798] Nadas, S., "Virtual Router Redundancy Protocol (VRRP) Version 3 for IPv4 and IPv6", RFC 5798, March 2010. [RFC6120] Saint-Andre, P., "Extensible Messaging and Presence Protocol (XMPP): Core", RFC 6120, March 2011. Marques, Fang, Pan, Shukla & Napiestda [Page 18] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 [xmpp-ping] "XMPP Ping", XEP 0199, June 2009. [pubsub] "PubSub Collection Nodes", XEP 0248, September 2010. 11.2. Informational References [I-D.marques-sdnp-flow-spec] Marques, P, Fang, L, Pan, P and A Shukla, "Traffic classification, filtering and redirection for end-system IP VPNs.", Internet-Draft draft-marques-sdnp-flow-spec-00, October 2011. [IEEE.802-1Q] Institute of Electrical and Electronics Engineers, "Local and Metropolitan Area Networks: Virtual Bridged Local Area Networks", IEEE Std 802.1Q-2005, May 2006. Authors' Addresses Pedro Marques Email: pedro.r.marques@gmail.com Luyuan Fang Cisco Systems 111 Wood Avenue South Iselin, NJ 08830 Email: lufang@cisco.com Ping Pan Infinera Corp 140 Caspian Ct. Sunnyvale, CA 94089 Email: ppan@infinera.com Amit Shukla Juniper Networks 1194 N. Mathilda Av. Sunnyvale, CA 94089 Email: amit@juniper.net Maria Napierala AT&T Labs 200 Laurel Avenue Middletown, NJ 07748 Email: mnapierala@att.com Marques, Fang, Pan, Shukla & Napiestda [Page 19] Internet-Draft BGP-signaled end-system IP/VPNs March 2012 Marques, Fang, Pan, Shukla & Napiestda [Page 20]