Working Group: ARMD Himanshu Shah Intended Status: Proposed Standard Ciena Corp Internet Draft Expiration Date: May, 2011 October 18, 2010 ARP Reduction in Data Center draft-shah-armd-arp-reduction-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on May 18, 2011 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. Shah, et al. Expires May 2011 1 Internet Draft draft-shah-arp-reduction-00.txt This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Abstract With advent of virtual machine (VM) technologies, a host is able to support multiple VMs in a single physical machine. The data center application leverages these capabilities to instantiate upwards of 10s to 100s of VMs in a server. Each VM operates as an independent IP host with its own MAC address associated with a virtual Network Interface Card (vNIC) that maps to a single physical Ethernet interface. These physical servers are typically stacked in a rack with its Ethernet interface connected to top-of-the-rack (ToR) switch. The ToR switches are interconnected through End-of-the-Row (EoR) switch which in turn is connected to core switches. As discussed in [ARP-Problem] the VM hosts use ARP broadcasts to find other VM hosts and use periodic (broadcast) gratuitous ARPs to refresh their IP to MAC address binding in other VM hosts. Such broadcasts in a large data center with potentially thousands of VM hosts in a layer-2 based topology can cause havoc. This document describes a solution whereby a ToR switch assumes the handling of the ARP broadcasts based on the ARP table that it maintains by gleaning information from the passing ARP PDUs. When the information is not new, gratuitous ARP PDUs are dropped and ARP broadcast requests from hosts are responded by the switch from the learned ARP information instead of forwarding them out. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119]. Table of Contents Copyright Notice .................................................... 1 Shah, et al. Expires May 2011 2 Internet Draft draft-shah-arp-reduction-00.txt Abstract.............................................................. 2 1.0 Contributing Authors ............................................. 3 2.0 Overview.......................................................... 3 2.1 Terminology ..................................................... 4 3.0 Topology.......................................................... 5 4.0 Configuration..................................................... 5 5.0 Building the ARP tables........................................... 6 5.1 ARP Requests .................................................... 6 5.2 ARP Response .................................................... 7 5.3 Gratuitous ARP .................................................. 7 5.4 Host movement ................................................... 7 5.5 IPv6 Hosts ...................................................... 8 5.6 External ARP servers ............................................ 8 6.0 Conclusion ...................................................... 8 7.0 Security Considerations........................................... 9 8.0 References........................................................ 9 8.1 Normative References ............................................ 9 8.2 Informative References .......................................... 9 9.0 Author's Address.................................................. 9 1.0 Contributing Authors This document is the combined effort of the following individuals and many others who have carefully reviewed this document and provided the technical clarifications Linda Durbar Huawei Sue Hares Huawei T Sridhar Force10 Networks 2.0 Overview The following factors exasperate the effect of ARP broadcasts in the data centers. . Ever increasing dependence on applications that run in data center . Large number of physical server hosts in the server farms . Use of large number of VMs in the physical server host . Each VM to have its own IP and MAC address and they all reside in the same subnet as it allows them to move around in different physical hosts based on fair resource distribution policies. This requires the data center networks to be layer 2 based Shah, et al. Expires May 2011 3 Internet Draft draft-shah-arp-reduction-00.txt . Each VM resort to frequent ARP broadcasts as request to find the target VMs and/or gratuitous ARPs to refresh its IP to MAC address binding in peers with whom they tend to chat more often. This also stems from the fact that VMs holds a relatively small ARP table and use more aggressive age out in order to accommodate the 'most' active peers in the table. The broadcast as such in layer 2 networks has far reaching impacts; i.e. wastage in network bandwidth as well as CPU resources used by all the VMs while processing superfluous ARP broadcasts. It appears that it is possible to minimize the ills of ARP broadcasts in the data center network in a relatively simpler fashion. The solution requires the first hop Ethernet Switches, typically ToR, to maintain ARP table learned from the passing ARP PDUs and selectively propagate and/or proxy on behalf of the remote peer. These types of ARP processing principles are well known and used/described in L2VPN Working Group documents such as [ARP- Mediation] and [IPLS]. The following sections describe the inner-workings of ARP snooping, learning and maintaining ARP tables, using the learned information to limit broadcast propagation and proxy (the response) on behalf of the remote peers. 2.1 Terminology ToR Top-of-Rack. An Ethernet switch present on top of a rack which provides network connectivity to the servers present on the rack. Downlink Downlink in this document refers to local host (servers in the rack) facing Ethernet connection in the ToR switch. Uplink Uplink in this document refers network facing Ethernet connection in the ToR switch. Typically, the uplinks from ToRs connect to end-of-rack switches. EoR End-of-Rack Ethernet switch. This is more of an aggregation switch. Uplinks from ToR connects to EoR and uplink from EoR connects to Core switch. Shah, et al. Expires May 2011 4 Internet Draft draft-shah-arp-reduction-00.txt Host/Server The host or server term is used in this document to refer to an IP host or server. An IP host could be a one physical entity or a logical entity (as a Virtual Machine) in a physical host. The term server refers to its application role in data center. Both terms are used interchangeably or together and mean IP end station. Local hosts This term is used in the context of a ToR switch to denote the (VM) hosts connected to a ToR on the downlink, i.e. directly connected hosts Remote hosts This term is used in the context of a ToR to denote the hosts that are accessible through uplink of the ToR. VM Virtual Machine. This is a logical instance of a host that operates independently in a physical host and has its own IP and MAC address. The VM architecture allows efficient use of physical host resources in data center application. 3.0 Topology An example topology of a data center network that is referred in this document is that of an hierarchical connectivity of low to high density Ethernet switches that provide flat (common broadcast domain) layer 2-based network for the servers in the data center. Each server host, thus connected is said to be on the same subnet and communicate directly using IP without having to go through a router or default gateway. In other words, an IPv4 host (VM or otherwise) on this network can find another IPv4 host's MAC address using the ARP methodologies. 4.0 Configuration It is assumed that ARP reduction methodologies that are defined in this document will be limited to ToR switches. We believe that maximum benefits of restraining ARP broadcasts in the network can be achieved by the first hop (or directly connected to host) switches without placing additional burden on second or third tier switches. The ToR switches will need to be configured with this feature enabled. Each Ethernet interface needs to be identified as a type of downlink or uplink within the context of this feature. The ARP reduction feature treats ARP frames received from downlink or uplink differently as described in the following sections. Shah, et al. Expires May 2011 5 Internet Draft draft-shah-arp-reduction-00.txt It is possible for the operator to configure various ARP reduction related parameters; such as - . ARP aging timer . Size of the ARP table . Static entries of IP to MAC address There are situations where low cost ToR switches do not have the needed capacity to process ARP reduction functions. Under those circumstances, external ARP server (described below) approach can be considered. 5.0 Building the ARP tables When enabled, ToR switch will start monitoring the data frames for the ARP PDUs. The ARP PDU processing is recommended to be handled in the following manner. . All ARP request PDUs should be redirected to control plane CPU . All gratuitous ARP PDUs should be redirected to control plane CPU . All ARP response PDUs should be bi-casted; one copy sent to control plane CPU and other copy forwarded out normally. The ARP table can become large. The scaling factor dictates that the table be maintained in the control plane memory as compared to hardware tables in the forwarding plane. In either case, it is prudent that 'local host' is preferred over 'remote host' when placing the IP to MAC address association entries in the contested ARP table space. 5.1 ARP Requests The ARP requests are broadcast frames. The ToR gleans the IP and MAC address from the ARP PDU. The source IP and MAC address association is learned or updated/refreshed, if already learned. The destination IP address is searched in the ARP table. If an entry exists, the associated MAC address from the table is used to prepare a unicast ARP reply PDU. The same MAC address is also used as the source MAC address in the MAC header that is prepended to the unicast ARP reply PDU. If the destination IP address in the request is not present in the table, then the original ARP request PDU is broadcasted to all the switch ports except the source port the request was received from. However, if the requested (destination) IP address is present in the ARP table, unicast ARP response PDU is prepared as described above and sent to the egress port based on which port the target existed and original ARP request PDU is dropped. The intent is to try preventing propagation of ARP request PDU broadcasts as much as possible using the information present in the Shah, et al. Expires May 2011 6 Internet Draft draft-shah-arp-reduction-00.txt ARP table. The following observations can be made from such behavior. . Most of the ARP requests from the local hosts for the local hosts can be prevented most of the times . Most of the ARP requests from the remote hosts for the local hosts can be prevented from forwarding towards downlinks or other uplinks . Many of the ARP requests from the local hosts for the remote hosts can be prevented from forwarding towards uplinks, if remote host IP to MAC association is known. 5.2 ARP Response The unicast ARP response is gleaned to learn/update the ARP table for source and destination IP/MAC address association and forwarded out as a normal frame. 5.3 Gratuitous ARP The Gratuitous ARP reply is a broadcast ARP PDU with destination IP address and MAC address of the sender. It is typically used by the (VM) IP hosts to keep its association fresh in peer's ARP cache. The ToR switch should process Gratuitous ARP in the following manner. . Learn/update/refresh the ARP table entry . If ARP entry was new or existed with different information then gratuitous ARP PDU is forwarded out otherwise the PDU is dropped. The important goal for handling of the gratuitous ARP PDU from the downlinks (i.e. local hosts) is to not propagate into the 'network' (i.e. to uplinks) if the information is not new. 5.4 Host movement The VM architecture allows movements of VMs to different physical server entities based on optimum resource utilization policies. The act of movement is called vMotion and the flexibility adds attraction for its use. The vMotion could be manual (operator initiated) or automatic in reaction to demands placed by the application users. The important point is that in either case, vMotion is not transparent and is made known to the network. There is ongoing work in IEEE 802.1Qbe standards organization to coordinate/communicate the presence and capabilities of the VMs to the directly connected network switch. It is expected that ToR would leverage the knowledge obtained of newborn VM to update local ARP table as well as to notify the network (other switches) via unsolicited gratuitous ARP on behalf of the VM. The details of such procedures will be described in the subsequent revisions of this document. Shah, et al. Expires May 2011 7 Internet Draft draft-shah-arp-reduction-00.txt 5.5 IPv6 Hosts The IPv6 hosts use Neighbor Discovery procedures that are different from ARP methodologies used by IPv4 hosts. The details of handling of Neighbor discovery procedures will be described in the subsequent updates to this document. 5.6 External ARP servers It is possible that in some configuration, the ToR switches may not be capable to handle the ARP reduction procedures. For such configuration, it is possible to outsource the ARP reduction procedures to one or more external ARP server hosts. The ToR switches will then be configured to, . Identify the interface(s) connected to the ARP servers. Such interface(s) must be separate from downlink and uplinks that are connected to 'host reachable' (or native) networks as described above. This concept is similar to how switches treat 'management' network separate from user data network. . All broadcast ARP PDUs are forwarded to interface(s) where ARP servers reside . All unicast ARP PDUs received from 'native' interfaces are bi- casted; one copy of the PDU is forwarded to ARP server and other forwarded normally . All ARP PDUs received from the ARP server interface(s) are the results of the ARP-Reduction procedure based PDUs generated by the ARP servers. They are handled in the following manner. o The source MAC address is not learned o Instead, the source MAC address is used to determine the 'real' native ingress interface. That is, switch will treat the packet as if it was received from the interface where source appears to reside and make the forwarding decision based on destination MAC address and the newly determined ingress port. . If multiple ARP server interfaces are to exist (in order to avoid single point of failure), an ARP PDU received from one ARP server interface is never forwarded out to other ARP server interface(s) (i.e. split horizon rule). 6.0 Conclusion Based on the procedures described in this document, it is possible for ToR switches in the data center to dampen ARP broadcasts significantly. The solution is not new, based on well known procedures, non-intrusive and low hanging fruit that strives to curtail broadcasts that are increasingly becoming a problem in the data centers. In essence, ToR switches are facilitating the offloading of the extended ARP table management from the IP hosts unto itself. The ARP table timeout can be tuned higher by the operator based on the available switch resources and network traffic behavior. The larger capacity of the ARP table directly translates to more effective subduing of the ARP broadcasts. An additional approach is described to further offload ARP table and PDU Shah, et al. Expires May 2011 8 Internet Draft draft-shah-arp-reduction-00.txt management to dedicated server(s) for reduced capacity low end ToR or as a cost effective solution. 7.0 Security Considerations The details of the security aspects will be addressed in future revision. 8.0 References 8.1 Normative References [ARP] RFC 826, STD 37, D. Plummer, "An Ethernet Address Resolution Protocol: Or Converting Network Protocol Addresses to 48.bit Ethernet Addresses for Transmission on Ethernet Hardware". [ARP-Problem] L.Dunbar et al., "Scalable Address Resolution for Large Data Center Problem Statements", draft-dunbar-arp-for- large-dc-problem-statement-00.txt. 8.2 Informative References [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking in Layer 2 VPN", draft-ietf-l2vpn-arp-mediation-13.txt. [IPLS] H.Shah et al., "IP-only LAN service", draft-ietf-l2vpn-ipls-09.txt. [PROXY-ARP] RFC 925, J. Postel, "Multi-LAN Address Resolution". 9.0 Author's Address Himanshu Shah Ciena Corp Email: hshah@ciena.com Shah, et al. Expires May 2011 9