OPSAWG R. Krishnan Internet Draft S. Khanna Intended status: Experimental Brocade Communications Expires: July 2013 L. Yong January 12, 2013 Huawei USA A. Ghanwani Dell Ning So Tata Communications B. Khasnabish ZTE Corporation Best Practices for Optimal LAG/ECMP Component Link Utilization in Provider Backbone Networks draft-krishnan-opsawg-large-flow-load-balancing-02.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on July 12, 2013. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of Krishnan Expires July 12, 2013 [Page 1] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract Demands on networking infrastructure are growing exponentially; the drivers are bandwidth hungry rich media applications, inter-data center communications, etc. In this context, it is important to optimally use the bandwidth in the service provider backbone networks which extensively use LAG/ECMP techniques for bandwidth scaling. This draft describes the issues faced in service provider backbones in the context of LAG/ECMP and recommends some best practices for managing the bandwidth efficiently in service provider backbones. Table of Contents 1. Introduction...................................................3 1.1. Conventions...............................................3 2. Hash-based Load Distribution in LAG/ECMP.......................4 3. Best Practices for Optimal LAG/ECMP Component Link Utilization.5 3.1. Large Flow Recognition....................................7 Krishnan Expires July 12, 2013 [Page 2] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 3.1.1. Flow Identification..................................7 3.1.2. Sampling Techniques - sFlow/PSAMP....................7 3.1.3. Automatic Hardware Recognition.......................8 3.2. Load Re-Balancing Options.................................9 3.2.1. Alternative Placement of Large Flows.................9 3.2.2. Redistributing Other Flows...........................9 3.2.2.1. Redistributing All Other Flows..................9 3.2.2.2. Redistributing the Other Flows on the Congested Link....................................................10 3.2.3. Component Link Protection Considerations............10 3.2.4. Load Re-Balancing Example...........................10 4. Operational Considerations....................................11 5. Data Model Considerations.....................................11 6. IANA Considerations...........................................11 7. Security Considerations.......................................12 8. Acknowledgements..............................................12 9. References....................................................12 9.1. Normative References.....................................12 9.2. Informative References...................................12 Appendix A. Internet Traffic Analysis and Load Balancing Simulation13 1. Introduction Service provider backbone networks extensively use LAG/ECMP techniques for capacity scaling. Network traffic can be predominantly categorized into two traffic types: long-lived large flows and other flows (include long-lived small flows, short-lived small/large flows). Stateless hash-based techniques[ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used to distribute both long-lived large flows and other flows over the component links in a LAG/ECMP. However the traffic may not be evenly distributed over the component links due to the traffic pattern. This draft describes best practices for optimal LAG/ECMP component link utilization while using hash-based techniques. These best practices comprise the following steps -- recognizing long-lived large flows in a router; and assigning the long-lived large flows to specific LAG/ECMP component links or redistribute other flows when a component link on the router is congested. 1.1. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this Krishnan Expires July 12, 2013 [Page 3] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 document are to be interpreted as described in RFC 2119 [RFC2119]. The following acronyms are used: Following Terms are used in the document: COTS: Commercial Off-the-shelf DOS: Denial of Service ECMP: Equal Cost Multi-path GRE: Generic Routing Encapsulation LAG: Link Aggregation Group Large flow(s): long-lived large flow(s) MPLS: Multiprotocol Label Switching NVGRE: Network Virtualization using Generic Routing Encapsulation Other flows: long-lived small flows and short-lived small/large flows QoS: Quality of Service VXLAN: Virtual Extensible LAN 2. Hash-based Load Distribution in LAG/ECMP Hashing techniques are often used for flow based load distribution [ITCOM]. A large space of the flow identifications, i.e. finer granularity of the flows, conducts more random in spreading the flows over a set of component links. The advantages of hashing based load distribution are the preservation of the packet sequence in a flow and the real time distribution with the stateless of individual flows. If the traffic flows randomly spread in the flow identification space, the flow rates are much smaller compared to the link capacity, and the rate differences are not dramatic, the hashing algorithm works very well in general. However, if one or more of these conditions do not meet, the hashing may result very unbalanced loads on individual component links. One example is illustrated in Figure 1. There is a LAG between 2 routers R1 and R2. This LAG has 3 component links (1), (2), (3). . Component link (1) has 2 other flows and 1 large flow and the link utilization is normal. Krishnan Expires July 12, 2013 [Page 4] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 . Component link (2) has 3 other flows and no large flow and the link utilization is light. o The absence of any large flow causes the component link under-utilized. . Component link (3) has 2 other flows and 2 large flows and the link utilization is exceeded. o The presence of 2 large flows causes the component link congested. +-----------+ +-----------+ | | -> -> | | | |=====> | | | (1)|--/---/-|(1) | | | | | | | | | | (R1) |-> -> ->| (R2) | | (2)|--/---/-|(2) | | | | | | | -> -> | | | |=====> | | | |=====> | | | (3)|--/---/-|(3) | | | | | +-----------+ +-----------+ Where: ->-> other flows ===> large flow Figure 1: Unevenly Utilized Component Links This document presents the improved hashing load distribution techniques based on the large flow awareness. The techniques compensate unbalanced load distribution from hashing due to the traffic pattern. 3. Best Practices for Optimal LAG/ECMP Component Link Utilization The suggested techniques in this draft are about a local optimization solution, where the local is in the sense of both measuring large flows and re-balancing the load at individual nodes in the network. This approach would not yield a globally optimal placement of a large flow across several nodes in the network which some networks may Krishnan Expires July 12, 2013 [Page 5] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 desire/require. On the other hand, this may be adequate for some operators for the following reasons-- 1) Different links in the network experience different levels of utilization and, thus, a more "targeted" solution is needed for those few hot-spots in the network; 2) Some networks may lack end-to-end visibility, e.g. when a network carries the traffic from multiple other networks. The various steps in achieving optimal LAG/ECMP component link utilization in backbone networks are detailed below: Step 1) This involves large flow recognition in routers and maintaining the mapping of the large flow to the component link that it uses. The recognition of large flows is explained in Section 3.1. Step 2) The egress component links are periodically scanned for link utilization. If the egress component link utilization exceeds a pre- programmed threshold, an operator alert is generated. The large flows mapped to the congested egress component link are exported to a central management entity. Step 3) On receiving the alert about the congested component link, the operator, through a central management entity, finds the large flows mapped to that component link and the LAG/ECMP group to which the component link belongs. Step 4) The operator can choose to rebalance the large flows on lightly loaded component links of the LAG/ECMP group or redistribute all the other flows on the congested link to other component links of the group. The operator, through a central management entity, can choose one of the following actions: 1) Can indicate specific large flows to rebalance; 2) Let the router decide the best large flows to rebalance; 3) Let the router to redistribute all the other flows on the congested link to other component links in the group. The central management entity conveys the above information to the router. The load re-balancing options are explained in section 3.2. Optionally, if desired, steps 2) to 4) could become an automated process. The techniques described above are especially useful when bundling links of different bandwidths for e.g. 10Gbps and 100Gbps as described in [I-D.ietf-rtgwg-cl-requirement]. Krishnan Expires July 12, 2013 [Page 6] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 3.1. Large Flow Recognition 3.1.1. Flow Identification A flow (large flow or other flow) can be defined as a sequence of packets for which ordered delivery should be maintained. Flows are commonly identified by using any of the following sets of fields in a packet header: . Layer 2: source MAC address, destination MAC address, VLAN ID . IP 5 tuple: IP Protocol, IP source address, IP destination address, TCP/UDP source port, TCP/UDP destination port . IP 3 tuple: IP Protocol, IP source address, IP destination address . MPLS Labels . IPv6: IP source address, IP destination address and IPv6 flow label (RFC 6437) Flow identification is possible based on inner and/or outer headers for tunneling protocols like GRE, VXLAN, NVGRE etc. The above list is not exhaustive. The best practices described in this document are agnostic to the fields that are used for flow identification. 3.1.2. Sampling Techniques - sFlow/PSAMP Enable sFlow [RFC 3176]/PSAMP [RFC 5475] sampling on all the egress ports in the routers. Through sFlow processing in a sFlow collector, an approximate indication of large flows mapping to each of the component links in each LAG/ECMP group is available. The advantages and disadvantages of sFlow/PSAMP are detailed below. Advantages: . Supported in most routers. . Requires minimal router resources. Disadvantages: . Large flow recognition time is long, not instant. Krishnan Expires July 12, 2013 [Page 7] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 The time taken to determine a candidate large flow would be dependent on the number of sFlow samples being generated and the processing power of the external sFlow collector. 3.1.3. Automatic Hardware Recognition Implementations may choose an automatic recognition of large flows on the hardware of a router. The characteristics of such an implementation would be: . Inline solution . Maintain line-rate performance . Perform accounting of large flows with a high degree of accuracy Using automatic hardware recognition of large flows, an accurate indication of large flows mapped to each of the component links in a LAG/ECMP group is available. The advantages and disadvantages of automatic hardware recognition are: Advantages: . Accurate and in real-time Disadvantages: . Not supported in many routers The measurement interval for determining a large flow and the bandwidth threshold of a large flow would be programmable parameters in the router. The implementation of automatic hardware recognition of large flows is vendor dependent. Below is a suggested technique. Suggested Technique for Automatic Hardware Recognition Step 1) If the large flow exists in a hardware table resource like TCAM, increment the counter of the flow. Else, proceed to Step 2. Step 2) There are multiple hash tables, each with a different hash function. Each hash table entry has an associated counter. On packet arrival, a new flow is looked up in parallel in all the hash tables and the corresponding counter is incremented. If the counter exceeds a programmed threshold in a given time interval in all the hash table Krishnan Expires July 12, 2013 [Page 8] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 entries, a candidate large flow is learnt and programmed in a hardware table resource like TCAM. There may be some false positives due to multiple other flows masquerading as a large flow; the amount of false positives is reduced by parallel hashing using different hash functions 3.2. Load Re-Balancing Options Below are suggested techniques for load re-balancing. Equipment vendors should implement all these techniques and allow the operator to choose one or more techniques based on their applications. 3.2.1. Alternative Placement of Large Flows In the LAG/ECMP group, choose other member component links with least average port utilization. Move some large flow(s) from the heavily loaded component link to other member component links using a Policy Based Routing (PBR) rule in the ingress processing element(s) in the routers. The key aspects of this are: . Other flows are not subjected to flow re-ordering. . Only certain large flows are subjected to momentary flow re- ordering temporarily. Note that perfect re-balancing of large flows may not be possible since flows arrive and depart at different times. 3.2.2. Redistributing Other Flows Some large flows may consume the entire bandwidth of the component link(s). In this case, it would be desirable for the other flows to not use the congested component link(s). This can be accomplished in one of the following ways. 3.2.2.1. Redistributing All Other Flows This works on existing router hardware. The idea is to prevent the other flow from hashing into the congested component link(s). . Modify the LAG/ECMP table to only include the non-congested component link(s). The other flows hash into this table to be mapped to a destination component link. . All the other flows are subject to momentary flow re-ordering. Krishnan Expires July 12, 2013 [Page 9] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 . The PBR rules for large flows (refer to Section 3.2.1) have strict precedence over the LAG/ECMP table lookup result. 3.2.2.2. Redistributing the Other Flows on the Congested Link This needs a switch/router hardware change. . If a packet belongs to one of other flows and is hashed to congested component link, apply a second hashing on it, which results the flow mapped to one of the non-congested component links. . The other flows originally directed to the congested link are re-directed to other non-congested component links. . The other flows originally directed to a congested component link are subject to momentary flow re-ordering. 3.2.3. Component Link Protection Considerations If desired, certain component links may be reserved for link protection. These reserved component links are not used for any flows which are described in Section 3.2. In the case when the component link(s) fail, all the flows on the failed component link(s) are moved to the reserved component link(s). The mapping table of large flows/ component link simply replaces the reference pointer from the failed component link to the reserved link. The LAG/ECMP hash table just replaces the reference pointer from the failed component link to the reserved link. 3.2.4. Load Re-Balancing Example Optimal LAG/ECMP component utilization for the use case in Figure 1, is depicted below in Figure 2. The large flow rebalancing explained in Section 3.2.1 is used. The improved link utilization is as follows: . Component link (1) has 2 other flows and 1 large flow and the link utilization is normal. . Component link (2) has 3 other flows and 1 large flow and the link utilization is normal now. . Component link (3) has 2 other flows and 1 large flow and the link utilization is normal now. Krishnan Expires July 12, 2013 [Page 10] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 +-----------+ +-----------+ | | -> -> | | | |=====> | | | (1)|--/---/-|(1) | | | | | | |=====> | | | (R1) |-> -> ->| (R2) | | (2)|--/---/-|(2) | | | | | | | | | | | -> -> | | | |=====> | | | (3)|--/---/-|(3) | | | | | +-----------+ +-----------+ Where: ->-> other flows ===> large flow Figure 2: Evenly utilized Composite Links 4. Operational Considerations For future study. We like to get operators input here. 5. Data Model Considerations For Step 2 in Section 3, IETF could potentially consider a standards- based activity around, say, a data-model used to move the long-lived large flow information from the router to the central management entity. For Step 4 in Section 3, IETF could potentially consider a standards- based activity around, say, a data-model used to move the long-lived large flow re-balancing information from the central management entity to the router. 6. IANA Considerations This memo includes no request to IANA. Krishnan Expires July 12, 2013 [Page 11] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 7. Security Considerations This document does not directly impact the security of the Internet infrastructure or its applications. In fact, it could help if there is a DOS attack pattern which causes a hash imbalance resulting in heavy overloading of large flows to certain LAG/ECMP component links. 8. Acknowledgements The authors would like to thank Shane Amante for all the support and valuable input. The authors would like to thank Curtis Villamizar for his valuable input. The authors would also like to thank Fred Baker and Wes George for their input. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, Internet Mail Consortium and Demon Internet Ltd., November 1997. 9.2. Informative References [I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements for MPLS Over a Composite Link", June 2012 [RFC 6790] K. Kompella et al., "The Use of Entropy Labels in MPLS Forwarding", November 2012 [CAIDA] Caida Internet Traffic Analysis, www.caida.org/home [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", draft-yong-pwe3-enhance-ecmp-lfat-01, Sept. 2010 [ITCOM] Jo, J., etc "Internet traffic load balancing using dynamic hashing with flow volume", SPIE ITCOM, 2002, [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and Multicast", November 2000. Krishnan Expires July 12, 2013 [Page 12] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", November 2000. [RFC5475] T. Zseby et al., "Sampling and Filtering Techniques for IP Packet Selection", March 2009. [RFC3176] P. Phaal et al. "InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks", RFC 3176, September 2001 Appendix A. Internet Traffic Analysis and Load Balancing Simulation Internet traffic [CAIDA] has been analyzed on the packet volume per a flow. The five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP protocol) are used as the flow identification. The analysis indicates that <~2% of the top rate ranked flows takes about ~30% of total traffic volume while the rest of >98% flows contributes ~70% in total.[YONG] The simulation has shown that given Internet traffic pattern, the hash method does not evenly distribute the flows over ECMP paths. Some links may be >90% loaded while some may be <40% loaded. The more ECMP paths exist, the more severe is the un-balancing. This implies that hash based distribution can cause some paths congested while other paths are only partial filled. [YONG] The simulation also shows the substantial improvement by using large flow aware hashing distribution technique described in this document. In using the same simulated traffic, the improved rebalancing can achieve <10% load differences among the links. It proves how large flow aware hashing distribution can effectively compensate the uneven load balancing caused by hashing and the traffic pattern. Authors' Addresses Ram Krishnan Brocade Communications San Jose, 95134, USA Phone: +001-408-406-7890 Email: ramk@brocade.com Sanjay Khanna Krishnan Expires July 12, 2013 [Page 13] Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013 Brocade Communications San Jose, 95134, USA Phone: +001-408-333-4850 Email: skhanna@brocade.com Lucy Yong Huawei USA 5340 Legacy Drive Plano, TX 75025, USA Phone: 469-277-5837 Email: lucy.yong@huawei.com Anoop Ghanwani Dell San Jose, CA 95134 Phone: (408) 571-3228 Email: anoop@alumni.duke.edu Ning So Tata Communications Plano, TX 75082, USA Phone: +001-972-955-0914 Email: ning.so@tatacommunications.com Bhumip Khasnabish ZTE Corporation New Jersey, 07960, USA Phone: +001-781-752-8003 Email: bhumip.khasnabish@zteusa.com Krishnan Expires July 12, 2013 [Page 14]