OPSAWG                                                      R. Krishnan
Internet Draft                                                S. Khanna
Intended status: Experimental                    Brocade Communications
Expires: July 2013                                              L. Yong
January 12, 2013                                             Huawei USA
                                                            A. Ghanwani
                                                                   Dell
                                                                Ning So
                                                     Tata Communications
                                                          B. Khasnabish
                                                        ZTE Corporation

   Best Practices for Optimal LAG/ECMP Component Link Utilization in
   Provider Backbone Networks

          draft-krishnan-opsawg-large-flow-load-balancing-02.txt

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on July 12, 2013.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of


Krishnan                Expires July 12, 2013                  [Page 1]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.


Abstract

   Demands on networking infrastructure are growing exponentially; the
   drivers are bandwidth hungry rich media applications, inter-data
   center communications, etc. In this context, it is important to
   optimally use the bandwidth in the service provider backbone networks
   which extensively use LAG/ECMP techniques for bandwidth scaling. This
   draft describes the issues faced in service provider backbones in the
   context of LAG/ECMP and recommends some best practices for managing
   the bandwidth efficiently in service provider backbones.


Table of Contents


   1. Introduction...................................................3
      1.1. Conventions...............................................3
   2. Hash-based Load Distribution in LAG/ECMP.......................4
   3. Best Practices for Optimal LAG/ECMP Component Link Utilization.5
      3.1. Large Flow Recognition....................................7


Krishnan                Expires July 12, 2013                  [Page 2]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


         3.1.1. Flow Identification..................................7
         3.1.2. Sampling Techniques - sFlow/PSAMP....................7
         3.1.3. Automatic Hardware Recognition.......................8
      3.2. Load Re-Balancing Options.................................9
         3.2.1. Alternative Placement of Large Flows.................9
         3.2.2. Redistributing Other Flows...........................9
            3.2.2.1. Redistributing All Other Flows..................9
            3.2.2.2. Redistributing the Other Flows on the Congested
            Link....................................................10
         3.2.3. Component Link Protection Considerations............10
         3.2.4. Load Re-Balancing Example...........................10
   4. Operational Considerations....................................11
   5. Data Model Considerations.....................................11
   6. IANA Considerations...........................................11
   7. Security Considerations.......................................12
   8. Acknowledgements..............................................12
   9. References....................................................12
      9.1. Normative References.....................................12
      9.2. Informative References...................................12
   Appendix A. Internet Traffic Analysis and Load Balancing Simulation13

                                      1. Introduction

   Service provider backbone networks extensively use LAG/ECMP
   techniques for capacity scaling. Network traffic can be predominantly
   categorized into two traffic types: long-lived large flows and other
   flows (include long-lived small flows, short-lived small/large
   flows). Stateless hash-based techniques[ITCOM, RFC 2991, RFC 2992,
   RFC 6790] are often used to distribute both long-lived large flows
   and other flows over the component links in a LAG/ECMP. However the
   traffic may not be evenly distributed over the component links due to
   the traffic pattern.

   This draft describes best practices for optimal LAG/ECMP component
   link utilization while using hash-based techniques. These best
   practices comprise the following steps -- recognizing long-lived
   large flows in a router; and assigning the long-lived large flows to
   specific LAG/ECMP component links or redistribute other flows when a
   component link on the router is congested.


1.1. Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this


Krishnan                Expires July 12, 2013                  [Page 3]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


document are to be interpreted as described in RFC 2119 [RFC2119].  The
following acronyms are used:

    Following Terms are used in the document:

   COTS: Commercial Off-the-shelf

   DOS: Denial of Service

   ECMP: Equal Cost Multi-path

   GRE: Generic Routing Encapsulation

   LAG: Link Aggregation Group

   Large flow(s): long-lived large flow(s)

   MPLS: Multiprotocol Label Switching

   NVGRE: Network Virtualization using Generic Routing Encapsulation

   Other flows: long-lived small flows and short-lived small/large flows

   QoS: Quality of Service

   VXLAN: Virtual Extensible LAN

2. Hash-based Load Distribution in LAG/ECMP

   Hashing techniques are often used for flow based load distribution
   [ITCOM]. A large space of the flow identifications, i.e. finer
   granularity of the flows, conducts more random in spreading the flows
   over a set of component links. The advantages of hashing based load
   distribution are the preservation of the packet sequence in a flow
   and the real time distribution with the stateless of individual
   flows. If the traffic flows randomly spread in the flow
   identification space, the flow rates are much smaller compared to the
   link capacity, and the rate differences are not dramatic, the hashing
   algorithm works very well in general. However, if one or more of
   these conditions do not meet, the hashing may result very unbalanced
   loads on individual component links. One example is illustrated in
   Figure 1. There is a LAG between 2 routers R1 and R2. This LAG has 3
   component links (1), (2), (3).

     .  Component link (1) has 2 other flows and 1 large flow and the
        link utilization is normal.


Krishnan                Expires July 12, 2013                  [Page 4]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


     .  Component link (2) has 3 other flows and no large flow and the
        link utilization is light.

          o The absence of any large flow causes the component link
             under-utilized.

     .  Component link (3) has 2 other flows and 2 large flows and the
        link utilization is exceeded.

          o The presence of 2 large flows causes the component link
             congested.

                  +-----------+        +-----------+
                  |           | -> ->  |           |
                  |           |=====>  |           |
                  |        (1)|--/---/-|(1)        |
                  |           |        |           |
                  |           |        |           |
                  |  (R1)     |-> -> ->|   (R2)    |
                  |        (2)|--/---/-|(2)        |
                  |           |        |           |
                  |           | -> ->  |           |
                  |           |=====>  |           |
                  |           |=====>  |           |
                  |        (3)|--/---/-|(3)        |
                  |           |        |           |
                  +-----------+        +-----------+

            Where: ->-> other flows
                   ===> large flow

                Figure 1: Unevenly Utilized Component Links

   This document presents the improved hashing load distribution
   techniques based on the large flow awareness. The techniques
   compensate unbalanced load distribution from hashing due to the
   traffic pattern.

3. Best Practices for Optimal LAG/ECMP Component Link Utilization

   The suggested techniques in this draft are about a local optimization
   solution, where the local is in the sense of both measuring large
   flows and re-balancing the load at individual nodes in the network.
   This approach would not yield a globally optimal placement of a large
   flow across several nodes in the network which some networks may


Krishnan                Expires July 12, 2013                  [Page 5]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   desire/require. On the other hand, this may be adequate for some
   operators for the following reasons-- 1) Different links in the
   network experience different levels of utilization and, thus, a more
   "targeted" solution is needed for those few hot-spots in the network;
   2) Some networks may lack end-to-end visibility, e.g. when a network
   carries the traffic from multiple other networks.

   The various steps in achieving optimal LAG/ECMP component link
   utilization in backbone networks are detailed below:

   Step 1) This involves large flow recognition in routers and
   maintaining the mapping of the large flow to the component link that
   it uses. The recognition of large flows is explained in Section 3.1.

   Step 2) The egress component links are periodically scanned for link
   utilization. If the egress component link utilization exceeds a pre-
   programmed threshold, an operator alert is generated. The large flows
   mapped to the congested egress component link are exported to a
   central management entity.

   Step 3) On receiving the alert about the congested component link,
   the operator, through a central management entity, finds the large
   flows mapped to that component link and the LAG/ECMP group to which
   the component link belongs.

   Step 4) The operator can choose to rebalance the large flows on
   lightly loaded component links of the LAG/ECMP group or redistribute
   all the other flows on the congested link to other component links of
   the group. The operator, through a central management entity, can
   choose one of the following actions:

      1) Can indicate specific large flows to rebalance;

      2) Let the router decide the best large flows to rebalance;

      3) Let the router to redistribute all the other flows on the
   congested link to other component links in the group.

   The central management entity conveys the above information to the
   router. The load re-balancing options are explained in section 3.2.

   Optionally, if desired, steps 2) to 4) could become an automated
   process.

   The techniques described above are especially useful when bundling
   links of different bandwidths for e.g. 10Gbps and 100Gbps as
   described in [I-D.ietf-rtgwg-cl-requirement].


Krishnan                Expires July 12, 2013                  [Page 6]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


3.1. Large Flow Recognition

3.1.1. Flow Identification

   A flow (large flow or other flow) can be defined as a sequence of
   packets for which ordered delivery should be maintained.  Flows are
   commonly identified by using any of the following sets of fields in a
   packet header:

     .  Layer 2: source MAC address, destination MAC address, VLAN ID

     .  IP 5 tuple: IP Protocol, IP source address, IP destination
        address, TCP/UDP source port, TCP/UDP destination port

     .  IP 3 tuple: IP Protocol,  IP source address, IP destination
        address

     .  MPLS Labels

     .  IPv6: IP source address, IP destination address and IPv6 flow
        label (RFC 6437)

   Flow identification is possible based on inner and/or outer headers
   for tunneling protocols like GRE, VXLAN, NVGRE etc.

   The above list is not exhaustive.  The best practices described in
   this document are agnostic to the fields that are used for flow
   identification.

3.1.2. Sampling Techniques - sFlow/PSAMP

   Enable sFlow [RFC 3176]/PSAMP [RFC 5475] sampling on all the egress
   ports in the routers. Through sFlow processing in a sFlow collector,
   an approximate indication of large flows mapping to each of the
   component links in each LAG/ECMP group is available. The advantages
   and disadvantages of sFlow/PSAMP are detailed below.

   Advantages:

     .  Supported in most routers.

     .  Requires minimal router resources.

   Disadvantages:

     .  Large flow recognition time is long, not instant.


Krishnan                Expires July 12, 2013                  [Page 7]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   The time taken to determine a candidate large flow would be dependent
   on the number of sFlow samples being generated and the processing
   power of the external sFlow collector.

3.1.3. Automatic Hardware Recognition

   Implementations may choose an automatic recognition of large flows on
   the hardware of a router. The characteristics of such an
   implementation would be:

     .  Inline solution

     .  Maintain line-rate performance

     .  Perform accounting of large flows with a high degree of
        accuracy

   Using automatic hardware recognition of large flows, an accurate
   indication of large flows mapped to each of the component links in a
   LAG/ECMP group is available. The advantages and disadvantages of
   automatic hardware recognition are:

   Advantages:

     .  Accurate and in real-time

   Disadvantages:

     .  Not supported in many routers

   The measurement interval for determining a large flow and the
   bandwidth threshold of a large flow would be programmable parameters
   in the router.

   The implementation of automatic hardware recognition of large flows
   is vendor dependent. Below is a suggested technique.

   Suggested Technique for Automatic Hardware Recognition

   Step 1) If the large flow exists in a hardware table resource like
   TCAM, increment the counter of the flow. Else, proceed to Step 2.

   Step 2) There are multiple hash tables, each with a different hash
   function. Each hash table entry has an associated counter. On packet
   arrival, a new flow is looked up in parallel in all the hash tables
   and the corresponding counter is incremented. If the counter exceeds
   a programmed threshold in a given time interval in all the hash table


Krishnan                Expires July 12, 2013                  [Page 8]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   entries, a candidate large flow is learnt and programmed in a
   hardware table resource like TCAM.

   There may be some false positives due to multiple other flows
   masquerading as a large flow; the amount of false positives is
   reduced by parallel hashing using different hash functions

3.2. Load Re-Balancing Options

   Below are suggested techniques for load re-balancing. Equipment
   vendors should implement all these techniques and allow the operator
   to choose one or more techniques based on their applications.

3.2.1. Alternative Placement of Large Flows

   In the LAG/ECMP group, choose other member component links with least
   average port utilization. Move some large flow(s) from the heavily
   loaded component link to other member component links using a Policy
   Based Routing (PBR) rule in the ingress processing element(s) in the
   routers. The key aspects of this are:

     .  Other flows are not subjected to flow re-ordering.

     .  Only certain large flows are subjected to momentary flow re-
        ordering temporarily.

   Note that perfect re-balancing of large flows may not be possible
   since flows arrive and depart at different times.

3.2.2. Redistributing Other Flows

   Some large flows may consume the entire bandwidth of the component
   link(s). In this case, it would be desirable for the other flows to
   not use the congested component link(s). This can be accomplished in
   one of the following ways.

3.2.2.1. Redistributing All Other Flows

   This works on existing router hardware. The idea is to prevent the
   other flow from hashing into the congested component link(s).

     .  Modify the LAG/ECMP table to only include the non-congested
        component link(s). The other flows hash into this table to be
        mapped to a destination component link.

     .  All the other flows are subject to momentary flow re-ordering.


Krishnan                Expires July 12, 2013                  [Page 9]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


     .  The PBR rules for large flows (refer to Section 3.2.1) have
        strict precedence over the LAG/ECMP table lookup result.

3.2.2.2. Redistributing the Other Flows on the Congested Link

   This needs a switch/router hardware change.

     .  If a packet belongs to one of other flows and is hashed to
        congested component link, apply a second hashing on it, which
        results the flow mapped to one of the non-congested component
        links.

     .  The other flows originally directed to the congested link are
        re-directed to other non-congested component links.

     .  The other flows originally directed to a congested component
        link are subject to momentary flow re-ordering.

3.2.3. Component Link Protection Considerations

   If desired, certain component links may be reserved for link
   protection. These reserved component links are not used for any flows
   which are described in Section 3.2. In the case when the component
   link(s) fail, all the flows on the failed component link(s) are moved
   to the reserved component link(s). The mapping table of large flows/
   component link simply replaces the reference pointer from the failed
   component link to the reserved link. The LAG/ECMP hash table just
   replaces the reference pointer from the failed component link to the
   reserved link.

3.2.4. Load Re-Balancing Example

   Optimal LAG/ECMP component utilization for the use case in Figure 1,
   is depicted below in Figure 2. The large flow rebalancing explained
   in Section 3.2.1 is used. The improved link utilization is as
   follows:

     .  Component link (1) has 2 other flows and 1 large flow and the
        link utilization is normal.

     .  Component link (2) has 3 other flows and 1 large flow and the
        link utilization is normal now.

     .  Component link (3) has 2 other flows and 1 large flow and the
        link utilization is normal now.


Krishnan                Expires July 12, 2013                 [Page 10]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


                  +-----------+        +-----------+
                  |           | -> ->  |           |
                  |           |=====>  |           |
                  |        (1)|--/---/-|(1)        |
                  |           |        |           |
                  |           |=====>  |           |
                  |  (R1)     |-> -> ->|   (R2)    |
                  |        (2)|--/---/-|(2)        |
                  |           |        |           |
                  |           |        |           |
                  |           | -> ->  |           |
                  |           |=====>  |           |
                  |        (3)|--/---/-|(3)        |
                  |           |        |           |
                  +-----------+        +-----------+

            Where: ->-> other flows
                   ===> large flow


                 Figure 2: Evenly utilized Composite Links

4. Operational Considerations

   For future study. We like to get operators input here.

5. Data Model Considerations

   For Step 2 in Section 3, IETF could potentially consider a standards-
   based activity around, say, a data-model used to move the long-lived
   large flow information from the router to the central management
   entity.

   For Step 4 in Section 3, IETF could potentially consider a standards-
   based activity around, say, a data-model used to move the long-lived
   large flow re-balancing information from the central management
   entity to the router.


6. IANA Considerations

   This memo includes no request to IANA.


Krishnan                Expires July 12, 2013                 [Page 11]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


7. Security Considerations

   This document does not directly impact the security of the Internet
   infrastructure or its applications. In fact, it could help if there
   is a DOS attack pattern which causes a hash imbalance resulting in
   heavy overloading of large flows to certain LAG/ECMP component
   links.
8. Acknowledgements

   The authors would like to thank Shane Amante for all the support and
   valuable input. The authors would like to thank Curtis Villamizar
   for his valuable input. The authors would also like to thank Fred
   Baker and Wes George for their input.

9. References

9.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for
             Syntax Specifications: ABNF", RFC 2234, Internet Mail
             Consortium and Demon Internet Ltd., November 1997.

9.2. Informative References

   [I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements
   for MPLS Over a Composite Link", June 2012

   [RFC 6790] K. Kompella et al., "The Use of Entropy Labels in MPLS
   Forwarding", November 2012

   [CAIDA]  Caida Internet Traffic Analysis, www.caida.org/home

   [YONG]  Yong, L., "Enhanced ECMP and Large Flow Aware Transport",
   draft-yong-pwe3-enhance-ecmp-lfat-01, Sept. 2010

   [ITCOM]  Jo, J., etc "Internet traffic load balancing using dynamic
   hashing with flow volume", SPIE ITCOM, 2002,

   [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
   Multicast", November 2000.


Krishnan                Expires July 12, 2013                 [Page 12]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path
   Algorithm", November 2000.

   [RFC5475] T. Zseby et al., "Sampling and Filtering Techniques for IP
   Packet Selection", March 2009.

   [RFC3176] P. Phaal et al. "InMon Corporation's sFlow: A Method for
   Monitoring Traffic in Switched and Routed Networks", RFC 3176,
   September 2001

 Appendix A. Internet Traffic Analysis and Load Balancing Simulation

   Internet traffic [CAIDA] has been analyzed on the packet volume per a
   flow. The five tuples in the packet header (IP addresses, TCP/UDP
   Ports, and IP protocol) are used as the flow identification. The
   analysis indicates that <~2% of the top rate ranked flows takes
   about ~30% of total traffic volume while the rest of >98% flows
   contributes ~70% in total.[YONG]

   The simulation has shown that given Internet traffic pattern, the
   hash method does not evenly distribute the flows over ECMP paths.
   Some links may be >90% loaded while some may be <40% loaded. The
   more ECMP paths exist, the more severe is the un-balancing. This
   implies that hash based distribution can cause some paths congested
   while other paths are only partial filled. [YONG]

   The simulation also shows the substantial improvement by using large
   flow aware hashing distribution technique described in this document.
   In using the same simulated traffic, the improved rebalancing can
   achieve <10% load differences among the links. It proves how large
   flow aware hashing distribution can effectively compensate the uneven
   load balancing caused by hashing and the traffic pattern.

Authors' Addresses

   Ram Krishnan
   Brocade Communications
   San Jose, 95134, USA
   Phone: +001-408-406-7890
   Email: ramk@brocade.com

   Sanjay Khanna


Krishnan                Expires July 12, 2013                 [Page 13]

Internet-Draft   Optimal Load Distribution over LAG/ECMP   January 2013


   Brocade Communications
   San Jose, 95134, USA
   Phone: +001-408-333-4850
   Email: skhanna@brocade.com

   Lucy Yong
   Huawei USA
   5340 Legacy Drive
   Plano, TX 75025, USA
   Phone: 469-277-5837
   Email: lucy.yong@huawei.com

   Anoop Ghanwani
   Dell
   San Jose, CA 95134
   Phone: (408) 571-3228
   Email: anoop@alumni.duke.edu

   Ning So
   Tata Communications
   Plano, TX 75082, USA
   Phone: +001-972-955-0914
   Email: ning.so@tatacommunications.com

   Bhumip Khasnabish
   ZTE Corporation
   New Jersey, 07960, USA
   Phone: +001-781-752-8003
   Email: bhumip.khasnabish@zteusa.com


Krishnan                Expires July 12, 2013                 [Page 14]