Internet Engineering Task Force Christian E. Hopps INTERNET-DRAFT Merit Network Expires September 1999 13 April 1999 Analysis of an Equal-Cost Multi-Path Algorithm Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (1999). All Rights Reserved. Expires September 1999 [Page 1] Draft Analysis of an ECMP Algorithm April 1999 Abstract Equal-cost multi-path (ECMP) is a routing technique for routing packets along multiple paths of equal cost. The forwarding engine identifies paths by next-hop. When forwarding a packet the router must decide which next-hop (path) to use. This document gives an analysis of one method for making that decision. The analysis includes the performance of the algorithm and the disruption caused by changes to the set of next-hops. 1. Hash-Threshold One method for determining which next-hop to use when routing with ECMP can be called hash-threshold. The router first selects a key by performing a hash (e.g., modulo-K where K is large, or CRC16) over the packet header fields that identify a flow. The N next-hops have been assigned unique regions in the key space. The router uses the key to determine which region and thus which next-hop to use. As an example of hash-threshold, upon receiving a packet the router performs a CRC16 on the packet's header fields that define the flow (e.g., the source and destination fields of the packet), this is the key. Say for this destination there are 4 next-hops to choose from. Each next-hop is assigned a region in 16 bit space (the key space). For equal usage the router may have chosen to divide it up evenly so each region is 65536/4 or 16k large. The next-hop is chosen by determining which region contains the key (i.e., the CRC result). 2. Analysis There are a few concerns when choosing an algorithm for deciding which next-hop to use. One is performance, the computational requirements to run the algorithm. Another is disruption (i.e., the changing of which path a flow uses). Balancing is a third concern; however since the algorithm's balancing characteristics are directly related to the chosen hash function this analysis does not treat this concern in depth. For this analysis we will assume regions of equal size. If the hash function is uniformly distributed the distribution of flows amongst paths will also be uniform. Expires September 1999 [Page 2] Draft Analysis of an ECMP Algorithm April 1999 2.1. Performance The performance of the hash-threshold algorithm can be broken down into three parts: selection of regions for the next-hops, obtaining the key and comparing the key to the regions to decide which next-hop to use. Since regions are restricted to be of equal size the calculation of region boundaries is trivial. The boundaries can be calculated as follows: i = 0; regionssize = keyspace.size / #{ next-hops } for n in { next-hops } n.start = i * regionsize; n.end = n.start + regionsize; i = i + 1 done This calculation is O(N). Furthermore the calculation can be done when the next-hops are added to or removed from the destination route. The algorithm doesn't specify the hash function used to obtain the key. Its performance in this area will be exactly the performance of the hash function. It is presumed that if this calculation proves to be a concern it can be done in hardware parallel to other operations that need to complete before deciding which next-hop to use. Finally the next-hop must be chosen. This is done by determining which region contains the key. The time required to do this is dependent on the way the next-hops are organized in memory. The problem reduces to a search. For example if the next-hops are stored as a linked list the time is O(N) as the router must traverse the list comparing each next-hop's region boundaries against the key. If the next-hops are stored as an ordered array a binary search can be used yielding O(logN). If the number of next-hops is limited to a fixed maximum the comparison can be done in parallel in hardware yielding O(1). Expires September 1999 [Page 3] Draft Analysis of an ECMP Algorithm April 1999 2.2. Disruption Protocols such as TCP perform better if the path they flow along does not change while the stream is connected. Disruption is the measurement of how many flows have their paths changed due to some change in the router. We measure disruption as the fraction of total flows whose path changes in response to some change in the router. Some algorithms such as round-robin (i.e., upon receiving a packet the least recently used next-hop is chosen) are disruptive regardless of any change in the router. Clearly this is not the case with hash- threshold. As long as the region boundaries remain unchanged the same next-hop will be chosen for a given flow. Because we have required regions to be equal in size the only reason for a change in region boundaries is the addition or removal of a next-hop. In this case the regions must all grow or shrink to fill the key space. The analysis begins with some examples of this. 0123456701234567012345670123456701234567 +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 4 | 5 | +---------+---------+---------+---------+ 0123456789012345678901234567890123456789 Figure 1. Before and after deletion of region 3 In figure 1. region 3 has been deleted. The remaining regions grow equally and shift to compensate. In this case 1/4 of region 2 is now in region 1, 1/2 (2/4) of region 3 is in region 2, 1/2 of region 3 is in region 4 and 1/4 of region 4 is in region 5. Since each of the original regions represent 1/5 of the flows, the total disruption is 1/5*(1/4 + 1/2 + 1/2 + 1/4) or 3/10. Note that the disruption to flows when adding a region is equivalent to that of removing a region. That is, we are considering the fraction of total flows that changes regions when moving from N to N-1 regions, and that same fraction of flows will change when moving from N-1 to N regions. Expires September 1999 [Page 4] Draft Analysis of an ECMP Algorithm April 1999 0123456701234567012345670123456701234567 +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 3 | 5 | +---------+---------+---------+---------+ 0123456789012345678901234567890123456789 Figure 2. Before and after deletion of region 4 In figure 2. region 4 has been deleted. Again the remaining regions grow equally and shift to compensate. 1/4 of region 2 is now in region 1, 1/2 of region 3 is in region 2, 3/4 of region 4 is in region 3 and 1/4 of region 4 is in region 5. Since each of the original regions represent 1/5 of the flows the, total disruption is 7/20. To generalize, upon removing a region K the remaining N-1 regions grow to fill the 1/N space. This growth is evenly divided between the N-1 regions and so the change in size for each region is 1/N/(N-1) or 1/(N(N-1)). This change in size causes non-end regions to move. The first region grows and so the second region is shifted towards K by the change in size of the first region. 1/(N(N-1)) of the flows from region 2 are subsumed by the change in region 1's size. 2/(N(N-1)) of the flows in region 3 are subsumed by region 2. This is because region 2 has shifted by 1/(N(N-1)) and grown by 1/(N(N-1)). This continues from both ends until you reach the regions that bordered K. The calculation for the number of flows subsumed from the Kth region into the bordering regions accounts for the removal of the Kth region. Thus we have the following equation. K-1 N --- i --- (i-K) disruption = \ --- + \ --- / (N)(N-1) / (N)(N-1) --- --- i=1 i=K+1 We can factor 1/((N)(N-1)) out as it is constant. Expires September 1999 [Page 5] Draft Analysis of an ECMP Algorithm April 1999 / K-1 N \ 1 | --- --- | = --- | \ i + \ (i-K) | (N)(N-1) | / / | \ --- --- / 1 i=K+1 We now use the the concrete formulas for the sum of integers. The first summation is (K)(K-1)/2. For the second summation notice that we are summing the integers from 1 to N-K, thus it is (N-K)(N-K+1)/2. (K-1)(K) + (N-K)(N-K+1) = ----------------------- 2(N)(N-1) Considering the summations, one can see that the least disruption is when K is as close to half way between 1 and N as possible. This can be proven by finding the minimum of the concrete formula for K holding N constant. First break apart the quantities and collect. 2K*K - 2K - 2NK + N*N + N = ------------------------- 2(N)(N-1) K*K - K - NK N + 1 = -------------- + ------- (N)(N-1) 2(N-1) Since we are minimizing for K the right side (N+1)/2(N-1) is constant as is the denominator (N)(N-1) so we can drop them. To minimize we take the derivative. Expires September 1999 [Page 6] Draft Analysis of an ECMP Algorithm April 1999 d -- (K*K - (N+1)K) dk = 2K - (N+1) Which is zero when K is (N+1)/2. The last thing to consider is that K must be an integer. When N is odd (N+1)/2 will yield an integer, however when N is even (N+1)/2 yields an integer + 1/2. In the case, because of symmetry, we get the least disruption when K is N/2 or N/2 + 1. Now since the formula is quadratic with a global minimum half way between 1 and N the maximum possible disruption must occur when edge regions (1 and N) are removed. If K is 1 or N the formula reduces to 1/2. The minimum possible disruption is obtained by letting K=(N+1)/2. In this case the formula reduces to 1/4 + 1/(4*N). So the range of possible disruption is (1/4, 1/2]. To minimize disruption we recommend adding new regions to the center rather than the ends. 3. Comparison to other algorithms Other algorithms exist to decide which next-hop to use. These algorithms all have different performance and disruptive characteristics. Of these algorithms we will only consider ones that are not disruptive by design (i.e., if no change to the set of next- hops occurs the path a flow takes remains the same). This will exclude round-robin and random choice. We will look at modulo-N and highest random weight. Modulo-N is a simpler form of hash-threshold. Given N next-hops the hash function includes a final modulo-N which directly maps the result to one of the next-hops. This operation is the fastest of the three we consider, however if a next-hop is added or removed the disruption is (N-1)/N. Expires September 1999 [Page 7] Draft Analysis of an ECMP Algorithm April 1999 Highest random weight (HRW) is another comparative method similar to hash-threshold. The router seeds a pseudo-random number generator with the packet header fields which describe the flow and the next- hop to obtain a weight. The next-hop which receives the highest weight is selected. The advantage with using HRW is that it has minimal disruption (i.e., disruption due to adding or removing a next-hop is always 1/N.) The disadvantage with HRW is an only slightly more complex function to choose the next-hop. A description of HRW along with comparisons to other methods can be found in [1]. Although not used for next-hop calculation an example usage of HRW can be found in [2]. If the complexity of HRW's next-hop selection processes is acceptable we think it should be considered as an alternative to hash-threshold. 4. Security Considerations This document is an analysis of an algorithm used to implement an ECMP routing decision. This analysis does not directly effect the security of the Internet Infrastructure. 5. References [1] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to Increase Hit Rates", IEEE/ACM Transactions on Networking, February 1998. [2] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification", RFC 2362, June 1998. 6. Author's Address Christian E. Hopps Merit Network 4251 Plymouth Road, Suite C. Ann Arbor, MI 48105 Phone: +1 734 936 0291 EMail: chopps@merit.edu Expires September 1999 [Page 8] Draft Analysis of an ECMP Algorithm April 1999 7. Full Copyright Statement Copyright (C) The Internet Society (1999). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. Expires September 1999 [Page 9]