Internet Engineering Task Force Indra Widjaja Fujitsu Network Communications INTERNET DRAFT Anwar Elwalid Expired in six months Bell Labs, Lucent Technologies August 1998 MATE: MPLS Adaptive Traffic Engineering Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract This document describes an MPLS Adaptive Traffic Engineering scheme, called MATE. The main goal of MATE is to avoid network congestion by balancing the loads among the LSPs. MATE makes minimal assumptions in that the intermediate LSRs are not required to perform traffic engineering activities or measurements beside forwarding packets. Moreover, MATE does not impose any particular scheduling, buffer management, architecture, or a priori traffic characterization, on the LSRs. 1.0 Introduction One of the main advantages of MPLS is its efficient support of expli- cit routing through the use of Label Switched Paths (LSPs) [1][2]. With destination-based forwarding as in the conventional datagram networks, explicit routing is usually provided by attaching to each Widjaja & Elwalid Expired in six months [Page 1] Internet Draft MPLS Adaptive Traffic Engineering August 1998 packet the network-layer address of each node along the explicit path. This approach makes the overhead in the packet prohibitively expensive. Switched forwarding using label swapping in MPLS enables a label to identify an LSP regardless of whether the LSP is esta- blished through hop-by-hop or explicit routing. In other words, MPLS imposes the same amount of packet header overhead irrespective of the use of explicit routing. Explicit routing has useful applications such as Virtual Private Net- works (VPNs) and Traffic Engineering. This memo focuses on engineer- ing the traffic across multiple explicit routes. The purpose of traffic engineering is to manage network resources as efficiently as possible so that congestion is minimized. This may be done by avoid- ing links that are already heavily stressed. Traffic engineering typ- ically becomes more effective as the network provides more alternate paths. Traffic engineering also becomes more critical in large Auto- nomous Systems where maximal operational efficiency should be emphasized [3]. It is envisioned that traffic engineering is performed only for traffic that does not require resource reservation, but may need pro- visioning on an aggregated basis. Examples include best-effort and differentiated services. This memo proposes that traffic engineering be done by establishing multiple LSPs between a given ingress LSR and a given egress LSR in an MPLS domain. The main objective is to have the ingress LSR distribute traffic across the multiple LSPs effec- tively so that network resources among the LSPs are equitably util- ized. This memo describes a scheme called MPLS Adaptive Traffic Engineering (MATE). It is to be noted that MATE is intended for quasi-stationary situations where traffic statistics changes rela- tively slowly (much longer than the round-trip delay between the ingress and egress LSRs). Recent measurements in the Internet indi- cate traffic stationarity in at least 5-min intervals [4]. Some of the features of MATE include: o Traffic engineering on a per-class basis. o Distributed load-balancing algorithm. o End-to-end protocol between ingress and egress LSRs. o No new hardware or protocol requirement on intermediate LSRs, nor a priori traffic distributions. o No assumption on the scheduling or buffer management schemes at the LSR. o Optimization decision based on LSP congestion measure. o Minimal packet reordering due to traffic engineering. o No clock synchronization between two LSRs. Widjaja & Elwalid Expired in six months [Page 2] Internet Draft MPLS Adaptive Traffic Engineering August 1998 2.0 Traffic Engineering Using MATE This section provides the description of MATE. While the specifics of the scheme are tailored for best-effort service, the basic princi- ples are intended to be extensible to other services (e.g., differen- tiated service) as well. 2.1 Overview It is assumed that M explicit LSPs have been established between an ingress LSR and an egress LSR when traffic engineering is to be facilitated between these two LSRs. The value of M is typically between 2 and 5 (the case with M=1 is not interesting). The M LSPs may be chosen to be the "best" M paths calculated by a link-state protocol or configured manually. Explicit LSPs may be established by RSVP, LDP, or other means. The mechanism to select and establish the LSPs is beyond the scope of this document. Once the LSPs are setup, the ingress LSR tries to distribute the traffic across the LSPs so that the traffic loads are balanced and congestion is thus minimized. The traffic to be engineered at the ingress LSR is the aggregated flow (called traffic trunk in [5]) that shares the same Forwarding Equivalence Class. This document assumes that the traffic to be engineered consists of the best-effort ser- vice. Future work will consider differentiated service. In order to perform effective traffic distribution, the characteris- tics of the LSPs and the QoS requirements of the traffic must be known. In general, the LSP characteristics may include the average packet delay, packet delay variance, loading factor/utilization, packet loss rate, bottleneck bandwidth, available bandwidth, etc. For best-effort traffic, there is no explicit QoS requirement, except that it is desirable to have minimum packet loss rate. Since MATE is intended to be as flexible as possible, the pertinent LSP charac- teristics are not assumed to be given quantities, but must be gath- ered through some measurement. In MATE, the ingress LSR transmits probe packets periodically to the egress LSR which will return the probe packets back to the ingress LSR. Based on the information in the returning probe packets, the ingress LSR is able to compute the LSP characteristics. Intermediate LSRs are not required to modify the contents of the probe packets, but such optional capabilities may be used to refine the measurement process. MATE employs a four-phase algorithm for load balancing. The first phase initializes the congestion measure for each LSP. The conges- tion measure may be a function of delay derivative and packet loss. Widjaja & Elwalid Expired in six months [Page 3] Internet Draft MPLS Adaptive Traffic Engineering August 1998 In the second phase, the algorithm tries to equalize the congestion measure for each LSP. Once the measures are equalized, the algorithm moves to the third phase. The algorithm monitors each LSP in the third phase. If an appreciable change in the network state is detected, the algorithm moves to the fourth phase where the conges- tion measures are appropriate adjusted. Then, the algorithm goes to the second phase and the whole process repeats. 2.2 Traffic Filtering and Distribution MATE performs a two-stage traffic distribution. First, MATE distri- butes the engineered traffic for a given ingress-egress pair equally among N bins at the ingress LSR. If the total incoming traffic to be engineered is of rate R bps, each bin receives an amount of r = R/N bps. Then, the traffic from the N bins is mapped into M LSPs accord- ing to the rule defined below. The engineered traffic can be filtered and distributed into the N bins in a number of ways. A simple method is to distribute the traffic on a per-packet basis without filtering. For example, one may distribute incoming packets at the ingress LSR to the bins in a round-robin fashion. Although it does not have to maintain any per- flow state, the method suffers from potentially having to reorder an excessive amount of packets for a given flow which is undesirable for TCP applications. On the other extreme, one may filter the traffic on a per-flow basis (e.g., based on tuple), and distribute the flows to the bins such that the loads are similar. Although per-flow traffic filtering and distribution preserves packet sequencing, it has to maintain a large number of states to keep track of each active flow. Another method, which is described in [6], is to filter the incoming packets by using a hash function on the IP field(s). The fields can be based on the source and destination address pair, or other combi- nations. A typical hash function is based on CRC. The purpose of the hash function is to randomize the address space to prevent hot spots. Traffic can be distributed into the N bins by applying a modulo-N operation on the hash space. Note that packet sequence for each flow is maintained with this method. After the engineered traffic is distributed into the N bins, a second function maps each bin to the corresponding LSP. The rule for the second function is very simple. If LSP(i) is to receive twice as much traffic as LSP(j), then LSP(i) should receive traffic from twice Widjaja & Elwalid Expired in six months [Page 4] Internet Draft MPLS Adaptive Traffic Engineering August 1998 as many bins as LSP(j). The value N should be chosen so that the smallest traffic that can be shifted, which is equal to 1/N of the total incoming traffic, has reasonable granularity. 2.3 Traffic Measurement The efficacy of any traffic engineering scheme depends crucially on the traffic measurement process. MATE does not require each LSR to perform traffic measurement. Only the ingress and egress LSRs are required to participate in the measurement process. For the purpose of balancing the loads on each LSP, the available bandwidth appears to be a desirable metric to measure. The methods for measuring the available bandwidth of a given path have been described in the past (e.g., see [7] and [8]). Based on our experi- ence, this metric turns out to be difficult to measure accurately using minimal requirements assumed in MATE. To this end, we found that packet delays across the LSPs are the most reliable known metric. The delay of a packet on an LSP can be obtained by transmitting a probe packet from the ingress LSR to the egress LSR. The probe packet is time-stamped at the ingress LSR at time T1 and recorded at the egress LSR at time T2. If the ingress's clock is faster than the egress's clock by Td, then the total packet delay (i.e, queueing time, propagation time, and processing time) is T2-T1+Td. A bunch of probe packets can easily yield an estimate on the mean packet delay Tm+Td, where Tm is the long-term average of T2-T1. The reliability of the estimator can be evaluated by bootstrapping or jacknife to give the confidence interval for the mean delay. One important point to note is that the value of Td is not required when only the marginal delay is needed. MATE exploits this property by relying only on marginal delays rather than absolute delays. Therefore, clock synchronization is not necessary. Packet loss probability is another metric that can be estimated by a bunch of probe packets. In general, only reasonably high packet loss rates can be reliably observed. Packet loss probability can be estimated by encoding a sequence number in the probe packet to notify the egress LSR how many probe packets have been transmitted by the ingress LSR, and another field in the probe packet to indicate how many probe packets have been received by the egress LSR. When a probe packet returns, the ingress LSR is able to estimate the one-way packet loss probability based on the number of probe packets that has been transmitted and the number that has been received. The advan- tage of this approach is that it is resilient to losses in the reverse direction. Widjaja & Elwalid Expired in six months [Page 5] Internet Draft MPLS Adaptive Traffic Engineering August 1998 2.4 Objective Function For best-effort traffic, our objective is to avoid congested LSPs. The congestion measure can be characterized by the sensitivity of the LSP. An LSP is qualitatively said to be sensitive if a perturbation in the load changes the mean delay significantly. MATE computes the derivative of the mean packet delay with respect to the offered load to quantify the sensitivity of an LSP. For a given LSP, the deriva- tive increases as the load increases. This is evident since the mean delay is an increasing and convex function with respect to the load. The derivative can be derived by observing the mean delays at two different loads. One advantage for using the sensitivity measure is that the fixed propagation delay has no effect on the derivative. An LSP is said to be lossless if each LSR along the LSP never dis- cards packets. Otherwise, the LSP is said to be lossy. Although the lossless case may not be realistic, it is much simpler to understand. Here, MATE performs load balancing by simply equalizing the deriva- tives across the M LSPs. The implementation is much simplified since only the relative derivatives should be known. MATE essentially applies a Min-Max operation; that is minimizing the load on an LSP that has maximum congestion. The objective is also equivalent to minimizing the sum of all LSP delays. MATE can be easily made to respond to changes in network state adaptively as long as the traffic is quasi-stationary. Network changes can be tracked by periodically measuring packet delays. If the changes in delays exceed a certain threshold, the algorithm minimized the most congested LSP by re- equalizing the derivatives. For the lossy case, the packet loss information must be incorporated into the load balancing process. The lossy case will be treated in the later section. 2.5 State Table MATE keeps a state table for each traffic class that will be engineered between an ingress LSR and an egress LSR. The state table needs to maintain the mean delay and derivative of each LSP. One con- venient realization is depicted below: Widjaja & Elwalid Expired in six months [Page 6] Internet Draft MPLS Adaptive Traffic Engineering August 1998 +-----------------+--------------+-------------+---------+ | LSP | D_old | D_new | Inc | +-----------------+--------------+-------------+---------+ | 1 | | | | | 2 | | | | | 3 | | | | | ... | | | | | M | | | | +-----------------+--------------+-------------+---------+ D_old (D_new) denotes the mean packet delay before (after) a traffic shift is made to the LSP. Inc (for Increments) denotes the number of bins for which traffic was recently shifted to/from the LSP. The derivative for LSP(i) can be simply computed by |D_new(i) - D_old(i)|/Inc(i). 2.6 Algorithm: Lossless Case We now present the algorithm in an ideal case with quasi-stationary traffic, zero packet loss, and perfect measurements. Initially, the traffic to be engineered at the ingress LSR is sent through the shortest path, say through LSP(1). The algorithm can be broken into four phases: Initially the state table is empty, and phase 1 is run. Phase 1: Initialize delays and derivatives Inc(1) <- N for i from 2 to M Send probe packets on LSP(1) and LSP(i) Compute mean packet delays D_old(1) and D_old(i) Shift [N/M] bins of traffic from LSP(1) to LSP(i) Inc(1) <- N - [N/M] Inc(i) <- [N/M] Send probe packets on LSP(1) and LSP(i) Compute mean packet delays D_new(1) and D_new(i) Go to Phase 2 Here, [N/M] indicates the largest integer less than or equal to N/M. At the end of Phase 1, all columns of the state table are filled up. The traffic at the ingress LSR is distributed almost equally across all M LSPs. Phase 2: Equalize derivatives Widjaja & Elwalid Expired in six months [Page 7] Internet Draft MPLS Adaptive Traffic Engineering August 1998 do Let LSP(i) be the LSP that has the highest derivative Let LSP(j) be the LSP that has the lowest derivative D_old(i) <- D_new(i) D_old(j) <- D_new(j) Shift n bins of traffic from LSP(i) to LSP(j) Send probe packets on LSP(i) and LSP(j) Compute mean packet delays D_new(i) and D_new(j) Inc(i) <- n Inc(j) <- n while (|D_old(i) + D_old(j)| > |D_new(i) + D_new(j)|) Go to Phase 3 The main stopping rule for the optimization process is when the total LSP delay can no longer be decreased. This is achieved when all derivatives are approximately equal. Note that phase 2 actually stops when the total LSP delay overshoots. A compensation could be made by shifting back the traffic from LSP(j) to LSP(i) after the overshoot is detected. The parameter n (n = 1, 2, ..., N) determines the amount of traffic to be shifted. A small value of n may cause the "noise" to mask the traffic that is shifted. On the other hand, a large value of n may overshoot and overload the LSP. Thus, this value should be properly engineered, and can be made adaptive. It is to be understood that MATE cannot shift more bins than it currently has for a given LSP. The algorithm terminates if no bin is available from the LSP with the lowest derivative. By the mean value theorem, the computed derivative is equal to the actual derivative between the points before and after the traffic is shifted. Phase 3: Track network dynamics Send probe packets intermittently on each LSP(i) Compute running mean packet delay D(i) If (|D(i) - D_new(i)| > alpha * |D_new(i) - D_old(i)|) Go to Phase 4 The condition in Phase 3 basically detects a network change by moni- toring an appreciable change in an LSP load. The amount of probe packets sent at phase 3 should be negligible com- pared to the engineered traffic that is sent to the same LSP (say, less than 1 percent). The parameter alpha determines the threshold on Widjaja & Elwalid Expired in six months [Page 8] Internet Draft MPLS Adaptive Traffic Engineering August 1998 the change in load. The mean packet delay on LSP(i) is updated by a low-pass filter: D(i) <- beta * D(i) + (1 - beta) * Measured delay where beta is weighting constant between 0 and 1, and D(i) is ini- tialized to D_new(i). In order to minimize synchronization due to multiple ingress LSRs detecting the network change simultaneously, it may be useful to ran- domize the detection time with a random backoff time. There are other schemes to eliminate synchronization. Phase 4: Refresh state table LSP(i) detects a significant change in its marginal delay set Flag(i) if (D(i) > D_new(i)) D_old(i) <- D(i) - 2 * (D_new(i) - D_old(i)) D_new(i) <- D(i) Inc(i) <- 1 else D_old(i) <- D(i) - (D_new(i) - D_old(i)) / 2 D_new(i) <- D(i) Inc(i) <- 1 For each LSP k such that (k != i) D_old(k) <- D(k) - (D_new(k) - D_old(k)) D_new(k) <- D(k) Go to Phase 2 When LSP(i) detects a significant change in its marginal delay, it changes doubles or halfs its derivative depending on whether the delay has increased or decreased, respectively. The associated flag is set to ensure that LSP(i) will be considered in Phase 2. D_new's and D_old's of the rest of the LSPs are updated with the latest information. All computations should check for illegal values. 2.7 Time Scales Since MATE relies on measurements to estimate LSP characteristics, it is important to understand the time scale each phase typically operates. After the initialization in Phase 1, MATE alternates in Phase 2, Phase 3 and Phase 4. MATE is designed to execute in Phase 2 on the order of 1 minute. The effectiveness in this phase depends critically on the stationarity of the traffic. Recent measurements indicate that traffic in the wide area network is stationary in at least a 5-minute interval [4]. Phase 3 is intended to detect a Widjaja & Elwalid Expired in six months [Page 9] Internet Draft MPLS Adaptive Traffic Engineering August 1998 persistent and appreciable change in the network after the algorithm has stabilized in Phase 2. In general, Phase 3 operates at a much longer time scale (on the order of 30 minutes or longer). Phase 4 is executed to update the State Table immediately after Phase 3. Phase 4 typically operates on the order of a second. 2.8 Monotonous Property of Delay versus Load To cope with "noise" in the cross/background traffic and measurement errors, we describe some improvements to the above algorithm by using the property of delay-load curve. It is possible for the computed mean delay to violate the monotonous property of delay versus load. A violation may occur in two cases. In the first case, the mean delay decreases when traffic is shifted into an LSP. In another case, the mean delay increases when traffic is shifted out of an LSP. The violation can be noted by associating an attribute flag to each delay variable in the state table. In order to correct an anomaly due to violation, MATE always repeats the measurement on the pair that is involved in violation until there is no more violation detected. In the repetition, traffic is shifted back and forth between the pair until no more violation is detected in each LSP. 2.9 Inflection Another observation is that the mean delay is typically a convex function of the offered load, except beyond a certain point (called the inflection point) where the LSP becomes heavily congested. When MATE encounters an inflection point during the iteration pro- cess, it should adjust D_old so that the higher derivative between the previous and the current iterations is used. This decision makes the derivative estimate more conservative when the inflection point is due to heavy congestion. 2.10 Dynamic Shift The number of bins to be shifted (Inc) can be dynamically adjusted as follows. If one of the LSPs to be shifted consists of an LSP whose new mean delay was invalidated previously due to a violation, then the number of bins to be shifted will be doubled from the previous value, subject to the maximum number of bins that can be shifted. Otherwise, the number of bins to be shifted will be halfed from the Widjaja & Elwalid Expired in six months [Page 10] Internet Draft MPLS Adaptive Traffic Engineering August 1998 previous value, subject to the minimum number of bins that is allowed. 2.11 Lossy Case The base algorithm described in the previous section assumes an ideal case where there is no packet loss. This section incorporates the effect of packet loss on the algorithm by adjusting the derivative with a loss factor. MPLS traffic is likely to be dominated by TCP applications, in which case the packet loss inside an MPLS domain should be tightly con- trolled since TCP reacts to packet losses. It is well-known that TCP achieves bandwidth that is inversely proportional to the square root of the packet loss probability (p), under idealized condition. Note that this relationship is typically reasonable only in a certain regime of p (roughly, p is not too close to zero or one). We use a typical delay-load function which is convex and increasing in load. The delay-load function blows up when the arrival rate approaches the service rate, which can be approximated by the simple formula: D = A / (mu - lambda), (1) where mu is the offered traffic, lambda is the average service rate, and A is some constant. Most queueing systems would have the same factor in the denominator as the dominant one. The derivative of the delay with respect to the offered traffic can then be computed by D' = A / (mu - lambda)^2. (2) Since the available bandwidth for the lossless case is BW_lossless = mu - lambda, (3) we can relate the lossless available bandwidth to the derivative by BW_lossless^2 = A / D'. (4) Because the bandwidth achieved by TCP is reduced by a factor of square root of p when there are packet losses, the available bandwidth for the lossy case can be correspondingly reduced by the same factor. In other words, we can write BW_lossy^2 = BW_lossless^2 * C / p, (5) Widjaja & Elwalid Expired in six months [Page 11] Internet Draft MPLS Adaptive Traffic Engineering August 1998 where C is set to the lowest observable loss probability through the probe packets. If MATE cannot determine the packet loss probability because of its rare occurrences, it sets p to C. Comparing (4) and (5), the lossy available bandwidth can be related to the derivative by BW_lossy^2 = C / (p * D'). (6) Therefore, equalizing (p * D') is equivalent to equalizing the avail- able bandwidth for the lossy case. It is interesting to see that one may define another metric called Power which is defined as the ratio of throughput to delay. If the objective is to equalize power, one will find that this is equivalent to equalizing (p * D'). For the lossy case, Phase 2 of the algorithm stops when the weighted delay increases. In Phase 3, it is not possible to measure the packet loss rate since the probe traffic is low. Instead, a lost probe packet is accounted for by assigning D(i) to the maximum measurable delay. 3.0 Protocol Specification To be added. 4.0 Security Considerations Security considerations are not addressed in the present document. 5.0 Acknowledgement The authors thank Cheng Jin for implementing MATE and suggesting improvements, and Dimitri Stiliadis and Eric Gray for providing insightful comments. 6.0 References [1] R. Callon, P. Doolan, N. Feldman, A. Fredette, G. Swallow and A. Viswanathan, Widjaja & Elwalid Expired in six months [Page 12] Internet Draft MPLS Adaptive Traffic Engineering August 1998 "A Framework for Multiprotocol Label Switching," Internet Draft , Nov 1997. [2] E.C. Rosen, A. Viswanathan, and R. Callon, "Multiprotocol Label Switching Architecture," Internet Draft , Mar 1998. [3] D.O. Awduche et. al., "Requirements for Traffic Engineering over MPLS," Internet Draft , April 1998. [4] K. Thompson, G.J. Miller, and R. Wilder, "Wide-Area Internet Traffic Patterns and Characteristics, " IEEE Networks, vol. 6, no. 6, Dec. 1997. [5] T. Li and Y. Rekhter, "Provider Architecture for Differentiated Services and Traffic Engineering (PASTE)," Internet Draft , Jan 1998. [6] C. Villamizar, "OSPF Optimized Multipath (OSPF-OMP)," Internet Draft , Mar 1998. [7] S. Keshav, "A Control-Theoretic Approach to Flow Control," Proc. SIGCOMM'91, pp.3-15, Sep. 1991. [8] R.L. Carter and M.E. Crovella, "Measuring Bottleneck Link Speed in Packet-Switched Networks," Technical Report, BU-CS-96-006, Boston University, Mar. 1996. Author Information: Indra Widjaja Fujitsu Network Communications 4403 Bland Road Raleigh, NC 27609, USA Phone: 919 790 2037 Email: indra.widjaja@fnc.fujitsu.com Anwar Elwalid Bell Labs Lucent Technologies Murray Hill, NJ 07974, USA Phone: 908 582-7589 Email: anwar@lucent.com Widjaja & Elwalid Expired in six months [Page 13]