Internet Engineering Task Force Cheng Jin INTERNET DRAFT David X. Wei draft-jin-wei-low-tcp-fast-01.txt Steven H. Low Expiration Date: December 31, 2003 Caltech June, 2003 FAST TCP for High-Speed Long-Distance Networks Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document proposes FAST TCP, a modification to the TCP congestion control algorithm for high-speed long-distance networks. The AIMD congestion control algorithm in the current TCP has equilibrium and dynamic problems that become more and more severe as bandwidth-delay product increases. FAST TCP uses feedback information, in addition to packet loss, to address these problems in order to stabilize a network around a fair and efficient operating point. 1. Introduction This document proposes FAST TCP, a modification to the TCP congestion control algorithm for high-speed long-distance connections. Extensive simulations and live experiments have shown that the current TCP cannot sustain high throughput for extended periods of time in high-speed long-distance networks using standard packet size. By the current TCP, we mean a TCP implementation that is based on Reno TCP, with or without enhancements. Even though the original scheme has been improved by several enhancements, including RFC 1423 (high-performance extensions), RFC 2018 (SACK), RFC 2582 (New Reno), RFC 2883 (D-SACK), and RFC 2988 (RTO calculation), the basic FAST FAST TCP for High-Speed Long-Distance Networks [Page 1] draft-jwl-tcp-fast-01.txt July 2003 additive-increase-multiplicative-decrease (AIMD) algorithm of Reno remains unchanged. While it has performed remarkably well and is generally believed to have prevented congestion collapse as the Internet scaled up by orders of magnitude in size, speed, load, and connectivity, it is also well-known that as the bandwidth-delay product of a network continues to grow, the current TCP will eventually become a performance bottleneck. In this document, we explain the main difficulties of the current TCP at large windows that motivate FAST TCP. We specify the window control algorithm in FAST TCP and comment on parameter setting and fairness properties of FAST TCP. More details on FAST TCP, including experimental evaluation, are presented in [JWL03]. This document solicits comments and suggestions on FAST TCP from the IETF community. 2. Motivation A congestion control algorithm can be understood at both the packet and the flow level [JWL03]. Historically for Reno TCP, packet-level implementation (AIMD) was first designed. The resulting flow-level properties, such as fairness, performance, and stability, were then investigated as an afterthought. For instance, the well-known 1/sqrt(p) formula, which relates TCP's equilibrium window and loss probability, expresses an equilibrium property at the flow level. In contrast, the packet-level designs of HSTCP, STCP, and FAST TCP are explicitly guided by flow-level goals. There are four main difficulties with the current TCP at large window sizes, due to equilibrium and dynamic problems at both the packet and flow level. The current TCP response function requires that an end-to-end path maintains an exceedingly small equilibrium loss probability, in order to sustain a large window. Such a small loss probability, e.g., to sustain a throughput of 10Gbps or higher, may be difficult to achieve even in optical fiber links, and more so on paths that may drop or corrupt packets for reasons other than congestion. At large window sizes (in excess of 10,000 packets), halving window on a loss event is too drastic, and increasing window by one packet per Round-Trip Time (RTT) is too conservative. This can lead to oscillation and underutilization at a bottleneck link. Note that there are two sources of window oscillations. One is at the packet level where the use of binary congestion signal (packet loss) necessarily leads to window oscillation, and Reno's parameters worsen the situation as bandwidth-delay product increases. The packet-level oscillation is eliminated in equation-based approach, such as TFRC [FHPW00], which uses multi-bit feedback. FAST FAST TCP for High-Speed Long-Distance Networks [Page 2] draft-jwl-tcp-fast-01.txt July 2003 The second source of oscillation is due to instability at the flow level, and can be reduced only with accurate estimation of congestion measure and a stable design of the flow dynamics. Recent efforts try to address these problems at the TCP sender by increasing window more aggressively between consecutive packet-loss events and decreasing window less drastically on each packet-loss event. HSTCP [Flo02] uses AIMD algorithm with the AI and MD parameters that are increasing functions of the current window size. Scalable TCP [Kel02] uses MIMD with constant parameters. Both proposals rely only on packet loss to adjust congestion window. Another approach, taken by XCP [KHR02], uses explicit feedback from routers for congestion control. FAST TCP addresses these problems by using queueing delay, in addition to packet loss, to assess congestion and adjust window. When congestion is mild, queueing delay is the dominant congestion signal, and FAST TCP operates around an equilibrium queueing delay that is strictly less than the maximum queueing delay. When congestion becomes severe, packet loss becomes the dominant congestion signal, and FAST TCP reduces the congestion to bring the system back to a mildly congested regime where it can again stabilize around a (possibly different) target queueing delay. Queueing delay has two advantages as a congestion measure. First, it can be more accurately estimated than loss probability both because packet losses in networks with large bandwidth-delay product are rare events (probability on the order 10^{-8} or smaller), and because loss samples provide coarser information than delay samples. Indeed, each measurement of packet loss (whether a packet is lost or not) provides one bit of information, whereas each measurement of queueing delay provides multi-bit information. This allows an equation-based implementation to stabilize in some steady state with a target fairness and high utilization. Second, the dynamics of queueing delay seems to have the right scaling with respect to network capacity that helps maintain stability as a network scales up in capacity [PDL01]. When ECN becomes available, FAST TCP can be extended to use ECN marking to replace, or supplement, queueing delay and packet loss as the congestion measure. This permits stable operations with high utilization and negligible packet loss and queueing delay. 3. Design Guidelines of FAST FAST TCP aims to achieve the following goals: * Stable equilibrium whenever possible. * Well-defined fairness properties. * High throughput and link utilization. FAST FAST TCP for High-Speed Long-Distance Networks [Page 3] draft-jwl-tcp-fast-01.txt July 2003 * Making use of additional feedbacks such as ECN when available. subject to the following constraints: * Only modifying the TCP sender. * Friendliness to the current TCP. * Requiring no cooperation from routers or receivers. 4. Window Control Algorithm Without explicit feedback from routers, only two kinds of signals are available for congestion control: queueing delay and packet loss. FAST TCP uses equation-based control with queueing delay, and multiplicative decrease with packet loss. 4.1 Reacting to Delay The goal of responding to queueing delay is to maintain a stable queue in the bottleneck router, instead of periodically inducing packet losses by overflowing the queue, in order to sustain high link utilization. FAST TCP uses queueing delay for congestion control whenever reliable RTT measurements are available. Throughout our discussion in this section, we assume congestion windows are maintained in the unit of packet to simplify discussion. Congestion windows that are maintained in bytes can be converted into packet-based windows by dividing windows by path MTU (PMTU). We will first explain the variables used in our equation-based congestion control. For each source, let * w_current, w_old, w_new denote the current congestion window size, the congestion window size one avgRTT ago, and the new target congestion window size, respectively; * avgRTT denote the average RTT computed using exponential averaging with a weight described later in Section 5.2; * baseRTT denote the minimum of the observed instantaneous RTTs; * q = avgRTT - baseRTT denote an estimate of the current queueing delay; * alpha denote a parameter, which is a non-negative number of packets described later in Section 5.4. Then the new target window w_new is calculated every RTT as follows: w_new = 1/2{ [w_old * baseRTT / avgRTT] + alpha + w_current } Note that each update increases the window by at most alpha/2. FAST FAST TCP for High-Speed Long-Distance Networks [Page 4] draft-jwl-tcp-fast-01.txt July 2003 If pacing of packet transmission is unavailable, the implementation should increment or decrement the congestion window smoothly over one RTT to the new target window w_new. If pacing of packet transmission is available, the implementation may set congestion window to w_new directly. A reference implementation to update congestion window from w_current toward w_new is as follows: * set num_ack = abs( w_current / [w_new-w_current] ); * if w_new > w_current: during the next window, increase congestion window by one packet for every max{num_ack, 1} acked packet(s); * if w_new < w_current: during the next window, decrease congestion window by one packet for every max{num_ack, 2} acked packets; The implementation should always be less aggressive than slow start, and should always be less drastic than rate-halving. In [JWL03], we prove that this algorithm converges to an equilibrium exponentially fast in the absence of feedback delay, and present extensive experimental results on dummynet to illustrate its stability and performance in the presence of large delays. 4.2 Reacting to Loss The goal in reacting to packet loss is to back off packet transmission quickly when severe congestion occurs. Subsequently, a system can return to a regime where reliable RTT measurements are available for the window adjustment algorithm described in Section 4.1. When a source detects a packet loss, the source should follow RFC 2582 (New Reno) to enter Fast Retransmission/Fast Recovery. The source may use other compatible improvements, e.g., rate-halving, in an implementation. Before exiting Fast Recovery, the source should not react to queueing delay unless it is certain that the delay measurements are reliable. By reliable, we mean that the delay measurements are from newly transmitted packets that would be SACK'd during recovery, and furthermore, we will react to queueing delay only after we have collected enough samples, e.g., 30% of the congestion window size at the beginning of the current loss event. 5 Measurement and Parameter Setting The algorithm that responds to queueing delay requires the following quantities: w_old, baseRTT, and q. Since q = avgRTT - baseRTT, FAST FAST TCP for High-Speed Long-Distance Networks [Page 5] draft-jwl-tcp-fast-01.txt July 2003 we need to measure w_old, avgRTT, and baseRTT. 5.1 RTT and w_old Measurement The implementation is encouraged to use the most accurate time-stamp available on an implementation platform. For example, In Linux 2.4, gettimeofday() is used to obtain time resolution down to several microseconds. Clearly, there is a trade-off between time precision and the overhead needed to achieve such precision. When a packet is ready for transmission, the implementation should record a time-stamp in the copy of the packet in the retransmission queue. The implementation should also record the current congestion window in the copy of the packet, which we call cwnd_stamp. At the time when a packet is acknowledged, the implementation should compare the current time and the time-stamp in the copy in the retransmission queue, and compute the RTT sample as the difference between the two values. The implementation should also set w_old to cwnd_stamp for the acknowledged packet. 5.2 avgRTT Computation In [Jac88], the average weight for Exponential Weighted Moving Average (EWMA) is 1/8. For high-speed long-distance network, smaller average weight is recommended since, with a large window, successive RTT samples capture congestion information over a timescale that is much smaller than a RTT. Average RTT values should capture the state of a network during the last window. We suggest an average weight that is proportional to the reciprocal of cwnd: weight = min{ 3/cwnd, 1/8 } The average RTT is updated on the receipt of a new RTT sample, as follows: avgRTT_new = (1-weight)*avgRTT_old + weight*RTT 5.3 baseRTT Computation The implementation may use the minimum of observed RTT samples since the start of a connection as baseRTT, as in TCP Vegas [BPe95]. This solution is not robust against possible routing changes during the lifetime of a connection. It may also incur estimation errors in a dynamic sharing environment where the minimum observed RTT of a new FAST TCP flow may include queueing delay produced by packets of existing flows. [LPW02] shows that estimation errors in baseRTT will skew the overall fairness in a network, but estimation errors do not seem to upset network stability. Even though simulations using infinite router buffers indeed confirm the presence of such errors and the resulting skewed fairness, in a dynamic sharing environment, our experiments in real networks and in dummynet with finite buffer suggest that it may not be a serious problem in FAST FAST TCP for High-Speed Long-Distance Networks [Page 6] draft-jwl-tcp-fast-01.txt July 2003 practice [JWL03,JWL+03]. Further research is needed to fully address the issue of baseRTT computation when network condition changes. 5.4 Setting alpha The parameter alpha, which may be different for different TCP flows, affects both the equilibrium and dynamic behaviors of a network. It specifies the total number of packets a single FAST connection tries to maintain in one or more queues in its path. This defines the equilibrium for each FAST connection. For n FAST flows sharing the same bottleneck router using the same alpha value, each will get 1/n of the bottleneck bandwidth. A buffer of size at least n*alpha packets at the router is required to maintain equilibrium. The parameter alpha also determines the window increment when queueing delay is zero, so it may be used to tune the aggressiveness of the window update function. The implementation may use a mapping table from bandwidths (x = cwnd/RTT) to alpha values. For example, to maintain about 2.5 ms of queueing delay for each FAST connection, alpha may be set according to link capacity by the following table: x alpha ------------------ <= 0.1 Gbps 20 1 Gbps 200 2.5 Gbps 500 10 Gbps 2000 Clearly, to scale with network capacity, an automatic way to set the alpha value is necessary. We are researching various ways to adaptively tune the alpha parameter on time scales larger than one RTT. 6. Fairness 6.1 Fairness among FAST TCP flows It is shown in [JWL03] that FAST TCP has a log utility function: U(x) = alpha * log(x) Hence, a network with only FAST TCP flows achieve proportional fairness under no congestion or mildly congested situations--- when packet loss occurs infrequently. Under severely congested situations, packet loss becomes the dominant signal, and AIMD in FAST should provide comparable fairness as current TCP with identical flows. FAST FAST TCP for High-Speed Long-Distance Networks [Page 7] draft-jwl-tcp-fast-01.txt July 2003 6.2 Friendliness to Current TCP Flows Interaction between FAST TCP with the current TCP is an important issue that is still under study. The alpha parameter in FAST TCP determines both the equilibrium bandwidth share and the aggressiveness during the additive increase phase when queueing delay is zero. Friendliness of FAST TCP to the current TCP can be tuned by appropriately selecting the alpha value. The challenge is to tune it automatically and in a decentralized manner. Another possible feature is to make FAST TCP behave exactly like the current TCP when congestion window is small so that it is compatible with the current TCP in low speed networks, similar to the approach taken in HSTCP [Flo02]. In this case, the window adjustment algorithm of Section 4.1 is used only when the congestion window size of a connection exceeds a threshold. 5. Security Considerations This proposal makes no changes to the underlying security of TCP. 6. Acknowledgment We are grateful to Guy Almes, Allison Mankin, Allyn Romanow, Stanislav Shunalov, and Lixia Zhang for encouraging us to work on this Internet Draft and their helpful advices. Guy Almes also provided very useful feedbacks on an earlier draft. 7. References [BPe95] L. Brakmo and L. Peterson. "TCP Vegas: end to end congestion avoidance on a global Internet", IEEE Journal on Selected Areas in Communication, 13(8):1465-1480, October 1995 [FHPW00] S. Floyd, M. Handley, J. Padhye, and J. Widmer. "Equation-based congestion control for unicast applications". ACM SIGCOMM, September 2000. [Flo02] S. Floyd. HighSpeed TCP Web Page, http://www.icir.org/floyd/hstcp.html [Jac88] V. Jacobson. "Congestion Control and Avoidance", ACM SIGCOMM, August 1988. [JWL03] C. Jin, D. X. Wei and S. H. Low. "FAST TCP: motivation, architecture, algorithms, performance", submitted for publication, June 2003. [JWL+03] C. Jin, D. X. Wei, S. H. Low, G. Buhrmaster, J. Bunn, D. H. Choe, R. L. A. Cottrell, J. C. Doyle, W. Feng, O. Martin, H. Newman, F. Paganini, S. Ravot, S. Singh. "FAST TCP: from theory to experiments", submitted for publication, FAST FAST TCP for High-Speed Long-Distance Networks [Page 8] draft-jwl-tcp-fast-01.txt July 2003 April 1, 2003. http://netlab.caltech.edu/FAST/ [Kel02] Tom Kelly, "Scalable TCP: Improving Performance in HighSpeed Wide Area Networks", February 2003. http://www-lce.eng.cam.ac.uk/~ctk21/scalable/ [KHR02] D. Katabi, M. Handley and C. Rohrs. "Congestion Control for High Bandwidth-Delay Product Networks", ACM SIGCOMM 2002. [LPW02] S. H. Low, L. L. Peterson and L. Wang. "Understanding Vegas: a duality model", Journal of ACM, 49(2):207-235, March 2002 http://netlab.caltech.edu [PDL02] F. Paganini, J. C. Doyle and S. H. Low. "Scalable laws for stable network congestion control", Proceedings of Conference on Decision and Control, December 2001. http://www.ee.ucla.edu/~paganini 8. Author's Address Cheng Jin Phone: +1 (626) 395-8820 E-mail: chengjin@cs.caltech.edu David X. Wei Phone: +1 (626) 395-3555 E-mail: weixl@cs.caltech.edu Steven H. Low Phone: +1 (626) 395-6767 E-mail: slow@caltech.edu Caltech http://netlab.caltech.edu 9. Copyright Notice The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the on-line list of claimed rights. 10. Expiration Date This document will expire on December 31, 2003. FAST FAST TCP for High-Speed Long-Distance Networks [Page 9]