Internet Engineering Task Force Cheng Jin INTERNET DRAFT David X. Wei draft-jin-wei-low-tcp-fast-00.txt Steven H. Low Expiration Date: December 23, 2003 Caltech June, 2003 FAST TCP for High-Speed Long-Distance Networks Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document proposes FAST TCP, a modification to the TCP congestion control algorithm for high-speed long-distance networks. The AIMD congestion control algorithm in the current TCP constrains the size of congestion window under realistic network conditions, resulting in low throughput in high-speed long-distance networks. FAST TCP uses additional feedback information to detect and control network congestion to achieve higher throughput. 1. Motivation This document proposes FAST TCP, a modification to the TCP congestion control algorithm for high-speed long-distance connections. Extensive simulations and live experiments have shown that the current TCP cannot sustain high throughput for extended periods of time in high-speed long-distance networks using standard packet size. By the current TCP, we mean a TCP implementation that is based on TCP Reno and includes RFC 1323 (high-performance extensions), RFC 2018 (SACK), RFC 2582 (New Reno), RFC 2883 (D-SACK), and RFC 2988 (RTO). In such networks, a large congestion window is necessary to achieve high link utilization. This is difficult for two reasons: one concerns with the steady-state property of the FAST FAST TCP for High-Speed Long-Distance Networks [Page 1] draft-jwl-tcp-fast-00.txt June 2003 current TCP and the other with its dynamic property. First, the current TCP response function requires that an end-to-end path maintains an exceedingly small steady-state loss probability, in order to sustain a large window. Such a small loss probability, e.g., to sustain a throughput of 10Gbps or higher, may be difficult to achieve even in optical fiber links, and more so on paths that may drop or corrupt packets for reasons other than congestion. Second, even if such a small loss probability is achieved, it generates packet losses too infrequently for the Additive Increase Multiplicative Decrease (AIMD) dynamics of the current TCP to work effectively. At large window sizes (in excess of 10,000 packets), halving window on a loss event is too drastic, and increasing window by one packet per Round-Trip Time (RTT) is too conservative. This may lead to oscillation and underutilization at a bottleneck link. Recent efforts try to solve this problem at the TCP sender by increasing window more aggressively between consecutive packet-loss events and decreasing window less drastically on each packet-loss event. HSTCP [Flo02] uses AIMD algorithm with the AI and MD parameters that are increasing functions of the current window size. Scalable TCP [Kel02] uses MIMD with constant parameters. Both proposals rely only on packet loss to adjust the window. Another approach, taken by XCP [KHR02], uses explicit feedback from routers for congestion control, while leaving the TCP sender unchanged. FAST TCP addresses the problem with large window by using queueing delay in addition to packet loss to assess congestion and adjust window. When congestion is mild, queueing delay is the dominant congestion signal, and FAST TCP operates around an equilibrium queueing delay that is strictly less than the maximum queueing delay. When congestion becomes severe, packet loss becomes the dominant congestion signal, and FAST TCP reduces the congestion to bring the system back to a mildly congested regime where it can again stabilize around a (possibly different) target queueing delay. When ECN becomes available, FAST TCP can use ECN marking to replace, or supplement, queueing delay and packet loss as the congestion measure. This permits stable operations with high utilization and negligible packet loss and queueing delay. This document solicits comments and suggestions on FAST TCP from the IETF community and associated researchers. 2. Design Guidelines of FAST The FAST TCP aims to achieve the following goals: * Stable equilibrium whenever possible. * Well-defined fairness properties. FAST FAST TCP for High-Speed Long-Distance Networks [Page 2] draft-jwl-tcp-fast-00.txt June 2003 * High throughput and link utilization. * Making use of additional feedbacks such as ECN when available. subject to the following constraints: * Only modifying the TCP sender. * Friendliness to the current TCP. * Requiring no cooperation from routers or receivers. 3. Window Control Algorithm Without explicit feedback from routers, only two kinds of signals are available for congestion control: queueing delay and packet loss. FAST TCP uses equation-based control with queueing delay, and multiplicative decrease with packet loss. 3.1 Reacting to Delay The goal of responding to queueing delay is to maintain a stable queue in the bottleneck router, instead of periodically inducing packet losses by overflowing the queue, in order to sustain high link utilization. FAST TCP uses queueing delay for congestion control whenever reliable RTT measurements are available. Throughout our discussion in this section, we assume congestion windows are maintained in the unit of packets to simplify discussion. Congestion windows that are maintained in bytes can be converted into packet-based windows by dividing the path MTU (PMTU). We will first explain the variables used in our equation-based congestion control. For each source, let * w_current, w_old, w_new denote the current congestion window size, the congestion window size one avgRTT ago, and the new target congestion window size, respectively; * avgRTT denote the average RTT computed using exponential averaging with a weight described later in Section 4.2; * baseRTT denote the minimum of the observed instantaneous RTTs; * q = avgRTT - baseRTT denote an estimate of the current queueing delay; * alpha denote a parameter, which is a non-negative number of packets described later in Section 4.4. Then the new target window w_new is calculated every RTT as follows: FAST FAST TCP for High-Speed Long-Distance Networks [Page 3] draft-jwl-tcp-fast-00.txt June 2003 w_new = 1/2{ [w_old * baseRTT / avgRTT] + alpha + w_current } Note that each update increases the window by at most alpha/2. If pacing of packet transmission is unavailable, the implementation SHOULD increment or decrement the congestion window smoothly over one RTT to the new target window w_new. If pacing of packet transmission is available, the implementation MAY set congestion window to w_new directly. A reference implementation to update congestion window from w_current toward w_new is as follows: * set num_ack = abs( w_current / [w_new-w_current] ); * if w_new > w_current: during the next window, increase congestion window by one packet for every max{num_ack, 1} acked packet(s); * if w_new < w_current: during the next window, decrease congestion window by one packet for every max{num_ack, 2} acked packets; The implementation SHOULD always be less aggressive than slow start, and SHOULD always be less drastic than rate-halving. In [CWL03], we prove that this algorithm converges to an equilibrium (exponentially fast, once avgRTT > baseRTT) and present experimental results on dummynet to illustrate the performance of FAST TCP. 3.2 Reacting to Loss The goal in reacting to packet loss is to back off packet transmission quickly when severe congestion occurs. Subsequently, a system can return to a regime where reliable RTT measurements are available for the window adjustment algorithm described in Section 3.1. When a source detects a packet loss, the source SHOULD follow RFC 2582 (New Reno) to enter Fast Retransmission/Fast Recovery. The source MAY use other compatible improvements, e.g., rate-halving, in an implementation. Before exiting Fast Recovery, the source SHOULD NOT react to queueing delay unless it is certain that the delay measurements are reliable. By reliable, we mean that the delay measurements are from newly transmitted packets that would be SACK'd during recovery, and furthermore, we will react to queueing delay only after we have collected enough samples, e.g., 30% of the congestion window size at the beginning of the current loss event. 4 Measurement and Parameter Setting FAST FAST TCP for High-Speed Long-Distance Networks [Page 4] draft-jwl-tcp-fast-00.txt June 2003 The algorithm that responds to queueing delay requires the following quantities: w_old, baseRTT, and q. Since q = avgRTT - baseRTT, we need to measure w_old, avgRTT, and baseRTT. 4.1 RTT and w_old Measurement The implementation is encouraged to use the most accurate time-stamp available on an implementation platform. For example, In Linux 2.4, gettimeofday() is used to obtain time resolution down to several microseconds. Clearly, there is a trade-off between time precision and the overhead needed to achieve such precision. When a packet is ready for transmission, the implementation SHOULD record a time-stamp in the copy of the packet in the retransmission queue. The implementation SHOULD also record the current congestion window in the copy of the packet, which we call cwnd_stamp. At the time when a packet is acknowledged, the implementation SHOULD compare the current time and the time-stamp in the copy in the retransmission queue, and compute the RTT sample as the difference between the two values. The implementation SHOULD also set w_old to cwnd_stamp for the acknowledged packet. 4.2 avgRTT Computation In [VJ88], the average weight for Exponential Weighted Moving Average (EWMA) is 1/8. For high-speed long-distance network, smaller average weight is recommended since, with a large window, successive RTT samples capture congestion information over a timescale that is much smaller than a RTT. Average RTT values should capture the state of a network during the last window. We suggest an average weight that is proportional to the reciprocal of cwnd: weight = min{ 3/cwnd, 1/8 } The average RTT is updated on the receipt of a new RTT sample, as follows: avgRTT_new = (1-weight)*avgRTT_old + weight*RTT 4.3 baseRTT Computation The implementation MAY use the minimum of observed RTT samples since the start of a connection as baseRTT, as in TCP Vegas [BPe95]. This solution is not robust against possible routing changes during the lifetime of a connection. It may also incur estimation errors in a dynamic sharing environment where the minimum observed RTT of a new FAST TCP flow may include queueing delay produced by packets of existing flows. [LPW02] shows that estimation errors in baseRTT will skew the overall fairness in a network, but estimation errors do not seem to upset network stability. Even though simulations using infinite router buffers indeed confirm the presence of such errors and the resulting skewed fairness, in a dynamic sharing FAST FAST TCP for High-Speed Long-Distance Networks [Page 5] draft-jwl-tcp-fast-00.txt June 2003 environment, our experiments in real networks and in dummynet with finite buffer suggest that it may not be a serious problem in practice [CWL03,CWL+03]. Further research is needed to fully address the issue of baseRTT computation when network condition changes. 4.4 Setting alpha The parameter alpha, which may be different for different TCP flows, affects both the equilibrium and dynamic behaviors of a network. It specifies the total number of packets a single FAST connection tries to maintain in one or more queues in its path. This defines the equilibrium for each FAST connection. For n FAST flows sharing the same bottleneck router using the same alpha value, each will get 1/n of the bottleneck bandwidth. Thus, a buffer of size at least n*alpha packets at the router is required to maintain equilibrium. The parameter alpha also determines the window increment when queueing delay is zero, so it may be used to tune the aggressiveness of the window update function. The implementation MAY use a mapping table from bandwidths (x = cwnd/RTT) to alpha values. For example, to maintain about 2.5 ms of queueing delay for each FAST connection, alpha may be set according to link capacity by the following table: x alpha ------------------ <= 0.1 Gbps 20 1 Gbps 200 2.5 Gbps 500 10 Gbps 2000 Clearly, to scale with network capacity, an automatic way to set the alpha value is necessary. We are researching various ways to adaptively tune the alpha parameter on time scales larger than one RTT. 5. Equilibrium and Dynamic Properties This will be updated in version 01 by June 30, 2003. 6. Fairness 6.1 Fairness among FAST TCP flows It is shown in [CWL03] that FAST TCP has a log utility function: U(x) = alpha * log(x) Hence, a network with only FAST TCP flows achieve proportional fairness under no congestion or mildly congested situations--- when packet loss occurs infrequently. FAST FAST TCP for High-Speed Long-Distance Networks [Page 6] draft-jwl-tcp-fast-00.txt June 2003 Under severely congested situations, packet loss becomes the dominant signal, and AIMD in FAST should provide comparable fairness as current TCP with identical flows. 6.2 Friendliness to Current TCP Flows Interaction between FAST TCP with the current TCP is an important issue that is still under study. The alpha parameter in FAST TCP determines both the equilibrium bandwidth share and the aggressiveness during the additive increase phase when queueing delay is zero. Friendliness of FAST TCP to the current TCP can be tuned by appropriately selecting the alpha value. The challenge is to tune it automatically and in a decentralized manner. Another possible feature is to make FAST TCP behave exactly like the current TCP when congestion window is small so that it is compatible with the current TCP in low speed networks, similar to the approach taken in HSTCP [Flo02]. In this case, the window adjustment algorithm of Section 3.1 is used only when the congestion window size of a connection exceeds a threshold. 5. Security Considerations This proposal makes no changes to the underlying security of TCP. 6. Acknowledgment We are grateful to Guy Almes, Allison Mankin, Allyn Romanow, Stanislav Shunalov, and Lixia Zhang for encouraging us to work on this Internet Draft and their helpful advices. 7. References [BPe95] L. Brakmo and L. Peterson. "TCP Vegas: end to end congestion avoidance on a global Internet", IEEE Journal on Selected Areas in Communication, 13(8):1465-1480, October 1995 [CWL03] C. Jin, D. X. Wei and S. H. Low. "FAST TCP: architecture, algorithms, performance", to be submitted for publication. [CWL+03] C. Jin, D. X. Wei, S. H. Low, G. Buhrmaster, J. Bunn, D. H. Choe, R. L. A. Cottrell, J. C. Doyle, W. Feng, O. Martin, H. Newman, F. Paganini, S. Ravot, S. Singh. "FAST TCP: from theory to experiments", submitted for publication, April 1, 2003. http://netlab.caltech.edu/FAST/ [Flo02] HighSpeed TCP Web Page, http://www.icir.org/floyd/hstcp.html [Jac88] Jacobson, V., "Congestion Control and Avoidance", SIGCOMM 1988, August 1988. FAST FAST TCP for High-Speed Long-Distance Networks [Page 7] draft-jwl-tcp-fast-00.txt June 2003 [Kel02] Tom Kelly, "Scalable TCP: Improving Performance in HighSpeed Wide Area Networks", February 2003. http://www-lce.eng.cam.ac.uk/~ctk21/scalable/ [KHR02] D. Katabi, M. Handley and C. Rohrs. "Congestion Control for High Bandwidth-Delay Product Networks", ACM SIGCOMM 2002. [LPW02] S. H. Low, L. L. Peterson and L. Wang. "Understanding Vegas: a duality model", Journal of ACM, 49(2):207-235, March 2002 8. Author's Address Cheng Jin Phone: +1 (626) 395-8820 E-mail: chengjin@cs.caltech.edu David X. Wei Phone: +1 (626) 395-3555 E-mail: weixl@cs.caltech.edu Steven H. Low Phone: +1 (626) 395-6767 E-mail: slow@cs.caltech.edu Caltech http://netlab.caltech.edu 9. Copyright Notice The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the on-line list of claimed rights. 10. Expiration Date This document will expire on December 23, 2003. FAST FAST TCP for High-Speed Long-Distance Networks [Page 8]