Internet Engineering Task Force                                Cheng Jin
INTERNET DRAFT                                              David X. Wei
draft-jin-wei-low-tcp-fast-01.txt                                 Steven H. Low
Expiration Date: December 31, 2003                               Caltech
                                                              June, 2003


         FAST TCP for High-Speed Long-Distance Networks


                         Status of this Memo

   This document is an Internet-Draft and is subject to
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet- Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Abstract

   This document proposes FAST TCP, a modification to the TCP congestion
   control algorithm for high-speed long-distance networks. The 
   AIMD congestion control algorithm in the current TCP has 
   equilibrium and dynamic problems that become more and more severe 
   as bandwidth-delay product increases.  FAST TCP uses feedback
   information, in addition to packet loss, to address these problems in 
   order to stabilize a network around a fair and efficient
   operating point.
   
1. Introduction

   This document proposes FAST TCP, a modification to the TCP congestion 
   control algorithm for high-speed long-distance connections. 
   Extensive simulations and live experiments have shown that the
   current TCP cannot sustain high throughput for extended periods of
   time in high-speed long-distance networks using standard packet size.  
   By the current TCP, we mean a TCP implementation that is based on 
   Reno TCP, with or without enhancements.  Even though the original
   scheme has been improved by several enhancements, including RFC 1423
   (high-performance extensions), RFC 2018 (SACK), RFC 2582 (New Reno),
   RFC 2883 (D-SACK), and RFC 2988 (RTO calculation), the basic

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 1]
draft-jwl-tcp-fast-01.txt                                      July 2003

   additive-increase-multiplicative-decrease (AIMD) algorithm of Reno
   remains unchanged.  While it has performed remarkably well and is
   generally believed to have prevented congestion collapse as the
   Internet scaled up by orders of magnitude in size, speed, load, and
   connectivity, it is also well-known that as the bandwidth-delay
   product of a network continues to grow, the current TCP will
   eventually become a performance bottleneck.

   In this document, we explain the main difficulties of the current TCP
   at large windows that motivate FAST TCP.   We specify the window
   control algorithm in FAST TCP and comment on parameter setting and
   fairness properties of FAST TCP.  More details on FAST TCP, including
   experimental evaluation, are presented in [JWL03].

   This document solicits comments and suggestions on FAST TCP from
   the IETF community.


2. Motivation

   A congestion control algorithm can be understood at both the packet
   and the flow level [JWL03].  Historically for Reno TCP, packet-level
   implementation (AIMD) was first designed.  The resulting flow-level
   properties, such as fairness, performance, and stability, were
   then investigated as an afterthought.  For instance, the well-known
   1/sqrt(p) formula, which relates TCP's equilibrium window and loss
   probability, expresses an equilibrium property at the flow level.
   In contrast, the packet-level designs of HSTCP, STCP, and FAST TCP
   are explicitly guided by flow-level goals.
   
   There are four main difficulties with the current TCP at large
   window sizes, due to equilibrium and dynamic problems at both the
   packet and flow level.

   The current TCP response function requires that an end-to-end
   path maintains an exceedingly small equilibrium loss probability, 
   in order to sustain a large window.  Such a small loss probability, 
   e.g., to sustain a throughput of 10Gbps or higher, may be difficult 
   to achieve even in optical fiber links, and more so on paths that 
   may drop or corrupt packets for reasons other than congestion.   

   At large window sizes (in excess of 10,000 packets), halving 
   window on a loss event is too drastic, and increasing window 
   by one packet per Round-Trip Time (RTT) is too conservative.  
   This can lead to oscillation and underutilization at a 
   bottleneck link.

   Note that there are two sources of window oscillations.  One
   is at the packet level where the use of binary congestion signal
   (packet loss) necessarily leads to window oscillation, and Reno's
   parameters worsen the situation as bandwidth-delay product increases.
   The packet-level oscillation is eliminated in equation-based
   approach, such as TFRC [FHPW00], which uses multi-bit feedback.
   

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 2]
draft-jwl-tcp-fast-01.txt                                      July 2003

   The second source of oscillation is due to instability at the flow
   level, and can be reduced only with accurate estimation of congestion
   measure and a stable design of the flow dynamics.  
      
   Recent efforts try to address these problems at the TCP sender by
   increasing window more aggressively between consecutive packet-loss
   events and decreasing window less drastically on each packet-loss
   event.  HSTCP [Flo02] uses AIMD algorithm with the AI and MD
   parameters that are increasing functions of the current window size.
   Scalable TCP [Kel02] uses MIMD with constant parameters.  Both
   proposals rely only on packet loss to adjust congestion window.
   Another approach, taken by XCP [KHR02], uses explicit feedback from
   routers for congestion control.

   FAST TCP addresses these problems by using queueing delay, 
   in addition to packet loss, to assess congestion and adjust 
   window.  When congestion is mild, queueing delay is the dominant 
   congestion signal, and FAST TCP operates around an equilibrium 
   queueing delay that is strictly less than the maximum queueing 
   delay.  When congestion becomes severe, packet loss becomes the 
   dominant congestion signal, and FAST TCP reduces the congestion 
   to bring the system back to a mildly congested regime where it 
   can again stabilize around a (possibly different) target queueing 
   delay.

   Queueing delay has two advantages as a congestion measure.  First,
   it can be more accurately estimated than loss probability both
   because packet losses in networks with large bandwidth-delay product
   are rare events (probability on the order 10^{-8} or smaller), and
   because loss samples provide coarser information than delay samples.
   Indeed, each measurement of packet loss (whether a packet is lost or
   not) provides one bit of information, whereas each measurement of
   queueing delay provides multi-bit information.  This allows an
   equation-based implementation to stabilize in some steady state with
   a target fairness and high utilization.  Second, the dynamics of
   queueing delay seems to have the right scaling with respect to
   network capacity that helps maintain stability as a network scales
   up in capacity [PDL01].
   
   When ECN becomes available, FAST TCP can be extended to use ECN
   marking to replace, or supplement, queueing delay and packet loss as
   the congestion measure.  This permits stable operations with high
   utilization and negligible packet loss and queueing delay.

3. Design Guidelines of FAST

   FAST TCP aims to achieve the following goals:

   * Stable equilibrium whenever possible.

   * Well-defined fairness properties.
   
   * High throughput and link utilization.


FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 3]
draft-jwl-tcp-fast-01.txt                                      July 2003

   * Making use of additional feedbacks such as ECN when available.    

   subject to the following constraints:

   * Only modifying the TCP sender. 

   * Friendliness to the current TCP.

   * Requiring no cooperation from routers or receivers. 
     
4. Window Control Algorithm

   Without explicit feedback from routers, only two kinds of signals
   are available for congestion control: queueing delay and packet loss.
   FAST TCP uses equation-based control with queueing delay, and
   multiplicative decrease with packet loss. 
   
4.1 Reacting to Delay
   
   The goal of responding to queueing delay is to maintain a stable
   queue in the bottleneck router, instead of periodically inducing 
   packet losses by overflowing the queue, in order to sustain high
   link utilization.  FAST TCP uses queueing delay for congestion
   control whenever reliable RTT measurements are available.
   
   Throughout our discussion in this section, we assume congestion
   windows are maintained in the unit of packet to simplify
   discussion.  Congestion windows that are maintained in bytes can
   be converted into packet-based windows by dividing windows by
   path MTU (PMTU).

   We will first explain the variables used in our equation-based
   congestion control.  For each source, let 

      * w_current, w_old, w_new denote the current congestion window
        size, the congestion window size one avgRTT ago, and the new
        target congestion window size, respectively;

      * avgRTT denote the average RTT computed using exponential
        averaging with a weight described later in Section 5.2;

      * baseRTT denote the minimum of the observed instantaneous RTTs;

      * q = avgRTT - baseRTT denote an estimate of the current
        queueing delay;

      * alpha denote a parameter, which is a non-negative number of
        packets described later in Section 5.4.

   Then the new target window w_new is calculated every RTT as follows:
   
      w_new = 1/2{ [w_old * baseRTT / avgRTT] + alpha + w_current }
   
   Note that each update increases the window by at most alpha/2.

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 4]
draft-jwl-tcp-fast-01.txt                                      July 2003


   If pacing of packet transmission is unavailable, the implementation
   should increment or decrement the congestion window smoothly 
   over one RTT to the new target window w_new.  If pacing of packet 
   transmission is available, the implementation may set congestion
   window to w_new directly.

   A reference implementation to update congestion window from w_current
   toward w_new is as follows:

   * set num_ack = abs( w_current / [w_new-w_current] );

   * if w_new > w_current:
     during the next window, increase congestion window by one packet 
     for every max{num_ack, 1} acked packet(s);

   * if w_new < w_current:
     during the next window, decrease congestion window by one packet 
     for every max{num_ack, 2} acked packets;

   The implementation should always be less aggressive than slow start,
   and should always be less drastic than rate-halving. 

   In [JWL03], we prove that this algorithm converges to an 
   equilibrium exponentially fast in the absence of feedback
   delay, and present extensive experimental results on 
   dummynet to illustrate its stability and performance in
   the presence of large delays.
   
4.2 Reacting to Loss
   
   The goal in reacting to packet loss is to back off packet
   transmission quickly when severe congestion occurs.  Subsequently,
   a system can return to a regime where reliable RTT measurements are
   available for the window adjustment algorithm described in
   Section 4.1.
   
   When a source detects a packet loss, the source should follow RFC
   2582 (New Reno) to enter Fast Retransmission/Fast Recovery. The
   source may use other compatible improvements, e.g., rate-halving, in
   an implementation.

   Before exiting Fast Recovery, the source should not react to
   queueing delay unless it is certain that the delay measurements are
   reliable.  By reliable, we mean that the delay measurements are from
   newly transmitted packets that would be SACK'd during recovery, and
   furthermore, we will react to queueing delay only after we have
   collected enough samples, e.g., 30% of the congestion window size at
   the beginning of the current loss event. 
   
5 Measurement and Parameter Setting
   
   The algorithm that responds to queueing delay requires the following 
   quantities: w_old, baseRTT, and q. Since q = avgRTT - baseRTT, 

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 5]
draft-jwl-tcp-fast-01.txt                                      July 2003

   we need to measure w_old, avgRTT, and baseRTT.
   
5.1 RTT and w_old Measurement

   The implementation is encouraged to use the most accurate time-stamp 
   available on an implementation platform. For example, In Linux 2.4, 
   gettimeofday() is used to obtain time resolution down to several
   microseconds.  Clearly, there is a trade-off between time precision
   and the overhead needed to achieve such precision.
   
   When a packet is ready for transmission, the implementation should
   record a time-stamp in the copy of the packet in the retransmission
   queue. The implementation should also record the current congestion
   window in the copy of the packet, which we call cwnd_stamp.
   
   At the time when a packet is acknowledged, the implementation should 
   compare the current time and the time-stamp in the copy in the
   retransmission queue, and compute the RTT sample as the difference
   between the two values. The implementation should also set w_old to
   cwnd_stamp for the acknowledged packet.
   
5.2 avgRTT Computation

   In [Jac88], the average weight for Exponential Weighted Moving Average
   (EWMA) is 1/8. For high-speed long-distance network, smaller average
   weight is recommended since, with a large window, successive RTT
   samples capture congestion information over a timescale that is much
   smaller than a RTT.  Average RTT values should capture the state of a
   network during the last window.   We suggest an average weight that
   is proportional to the reciprocal of cwnd: 

      weight = min{ 3/cwnd, 1/8 }

   The average RTT is updated on the receipt of a new RTT sample,
   as follows:

      avgRTT_new = (1-weight)*avgRTT_old + weight*RTT

5.3 baseRTT Computation

   The implementation may use the minimum of observed RTT samples since
   the start of a connection as baseRTT, as in TCP Vegas [BPe95].

   This solution is not robust against possible routing changes during
   the lifetime of a connection.  It may also incur estimation errors
   in a dynamic sharing environment where the minimum observed RTT of
   a new FAST TCP flow may include queueing delay produced by packets
   of existing flows.   [LPW02] shows that estimation errors in baseRTT
   will skew the overall fairness in a network, but estimation errors
   do not seem to upset network stability.  Even though simulations
   using infinite router buffers indeed confirm the presence of such
   errors and the resulting skewed fairness, in a dynamic sharing
   environment, our experiments in real networks and in dummynet with
   finite buffer suggest that it may not be a serious problem in

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 6]
draft-jwl-tcp-fast-01.txt                                      July 2003

   practice [JWL03,JWL+03].   Further research is needed to fully
   address the issue of baseRTT computation when network condition
   changes.
   
5.4 Setting alpha 

   The parameter alpha, which may be different for different TCP flows,
   affects both the equilibrium and dynamic behaviors of a network.  

   It specifies the total number of packets a single FAST connection
   tries to maintain in one or more queues in its path.  This defines
   the equilibrium for each FAST connection.  For n FAST flows sharing
   the same bottleneck router using the same alpha value, each will get
   1/n of the bottleneck bandwidth.  A buffer of size at least
   n*alpha packets at the router is required to maintain equilibrium.

   The parameter alpha also determines the window increment when
   queueing delay is zero, so it may be used to tune the aggressiveness
   of the window update function.

   The implementation may use a mapping table from bandwidths
   (x = cwnd/RTT) to alpha values.  For example, to maintain about
   2.5 ms of queueing delay for each FAST connection, alpha may be set
   according to link capacity by the following table:
   
        x       alpha
   ------------------
   <= 0.1 Gbps     20
        1 Gbps    200
      2.5 Gbps    500
       10 Gbps   2000

   Clearly, to scale with network capacity, an automatic way to set the
   alpha value is necessary.  We are researching various ways to
   adaptively tune the alpha parameter on time scales larger than
   one RTT.

6. Fairness

6.1 Fairness among FAST TCP flows


   It is shown in [JWL03] that FAST TCP has a log utility function:

      U(x) = alpha * log(x)

   Hence, a network with only FAST TCP flows achieve proportional
   fairness under no congestion or mildly congested situations---
   when packet loss occurs infrequently.

   Under severely congested situations, packet loss becomes 
   the dominant signal, and AIMD in FAST should provide 
   comparable fairness as current TCP with identical flows. 


FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 7]
draft-jwl-tcp-fast-01.txt                                      July 2003

6.2 Friendliness to Current TCP Flows

   Interaction between FAST TCP with the current TCP is an important
   issue that is still under study.

   The alpha parameter in FAST TCP determines both the equilibrium
   bandwidth share and the aggressiveness during the additive increase
   phase when queueing delay is zero.  Friendliness of FAST TCP to the
   current TCP can be tuned by appropriately selecting the alpha value.
   The challenge is to tune it automatically and in a decentralized
   manner.

   Another possible feature is to make FAST TCP behave exactly like the
   current TCP when congestion window is small so that it is compatible
   with the current TCP in low speed networks, similar to the approach
   taken in HSTCP [Flo02].  In this case, the window adjustment
   algorithm of Section 4.1 is used only when the congestion window size
   of a connection exceeds a threshold.

5. Security Considerations 

   This proposal makes no changes to the underlying security of TCP.

6. Acknowledgment

   We are grateful to Guy Almes, Allison Mankin, Allyn Romanow, 
   Stanislav Shunalov, and Lixia Zhang for encouraging us to
   work on this Internet Draft and their helpful advices.  Guy
   Almes also provided very useful feedbacks on an earlier draft.
	
7. References

   [BPe95] L. Brakmo and L. Peterson. "TCP Vegas: end to end congestion
   avoidance on a global Internet", IEEE Journal on Selected Areas in
   Communication, 13(8):1465-1480, October 1995

   [FHPW00] S. Floyd, M. Handley, J. Padhye, and J. Widmer.
   "Equation-based congestion control for unicast applications".
   ACM SIGCOMM, September 2000.

   [Flo02] S. Floyd. HighSpeed TCP Web Page,
   http://www.icir.org/floyd/hstcp.html

   [Jac88] V. Jacobson. "Congestion Control and Avoidance",
   ACM SIGCOMM, August 1988.

   [JWL03] C. Jin, D. X. Wei and S. H. Low. "FAST TCP: motivation,
   architecture, algorithms, performance", submitted for
   publication, June 2003.

   [JWL+03]  C. Jin, D. X. Wei, S. H. Low, G. Buhrmaster, J. Bunn, 
   D. H. Choe, R. L. A. Cottrell, J. C. Doyle, W. Feng, O. Martin,
   H. Newman, F. Paganini, S. Ravot, S. Singh.
   "FAST TCP: from theory to experiments", submitted for publication,

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 8]
draft-jwl-tcp-fast-01.txt                                      July 2003

   April 1, 2003.  http://netlab.caltech.edu/FAST/

   [Kel02] Tom Kelly, "Scalable TCP: Improving Performance in HighSpeed
   Wide Area Networks", February 2003.  
   http://www-lce.eng.cam.ac.uk/~ctk21/scalable/
   
   [KHR02] D. Katabi, M. Handley and C. Rohrs.
   "Congestion Control for High Bandwidth-Delay Product Networks",
   ACM SIGCOMM 2002.
   
   [LPW02] S. H. Low, L. L. Peterson and L. Wang. "Understanding Vegas:
   a duality model", Journal of ACM, 49(2):207-235, March 2002
   http://netlab.caltech.edu
   
   [PDL02] F. Paganini, J. C. Doyle and S. H. Low.
   "Scalable laws for stable network congestion control",
   Proceedings of Conference on Decision and Control, December
   2001.  http://www.ee.ucla.edu/~paganini


8. Author's Address 

   Cheng Jin
   Phone: +1 (626) 395-8820
   E-mail: chengjin@cs.caltech.edu

   David X. Wei
   Phone: +1 (626) 395-3555
   E-mail: weixl@cs.caltech.edu

   Steven H. Low
   Phone: +1 (626) 395-6767
   E-mail: slow@caltech.edu

   Caltech
   http://netlab.caltech.edu

9. Copyright Notice

   The IETF has been notified of intellectual property rights
   claimed in regard to some or all of the specification contained
   in this document.  For more information consult the on-line list
   of claimed rights.

10. Expiration Date

   This document will expire on December 31, 2003.


FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 9]