Internet Engineering Task Force                                Cheng Jin
INTERNET DRAFT                                              David X. Wei
draft-jin-wei-low-tcp-fast-00.txt                          Steven H. Low
Expiration Date: December 23, 2003                               Caltech
                                                              June, 2003


         FAST TCP for High-Speed Long-Distance Networks


                         Status of this Memo

   This document is an Internet-Draft and is subject to
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet- Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Abstract

   This document proposes FAST TCP, a modification to the TCP congestion 
   control algorithm for high-speed long-distance networks. The 
   AIMD congestion control algorithm in the current TCP constrains the
   size of congestion window under realistic network conditions,
   resulting in low throughput in high-speed long-distance networks.
   FAST TCP uses additional feedback information to detect and control
   network congestion to achieve higher throughput.
   
1. Motivation

   This document proposes FAST TCP, a modification to the TCP congestion 
   control algorithm for high-speed long-distance connections. 
   Extensive simulations and live experiments have shown that the
   current TCP cannot sustain high throughput for extended periods of
   time in high-speed long-distance networks using standard packet size.  
   By the current TCP, we mean a TCP implementation that is based on TCP
   Reno and includes RFC 1323 (high-performance extensions),
   RFC 2018 (SACK), RFC 2582 (New Reno), RFC 2883 (D-SACK), and
   RFC 2988 (RTO).  In such networks, a large congestion window is
   necessary to achieve high link utilization.  This is difficult for
   two reasons: one concerns with the steady-state property of the

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 1]
draft-jwl-tcp-fast-00.txt                                      June 2003

   current TCP and the other with its dynamic property.
 
   First, the current TCP response function requires that an end-to-end
   path maintains an exceedingly small steady-state loss probability, 
   in order to sustain a large window.  Such a small loss probability, 
   e.g., to sustain a throughput of 10Gbps or higher, may be difficult 
   to achieve even in optical fiber links, and more so on paths that 
   may drop or corrupt packets for reasons other than congestion.   

   Second, even if such a small loss probability is achieved, it
   generates packet losses too infrequently for the Additive Increase
   Multiplicative Decrease (AIMD) dynamics of the current TCP 
   to work effectively.  At large window sizes (in excess of 10,000
   packets), halving window on a loss event is too drastic, and
   increasing window by one packet per Round-Trip Time (RTT) is too
   conservative.  This may lead to oscillation and underutilization
   at a bottleneck link.
   
   Recent efforts try to solve this problem at the TCP sender by
   increasing window more aggressively between consecutive packet-loss
   events and decreasing window less drastically on each packet-loss
   event.  HSTCP [Flo02] uses AIMD algorithm with the AI and MD
   parameters that are increasing functions of the current window size.
   Scalable TCP [Kel02] uses MIMD with constant parameters.  Both
   proposals rely only on packet loss to adjust the window.  Another
   approach, taken by XCP [KHR02], uses explicit feedback from routers
   for congestion control, while leaving the TCP sender unchanged.

   FAST TCP addresses the problem with large window by using 
   queueing delay in addition to packet loss to assess congestion
   and adjust window.  When congestion is mild, queueing delay 
   is the dominant congestion signal, and FAST TCP operates 
   around an equilibrium queueing delay that is strictly less
   than the maximum queueing delay.  When congestion becomes severe,
   packet loss becomes the dominant congestion signal, and
   FAST TCP reduces the congestion to bring the system
   back to a mildly congested regime where it can again stabilize
   around a (possibly different) target queueing delay.
   
   When ECN becomes available, FAST TCP can use ECN marking to replace,
   or supplement, queueing delay and packet loss as the congestion
   measure.  This permits stable operations with high utilization and
   negligible packet loss and queueing delay.
   
   This document solicits comments and suggestions on FAST TCP from
   the IETF community and associated researchers.

2. Design Guidelines of FAST

   The FAST TCP aims to achieve the following goals:

   * Stable equilibrium whenever possible.

   * Well-defined fairness properties.

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 2]
draft-jwl-tcp-fast-00.txt                                      June 2003

   
   * High throughput and link utilization.

   * Making use of additional feedbacks such as ECN when available.    

   subject to the following constraints:

   * Only modifying the TCP sender. 

   * Friendliness to the current TCP.

   * Requiring no cooperation from routers or receivers. 
     
3. Window Control Algorithm

   Without explicit feedback from routers, only two kinds of signals
   are available for congestion control: queueing delay and packet loss.
   FAST TCP uses equation-based control with queueing delay, and
   multiplicative decrease with packet loss. 
   
3.1 Reacting to Delay
   
   The goal of responding to queueing delay is to maintain a stable
   queue in the bottleneck router, instead of periodically inducing 
   packet losses by overflowing the queue, in order to sustain high
   link utilization.  FAST TCP uses queueing delay for congestion
   control whenever reliable RTT measurements are available.
   
   Throughout our discussion in this section, we assume congestion
   windows are maintained in the unit of packets to simplify
   discussion.  Congestion windows that are maintained in bytes can
   be converted into packet-based windows by dividing the
   path MTU (PMTU).

   We will first explain the variables used in our equation-based
   congestion control.  For each source, let 

      * w_current, w_old, w_new denote the current congestion window
        size, the congestion window size one avgRTT ago, and the new
        target congestion window size, respectively;

      * avgRTT denote the average RTT computed using exponential
        averaging with a weight described later in Section 4.2;

      * baseRTT denote the minimum of the observed instantaneous RTTs;

      * q = avgRTT - baseRTT denote an estimate of the current
        queueing delay;

      * alpha denote a parameter, which is a non-negative number of
        packets described later in Section 4.4.

   Then the new target window w_new is calculated every RTT as follows:
   

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 3]
draft-jwl-tcp-fast-00.txt                                      June 2003

      w_new = 1/2{ [w_old * baseRTT / avgRTT] + alpha + w_current }
   
   Note that each update increases the window by at most alpha/2.

   If pacing of packet transmission is unavailable, the implementation
   SHOULD increment or decrement the congestion window smoothly 
   over one RTT to the new target window w_new.  If pacing of packet 
   transmission is available, the implementation MAY set congestion
   window to w_new directly.

   A reference implementation to update congestion window from w_current
   toward w_new is as follows:

   * set num_ack = abs( w_current / [w_new-w_current] );

   * if w_new > w_current:
     during the next window, increase congestion window by one packet 
     for every max{num_ack, 1} acked packet(s);

   * if w_new < w_current:
     during the next window, decrease congestion window by one packet 
     for every max{num_ack, 2} acked packets;

   The implementation SHOULD always be less aggressive than slow start,
   and SHOULD always be less drastic than rate-halving. 

   In [CWL03], we prove that this algorithm converges to an 
   equilibrium (exponentially fast, once avgRTT > baseRTT)
   and present experimental results on dummynet to illustrate 
   the performance of FAST TCP.
   
3.2 Reacting to Loss
   
   The goal in reacting to packet loss is to back off packet
   transmission quickly when severe congestion occurs.  Subsequently,
   a system can return to a regime where reliable RTT measurements are
   available for the window adjustment algorithm described in
   Section 3.1.
   
   When a source detects a packet loss, the source SHOULD follow RFC
   2582 (New Reno) to enter Fast Retransmission/Fast Recovery. The
   source MAY use other compatible improvements, e.g., rate-halving, in
   an implementation.
   
   Before exiting Fast Recovery, the source SHOULD NOT react to
   queueing delay unless it is certain that the delay measurements are
   reliable.  By reliable, we mean that the delay measurements are from
   newly transmitted packets that would be SACK'd during recovery, and
   furthermore, we will react to queueing delay only after we have
   collected enough samples, e.g., 30% of the congestion window size at
   the beginning of the current loss event. 
   
4 Measurement and Parameter Setting
   

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 4]
draft-jwl-tcp-fast-00.txt                                      June 2003

   The algorithm that responds to queueing delay requires the following 
   quantities: w_old, baseRTT, and q. Since q = avgRTT - baseRTT, 
   we need to measure w_old, avgRTT, and baseRTT.
   
4.1 RTT and w_old Measurement

   The implementation is encouraged to use the most accurate time-stamp 
   available on an implementation platform. For example, In Linux 2.4, 
   gettimeofday() is used to obtain time resolution down to several
   microseconds.  Clearly, there is a trade-off between time precision
   and the overhead needed to achieve such precision.
   
   When a packet is ready for transmission, the implementation SHOULD
   record a time-stamp in the copy of the packet in the retransmission
   queue. The implementation SHOULD also record the current congestion
   window in the copy of the packet, which we call cwnd_stamp.
   
   At the time when a packet is acknowledged, the implementation SHOULD 
   compare the current time and the time-stamp in the copy in the
   retransmission queue, and compute the RTT sample as the difference
   between the two values. The implementation SHOULD also set w_old to
   cwnd_stamp for the acknowledged packet.
   
4.2 avgRTT Computation

   In [VJ88], the average weight for Exponential Weighted Moving Average
   (EWMA) is 1/8. For high-speed long-distance network, smaller average
   weight is recommended since, with a large window, successive RTT
   samples capture congestion information over a timescale that is much
   smaller than a RTT.  Average RTT values should capture the state of a
   network during the last window.   We suggest an average weight that
   is proportional to the reciprocal of cwnd: 

      weight = min{ 3/cwnd, 1/8 }

   The average RTT is updated on the receipt of a new RTT sample,
   as follows:

      avgRTT_new = (1-weight)*avgRTT_old + weight*RTT

4.3 baseRTT Computation

   The implementation MAY use the minimum of observed RTT samples since
   the start of a connection as baseRTT, as in TCP Vegas [BPe95].

   This solution is not robust against possible routing changes during
   the lifetime of a connection.  It may also incur estimation errors
   in a dynamic sharing environment where the minimum observed RTT of
   a new FAST TCP flow may include queueing delay produced by packets
   of existing flows.   [LPW02] shows that estimation errors in baseRTT
   will skew the overall fairness in a network, but estimation errors
   do not seem to upset network stability.  Even though simulations
   using infinite router buffers indeed confirm the presence of such
   errors and the resulting skewed fairness, in a dynamic sharing

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 5]
draft-jwl-tcp-fast-00.txt                                      June 2003

   environment, our experiments in real networks and in dummynet with
   finite buffer suggest that it may not be a serious problem in
   practice [CWL03,CWL+03].   Further research is needed to fully
   address the issue of baseRTT computation when network condition
   changes.
   
4.4 Setting alpha 

   The parameter alpha, which may be different for different TCP flows,
   affects both the equilibrium and dynamic behaviors of a network.  

   It specifies the total number of packets a single FAST connection
   tries to maintain in one or more queues in its path.  This defines
   the equilibrium for each FAST connection.  For n FAST flows sharing
   the same bottleneck router using the same alpha value, each will get
   1/n of the bottleneck bandwidth.  Thus, a buffer of size at least
   n*alpha packets at the router is required to maintain equilibrium.

   The parameter alpha also determines the window increment when
   queueing delay is zero, so it may be used to tune the aggressiveness
   of the window update function.

   The implementation MAY use a mapping table from bandwidths
   (x = cwnd/RTT) to alpha values.  For example, to maintain about
   2.5 ms of queueing delay for each FAST connection, alpha may be set
   according to link capacity by the following table:
   
        x       alpha
   ------------------
   <= 0.1 Gbps     20
        1 Gbps    200
      2.5 Gbps    500
       10 Gbps   2000

   Clearly, to scale with network capacity, an automatic way to set the
   alpha value is necessary.  We are researching various ways to
   adaptively tune the alpha parameter on time scales larger than
   one RTT.

5. Equilibrium and Dynamic Properties

   This will be updated in version 01 by June 30, 2003.

6. Fairness

6.1 Fairness among FAST TCP flows

   It is shown in [CWL03] that FAST TCP has a log utility function:

      U(x) = alpha * log(x)

   Hence, a network with only FAST TCP flows achieve proportional
   fairness under no congestion or mildly congested situations---
   when packet loss occurs infrequently.

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 6]
draft-jwl-tcp-fast-00.txt                                      June 2003


   Under severely congested situations, packet loss becomes 
   the dominant signal, and AIMD in FAST should provide 
   comparable fairness as current TCP with identical flows. 

6.2 Friendliness to Current TCP Flows

   Interaction between FAST TCP with the current TCP is an important
   issue that is still under study.

   The alpha parameter in FAST TCP determines both the equilibrium
   bandwidth share and the aggressiveness during the additive increase
   phase when queueing delay is zero.  Friendliness of FAST TCP to the
   current TCP can be tuned by appropriately selecting the alpha value.
   The challenge is to tune it automatically and in a decentralized
   manner.

   Another possible feature is to make FAST TCP behave exactly like the
   current TCP when congestion window is small so that it is compatible
   with the current TCP in low speed networks, similar to the approach
   taken in HSTCP [Flo02].  In this case, the window adjustment
   algorithm of Section 3.1 is used only when the congestion window size
   of a connection exceeds a threshold.

5. Security Considerations 

   This proposal makes no changes to the underlying security of TCP.

6. Acknowledgment

   We are grateful to Guy Almes, Allison Mankin, Allyn Romanow, 
   Stanislav Shunalov, and Lixia Zhang for encouraging us to
   work on this Internet Draft and their helpful advices.
	
7. References

[BPe95] L. Brakmo and L. Peterson. "TCP Vegas: end to end 
congestion avoidance on a global Internet", IEEE Journal 
on Selected Areas in Communication, 13(8):1465-1480, October 1995

[CWL03] C. Jin, D. X. Wei and S. H. Low. "FAST TCP: architecture, 
algorithms, performance", to be submitted for publication.

[CWL+03]  C. Jin, D. X. Wei, S. H. Low, G. Buhrmaster, J. Bunn, 
D. H. Choe, R. L. A. Cottrell, J. C. Doyle, W. Feng, O. Martin, 
H. Newman, F. Paganini, S. Ravot, S. Singh. "FAST TCP: from theory 
to experiments", submitted for publication, April 1, 2003.
http://netlab.caltech.edu/FAST/

[Flo02] HighSpeed TCP Web Page, 
http://www.icir.org/floyd/hstcp.html

[Jac88] Jacobson, V., "Congestion Control and Avoidance", SIGCOMM 1988,
August 1988.

FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 7]
draft-jwl-tcp-fast-00.txt                                      June 2003


[Kel02] Tom Kelly, "Scalable TCP: Improving Performance in HighSpeed
Wide Area Networks", February 2003.  
http://www-lce.eng.cam.ac.uk/~ctk21/scalable/

[KHR02] D. Katabi, M. Handley and C. Rohrs.
"Congestion Control for High Bandwidth-Delay Product Networks",
ACM SIGCOMM 2002.

[LPW02] S. H. Low, L. L. Peterson and L. Wang. "Understanding Vegas:
a duality model", Journal of ACM, 49(2):207-235, March 2002

8. Author's Address 

   Cheng Jin
   Phone: +1 (626) 395-8820
   E-mail: chengjin@cs.caltech.edu

   David X. Wei
   Phone: +1 (626) 395-3555
   E-mail: weixl@cs.caltech.edu

   Steven H. Low
   Phone: +1 (626) 395-6767
   E-mail: slow@cs.caltech.edu

   Caltech
   http://netlab.caltech.edu

9. Copyright Notice

   The IETF has been notified of intellectual property rights
   claimed in regard to some or all of the specification contained
   in this document.  For more information consult the on-line list
   of claimed rights.

10. Expiration Date

   This document will expire on December 23, 2003.


FAST       FAST TCP for High-Speed Long-Distance Networks       [Page 8]