Internet Engineering Task Force Ethan Blanton INTERNET DRAFT Purdue University File: draft-blanton-tcp-reordering-00.txt Robert Dimond Verizon/NASA GRC Mark Allman BBN/NASA GRC February, 2003 Expires: July, 2003 Practices for TCP Senders in the Face of Segment Reordering Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document outlines actions a TCP sender can take after determining that a spurious fast retransmission has been sent due to packet reordering. The document outlines the method a TCP connection can use to "undo" needless changes made to the congestion control state. In addition, several methods to help prevent future fast retransmits are outlined. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 1 Introduction As outlined in [RFC793] TCP receivers always indicate the next in-order byte of data they expect to receive in the acknowledgments (ACKs) they send. Therefore, if a segment arrives that does not include the next expected byte of data, the receiver generates a Expires: July 2003 [Page 1] draft-blanton-tcp-reordering-00.txt February 2003 duplicate ACK. This situation can arise from either packet loss or from packet reordering in the network. Meanwhile, TCP's fast retransmit algorithm [Jac88,RFC2581] uses the arrival of three duplicate ACKs as an indication that the network lost the packet containing the indicated data byte. Upon detection of a lost segment via fast retransmit the sender resends the given data segment before the expiration of the retransmission timer (RTO). The sender waits for three duplicate ACKs in an attempt to distinguish between actual packet loss and small amounts of packet reordering. Several Internet measurement studies have shown that waiting for the arrival of three duplicate ACKs may not be sufficient to distinguish between packet loss and packet reordering over some network paths [Pax97,BPS99]. Therefore, in some cases a TCP sender may send a spurious fast retransmit. This spurious retransmit wastes network resources on needless work, as well as causing the TCP sender to infer that the network is experiencing congestion and thus the sender should reduce its sending rate. [BA02a] shows the performance impact of various levels of reordering on TCP connections. This document offers several experimental strategies for mitigating the impact of spurious fast retransmits. The first mitigation is to allow the sender to revert to its previous sending rate when spurious retransmissions are detected. The second mitigation is to dynamically vary the number of duplicate ACKs required to trigger fast retransmit in an attempt to prevent spurious retransmissions. Explicitly out of scope for this document is the subject of spurious RTOs. This subject is covered by others (as outlined in section 3 on related work). We hope the community will experiment with the mechanisms presented in this document and provide their experience and input to a future decision on which mechanisms (if any) may be appropriate to include as part of the TCP's standard congestion control algorithms. This document is organized as follows. Section 2 offers definitions. Section 3 outlines related work. Section 4 details how a TCP sender can revert to a previous congestion control state after a spurious retransmit has been detected. Section 5 outlines the methods for dynamically adjusting the number of duplicate ACKs required to trigger fast retransmit. Section 6 discusses interactions between adjusting the duplicate ACK threshold and the RTO. Section 7 sketches future work in this area. Section 8 outlines security considerations. 2 Definitions It is assumed that the reader is familiar with the congestion control concepts and terminology for TCP from [RFC2581]. Expires: July 2003 [Page 2] draft-blanton-tcp-reordering-00.txt February 2003 In addition, the following terms will be used: ``dupthresh'' represents the number of duplicate ACKs required to trigger a fast retransmit. [RFC2581] defines dupthresh to be 3 duplicate ACKs. 3 Related Work [Editor's note: Additional related work outlined in [SL02] will be included in a later version of this document.] A number of documents have been written on the subject of spurious retransmits. In this space, there are three categories of mechanisms: (1) Mechanisms for detecting spurious retransmits. The changes specified in this document require a mechanism to detect when a TCP sender has needlessly retransmitted. We do not specify any one algorithm that must be used with our mitigation strategies because the exact method employed does not have a direct impact on this specification. (That is not to say that the choice of detection mechanism is immaterial in a larger sense.) [BA02b] outlines a scheme that uses the DSACK TCP option [RFC2883] to detect spurious retransmits. [LM02] outlines the Eifel algorithm that uses the TCP timestamp option [RFC1323] to detect unneeded retransmissions. (An additional variant of the Eifel algorithm uses reserved bits from the TCP header rather than timestamps [LK00].) [SK02] outlines the F-RTO algorithm that uses a slight modification to TCP's sending pattern after a retransmission timeout (RTO) and the resulting ACKs to detect spurious RTOs. (2) Mechanisms for undoing needless changes to congestion control state. Section 4 of this document specifies the appropriate method for reverting to a previous set of congestion control parameters after reliably detecting a needless retransmit (and in the presence of no other congestion signals). The scheme specified in this document draws on the spirit outlined in [RFC2883]. [LG02] outlines the response to spurious retransmits (both spurious fast retransmits and spurious timeouts) after detection by the Eifel detection algorithm [LM02]. The mechanism for undoing needless changes to the congestion control state of the connection is similar to the mechanism we outline in section 4. (3) Mechanisms for making TCP more tolerant to reordering events such that future spurious retransmits are avoided. Expires: July 2003 [Page 3] draft-blanton-tcp-reordering-00.txt February 2003 [LG02] outlines one particular method for adjusting dupthresh in an attempt to prevent future needless fast retransmissions. (In addition, [LG02] provides a method for adjusting the RTO after a spurious RTO.) [ZKFP02] outlines a scheme whereby the TCP sender's SACK scoreboard data structure is annotated with information about packet reordering in an attempt to adjust the dupthresh to prevent packet reordering from causing needless retransmits. Finally, section 5 of this document outlines several schemes to avoid needlessly evoking the fast retransmit algorithm. The mechanisms in section 5 are based on [BA02a]. Each of the above proposals has its pros and cons. An in-depth analysis of each proposal is beyond the scope of this document. 4 Reverting Congestion Control State [RFC2883,LK00,BA02a] all mention the possibility of reverting congestion control state when a sender determines that the congestion window was reduced unnecessarily due to a spurious retransmission. This document specifies that a sender that reliably detects a spurious retransmission MAY ``undo'' the needless congestion control state changes. If the sender wishes to undo needless changes it MUST use the following algorithm: (1) When retransmitting a segment and before reducing cwnd or ssthresh, save the value of cwnd as cwnd_prev and the value of ssthresh as ssthresh_prev. (2) Upon the detection of a spurious retransmit, set ssthresh to cwnd_prev. This causes a 1 RTT slow start back to the cwnd in use before the sender needlessly changed the congestion control state. This rule causes a smooth increase of cwnd, whereas simply setting cwnd to cwnd_prev upon detection of a spurious retransmit would cause the transmission of a potentially large burst. (3) Optional. When cwnd reaches ssthresh set ssthresh to ssthresh_prev. This rule allows a TCP that is using slow start when a spurious retransmission is detected to resume that slow start after detection. Without this rule the cwnd would be reverted to its previous value, however further cwnd growth would be limited to the linear growth of congestion avoidance. If a sender uses the above algorithm, one of the algorithms specified in the next section MUST be used in an attempt to prevent further needless retransmissions. Expires: July 2003 [Page 4] draft-blanton-tcp-reordering-00.txt February 2003 5 Preventing Spurious Fast Retransmits Undoing congestion control changes caused by reordering events increases the overall performance of a TCP connection [BA02a]. However, [BA02a] shows that when the steps in section 4 are taken alone the number of unnecessary retransmits (and, therefore, wasted network resources) still remains significantly high. Therefore, further performance gains and network resource savings can be realized if we can prevent subsequent spurious fast retransmits. Simulations in [BA02a] show that a strategy of reversing erroneous congestion control decisions, coupled with a fast retransmit mitigation scheme, achieves good performance (within 1-2% of a transfer that experiences no reordering) while reducing the number of unnecessary retransmits by roughly a factor of 6. This document purposes several options that deal with postponing the fast retransmit algorithm to allow the reordered segments to be acknowledged by the receiver. This is achieved by increasing the duplicate acknowledgment threshold, or directly delaying fast retransmit by an increment of time when a spurious retransmission has occurred previously. Regardless of which approach is taken to avoid spurious retransmits, caution should be taken in their implementation. For example, more aggressive manipulations of dupthresh can result in faster convergence in the face of a connection facing persistent reordering, but may increase the chance of relying on the RTO to recover from real loss. The RTO is often lengthy and cuts cwnd to one segment and invokes slow start. 5.1 Increase dupthresh By K A TCP sender that implements this strategy MUST increase dupthresh by some small constant, K, each time a spurious fast retransmit is detected. In [BA02a] K is set to increment dupthresh by one after a spurious retransmission is detected. This method requires minimal state to implement, but a conservative increment of K may be slow to converge in connections where persistent large reordering events are present. In connections where the reordering event lengths are varied convergence becomes problematical at best. 5.2 Increase dupthresh Based On Reordering Length A TCP sender that implements this strategy MUST increase dupthresh based on the length of the most recent reordering event that causes a spurious fast retransmit. With this method a TCP sender determines the number of duplicate ACKs, C, that would have disambiguated the reordering event from the perceived loss. C is then averaged with the current value of dupthresh, per dupthresh = (C + dupthresh) / 2 (1) Expires: July 2003 [Page 5] draft-blanton-tcp-reordering-00.txt February 2003 If the new value for dupthresh is less than or equal to its predecessor, dupthresh SHOULD be incremented by one. This method allows for a varied growth of dupthresh, accommodating the varied reordering event lengths mentioned in the previous option. Unfortunately, reordering events significantly longer than the connection's average reordering events may result in skewing dupthresh to an overly optimistic value, preventing fast retransmit before the RTO timer expires. 5.3 Delay of Fast Retransmit As proposed in [Pax97], the fast retransmit can be delayed by some small amount of time after dupthresh duplicate ACKs have been received in an effort to distinguish between loss and reordering. A sender that implements this strategy MUST track of the duration in time, T, of the reordering event from the reception of the first duplicate ACK to the first cumulative or partial ACK. This represents the time necessary to disambiguate the longest reordering event in the current connection from its perceived loss. The sender MUST then, on subsequent perceived loss events, wait T seconds after the third duplicate acknowledgment for subsequent reordering/loss events before invoking the fast retransmit algorithm. Should a cumulative ACK be received before this timer expires, the sender MUST NOT invoke fast retransmit. Similar in approach to the scheme sketched in section 5.2, this method considers the length of the reordering event. In using the passage of time, as opposed to counting duplicate ACKs, the strategy is more robust where ACK loss is present. Unfortunately this potential advantage comes at a cost of requiring more overhead than in the previously outlined methods in the form of an additional timer for each TCP connection. 5.4 Average Duplicate ACK Tracking In an attempt to use a connection's history while still being adaptive to reordering events which are uncharacteristically divergent, an Exponentially Weighted Moving Average (EWMA) is tried in [BA02a]. The EWMA gives the values needed to mitigate recent spurious fast retransmits a higher weight over the values used in previous fast retransmit mitigation attempts in the connections history. Unfortunately, initial results for this mechanism yielded little or no advantages over TCP variants which did not use any mechanisms to mitigate spurious fast retransmits. Therefore, we do not recommend using an EWMA scheme in this document, but rather offer it as a future research area. 6 Preventing Spurious Timeouts In section 5, we outline several methods to increase dupthresh, or insert delay before fast retransmit, in hopes of maintaining cwnd to mitigate TCP performance issues in the face of reordering. Regardless of the method used, since reordering events may vary in Expires: July 2003 [Page 6] draft-blanton-tcp-reordering-00.txt February 2003 length and ACK loss will effect the number of duplicate ACKs returned to the sender there exists a possibility that dupthesh will be set to an overly optimistic value. In this case the RTO will expire before enough duplicate ACKs are received for fast retransmit. In addition to the often lengthy timeout imposed by the RTO [RFC2988] the TCP sender will reduce cwnd to one segment and restart transmission with slow start. This document sketches several additional methods for coping with RTOs caused by an overly optimistic dupthresh in the next subsections. In addition, the outline of future work below (section 7) offers some thoughts in this area. 6.1 Reducing dupthresh In addition to the performance issues involved with the expiration of the RTO, the sender should consider an RTO as an indication that the current dupthresh is no longer a valid estimate of the reordering conditions in the network. The TCP sender MUST reset dupthresh to the default value of from [RFC2581] (currently 3). This conservative approach guards against subsequent RTOs stemming from an optimistic dupthresh value, but may allow spurious retransmissions to resume until the sender's dupthresh estimate re-converges. 6.2 Bounding dupthresh This document specifies that the dupthresh MUST never be larger than cwnd-1 segments. Furthermore, to be more conservative we RECOMMEND that dupthresh never be more than 90% of cwnd. This means that whenever cwnd is reduced in the course of a TCP transfer the stack must also ensure that dupthresh is likewise reduced, if necessary. 6.3 Extending Limited Transmit A final strategy that MAY be used by a TCP sender in an attempt to prevent RTOs is to extend the limited transmit algorithm [RFC3042]. Limited transmit calls for a TCP sender to transmit one previously unsent data segment upon reception of each of the first two duplicate ACKs. This document allows that a a TCP sender MAY transmit a previously unsent data segment on every second duplicate ACK received after the first two and before dupthresh is met. So, when used in conjunction with [RFC3042], a TCP sender with a dupthresh of 7 could transmit new data segments on duplicate ACKs 1, 2, 4, and 6. This mechanism allows the TCP sender to sustain the ACK clock and possibly generate new duplicate ACKs that will help trigger fast retransmit and prevent an RTO. The mechanism is conservative because in the case when loss is the cause of the duplicate ACKs (rather than reordering) the rate the TCP sender is transmitting new segments into the network is gradually being reduced by sending only on every second duplicate ACK. This mechanism is similar to the rate-halving proposals []. Expires: July 2003 [Page 7] draft-blanton-tcp-reordering-00.txt February 2003 7 Future Work All the schemes for varying dupthresh in this document involve increasing the number of duplicate ACKs required to trigger fast retransmit. While we provide one scheme to reduce dupthresh (collapsing back to the default value after an RTO) [ZKFP02] shows that schemes that offer both dupthresh increase and decrease offer more performance benefit. One area for future work is to extend the strategies outlined in section 5 to allow for decreasing dupthresh. 8 Security Considerations Reversion of congestion control state has no obvious security considerations given an accurate signal that indicates a needless retransmit. Any security considerations pertaining to the method of determining that a fast retransmit was spurious are directly applicable to this document. The adjustment of dupthresh may be vulnerable to any third party who wishes to illegitimately gain network resources and/or cause a denial of service between the sender and receiver. The vulnerability exists when a third party sends fictitious duplicate acknowledgments to the sender with a sequence number within the sender's current cwnd. As a result, the sender may needlessly retransmit and also grow dupthresh to an overly optimistic value, resulting in future RTOs. Such RTOs cause the sender to drop cwnd to 1 segment, relinquishing network resources at the cost of the connection's performance. Acknowledgments The authors would like to thank Sally Floyd, Reiner Ludwig, Vern Paxson and Randall Stewart for spurring this work through their previous work for providing input during the formative stages of this draft. References [BA02a] E. Blanton, M. Allman. On Making TCP More Robust to Packet Reordering. ACM Computer Communication Review, 32(1), January 2002. [BA02b] E. Blanton, M. Allman. Using TCP DSACKs and SCTP Duplicate TSNs to Detect Spurious Retransmissions. Internet-Draft draft-blanton-dsack-use-02.txt, October 2002. Work in progress. [BPS99] J. Bennett, C. Partridge, N. Shectman. Packet Reordering is Not Pathological Network Behavior. IEEE/ACM Transactions on Networking, December 1999. [Jac88] Van Jacobson. Congestion Avoidance and Control. ACM SIGCOMM 1988. Expires: July 2003 [Page 8] draft-blanton-tcp-reordering-00.txt February 2003 [LG02] R. Ludwig, A. Gurtov. The Eifel Response Algorithm for TCP. Internet-Draft draft-ietf-tsvwg-tcp-eifel-response-01.txt, October 2002. Work in progress. [LK00] R. Ludwig, R. Katz. The Eifel Algorithm: Making TCP Robust Against Spurious Retransmissions. Computer Communication Review Volume 30 Number 1, January 2000. [LM02] R. Ludwig, M. Meyer. The Eifel Detection Algorithm for TCP. Internet-Draft draft-ietf-tsvwg-tcp-eifel-alg-06.txt, October 2002. Work in progress. [Pax97] V. Paxson. End-to-End Internet Packet Dynamics. Presented at ACM SIGCOMM, September 1997. [RFC1323] Van Jacobson, Robert Braden, David Borman. TCP Extensions for High Performance. RFC 1323. May 1992. [RFC2581] M. Allman, V. Paxson, W. Stevens, TCP Congestion Control, RFC 2581, April 1999. [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky. An Extension to the Selective Acknowledgement (SACK) Option for TCP. RFC 2883, July 2000. [RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission Timer, RFC 2988, November 2000. [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd. Enhancing TCP's Loss Recovery Using Limited Transmit. RFC 3042, January 2001 [SL02] Yogesh Swami, Khiem Le. DCLOR: De-correlated Loss Recovery using SACK Option for Spurious Timeouts. Internet-Draft draft-swami-tsvwg-tcp-dclor-00.txt, November 2002. Work in progress. [SK02] P. Sarolahti, M. Kojo. F-RTO: A TCP RTO Recovery Algorithm for Avoiding Unnecessary Retransmissions. Internet-Draft draft-sarolahti-tsvwg-tcp-frto-02.txt, November 2002. Work in progress. [ZKFP02] M. Zhang, B. Karp, S. Floyd, L. Peterson. RR-TCP: A Reordering-Robust TCP with DSACK. ICSI Technical Report TR-02-006, July 2002. Authors' Addresses: Ethan Blanton Purdue University Computer Sciences 1398 Computer Science Building West Lafayette, IN 47907 eblanton@cs.purdue.edu Expires: July 2003 [Page 9] draft-blanton-tcp-reordering-00.txt February 2003 Robert Dimond Verizon FNS/NASA Glenn Research Center Lewis Field 21000 Brookpark Rd. MS 142-1 Cleveland, OH 44135 Phone: 216-433-8468 bdimond@grc.nasa.gov Mark Allman BBN Technologies/NASA Glenn Research Center Lewis Field 21000 Brookpark Rd. MS 54-5 Cleveland, OH 44135 Phone: 216-433-6586 Fax: 216-433-8705 mallman@bbn.com http://roland.grc.nasa.gov/~mallman Expires: July 2003 [Page 10]