Internet Engineering Task Force P. Sarolahti INTERNET DRAFT Nokia Research Center File: draft-sarolahti-tsvwg-tcp-frto-00.txt M. Kojo University of Helsinki June 24, 2002 Expires: December, 2002 F-RTO: A TCP RTO Recovery Algorithm for Avoiding Unnecessary Retransmissions Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Spurious retransmission timeouts (RTOs) cause suboptimal TCP performance, because they often result in unnecessary retransmission of the last window of data. This document describes the "Forward RTO Recovery" (F-RTO) algorithm for recovering from the TCP RTOs efficiently even when the RTO was spurious. F-RTO is a TCP sender only algorithm that does not require any TCP options to operate. After retransmitting the first unacknowledged segment triggered by an RTO, the F-RTO algorithm at a TCP sender monitors the incoming acknowledgements to determine whether the timeout was spurious and to decide whether to send new segments or retransmit unacknowledged segments. The algorithm effectively avoids additional unnecessary retransmissions and thereby improves TCP performance in case of a Expires: December 2002 [Page 1] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 spurious timeout. 1. Introduction The TCP protocol [Pos81] has two methods for triggering retransmissions. Primarily, the TCP sender relies on incoming duplicate ACKs, which indicate that the receiver is missing some of the data. After a required amount of successive duplicate ACKs have arrived at the sender, it retransmits the first unacknowledged segment [APS99]. Secondarily, the TCP sender maintains a retransmission timer which triggers retransmission of segments, if they have not been acknowledged within the retransmission timer expiration period. When the retransmission timer expires, the congestion window is initialized to one segment and unacknowledged segments are retransmitted using the slow-start algorithm. The retransmission timer is adjusted dynamically based on the measured round-trip times [PA00]. It has been pointed out that the retransmission timer can expire spuriously and trigger unnecessary retransmissions when no segments have been lost [GL02]. After a spurious RTO the acknowledgements of original segments arrive at the sender, usually triggering unnecessary retransmissions of whole window of segments. There are a number of potential reasons for spurious RTOs. First, some mobile networking technologies involve sudden delay peaks on transmission because of actions taken during a hand-off. Second, arrival of competing traffic of higher priority on a low-bandwidth link, or some other change in available bandwidth involves a sudden increase of round-trip time which may trigger a spurious retransmission timeout. A persistently reliable link layer can also cause a sudden delay when several data frames are lost for some reason. This document does not distinguish the different causes of such a delay, but discusses the spurious RTO caused by delay in general. This document describes an alternative algorithm called "Forward RTO- Recovery" (F-RTO) to be used for retransmitting segments after an RTO. The purpose of the F-RTO algorithm is to avoid most of the unnecessary retransmissions following a spurious timeout and thereby improve TCP performance. When the RTO is not spurious, the F-RTO algorithm reverts back to the conventional RTO recovery algorithm and should have similar performance. F-RTO does not require any TCP options in its operation, and it can be implemented by modifying only the TCP sender. This is different from alternative algorithms (Eifel [LK00] and D-SACK-based algorithms) that have been suggested for the same purpose. The Eifel algorithm uses TCP timestamps for detecting Expires: December 2002 [Page 2] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 a spurious timeout and the D-SACK-based algorithms require that the SACK option with D-SACK extension [FMMP00] is in use. When a RTO occurs, the F-RTO sender retransmits the first unacknowledged segment normally. If the next two acknowledgements advance the window, the F-RTO sender continues sending new data and exits the recovery. However, if either of the next two acknowledgements is a duplicate ACK, the RTO was likely to be caused because of missing data, hence the F-RTO sender retransmits the unacknowledged segments in slow start similarly to the traditional algorithm. The F-RTO algorithm only attempts to avoid unnecessary retransmissions after an RTO. Eifel and D-SACK can also be used in avoiding unnecessary retransmissions in other events, for example due to packet reordering. In addition, F-RTO is not intended to be used for deciding whether to revert the congestion control parameters after a spurious timeout. However, F-RTO can be used jointly with either Eifel or D-SACK algorithms. In such a case the TCP sender may detect unnecessary retransmissions also due to other reasons than a spurious RTO and it can more safely revert the congestion control parameters when a spurious RTO is detected. 2. F-RTO Algorithm The F-RTO algorithm affects the TCP sender behavior only after a retransmission timeout. Otherwise the TCP behavior is similar to the conventional TCP. When the retransmission timer expires, the F-RTO algorithm takes the following steps at the TCP sender. 1) When RTO expires, retransmit first unacknowledged segment. The TCP sender must adjust ssthresh to FlightSize / 2, according to the TCP congestion control specifications. However, congestion window (cwnd) is not yet adjusted, but the TCP sender waits for the next incoming acknowledgements before deciding on the following actions. Leaving cwnd unadjusted in this stage does not cause TCP sender to inject any additional segments. 2) When the first acknowledgement after the RTO arrives at the sender, the sender chooses the following actions depending on whether the ACK advances the window or whether it is a duplicate ACK. a) If the acknowledgement is a duplicate ACK, revert to the conventional recovery and do not enter step 3 of this algorithm. Expires: December 2002 [Page 3] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 The sender must set cwnd to 1 * MSS. This duplicate ACK is triggered by a segment that was sent before the retransmission triggered by the RTO. This is possible, for example, if the RTO is triggered during fast recovery and the forward transmissions have triggered duplicate ACKs. A common reason for entering this branch is that RTO triggered retransmission of a segment that was already retransmitted earlier by fast retransmit. b) If the acknowledgement advances the window, transmit two new segments. At this point the TCP sender should set cwnd <- ssthresh, thus halving the transmission rate. Sending two new segments at this point is equally aggressive to the conventional RTO recovery algorithm, which would have increased its cwnd to 2 * MSS when the first ACK arrives after RTO. 3) When the second acknowledgement after the RTO arrives at the sender, either continue transmitting new data, or start retransmitting the unacknowledged segments. a) If the acknowledgement is a duplicate ACK, set congestion window to three segments, continue with the slow start algorithm retransmitting unacknowledged segments. The duplicate ACK indicates that at least one segment other than the segment which triggered RTO is lost in the last window of data. There is no sufficient evidence that any of the segments was delayed. Therefore, the sender proceeds with retransmissions similarly to the conventional RTO recovery algorithm, with the send_high variable stored when the retransmission timer expired to avoid unnecessary fast retransmits. b) If the acknowledgement advances the window, continue transmitting new data following the congestion avoidance algorithm. Because the TCP sender has retransmitted only one segment after the RTO, this acknowledgement indicates that an originally transmitted segment has arrived at the receiver. This is regarded as a strong indication of a spurious RTO. However, since the TCP sender cannot surely know at this point whether the segment that triggered the RTO was actually lost, adjusting the congestion control parameters after the RTO is the correct action. From this point on, the TCP sender continues as in the normal congestion avoidance. Expires: December 2002 [Page 4] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 If this algorithm branch is taken, the TCP sender ignores the send_high variable that indicates the highest sequence number transmitted so far [FH99]. The send_high variable was proposed as a "bugfix" for avoiding unnecessary multiple fast retransmits when RTO expires during fast recovery with NewReno TCP. As the sender has not retransmitted other segments but the one that triggered RTO, the problem addressed by the bugfix cannot occur. Therefore, if there are duplicate ACKs arriving at the sender after the RTO, they are likely to indicate a packet loss, hence fast retransmit should be used to allow efficient recovery. Alternatively, if there are not enough duplicate ACKs arriving at the sender after a packet loss, the retransmission timer expires another time and the sender enters step 1 of this algorithm. When algorithm branch (3b) is taken, the sender does not reduce the congestion window to one segment, but halves it to the level of ssthresh. Because the sender does not enter slow start, it increases congestion window only once in a round-trip time after RTO, and therefore is slightly more conservative than the conventional recovery algorithm. In fact, if the segment that triggered RTO was not lost, the correct behavior would have been to not decrease the congestion window at all. However, by using the D-SACK or TCP timestamp options the sender can more reliably detect whether the retransmission was unnecessary and revert the last adjustments on the congestion control parameters in such a case. Section 4 outlines the possible methods that can be used with F-RTO in reverting the congestion control state. Branch (3b) can also be taken when a retransmission triggered by delay is lost. This kind of loss cannot be detected at the sender. Therefore, we consider that reducing the congestion window to half of its previous size is an adequate action at this point, because a similar action is taken when TCP sender enters fast recovery. 3. Scenarios We now discuss different scenarios where RTOs typically may occur and discuss how the F-RTO algorithm would perform in those scenarios. The interesting scenarios are a sudden delay triggering RTO, loss of a retransmitted packet during fast recovery, link outage losing several packets, and packet reordering. A performance evaluation with a more thorough analysis on a real implementation of F-RTO is given in [SKR02]. Expires: December 2002 [Page 5] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 3.1. Sudden delay An unexpectedly long delay can trigger RTO, should it occur on a single packet blocking the following packets, or appear as increased RTTs for several successive packets. The example below illustrates the sequence of packets and acknowledgements seen by the TCP sender that follows the F-RTO algorithm, when a sudden delay occurs triggering RTO. For simplicity, delayed acknowledgements are not used in the example. ... (cwnd = 6, ssthresh < 6, FlightSize = 5) 1. SEND(10) -> 2. ACK(6) <- 3. SEND(11) -> 4. (ssthresh <- 3) 5. SEND(6) -> 6. ACK(7) <- 7. SEND(12) -> 8. SEND(13) -> 9. ACK(8) <- (cwnd <- 3, FlightSize = 6) 10. ACK(9) <- (cwnd = 3, FlightSize = 5) 11. ACK(10) <- (cwnd = 3, FlightSize = 4) 12. ACK(11) <- (cwnd = 4, FlightSize = 3) 13. SEND(14) -> ... When a sudden delay long enough to trigger RTO occurs at step 4, the TCP sender retransmits the first unacknowledged segment (step 5). Because the next ACK advances the window, the TCP sender continues by sending two new data segments (steps 7, 8) and adjusts cwnd to 3 MSS. Because the second acknowledgement arriving after the RTO also advances the window the TCP sender exits the recovery and continues with the congestion avoidance. From this point on the retransmissions are invoked either by fast retransmit or when triggered by the retransmission timer. Because the TCP sender reduces cwnd when receiving the first ACK after RTO and sends the two new data segments at steps 7 and 8, it has to wait until the FlightSize is reduced to the level of congestion window before it can continue transmitting again at step 13. 3.2. Loss of a retransmission If a retransmitted segment is lost, the only way to retransmit it again is to wait for the RTO to trigger the retransmission. Once the segment is successfully received, the receiver usually acknowledges several segments cumulatively. The example below shows a scenario where retransmission (of segment 6) is lost, as well as a later Expires: December 2002 [Page 6] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 segment (segment 9) in the same window. The limited transmit [ABF01] or SACK TCP [MMFR96] enhancements are not in use in this example. ... (cwnd = 6, ssthresh < 6, FlightSize = 5) 1. SEND(10) -> 2. ACK(6) <- 3. SEND(11) -> 4. ACK(6) <- 5. ACK(6) <- 6. ACK(6) <- 7. SEND(6) -> (cwnd <- 6, ssthresh <- 3, FlightSize = 6) 8. ACK(6) <- 9. (ssthresh <- 3) 10. SEND(6) -> 11. ACK(9) <- 12. SEND(12) -> 13. SEND(13) -> 14. ACK(9) <- (cwnd <- 3) 15. SEND(9) -> 16. SEND(10) -> 17. SEND(11) -> 18. ACK(11) <- ... In the example above, segment 6 is lost and the sender retransmits it after three duplicate ACKs in step 7. However, the retransmission is also lost, and the sender has to wait for the RTO to expire before retransmitting it again. Because the first ACK following the RTO advances the window (step 11), the sender transmits two new segments. The second ACK in step 14 does not advance the window, and the sender enters the slow start, sets cwnd to 3 * MSS, and retransmits the next three unacknowledged segments, as per the F-RTO algorithm description given in Section 2. After this the receiver acknowledges all segments transmitted prior to entering recovery and the sender can continue transmitting new data in congestion avoidance. 3.3. Link outage A performance study shows that F-RTO performs similarly to the regular recovery when consecutive packets are lost both up- and downstream as a result of link outage [SKR02]. Because one of the two next ACKs after RTO does not advance the window in case of data loss, the TCP sender start retransmitting unacknowledged segments in slow start. The abovementioned performance study recovered a problem if the decision to skip retransmissions after RTO is based on the Expires: December 2002 [Page 7] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 information from the TCP timestamps only (i.e. by following the Eifel algorithm). When both data segments and acknowledgements are dropped in the same window, the TCP sender may end up in a wrong conclusion: in some occasions it continues transmitting new segments according to timestamp feedback, even though several data segments have been dropped. A simple example illustrating the problem is given below. In Section 4 we outline how this phenomenon can be avoided by using a retransmission logic of the F-RTO algorithm in conjunction with the timestamp information. ... (cwnd = 6, ssthresh = 4, FlightSize = 5) 1. SEND(10) -> (ts 110> 2. ACK(6) <- 3. SEND(11) -> 4. 5. (cwnd <- 1, ssthresh <- 3) 6. SEND(6) -> 7. ACK(7) <- (undo: cwnd <- 6, ssthresh <- 4) 8. SEND(12) -> 9. ACK(7) <- 10. 11. SEND(7) -> 12. ACK(8) -> ... The Eifel sender fails, because the duplicate ACKs caused by lost data segment 7 are required to echo the timestamp of the segment that was last to update the window [BBJ92]. Because the duplicate ACKs were dropped, the first ACK after the retransmission in step 6 appears as a new acknowledgement with a timestamp earlier than the timestamp of the retransmission. The sender reverts the adjustments on the congestion window and continues transmitting new data, which is a wrong action to take. This can lead to another RTO which could have been avoided with the regular TCP sender that retransmits the unacknowledged segments after the RTO. The F-RTO sender can avoid the problem presented above, because it starts retransmitting in slow start after getting the second acknowledgement not advancing the window at step 9. 3.4. Packet reordering Since F-RTO modifies the TCP sender behavior only after a retransmission timeout, we limit the discussion on the effects of packet reordering in F-RTO behavior to the cases where packet reordering occurs immediately after RTO. We consider the retransmission timeout due to packet reordering to be very rare case, since reordering often triggers fast retransmit due to duplicate ACKs Expires: December 2002 [Page 8] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 caused by out-of-order segments. Should packet reordering occur after a RTO, duplicate ACKs arrive to the sender, taking the F-RTO algorithm to retransmit in slow start as a regular RTO recovery would do. Although this might not be the correct action, it is similar to the behavior of the regular TCP, making F-RTO a safe modification also in the presence of reordering. 4. Reverting the Congestion Control State The related work on concerning improving the TCP performance on spurious retransmission timeouts suggests that the congestion control state would be reverted to the situation preceding the spurious timeout. The justification for this is that since there are no packets actually lost, the congestion window should not be reduced either. The F-RTO algorithm itself does not provide capabilities for making the decision to revert the congestion control state, but it can be combined with TCP enhancements such as Eifel [LK00] or D-SACK [FMMP00] that can be used for reverting the congestion window. These enhancements, however, require the TCP timestamps or selective acknowledgements to be present. Reverting the congestion control state is not encouraged by the authors at the present, but it remains a research issue to study the performance impacts of such action. Although by increasing the congestion window and slow start threshold the available network bandwidth can be utilized more rapidly, reverting the congestion control state may cause additional congestion losses on a bottleneck link. However, the algorithms for reverting the congestion control state when SACK or TCP timestamps are available, are outlined briefly below. When TCP timestamps are enabled, the TCP sender should follow the retransmission algorithm given in Section 2. However, the TCP sender MAY revert the slow start threshold and cwnd to a value preceding the RTO, if the echoed TCP timestamp in the first and second acknowledgement following the RTO are earlier than the timestamp of the retransmission, and if the two next acknowledgements advance the window. When selective acknowledgements are available with the D-SACK enhancement, the TCP sender can employ the D-SACK enhancement for deciding when to revert the congestion control state. Again, the TCP sender should follow the retransmission algorithm given in Section 2. When RTO expires, the TCP sender stores the highest segment transmitted (send_high) and initializes the counter for retransmissions made due to RTO. The TCP sender counts the number of D-SACK blocks indicating false retransmissions received with the Expires: December 2002 [Page 9] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 acknowledgements up to send_high. If all retransmissions were indicated as duplicate segments with D-SACK blocks and the next two acknowledgements following the RTO advanced the window, the slow start threshold and cwnd MAY be reverted to the values preceding the RTO. It can be noticed that by using Eifel the TCP sender can revert the congestion control state immediately after the two acknowledgements following the RTO, whereas using D-SACK the TCP sender must wait for the whole window of data transmitted before RTO to arrive before it can make decision on undoing the congestion control state. In addition, when reverting the congestion control state, the sender should avoid injecting burst of segments into the network with some burst avoidance algorithm. 5. Security Considerations No additional security threats on TCP due to the F-RTO algorithm are known. References [ABF01] M. Allman, H. Balakrishnan, and S. Floyd. Enhancing TCP's Loss Recovery Using Limited Transmit. RFC 3042, January 2001. [APS99] M. Allman, V. Paxson, and W. Stevens. TCP Congestion Con- trol. RFC 2581, April 1999. [BBJ92] D. Borman, R. Braden, and V. Jacobson. TCP Extensions for High Performance. RFC 1323, May 1992. [FH99] S. Floyd and T. Henderson. The NewReno Modification to TCP's Fast Recovery Algorithm. RFC 2582, April 1999. [FMMP00] S. Floyd, J. Mahdavi, M. Mathis, and M. Podolsky. An Exten- sion to the Selective Acknowledgement (SACK) Option to TCP. RFC 2883, July 2000. [GL02] A. Gurtov and R. Ludwig. Evaluating the Eifel Algorithm for TCP in a GPRS Network. In Proc. of European Wireless, Flo- rence, Italy, February 2002 [LK00] R. Ludwig and R.H. Katz. The Eifel Algorithm: Making TCP Robust Against Spurious Retransmissions. ACM Computer Expires: December 2002 [Page 10] draft-sarolahti-tsvwg-tcp-frto-00.txt June 2002 Communication Review, 30(1), January 2000. [MMFR96] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP Selec- tive Acknowledgement Options. RFC 2018, October 1996. [PA00] V. Paxson and M. Allman. Computing TCP's Retransmission Timer. RFC 2988, November 2000. [Pos81] J. Postel. Transmission Control Protocol. RFC 793, Septem- ber 1981. [SKR02] P. Sarolahti, M. Kojo, and K. Raatikainen. F-RTO: A New Recovery Algorithm for TCP Retransmission Timeouts. Univer- sity of Helsinki, Dept. of Computer Science. Series of Pub- lications C, No. C-2002-07. February 2002. Authors' Addresses Pasi Sarolahti Nokia Research Center P.O. Box 407 FIN-00045 NOKIA GROUP Finland Phone: +358 50 4876607 EMail: pasi.sarolahti@nokia.com Markku Kojo University of Helsinki Department of Computer Science P.O. Box 26 FIN-00014 UNIVERSITY OF HELSINKI Finland Phone: +358 9 1914 4179 EMail: markku.kojo@cs.helsinki.fi Expires: December 2002 [Page 11]