Internet Engineering Task Force Matt Mathis INTERNET DRAFT Jeff Semke draft-mathis-tcp-ratehalving-00.txt PSC Jamshid Mahdavi Novell August 1999 Expires: February 2000 The Rate-Halving Algorithm for TCP Congestion Control Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft provides a detailed description of the Rate-Halving algorithm. As specified by RFC2581, Fast Recovery adjusts the congestion window (cwnd) by transmitting new data only during the second half of the recovery interval. The Rate-Halving algorithm adjusts the congestion window by spacing transmissions at the rate of one data segment per two segments acknowledged over the entire recovery period, thereby sustaining the self-clocking of TCP and avoiding a burst. Since it is largely independent of the details of the data retransmission strategy, the Rate-Halving algorithm can be used with several standard and experimental TCP implementations: NewReno, SACK, and ECN. This algorithm was first proposed by Janey Hoe [Hoe95]. 1. Introduction =============== All "Reno TCP" implementations include TCP Fast Retransmit and Fast Recovery algorithms [RFC2581]. Fast retransmit relies on three duplicate acknowledgements to trigger the retransmission of a single lost segment. Once the Fast Retransmit has occurred, TCP then waits for enough additional duplicate ACKs to arrive, indicating that half of the data in flight has left the network. Only when this has occurred will TCP send additional new data. The consequence of this delay is that the entire new window of data is transmitted in one half of one Round Trip Time (RTT). This burst can cause repeated bursts in successive RTTs following the recovery, which can result in overall additional burstiness on the network. Hoe [Hoe95] suggested that during Fast Recovery the TCP data sender space out retransmissions and new data on alternate acknowledgements across the entire recovery RTT. (Note that this eliminates the half RTT lull in sending which occurs in Reno TCP.) This document describes a Rate-Halving algorithm which implements Hoe's idea. It explains how to implement the algorithm under NewReno, SACK, and ECN-style TCP implementations. This document does not discuss the other suggestions in [Hoe95] and [Hoe96], such as a change to the ssthresh parameter during Slow-Start, or the NewReno proposal [RFC2582]. Rate-Halving has a number of other useful properties as well. It results in a slightly lower final value for cwnd following recovery, which has been suggested by Floyd and others as the more correct value. The Rate-Halving algorithm provides proper adjustments to the congestion window in response to congestion signals such as a lost segment or an ECN-Echo bit [RFC2481]. These adjustments are largely independent of the strategy used to retransmit missing segments, allowing Rate-Halving to be extended for use in other TCP implementations or even non-TCP transport protocols. Definitions of terms and variables that are used throughout the rest of the document can be found in Section 2. Section 3 provides an overview of the Rate-Halving algorithm. Section 4 presents a detailed specification of the algorithm. Section 5 contains related commentary, and Section 6 discusses interactions between Rate-Halving and other components of TCP. 2. Definitions ============== The following terms are used throughout the document, and are defined here for clarity. The definitions within quotations in this section are included from the cited documents. SENDER MAXIMUM SEGMENT SIZE (SMSS): "The SMSS is the size of the largest segment that the sender can transmit. This value can be based on the maximum transmission unit of the network, the path MTU discovery [RFC1191] algorithm, RMSS, or other factors. The size does not include the TCP/IP headers and options." from [RFC2581] ADJUSTMENT INTERVAL: The time during which TCP reduces the congestion window after experiencing congestion. REPAIR INTERVAL: The time during which TCP retransmits lost segments. RATE-HALVING STATE (rh_state): A state variable indicating what form of adjustments are being performed during adjustment intervals. The Rate-Halving states are described in Section 4.1. CONGESTION WINDOW (cwnd): "A TCP state variable that limits the amount of data a TCP can send. At any given time, a TCP MUST NOT send data with a sequence number higher than the sum of the highest acknowledged sequence number and the minimum of cwnd and rwnd." from [RFC2581]. Cwnd, as used by Reno and NewReno TCPs, is used as an offset from snd_una (the highest-sequence segment acknowledged by the receiver, indicating that all segments less than snd_una have been received). During recovery, Reno and NewReno artificially inflate cwnd as dupacks are received. RATE-HALVING CONGESTION WINDOW (rhcwnd): In order to differentiate the usage of cwnd within Rate-Halving from the cwnd used by Reno and NewReno [RFC2582], we introduce a new but comparable congestion window variable called rhcwnd. Rhcwnd is not inflated by dupacks during congestion episodes. It represents the amount of data that the sender is allowed to have outstanding. It is not considered an offset to snd_una, but rather a count of the data in flight, regardless of whether the segments contain new or retransmitted data. PRIOR_RHCWND (prior_rhcwnd): The value of rhcwnd saved when the adjustment interval begins. LOSS WINDOW (LW): "The loss window is the size of the congestion window after a TCP sender detects loss using its retransmission timer." from [RFC2581] MAX_SEQ (max_seq): The maximum sequence number transmitted for the current connection. PRIOR_MAX_SEQ (prior_max_seq): The maximum sequence number transmitted before the start of an adjustment interval. FORWARD ACKNOWLEDGEMENT (fack): The highest sequence number known to have reached the receiver, plus one. (This includes SACK information, and is different than the variable snd.una when losses have occurred). The use of the fack state variable (and the retran_data state variable, below) was originally described by Mathis [MM96]. The details for setting fack can be found in Section 6.1.2. DUPLICATE ACKNOWLEDGMENTS (dupacks): Acknowledgments that carry the same sequence number as an earlier acknowledgment. In order to qualify as a dupack, the ack must carry no data, and must not be a window update. Dupacks provide an indication that the segment following the sequence number in the dupack may have been lost, but that some future segment has been received. The state variable "dupacks" is a count of the current number of "outstanding" dupacks. By "outstanding", we mean dupacks which have been received, and which represent data not yet covered by snd.una. PARTIAL ACKNOWLEDGMENTS: Following congestion, these are "ACKs which cover new data, but not all the data outstanding when loss was detected" from [RFC2582]. FULL ACKNOWLEDGMENT: Following congestion, an ACK which covers all data outstanding when a loss was detected (i.e. an ack that is greater than or equal to prior_max_seq.) RETRANSMITTED DATA (retran_data): The amount of retransmitted data outstanding in the network. NUM_RETRANS (num_retrans): The number of bytes retransmitted during the current repair interval. ECN-ECHO ACK PACKET: "If the sender receives an ECN-Echo ACK packet (that is, an ACK packet with the ECN-Echo flag set in the TCP header), then the sender knows that congestion was encountered in the network on the path from the sender to the receiver. The indication of congestion should be treated just as a congestion loss in non-ECN-Capable TCP." from [RFC2481]. In addition to the above definitions, we use the following commonly- used TCP state variables and terms: ssthresh [RFC2581], snd.una, snd.nxt [RFC793], and segment [RFC2581]. 3. Overview of the Rate-Halving Algorithm ========================================= The Rate-Halving (RH) algorithm is designed to reduce the congestion window following the detection of congestion in the network. Rate-Halving reduces rhcwnd over one round trip by transmitting one new segment for every two segments acknowledged by the receiver. The Rate-Halving algorithm also detects the end of this round trip, employing several different checks to perform this detection most accurately. The choice of which segment to send at any given point in time is completely independent of the Rate-Halving algorithm, allowing Rate-Halving to be used simultaneously with ECN (where no retransmissions occur) [RFC2481], NewReno (where retransmissions are spread out one per RTT) [RFC2582], or SACK (where holes are filled soon after they are detected) [RFC2018]. As a result of the separation of window reduction from data retransmission, the term "recovery" ceases to have a clear meaning. We replace it with two terms: "adjustment interval" (referring to the round trip containing the window adjustment) and "repair interval" (referring to the retransmission of missing data). The adjustment interval is entered immediately upon indication of the possible onset of congestion. Two possible indications of congestion are presence of the ECN-Echo bit, or indications of a packet loss via duplicate ACK or via presence of a SACK block. During the adjustment interval, the window is reduced by sending one data segment for each two segments which are acknowledged, as per the Rate-Halving algorithm (described in detail below). At the end of the adjustment interval, additional checks are made to update ssthresh and to ensure that the final value of rhcwnd is appropriate. Due to the additional window reduction for the lost data, the final value for rhcwnd is 1/2 of the correctly delivered portion of the window of data which was in the network at the time of detection. 4. Complete Rate-Halving Algorithm ================================== This section describes the complete Rate-Halving algorithm. Most of the complexity of the specification is a result of the need for backwards compatibility with hosts that do not support SACK or ECN. For all subsections of this section, additional discussion and explanation may be found in Section 5 under the same subsection number. 4.1 Rate-Halving states Rate-Halving is specified as a state machine with the following states: RH_INCR - Rate-Halving is not performing any downward window adjustment. This corresponds to TCP being in either Slow-start or the linear increase region. RH_EXACT - Rate-Halving is reducing the window with accurate knowledge of what data has left the network. This state normally follows congestion indicated by ECN or SACK. RH_EST - Rate-Halving is reducing the window while estimating the amount of data that has left the network. This state is analagous to Reno's Fast Recovery. RH_EST_REPAIR - Rate-Halving is holding the window constant while repairing multiple holes in a manner similar to NewReno on successive round trips after the window adjustment has completed. +----------------+ +----------------+ | RH_INCR |-------------------->| RH_EXACT | | | 4.4 New SACK or ECN | | | 4.2 Grow rhcwnd| | 4.6 Reduce | | |<--------------------| rhcwnd | | 4.3 Transmit | 4.8 Minor reorder | | | | | 4.3 Transmit | | |<--------------------| | | | 4.10 New data ACKed | | | | 4.14 Bound | | | | | | | |<--------------------| | | | 4.15 Retransmit | | | | Timeout | | +----------------+ +----------------+ ^ ^ ^ ^ | | | | | | | | | | | | | v | | | | | | 4.14| | | | v +-----------+ Bound| | | | | | 4.5 Dupacks | | | | +---------------------------+ 4.13| | | | 4.5 Dupacks | Adjust| | | +---------------+ | | | | | | 4.11| | +---------------+ | | Full| | | | | Ack| +---------------+ | | | | | | | v +----------------+ | | | +----------------+ | RH_EST_REPAIR | | | +-<---------| RH_EST | | | | | 4.8 Reorder | | | 4.9 Partial ACK| | | | 4.7 Reduce | | | | +-<------------| rhcwnd | | 4.3 Transmit | | 4.11 Full Ack| | | | | 4.13 Adjust | 4.3 Transmit | | | ^ 4.14 Bound | | | | | | 4.9 Partial ACK| | |->-+---------<-------| | | | 4.15 Timeout | | | | | | | |<--------------------| | | | 4.12 Exit if rhcwnd | | | | <= prior_rhcwnd/2 | | | | | | | |-------------------->| | | | 4.5 Re-enter | | +----------------+ +----------------+ 4.2 Increasing Rhcwnd Upon receipt of a new ACK in RH_INCR, apply Congestion Avoidance or Slow-start per RFC2581 only when: snd.nxt - fack + retran_data + SMSS >= rhcwnd 4.3 Data Transmission While honoring all other sending constraints (such as receiver window), in any state, transmit data of length "len" if (snd.nxt - fack + retran_data + len) < rhcwnd If the data is a retransmission, retran_data += len 4.4 Entering RH_EXACT State Transition from RH_INCR to RH_EXACT if the ACK either carries an ECN-Echo bit, or indicates a new hole via a SACK option. Store the following information: prior_rhcwnd = rhcwnd prior_max_seq = max_seq 4.5 Entering RH_EST State From either RH_INCR or RH_EXACT, transition to RH_EST upon receipt of a dupack with no SACK options. From the RH_EST_REPAIR state, transition to RH_EST upon receipt of an ACK with the ECN-Echo bit set. If the transition is from RH_INCR or RH_EST_REPAIR to RH_EST, store the following information: prior_rhcwnd = rhcwnd prior_max_seq = max_seq 4.6 Reduction of Rhcwnd in RH_EXACT State For each ACK received while in state RH_EXACT, reduce rhcwnd by 1/2 the distance fack advances plus 1/2 the size of any new holes. 4.7 Reduction of Rhcwnd in RH_EST State For each dupack received while in state RH_EST: rhcwnd -= SMSS/2 fack = min((fack + SMSS), max_seq) dupacks += 1 4.8 Detection of Reordering; Return to RH_INCR Reordering detection applies only if an ECN-Echo has not been received and a retransmission has not been sent since the adjustment interval began. In RH_EXACT, reordering has occurred if an ACK arrives carrying no ECN-Echo bit and no SACK blocks. In RH_EST, reordering has occurred if a partial or full ACK arrives. If reordering is detected, transition to RH_INCR and restore the value of rhcwnd, without performing the bounding checks of 4.14: rhcwnd = prior_rhcwnd; 4.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR For each partial acknowledgement received in RH_EST or RH_EST_REPAIR: 1. Set retran_data to 0. 2. Reduce dupacks by [(bytes of data acked)/SMSS - 1]. 3. For the purpose of data transmission, note that a new hole exists at snd.una. The new hole is eligible for retransmission if dupacks > 3, after being reduced in step 2. 4.10 Completion of RH_EXACT; Return to RH_INCR In RH_EXACT, Rate Halving is complete if one of the following occurs (indicating that a round trip time has passed): 1. An ACK or SACK arrives which acknowledges data beyond prior_max_seq. 2. An ACK or SACK arrives which acknowledges a segment retransmitted after entering the RH_EXACT state. Transition to state RH_INCR and perform the bounding checks in 4.14. 4.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR In state RH_EST or RH_EST_REPAIR, adjustment and repair are both complete when a full acknowledgement is received (i.e. snd.una >= prior_max_seq). When this occurs, transition to state RH_INCR, perform the estimation adjustment in 4.13, followed by the bounding checks in 4.14. 4.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR In state RH_EST, adjustment (but not repair) is complete when rhcwnd <= prior_rhcwnd / 2 Transition from state RH_EST to state RH_EST_REPAIR. 4.13 Corrections for Estimation upon return to RH_INCR When adjustment and repair are complete after an estimated recovery, perform the following adjustment to compute the correct value for rhcwnd: rhcwnd = (prior_rhcwnd - num_retrans) / 2; 4.14 Application of Bounding Paremeters When transitioning back to RH_INCR, update the TCP state variables as follows: if (rhcwnd > prior_rhcwnd/2) rhcwnd = prior_rhcwnd/2; ssthresh = rhcwnd; if (ssthresh < prior_rhcwnd/4) ssthresh = prior_rhcwnd/4; Check for a new condition that would cause entry into another adjustment interval (e.g. the presence of SACK blocks reporting a new hole (beyond prior_max_seq). 4.15 Retransmission Timeouts If a timeout occurs while in a non-RH_INCR state, set ssthresh = prior_rhcwnd / 2 rhcwnd = LW and return to the RH_INCR. Otherwise, if a timeout occurs while in RH_INCR, set ssthresh = rhcwnd /2 rhcwnd = LW 5. Discussion of Rate-Halving Algorithm ======================================= Each of the subsections of Section 5 provide discussion for the corresponding subsection of Section 4. 5.1 Rate-Halving States The Rate-Halving algorithm may be viewed as a finite state machine with four states. These states are independent of whether the TCP connection is in Congestion Avoidance or Slow-start. 5.2 Increasing Rhcwnd The left hand term in 4.2 includes all outstanding transmitted and retransmitted segments that are still in the network, plus one additional segment to account for the effects of rounding due to segmentation. This restriction assures that rhcwnd only grows as long as TCP actually succeeds in injecting enough data into the network to test the path. If TCP is throttled by something other than rhcwnd, then there is no assurance that rhcwnd of data will actually fit into the network. This test should be done prior to taking any data which was ACKed or SACKed in the incoming segment into account. This may be accomplished either by performing this check immediately upon receiving an ACK, or by remembering the total amount of new and retransmitted data ACKed or SACKed by the segment, and adding this to the term on the left hand side. Since it is never safe to open the window while estimating what data has left the network, Reno and NewReno can never open the window while dupacks are present - e.g during any adjustment or repair interval. 5.3 Data Transmission Rhcwnd always specifies an amount of data allowed to be in flight in the network. The expression (snd.nxt - fack + retran_data + len) is a measure of the outstanding data. Data may be transmitted whenever the amount of outstanding data is less than the amount allowed by rhcwnd. Contrast this to Reno, where cwnd is used as an offset to snd.una, specifying which sequence numbers may be sent. As a consequence the cwnd variable is overloaded during recovery (e.g. setting cwnd=SMSS to force Fast Retransmit.) To illustrate that the new semantics for rhcwnd in Rate-Halving are compatible with those of cwnd in Reno, observe that in Reno the boundary between being able to send data and not send data is: snd.nxt = snd.una + cwnd where cwnd = rhcwnd - retran_data + dupacks*SMSS and retran_data = 0 or 1 depending on whether a retransmission has been sent. Substituting and rearranging the Reno equations yields equation 1: (1) rhcwnd = snd.nxt - snd.una - dupacks*SMSS + retran_data In Rate-Halving, the boundary occurs at: awnd = rhcwnd, where awnd = snd.nxt - fack + retran_data and fack = snd.una + losses + SACKed data Substituting and rearranging the Rate-Halving equations yields: (2) rhcwnd = snd.nxt - snd.una - losses - SACKed data + retran_data It can be seen that equations 1 and 2 are quite similar. The losses known by Rate-Halving (via SACK options) are not known to Reno. In addition, the amount of data that has left the network (indicated by SACK options to Rate-Halving) must be estimated by Reno as dupacks*SMSS. Reno only allows one retransmitted segment per round trip, while Rate-Halving permits (and keeps track of) more. When SACK options are not used by the receiver, the Rate-Halving sender must estimate by using equation 1. Similar analysis can be applied to The "pipe" algorithm used for SACK in the "ns" [NS] simulator and some SACK implementations. 5.4 Entering RH_EXACT State Note that the sender can infer congestion by receiving a segment with the ECN-Echo bit set, or by receiving an Ack carrying a SACK option which indicates a new hole. In both of these cases, the amount of data that has left the network is known, either because no data was lost, or the amount of data that was lost is reported by SACK options. In the case of a suspected loss, enter the adjustment interval immediately. While this may seem to be a major change to TCP, in fact we retain the "3 Dupacks" check (and make it more robust) which is included in Reno TCP. Segment retransmissions are determined by the SACK scoreboard, as discussed in section 6.2.1. During the early portion of the adjustment interval, one new segment will generally be transmitted prior to retransmitting the lost segment which triggered the adjustment. The algorithm will detect if the segments have merely been reordered, and the adjustment is terminated (see section 5.8 below). Prior_rhcwnd and prior_max_seq will be used for end-of-adjustment checks. 5.5 Entering RH_EST State The significance of this state is that the amount of data in the network is estimated, since exact information is not available through SACK options. If the prior state was RH_EST_REPAIR, then this is a new reduction triggered by detecting a new ECN prior to the completion of a NewReno-style repair. The sender can also infer congestion by receiving duplicate acknowledgements. In this case, it is known that a segment was lost or misordered, but the number of missing segments is not known. The number can only be estimated by counting the number of duplicate ACKs received. Even if RH_EXACT was entered with an ECN-Echo, the RH_EST state can be entered from RH_EXACT if a dupack is later received. 5.6 Reduction of Rhcwnd in RH_EXACT State Rather than counting received ACKs, the distance (in bytes) that fack advances is explicitly used. This ensures that the window is properly reduced even when the receiver is sending delayed ACKs during congestion episodes (such as with ECN). Each SACK option that indicates a new hole triggers a reduction of rhcwnd by all of data that was lost plus 1/2 the data that was received. During RH, we reduce rhcwnd as data leaves the network. The net effect is to cause the transmission of one new segment for every two segments reported by the receiver. 5.7 Reduction of Rhcwnd in RH_EST State Assume that each dupack is for a segment of size SMSS, and that the rate is being halved. This estimate is designed to result in rhcwnd and fack values that are reasonable, given that more exact information is not available about which segments have left the network. These estimates are comparable to (and no worse than) the estimates that Reno uses during Fast Recovery. When the repair interval ends, final adjustments recalibrate rhcwnd and fack to their correct values. 5.8 Detection of Reordering; Return to RH_INCR Note that there is a choice of what do with the congestion window when segment reordering is detected. In principle, rhcwnd could be set to an arbitrary function of prior_rhcwnd. For example, one could choose to penalize networks which reorder segments by forcibly reducing rhcwnd: rhcwnd = prior_rhcwnd / 2; A gentler form of reordering penalty would be to leave rchwnd as updated by Rate-Halving (Reduce rhcwnd as indicated in section 4.6 or 4.7 until the reordering has been detected). The choice to restore rhcwnd results in no penalty for reordering: rhcwnd = prior_rhcwnd; This is functionally equivalent to today's Reno implementations, and may be less bursty in some cases. (RH is less bursty in cases in which a single new segment is transmitted prior to detecting the reordering, resulting in a burst which is one segment smaller than the equivalent Reno burst). The optimal action is probably between these extremes, and should be the subject of future research. 5.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR A partial ACK indicates that one hole has been filled (since it is now covered by an ACK) and that another one exists. In step 1, retran_data is cleared since the retransmitted segments left the network when the partial ACK was generated. Some of the segments that generated duplicate ACKs are acknowledged by the partial ACK. These segments must be accounted for in the count of dupacks, so step 2 reduces dupacks by the amount of data that has just been acknowledged. A new hole is indicated by the partial ack one round trip after the retransmission for the previous hole was sent. The only way in which the segment indicated by the partial ack would be falsely retransmitted would be if the segment was reordered (delayed) by more than a round trip time. If the previous hole was caused by a reordered segment, then the partial ack may appear without sending a retransmission (and with dupacks < 3). In this case, the new hole is not eligible for retransmission until dupacks >= 3. 5.10 Completion of RH_EXACT; Return to RH_INCR Under severe reordering (after retransmission) condition 2 may cause a premature exit from the adjustment interval. This would leave rhcwnd too large. However the bounding check in 4.14 will force the required 1/2 window reduction. Case 2 might be particularly tricky to implement for cases in which SACK, ECN, and NewReno are all supported. Here are some guidelines for implementation: ACKs or SACKs which show the arrival of retransmitted data may not be reliable indicators of case 2. This is because it is possible (and even likely) for the data repair interval to continue well beyond the end of the adjustment interval. Should a second adjustment interval begin as a result of subsequent lost data, it is possible for holes to be filled which don't satisfy case 2. Therefore, a more rigorous test is needed for retransmitted segments. Assuming that there is a SACK scoreboard, one possible implementation is to store an adjustment interval ID number with each retransmission. This ID will clearly indicate during which adjustment interval a retransmission occurred. (The ID may also be "zero" indicating the retransmissions occurred outside of an adjustment interval). ACKs and SACKs which are received for retransmissions can then be definitively tested against condition 2. 5.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR A full ack indicates that all segments that were outstanding at the first indication of congestion have been successfully received. Therefore, the adjustment and repair intervals should be completed, but final adjustments and checks must be made to ensure the validity of state variables. 5.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR When the window has been reduced by half, but a full ack has not yet been received, it is necessary to end the adjustment interval, holding the window constant while the repair interval completes. In the RH_EST_REPAIR state, rhcwnd is not increased or decreased. While in this state, retransmissions will be sent once per round trip time, as in NewReno until a full ACK ends the repair interval. 5.13 Corrections for Estimation upon return to RH_INCR The final adjustment ensures that the resulting congestion window is half of the data that successfully passed through the network during the congested round trip. 5.14 Application of Bounding Paremeters At the end of the RH_EXACT state, rhcwnd has been set to (prior_rhcwnd-loss)/2, where loss is the number of bytes of data that have been lost during the current adjustment interval. In some cases, it is possible for rhcwnd to be set to an invalid result (such as severe reordering mentioned above or improper acknowledgement strategies of the receiver). Rate-Halving includes a set of Bounding Parameters which are used to guarantee that the Rate-Halving algorithm will make appropriate adjustments to the window. The prior_rhcwnd/2 check guarantees that, no matter what else happens, rhcwnd will always be reduced by a factor of two. The prior_rhcwnd/4 check ensures that the TCP connection is not unduly harmed by extreme network conditions. The choice to set ssthresh (rather than rhcwnd) to prior_rhcwnd/4 prevents sending a burst into the network as a result of suddenly increasing rhcwnd. A short period of Slow-start will follow, allowing rhcwnd to quickly reach the bounding parameter of prior_rhcwnd/4. Note that it is possible to exit the adjustment interval on an ACK, and re-enter a new adjustment interval as a result of the same acknowledgement. For example, an ACK may carry both acknowledgement of retransmitted data, causing an exit per 4.10 case 2, and an ECN-Echo bit (indicating that the retransmitted segment was marked), causing a new adjustment interval per 4.4. 5.15 Retransmission Timeouts Timeout behavior is nearly identical to Reno. Since rhcwnd is reduced during non-RH_INCR states, it is necessary to use prior_rhcwnd to prevent an overly agressive adjustment in the ssthresh calculations. See section 6.2.2 for information about SACK reneging (SACK receiver discards data that it has already sent SACK blocks for [RFC2018]). Do not clear the exponential backoff multiplier after the timeout if the ECN-Echo bit is set on the ACK that results from the retransmission. 6. Additional Implementation Notes ================================== 6.1 ACK processing 6.1.1 Order of actions to be taken Upon receiving an acknowledgement, the following actions should be taken: 1. Save a copy of the following state variables for use during processing of this ACK: snd.una, fack, retran_data. 2. Update the scoreboard, snd.una, fack, and retran_data. 3. Note if the ECN-Echo bit is set. 4. Check for exit conditions from RH states (4.8, 4.10, 4.11). 5. Check for transition to RH_EST_REPAIR state (4.12). 6. Process partial ACKs (4.9). 7. Check for entrance conditions for RH (4.4, 4.5). 8. Adjust the window using state variables as they existed at the time the ACK was received (4.2, 4.6, 4.7). 9. If the window allows (4.3), send a new segment or a retransmission as determined by the scoreboard (6.2.1). 6.1.2 Setting FACK When in the RH_EXACT or RH_INCR state: When each ACK is received, set the fack state variable to be the maximum of: (1) the highest sequence number which has been acknowledged plus one, and (2) the highest sequence number which has appeared in a SACK block plus one. When in the RH_EST or RH_EST_REPAIR state: When each ACK is received, set the fack state variable to be snd_una + (1 + dupacks) * SMSS. 6.2 Data Recovery 6.2.1 SACK Scoreboard A SACK scoreboard keeps track of data blocks that have been received after some data has been lost, as indicated by SACK options. The data transmission rule of 4.3 indicates only WHEN to send data (based on congestion control), not WHAT data to send (based on reliable delivery). The scoreboard keeps track of what data has been received, and indicates what data is eligible for retransmission. Reno TCP uses a threshold of 3 dupacks to determine that a missing segment has been lost, not just reordered. By immediately entering the adjustment interval, not only can a Rate-Halving implementation apply the dupack threshold to the first loss, but also to all subsequent losses. Mathis and Mahdavi presented a method for implementing this logic, termed thresholded retransmissions, in the scoreboard [MM97]. A data block is eligible for retransmission only after the hole has been observed at least three times, as indicated by dupacks, or SACK options with sequence numbers that are higher than the data block in question. If no data is eligible for retransmission, new data may be sent when allowed by the congestion window and the receiver's advertised window (4.3). When no SACK information is available, such as in states RH_EST or RH_EST_REPAIR, a single-element scoreboard may be used to store loss and retransmission state information. 6.2.2 Retransmission timeout/SACK reneging RFC2018 recommends (with a SHOULD) that the scoreboard be cleared when a timeout is experienced, since the timeout might indicate that the SACK receiver may have reneged (discarded data that it has already sent a SACK option for). If this RFC2018 recommendation is followed, then perform the following actions in addition to the actions specified in 4.15: snd.nxt = snd.una fack = snd.una retran_data = 0 It is believed to be safe to relax the SHOULD in RFC2018, such that when a retransmission timeout is experienced, the scoreboard can be kept in order to retain the information about which segments have already been received. When allowing the scoreboard to be maintained through a timeout (conflicting with RFC2018), the following actions must be performed in addition to the actions specified in 4.15: snd.nxt = fack retran_data = 0 In this case, snd.nxt is pulled back to fack, which is the highest sequence number known to have reached the receiver. Normal transmissions will begin at this sequence number, and retransmissions will continue to be issued by the scoreboard. (Following the timeout, fack will advance to the highest ACK or SACK block received. snd.nxt should be pulled ahead any time that fack > snd.nxt.) If the scoreboard is not discarded during a timeout, the sender must be able to detect that the receiver has reneged. If snd.una advances to a block stored in the scoreboard, the sender knows that a SACK renege has occurred, and must clear the scoreboard, setting fack = snd.una snd.nxt = snd.una retran_data = 0 There are three options for what should happen to the window when the sender detects that the receiver has reneged: 1) force a retransmit timeout, 2) cut the window in half, or 3) do nothing to the window. More experimentation is needed to determine the correct approach. 7. Acknowledgements ==================== The authors would like to acknowledge Janie Hoe's work, which introduced the concepts of sending new data during recovery and spacing out data segments during the round trip following a loss. The authors would also like to thank Van Jacobson and Sally Floyd for inspirational conversations about many of the ideas in this document. In addition, the authors are grateful to Mark Allman, Jitendra Padhye, and Khrys Myrddin for their constructive feedback about earlier drafts of this document. 8. References ============== [Hoe95] J. Hoe, Startup Dynamics of TCP's Congestion Control and Avoidance Schemes. Master's Thesis, MIT, 1995. URL "http://ana- www.lcs.mit.edu/anaweb/ps-papers/hoe-thesis.ps". [Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion Control Scheme for TCP. In ACM SIGCOMM, August 1996. URL "http://www.acm.org/sigcomm/sigcomm96/program.html". [MM96] M. Mathis, J. Mahdavi, "Forward Acknowledgement: Refining TCP Congestion Control," SIGCOMM 96, August 1996. [MM97] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding Parameters," December 1997. URL "http://www.psc.edu/networking/papers/FACKnotes/current/". [NS] The UCB/LBNL/VINT Network Simulator (NS). URL "http://www- mash.cs.berkeley.edu/ns/". [RFC793] J. Postel, "Transmission Control Protocol," RFC793, September 1981. [RFC1191] J. Mogul, and S. Deering, "Path MTU Discovery," RFC1191, November 1990. [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgment Options," RFC2018, October 1996. [RFC2481] K. Ramakrishnan, and S. Floyd, "A Proposal to add Explicit Congestion Notification (ECN) to IP," RFC2481, January 1999. [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control," RFC2581, April 1999. [RFC2582] S. Floyd, and T. Henderson, "The NewReno Modification to TCP's Fast Recovery Algorithm," RFC2582, April 1999. 9. Security Considerations ========================== RFC 2581 discusses general security considerations concerning TCP congestion control. This document describes a specific algorithm that conforms with the congestion control requirements of RFC 2581, and so those considerations apply to this algorithm, too. There are no known additional security concerns for this specific algorithm. AUTHORS' ADDRESSES Matthew Mathis and Jeffrey Semke Pittsburgh Supercomputing Center 4400 Fifth Avenue Pittsburgh, PA 15213 EMail: mathis@psc.edu, semke@psc.edu Jamshid Mahdavi Novell, Inc. Mailstop: SJF-B492 2211 North First Street San Jose, CA 95131 EMail: mahdavi@novell.com, kml@novell.com