Network Working Group N. Demizu Internet-Draft NICT Expires: June 29, 2006 December 29, 2005 TS2 --- A Modified TCP Timestamps Mechanism Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than a "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2005). All Rights Reserved. Demizu Expires June 2006 [Page 1] Internet-Draft December 2005 Abstract This memo proposes a modified TCP Timestamps mechanism called "TS2". It uses the existing TCP Timestamps option specified in RFC1323 and a new TCP option called "the TCP Old Timestamps option", which is specified in this memo. As a fallback, an RFC1323-compatible mode called "TS1" is also available. The base mechanism of TS2 includes the definitions of those two TCP Timestamps options, mode negotiation to enable TS1 or TS2, and a rule for updating internal states. The applied mechanisms of TS2 include an accurate RTT measurement mechanism that is workable even for duplicate ACK segments (RTTM/TS2), a reordering-robust mechanism to detect wrapped sequence numbers (PAWS/TS2), a lightweight mechanism to detect spoofed segments (PASA/TS2), a loss detection mechanism applicable to both original and retransmitted data segments (DLD/TS2), and a spurious retransmission detection mechanism that operates without waiting for one RTT by sending arbitrary in-window data (SRD/TS2). Table of Contents 1. Introduction ................................................... 3 2. Terminology .................................................... 3 3. Two TCP Timestamps Options ..................................... 4 4. Base Mechanism ................................................. 6 5. RTTM (Round Trip Time Measurement) ............................ 10 6. PAWS (Protection Against Wrapped Sequence numbers) ............ 11 7. PASA (Protection Against Spoofing Attacks) .................... 14 8. DLD (Data Loss Detection) ..................................... 20 9. SRD (Spurious Retransmission Detection) ....................... 24 10. Security Considerations ...................................... 26 11. IANA Considerations .......................................... 26 12. Acknowledgements ............................................. 26 13. References ................................................... 26 Author's Address ................................................. 29 Appendix A: TS2 Reference ........................................ 29 Appendix B: Alternative Ideas .................................... 46 Appendix C: Loss Detection With SACK and DLD/TS2 ................. 47 Appendix D: Summary of TCP Timestamps Option in RFC1323 .......... 51 Appendix E: Issues with TCP Timestamps Option in RFC1323 ......... 55 Appendix F: Problem of PAWS in RFC1323 and Reordering ............ 57 Copyright Statement and Intellectual Property .................... 62 Demizu Expires June 2006 [Page 2] Internet-Draft December 2005 1. Introduction This memo proposes a modified TCP Timestamps mechanism called "TS2". It uses the existing TCP Timestamps option [RFC1323] and a new TCP option called the TCP Old Timestamps option, which is specified in this memo. The significant differences between TS2 and the TCP Timestamps option specified in [RFC1323] are the rule to determine which timestamp is echoed and the timestamp unit. In addition, TS2 solves the issues with the existing TCP Timestamps option specified in [RFC1323], as described in Appendix E. As a fallback, RFC1323-compatible mode called "TS1" is also available. The use of TS1 or TS2 is negotiated using the two options on SYN and SYN+ACK segments in the TCP three-way handshake phase. TS2 enables several applied mechanisms, as follows. When TS2 is enabled on a TCP connection, a local node MAY enable one or more of these mechanisms on the TCP connection without additional negotiation with a remote node. - RTTM/TS2 (Round Trip Time Measurement with TS2) enables accurate RTT measurements even when a duplicate ACK segment is received. - PAWS/TS2 (Protection Against Wrapped Sequence numbers with TS2) is a reordering-robust protection mechanism for wrapped sequence numbers. - PASA/TS2 (Protection Against Spoofing Attacks with TS2) is a lightweight protection mechanism against spoofing attacks that inject faked SYN, data, FIN, and RST segments. - DLD/TS2 (Data Loss Detection with TS2) detects losses of both original and retransmitted data segments. - SRD/TS2 (Spurious Retransmission Detection with TS2) detects spurious retransmission without waiting for one RTT, by sending arbitrary in-window data. Note:The procedures described in this memo have not been demonstrated by simulation nor by implementation. 2. Terminology 2.1 General This memo uses the same variable names and TCP state names defined in section 3.2 of [RFC793]. In addition, it introduces the following variables and notations: SND.MAX holds the maximum value of SND.NXT; Demizu Expires June 2006 [Page 3] Internet-Draft December 2005 SSEG.XYZ means the XYZ field on the segment being sent; and RSEG.XYZ means the XYZ field on the segment just received. The memo uses the following abbreviations defined in [RFC2988]: RTO (Retransmission TimeOut), RTT (Round-Trip Time), SRTT (Smoothed RTT), and RTTVAR (RTT VARiation). The memo uses the following abbreviations defined in [RFC2581]: SMSS (Sender Maximum Segment Size), and RMSS (Receiver Maximum Segment Size). According to [RFC793], SEG.LEN includes the SYN and FIN bits. Thus, segments satisfying (RSEG.LEN > 0) include data, SYN, and FIN segments. If a RST segment has data, this memo does not consider that it satisfies (RSEG.LEN > 0). For simplicity, the term "data segments" often means "data, SYN, and/or FIN segments" in this memo. The memo refers to the initial transmission of an octet as the "original transmission", and to a subsequent transmission of the same octet as a "retransmission" [RFC3522][RFC4015]. A data segment for which part or all of its octets are sent by original transmission is referred to as an "original data segment". Other data segments are referred to as "retransmitted data segments". All arithmetic dealing with TCP sequence numbers must be performed modulo 2^32. 2.2 Requirements The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. This memo makes use of conceptual variables to describe behavior. The specific variable names, and how their values are referred to and changed, are provided here to demonstrate behavior. Implementations are not required to follow the memo exactly, as long as its external behavior is consistent with that described here. 3. Two TCP Timestamps Options TS2 uses two TCP options: the TCP Timestamps option specified in [RFC1323], and a new TCP option called the TCP Old Timestamps option, as specified in this section. 3.1 TCP Timestamps Option Figure 3-1 shows the format of the TCP Timestamps option. Demizu Expires June 2006 [Page 4] Internet-Draft December 2005 For simplicity, it is hereafter called "the TS option". +-------+-------+ |Kind=8 |Len=10 | +---------------+-------+-------+ | TSval (TS Value) | +-------------------------------+ | TSecr (TS Echo Reply) | +-------------------------------+ Figure 3-1: The TS option 3.2 TCP Old Timestamps Option The format of the TCP Old Timestamps option has two forms as given below. The option-kind value is <>. On SYN and SYN+ACK segments, the TCP Old Timestamps option consists of only two octets (option-kind and option-length), as shown in Figure 3-2. The purpose of this form is to negotiate the use of TS2. For simplicity, this form is hereafter called "the OTS_OK option". +-------+-------+ |Kind=??|Len=2 | +-------+-------+ Figure 3-2: The OTS_OK option On other segments, the format of the TCP Old Timestamps option is the same as that of the TS option, except for the option-kind value, as shown in Figure 3-3. The purpose of this form is to inform a remote node that the TSecr value is not fresh. In contrast, the TSecr value in the TS option is fresh. For simplicity, this form is hereafter called "the OTS option". +-------+-------+ |Kind=??|Len=10 | +---------------+-------+-------+ | TSval (TS Value) | +-------------------------------+ | TSecr (TS Echo Reply) | +-------------------------------+ Figure 3-3: The OTS option 3.3 TSval and TSecr Fields In both the TS option and the OTS option, the TSval field contains the timestamp when the segment is sent, while the TSecr field Demizu Expires June 2006 [Page 5] Internet-Draft December 2005 contains the TS.Recent value, which is updated by the received TSval values, as specified in the base mechanism. When TS2 is enabled, the timestamp unit is fixed at 1 usec (10^-6). When TS1 is enabled, the timestamp unit is between 1 second and 1 ms (10^-3), in order to be compatible with [RFC1323]. All arithmetic dealing with timestamps must be performed modulo 2^32. Since the external timestamp unit for TS2 is fine (i.e., 1 usec) compared to today's possible RTTs, some lower bits of an external timestamp might be usable as cookie or nonce. Note that the granularity of a timestamp is a different concept from the unit of a timestamp. Timestamps are internally generated from an internal tick count or a real time clock. The granularity of a timestamp means the interval time for updating a timestamp source so that the resulting timestamp is changed. With TS1, since the timestamp unit can be chosen between 1 second and 1 ms, it would be simplest to make it the same as the granularity of a timestamp in many implementations. In contrast, with TS2, since the timestamp unit is fixed at 1 usec, the granularity will be much longer than the unit in most implementations. In any case, the granularity of a timestamp cannot be shorter than the unit of a timestamp. In this memo, "internal timestamp" means a timestamp generated directly from a timestamp source such as an internal tick count or a real time clock. "External timestamp" means a timestamp exchanged in the TSval and TSecr fields. An external timestamp is calculated as an internal timestamp plus a variable offset, as specified in the base mechanism. 4. Base Mechanism This section describes how the timestamp mode (i.e., none, TS1, or TS2) is negotiated, how the timestamps option kind is chosen (i.e., the TS option or the OTS option), and how the values of SSEG.TSval and SSEG.TSecr are computed. 4.1 Variables The base mechanism uses the following variables: TS.Req (integer), TS.Mode (integer), TS.SndOff (32bit-timestamp), TS.SndAdj (32bit-timestamp), TS.Recent (32bit-timestamp), TS.RecentIsOld (boolean), and Last.Ack.Sent (32bit-sequence-number). Among these variables, only TS.Req, TS.Mode, and TS.SndOff are referred to by the applied mechanisms. TS.Req contains the requested mode (0 = none, 1 = TS1, or 2 = TS2) of Demizu Expires June 2006 [Page 6] Internet-Draft December 2005 a TCP connection to be established. TS.Mode records the result of the mode negotiation. Its initial value is negative, which means "negotiation has not been completed". TS.SndOff holds an offset for converting an internal timestamp value to an external timestamp value. More specifically, the current external timestamp value is calculated as the current internal timestamp value plus TS.SndOff. It is also used to avoid reusing the same range of TSval values when a TCP Control Block is reused. In addition, it is used by PASA-DF/TS2 to randomize the initial timestamp values of TCP connections. TS.SndAdj is used to adjust TS.SndOff when TS2 is enabled through the mode negotiation. TS.Recent holds the value to be echoed in the TSecr fields of both the TS option and the OTS option. Its initial value is zero. TS.RecentIsOld is accessed only when TS2 is enabled. It is true if any segment that satisfies (RSEG.LEN > 0) and carries the TS option or the OTS option has not been received after the TS.Recent value has last been echoed. Last.Ack.Sent holds the last SSEG.ACK value sent, which is equal to the maximum SSEG.ACK value sent. Note: TS.Recent and Last.Ack.Sent are inherited from [RFC1323]. 4.2 Mode Negotiation This subsection describes the procedure of mode negotiation using the two TCP Timestamps options on SYN and SYN+ACK segments in the TCP three-way handshake phase. When a SYN segment is sent to establish a TCP connection, if TS2 is requested, the SYN segment SHOULD carry both the TS option and the OTS_OK option. If TS1 is requested, the SYN segment SHOULD carry the TS option, and it MUST NOT carry the OTS_OK option. If neither TS1 nor TS2 is requested, the SYN segment MUST NOT carry the TS option nor the OTS_OK option. If a SYN (without ACK) segment is received in the LISTEN or SYN-SENT state, and if the received segment does not carry one or both of the TS option and the OTS_OK option, the SYN+ACK segment sent in reply MUST NOT carry the OTS_OK option. Similarly, if the received segment does not carry the TS option, the SYN+ACK segment sent in reply MUST NOT carry the TS option nor the OTS_OK option. When a SYN segment is received in the LISTEN or SYN-SENT state, if TS2 is requested, the SYN+ACK segment sent in reply SHOULD carry both the TS option and the OTS_OK option as long as the above rule allows. If TS1 is requested, the SYN+ACK segment sent in reply SHOULD carry Demizu Expires June 2006 [Page 7] Internet-Draft December 2005 the TS option as long as the above rule allows, and it MUST NOT carry the OTS_OK option. If neither TS1 nor TS2 is requested, the SYN segment MUST NOT carry the TS option and the OTS_OK option. On SYN and SYN+ACK segments in the TCP three-way handshake phase, if both the TS option and the OTS_OK option are exchanged, TS2 is enabled. If the TS option is exchanged but the OTS_OK option is not exchanged, TS1 is enabled. If the TS option is not exchanged, neither TS1 nor TS2 is enabled. The result is recorded in TS.Mode. When TS2 is enabled, TS.RecentIsOld is set to false. TS.SndAdj supports this negotiation. Since the timestamp unit is different between TS1 and TS2, the timestamp values of TS1 and TS2 are almost always different. To save the TCP option space on SYN and SYN+ACK segments, however, only the TS option carries the TSval field and the TSecr field. These two fields contain the timestamps of TS1. The purpose of TS.SndAdj is to adjust TS.SndOff to generate correct external timestamps for TS2 when TS2 is enabled. When the first SYN segment is sent, TS.SndAdj holds the difference between the timestamps of TS1 and TS2. Then, when TS2 is enabled after mode negotiation, TS.SndAdj is added to TS.SndOff. TS.SndAdj is used only during mode negotiation. 4.3 Input Processing This subsection describes the procedure for processing received segments. If TS1 or TS2 is enabled, and if a received segment carrying the TS option or the OTS option satisfies at least one of inequalities (1) and (2) below, it SHOULD be processed by the base mechanism and the applied mechanisms to update their variables. Otherwise, those variables MUST NOT be updated, while the RSEG.TSval and RSEG.TSecr values on such a segment MAY be checked to test the received segment. (RSEG.LEN > 0 && Last.Ack.Sent - max(RCV.WND) < RSEG.SEQ + RSEG.LEN && RSEG.SEQ < RCV.NXT + RCV.WND) ............................ (1) or (RSEG.LEN == 0 && Last.Ack.Sent - max(RCV.WND) <= RSEG.SEQ && RSEG.SEQ < RCV.NXT + RCV.WND) ............................ (2) If TS1 is enabled, and if a received segment other than an RST segment carries the TS option and satisfies all of inequality (1), (RSEG.SEQ <= Last.ACK.sent), and (RSEG.TSval > TS.Recent), then RSEG.TSval is recorded in TS.Recent. In other words, TS.Recent holds Demizu Expires June 2006 [Page 8] Internet-Draft December 2005 the maximum RSEG.TSval value on duplicate data segments and in-sequence data segments. Note that TS.Recent is not updated by out-of-order segments. Also note that TS.Recent is monotonically nondecreasing. If TS2 is enabled, and if a received segment other than a SYN or RST segment does not carry the TS option nor the OTS option, it MUST be dropped, and an ACK segment SHOULD be sent in reply. Otherwise, if a received segment other than a RST segment carries the TS option or the OTS option and satisfies inequality (1), and if TS.RecentIsOld is true or (RSEG.TSval < TS.Recent) is satisfied, then RSEG.TSval is recorded in TS.Recent, and TS.RecentIsOld is set to false. In other words, TS.Recent holds the minimum RSEG.TSval value on data segments received after a segment has last been sent. Note that TS.Recent is updated by out-of-order data segments, as well as duplicate data segments and in-sequence data segments. Also note that TS.Recent is not monotonically nondecreasing. Note: The reason why SYN and RST segments are handled specially is to disconnect half-open TCP connections. 4.4 Output Processing This subsection describes the procedure for processing segments being sent. If a segment carries the TS option or the OTS option, SSEG.TSval contains the current external timestamp value, and SSEG.TSecr contains the TS.Recent value unless otherwise specified below (i.e., ). If TS1 is enabled, all segments other than RST segments SHOULD carry the TS option, while RST segments SHOULD NOT carry the TS option. Note: The reason why RST segments SHOULD NOT carry the TS option is that section 4.2 of [RFC1323] states, "It is recommended that RST segments NOT carry timestamps, and that RST segments be acceptable regardless of their timestamp". If TS2 is enabled, follow the rule below. - When a segment other than a RST segment is sent, if TS.RecentIsOld is false, the segment MUST carry the TS option, and TS.RecentIsOld is set to true. If TS.RecentIsOld is true, the segment MUST carry the OTS option. - When a RST segment is sent, it MUST carry the OTS option unless otherwise specified below. Demizu Expires June 2006 [Page 9] Internet-Draft December 2005 - When a RST segment is sent in reply to a received segment because of [RFC793], follow the rule below. If the received segment carries the TS option or the OTS option, TS2 may be enabled on the remote node. Thus, to make the RST segment sent in reply acceptable to PAWS/TS2 and PASA/TS2 at the remote node, the RST segment MUST carry the OTS option, where SSEG.TSval is the RSEG.TSecr value and SSEG.TSecr is the RSEG.TSval value. (i.e., ) On the other hand, if the received segment does not carry the TS option nor the OTS option, TS1 may be enabled at the remote node, or neither of TS1 nor TS2 is enabled at the remote node. Thus, to make the RST segment sent in reply acceptable at the remote node in either case, the RST segment SHOULD NOT carry the TS option nor the OTS option. Note: The reason why RST segments are handled specially is to disconnect half-open TCP connections. 5. RTTM (Round Trip Time Measurement) RTTM is an RTT measurement mechanism making use of the RSEG.TSecr field in the TS option on received segments. It can be enabled when either TS1 or TS2 is enabled. If a received segment does not satisfy inequality (1) nor (2), RTTM MUST NOT be performed. Both RTTM/TS1 (RTTM with TS1) and RTTM/TS2 (RTTM with TS2) can take RTT measurements even when a received ACK segment was sent in reply to a retransmitted data segment. Therefore, they replace Karn's algorithm [KP87]. The most significant difference between RTTM/TS1 and RTTM/TS2 is that RTTM/TS1 can take RTT measurements only when SND.UNA is advanced, while RTTM/TS2 can take RTT measurements whenever the TS option is received. Note: If a smoothed RTT is computed from many RTT measurements per RTT, the resulting SRTT, RTTVAR, and RTO values [RFC2988] would become short-sighted. Implementations should take care of this issue. The question of how to compute an RTO value from many measured RTTs is outside the scope of this memo. 5.1 RTTM/TS1 If TS1 is enabled, when SND.UNA is advanced, the RTT can be calculated as (CurrentExternalTS(T1) - RSEG.TSecr + TS1_RTTM_G), Demizu Expires June 2006 [Page 10] Internet-Draft December 2005 where TS1_RTTM_G is the timestamp granularity of TS1. Note: "TS1_RTTM_G" in the above expression SHOULD NOT be replaced with "TS1_RTTM_G/2" unless TS1_RTTM_G is much lower than any possible RTT. Since there is a possibility that the RSEG.TSecr value is very old, the measured RTT may be longer than the real RTT in some corner cases. See Appendix E.1 for more detail. When the RSEG.TSecr value is zero, RTT SHOULD NOT be calculated. The reason is that some implementations of the TCP Timestamps option [RFC1323] send zero in the TSecr field in some scenarios where zero is apparently a bogus timestamp value. 5.2 RTTM/TS2 If TS2 is enabled, then whenever the TS option is received, the RTT can be calculated as (CurrentExternalTS(T2)-RSEG.TSecr + TS2_RTTM_G), where TS2_RTTM_G is the timestamp granularity of TS2. Note: "TS2_RTTM_G" in the above expression SHOULD NOT be replaced with "TS2_RTTM_G/2" unless TS2_RTTM_G is much lower than any possible RTT. When the OTS option is received, RTT SHOULD NOT be calculated, because the RSEG.TSecr value in the OTS option may be very old. 6. PAWS (Protection Against Wrapped Sequence numbers) PAWS is a mechanism for detecting old duplicate segments by making use of the RSEG.TSval field in the TS option and the OTS option on the received segment. It can be enabled when either TS1 or TS2 is enabled. As described in Appendix F, there is a possibility that a legitimate data segment could be discarded by PAWS in RFC1323 when it is delayed because of reordering. In contrast, as described in Appendix E.2, PAWS/TS1 (PAWS with TS1) is slightly robust against reordering. And, PAWS/TS2 (PAWS with TS2) is robust against reordering, so that legitimate segments are unlikely to be discarded even when delayed because of reordering. PAWS/TS1 and PAWS/TS2 use two variables: TS.RcvMin (32bit-timestamp) and TS.RcvMin_time (internal-time). TS.RcvMin holds the maximum value of the received RSEG.TSval values in both the TS option and the OTS option. TS.RcvMin_time holds the last time when a segment satisfying (RSEG.TSval >= TS.RcvMin) was received. The value of Demizu Expires June 2006 [Page 11] Internet-Draft December 2005 TS.RcvMin is valid for a limited amount of time depending on TS.Mode. If a received segment does not satisfy inequality (1) nor (2), these variables MUST NOT be updated. Note: TS.RcvMin is updated by any kind of segment, while TS.Recent is updated only by segments satisfying (RSEG.LEN > 0). In addition, TS.RcvMin is always monotonically nondecreasing, in contrast to TS.Recent. When TS.RcvMin is valid, all received segments SHOULD be tested using TS.RcvMin, as described in the following subsections, before the acceptability test of [RFC793]. This test is called the PAWS test in this memo. To avoid discarding legitimate delayed segments due to reordering, the lower bound of acceptable RSEG.TSval values is chosen as slightly lower than TS.RcvMin, as suggested in the appendix. Note: PAWS/TS1 and PAWS/TS2 use the dedicated variable TS.RcvMin for the PAWS test, while PAWS in [RFC1323] uses TS.Recent. 6.1 PAWS/TS1 When TS1 is enabled, the minimum acceptable RSEG.TSval value is (TS.RcvMin - TS1_PAWS_MARGIN), where TS1_PAWS_MARGIN is a margin. Its appropriate value, such as RTO value, cannot be computed, however, because the unit of the received timestamp is unknown. Hence, this memo recommends TS1_PAWS_MARGIN = 1 simply because it is better than zero. Thus, when the value of TS.RcvMin is valid, if a received segment other than a SYN or RST segment carries the TS option, it MUST satisfy (RSEG.TSval >= TS.RcvMin - TS1_PAWS_MARGIN). If it does not satisfy this inequality, it MUST be dropped, and an ACK segment with the TS option SHOULD be sent in reply. Note: The reason why SYN and RST segments are not tested is to disconnect half-open TCP connections. Another reason why RST segments are not tested is that section 4.2 of [RFC1323] states, "It is recommended that RST segments NOT carry timestamps, and that RST segments be acceptable regardless of their timestamp". The value of TS.RcvMin is valid until the internal clock reaches (TS.RcvMin_time + TS1_PAWS_IDLE). TS1_PAWS_IDLE should be longer than the longest timeout, and it should be reasonably less than 2^31. The default value of TS1_PAWS_IDLE is 24 days, which is the same value specified in [RFC1323]. 6.2 PAWS/TS2 When TS2 is enabled, the minimum acceptable RSEG.TSval value is Demizu Expires June 2006 [Page 12] Internet-Draft December 2005 (TS.RcvMin - CurRTO), where CurRTO means the current RTO value. Thus, when the value of TS.RcvMin is valid, all received legitimate segments must satisfy (RSEG.TSval >= TS.RcvMin - CurRTO). If a received segment other than a SYN or RST segment does not satisfy this inequality, it MUST be dropped, and an ACK segment SHOULD be sent in reply. In addition, if a received RST segment with the OTS option does not satisfy this inequality, it MUST be dropped silently. Note: The reason why RST segments without the OTS option and SYN segments are not tested here is to disconnect half-open TCP connections. The value of TS.RcvMin is valid until the internal clock reaches (TS.RcvMin_time + TS2_PAWS_IDLE). TS2_PAWS_IDLE should be longer than the longest timeout, and it should be reasonably less than 2^31. The default value of TS2_PAWS_IDLE is 20 minutes (= 1200 seconds). (Note: 2^31 / 1000000 = 2147 seconds.) Note 1: PAWS/TS2 assumes that RTTs measured at a local node and RTTs measured at a remote node are almost the same. Since the timestamp unit is fixed, the RTO value in the inequality of PAWS/TS2 can be evaluated under this assumption. This assumption might be wrong in some asymmetric networks. In such cases, the robustness against reordering may be poor. Nevertheless, it would not be worse than that of PAWS in [RFC1323]. Note 2: An RTT may suddenly increase due to congestion, route changes, link-bandwidth changes, etc. Hence, the computation of RTO values should be done in a conservative manner. If the value of TS.RcvMin is valid, since the timestamp unit is fixed, the expected RSEG.TSval value can be calculated using TS.RcvMin and TS.RcvMin_time. Let elapsed_ts be the elapsed time between TS.RcvMin_time and the current time in the units of the TS2 timestamp. Then, the expected RSEG.TSval value is calculated as (TS.RcvMin + elapsed_ts). - By considering the possibility of delays, the maximum acceptable value of RSEG.TSval is (TS.RcvMin + elapsed_ts + TS2_PAWS_DEV), where TS2_PAWS_DEV is a margin. When a segment other than a RST segment without the OTS option or a SYN segment is received, if its RSEG.TSval value is greater than this maximum bound, the segment MAY be dropped. If it is dropped, and if it is not a RST segment, an ACK segment SHOULD be sent in reply. Note: If PASA-DF/TS2 is enabled, this test SHOULD be performed. - Without considering PASA-DF/TS2, the minimum acceptable value of RSEG.TSval would be (TS.RcvMin + elapsed_ts - TS2_PAWS_DEV), Demizu Expires June 2006 [Page 13] Internet-Draft December 2005 where TS2_PAWS_DEV is a margin. If the remote node uses PASA-DF/TS2, however, the RSEG.TSval values may be less than this minimum value, because they may be tweaked. Therefore, the minimum bound SHOULD NOT be tested. 7. PASA (Protection Against Spoofing Attacks) PASA is a lightweight mechanism for protecting TCP connections against spoofing attacks injecting faked SYN, data, FIN, and RST segments. PASA can be enabled when TS2 is enabled. PASA does not work with TS1. PASA/TS2 (PASA with TS2) consists of two parts. One is called PASA-DF/TS2 (PASA for Data and FIN segments with TS2). It detects spoofed data and FIN segments with the TS option or the OTS option by making use of received RSEG.TSecr values. It also detects spoofed RST segments with the OTS option by applying the same test. The other part is called PASA-SR/TS2 (PASA for SYN and RST segments with TS2). It enables both genuine RST segments without the OTS option and genuine SYN segments to trigger disconnection of their TCP connections, while spoofed segments are not allowed to trigger such disconnection. Since PASA-DF/TS2 and PASA-SR/TS2 are independent of each other, an implementation MAY support one or both of them. 7.1 PASA-DF/TS2 (PASA for Data and FIN Segments with TS2) This subsection describes a mechanism called PASA-DF/TS2 that detects spoofed data and FIN segments with the TS option or the OTS option by making use of received RSEG.TSecr values. It also detects spoofed RST segments with the OTS option by applying the same test. If a received segment does not satisfy inequality (1) nor (2), it MUST NOT update the variables of PASA-DF/TS2. In contrast, if PASA-DF/TS2 is enabled, any received segment SHOULD be tested by PASA-DF/TS2 before the acceptability test of [RFC793]. PASA-DF/TS2 uses four variables: TS.SndMin (32bit-timestamp), TS.SndMax (32bit-timestamp), TS.SndMax_time (internal-time), and TS.PASADF_On (boolean). TS.SndMin holds the maximum value of the received RSEG.TSecr field in the TS option and the OTS option on segments satisfying inequality (1) or (2), while TS.SndMax holds the maximum value of the SSEG.TSval field on sent segments satisfying (SSEG.LEN > 0). For the first SYN or SYN+ACK segment sent, SSEG.TSval is copied to both TS.SndMin and TS.SndMax. Consequently, the received RSEG.TSecr values of the established TCP connection should be in the range from around TS.SndMin to TS.SndMax. The test of whether the received segments Demizu Expires June 2006 [Page 14] Internet-Draft December 2005 fit in this range is called the PASA-DF test in this memo. TS.SndMax_time holds the latest time when a segment satisfying (SSEG.LEN > 0) is sent or when TS.SndOff is updated. TS.PASADF_On indicates whether the PASA-DF test can be performed. The default value of TS.PASADF_On is true. Note: The reason why PASA-DF/TS2 does not test RST segments without the OTS option and SYN segments is to disconnect half-open TCP connections. They are tested by PASA-SR/TS2, as described later. When PASA-DF/TS2 is enabled, the maximum acceptable value of RSEG.TSval SHOULD be tested by PAWS/TS2. Additionally, if the timestamp granularity is longer than 1 usec, some lower bits of external timestamps SHOULD be used as nonce. 7.1.1 External Timestamp Values When PASA-DF/TS2 is enabled, TS.SndOff is utilized to randomize the initial SSEG.TSval value in order to obfuscate external timestamp values. If a TCP control block is reused by a new TCP connection, TS.SndOff MUST be incremented by a random number in the range from 0 to about 10 minutes (e.g., 2^29 usec). In other cases, a newly generated random number MUST be copied to TS.SndOff. TS.SndOff is also utilized to minimize the difference between TS.SndMin and TS.SndMax after a long idle period. Since the possibility of accepting spoofed segments is the difference in 2^32, it is important to keep the difference small. Therefore, the advancement of TS.SndMax MUST be no greater than an upper bound. In addition, SSEG.TSval MUST be monotonically nondecreasing in order to make PAWS operable at a remote node. To satisfy these requirements, this memo proposes that the advancement of SSEG.TSval be no greater than the upper bound. Suppose that TS2_PASADF_MAXADV is the upper bound. Then, TS.SndOff MUST be tweaked as follows: over_time = (CurrentTime - TS.SndMax_time) - TS2_PASADF_MAXADV; if (over_time > 0) { TS.SndOff += - time2ts(over_time) + RandomNumber(2^26); TS.SndMax_time = CurrentTime; } Note 1: This memo assumes that CurrentTime is not wrapped in the lifetime of any TCP connections. Note 2: RandomNumber(2^26) above means a random number in the range from 0 to 2^26 usec (about 1 minutes). Demizu Expires June 2006 [Page 15] Internet-Draft December 2005 7.1.2 Temporarily Suspension of Tests There is a possibility that (TS.SndMax - TS.SndMin) becomes negative when a data segment is sent after a series of sporadic transmissions that do not elicit any segments from a remote node. For example, consider a case where a TCP stack has received huge data, while its application reads the data very slowly. In this case, the TCP stack would send small window updates once in a while. During such a period, TS.SndMin and TS.SndMax are unchanged, while the SSEG.TSval values in those window updates are increasing. After a while, the difference between SSEG.TSval and TS.SndMin could be larger than 2^31. That is, (SSEG.TSval - TS.SndMin) could be negative. If a data segment is sent in that case, (TS.SndMax - TS.SndMin) also would be negative. To avoid confusion, the PASA-DF test MUST NOT be performed in such cases. Thus, this memo proposes the following procedure: - When TS.PASADF_On is true, if (TS.SndMax - TS.SndMin) becomes negative after a segment is sent, TS.PASADF_On is set to false. - When TS.PASADF_On is false, the PASA-DF test MUST NOT be performed. - When TS.PASADF_On is false, if a received segment satisfies the requirement that (TS.SndMax - RSEG.TSecr) be non-negative, RSEG.TSecr is copied to TS.SndMin, and TS.PASADF_On is set to true. This procedure is incorporated in the following two subsections. 7.1.3 Input Processing This subsection describes the procedure when a segment is received. When TS.PASADF_On is true, follow the procedure below. - In the CLOSED and LISTEN states, RSEG.TSecr MUST NOT be tested. - In the SYN-SENT state, if a received SYN+ACK segment carries both the TS option and the OTS_OK option, it MUST satisfy (TS.SndMin <= RSEG.TSecr <= TS.SndMax). If it does not satisfy this inequality, it MUST be dropped silently. When another segment is received, RSEG.TSecr is not tested. Note: The reason why an ACK segment is not sent in reply here is that an ACK segment cannot be sent in the SYN-SENT state. - In other states, all segments other than SYN and RST segments must satisfy (TS.SndMin - CurRTO <= RSEG.TSecr <= TS.SndMax), Demizu Expires June 2006 [Page 16] Internet-Draft December 2005 where CurRTO means the current RTO value. If a received segment other than a SYN or RST segment does not satisfy this inequality, it MUST be dropped, and an ACK segment SHOULD be sent in reply. In addition, if a received RST segment with the OTS option does not satisfy the inequality, it MUST be dropped silently. In both cases, if the received segment satisfies the inequality, and it satisfies (RSEG.TSecr > TS.SndMin), then RSEG.TSecr is copied to TS.SndMin. Note: The reason why RST segments without the OTS option and SYN segments are not tested here is to disconnect half-open TCP connections. When TS.PASADF_On is false, if a received segment satisfies the requirement that (TS.SndMax - RSEG.TSecr) be non-negative, RSEG.TSecr is copied to TS.SndMin, and TS.PASADF_On is set to true. 7.1.4 Output Processing This subsection describes the procedure when a segment is sent. When a segment satisfying (SSEG.LEN >0) is sent, SSEG.TSval is copied to TS.SndMax, and the current time is copied to TS.SndMax_time. If the segment is the first segment (e.g., the first SYN segment or the first SYN+ACK segment), SSEG.TSval is also copied to TS.SndMin. When TS.PASADF_On is false, if (TS.SndMax - TS.SndMin) becomes negative after the above copying, TS.PASADF_On is set to false. 7.2 PASA-SR/TS2 (PASA for SYN and RST Segments with TS2) This subsection describes a mechanism called PASA-SR/TS2, which enables both genuine RST segments without the OTS option and genuine SYN segments to trigger disconnection of their TCP connections, while spoofed segments are not allowed to trigger such disconnection. If a received segment does not satisfy inequality (1) nor (2), it MUST NOT update the variables of PASA-SR/TS2. In contrast, if PASA-SR/TS2 is enabled, any received segment SHOULD be tested by PASA-SR/TS2 before the acceptability test of [RFC793]. Note: RST segments without the OTS option and SYN segments are not dropped by the base mechanism of TS2, PAWS/TS2, and PASA-DF/TS2 in order to disconnect half-open TCP connections. 7.2.1 Procedure PASA-SR/TS2 uses the following variables: TS.PASASR_On (boolean) and TS.PASASR_time (internal-time). The initial value of TS.PASASR_On is Demizu Expires June 2006 [Page 17] Internet-Draft December 2005 false. TS.PASASR_time is valid only when TS.PASASR_On is true. The procedure is as follows: - When a SYN segment is received against an established TCP connection, regardless of whether it has the TS option or the OTS option, it MUST be dropped, and an ACK segment without either the TS option or the OTS option SHOULD be sent in reply. The window size of the ACK segment is TS2_PASASR_WIN, which should be small enough (e.g., 1 RMSS). Then, TS.PASASR_On is set to true, and the current time is recorded in TS.PASASR_time. - When TS.PASASR_On is true, if a received segment is not dropped by the base mechanism of TS2, the PAWS test, the PASA-DF test, the above two rules, and the acceptability test of [RFC793], then do the following: if (1) the received segment is not a RST segment, (2) it is a RST segment with the OTS option, or (3) a long time has been passed since the last ACK segment sent in reply to a SYN segment by PAWS/TS2 or PASA-DF/TS2 was sent (i.e., CurrentTime - TS.PASASR_time >= TS2_PASASR_TIME), then TS.PASASR_On is set to false. - If a RST segment without the OTS option is received, and if TS.PASASR_On is false, or the segment does not satisfy (RCV.NXT <= RSEG.SEQ < RCV.NXT + TS2_PASASR_WIN), it MUST be dropped silently. 7.2.2 Examples If a SYN segment is received against an established TCP connection, there are two possible causes. The first is that the remote node has been rebooted or disconnected silently and is trying to establish a new TCP connection with the same quadruple by chance. The second cause is that a malicious node sent a spoofed SYN segment with the same quadruple by chance. If a RST segment without the OTS option is received against an established TCP connection, there are two possible causes. The first is that the remote node has been rebooted or disconnected silently and has sent a RST segment in reply to a segment sent by the local node. The second cause is that a malicious node sent a spoofed RST segment with the same quadruple by chance. In any case, genuine SYN and RST segments should cause the TCP connection to disconnect, while spoofed SYN and RST segments should not cause it to disconnect. The following four examples show how PASA-SR/TS2 would work against the four possible causes described above. Suppose that a local node Demizu Expires June 2006 [Page 18] Internet-Draft December 2005 and a remote node have an established TCP connection, and TS2 is enabled on it. Case 1: The remote node has been rebooted or disconnected silently, and it sent a SYN segment to establish a new TCP connection with the same quadruple by chance. When this genuine SYN segment is received, an ACK segment with window size = TS2_PASASR_WIN without either the TS option or the OTS option is sent in reply because of PASA-SR/TS2. When the remote node receives this ACK segment, it sends a RST segment without the OTS option because it is in the SYN-SENT state. Since the sequence number of this RST segment would satisfy (RCV.NXT <= RSEG.SEQ < RCV.NXT + TS2_PASASR_WIN), this RST segment would be accepted by the local node because of PASA-SR/TS2. Then, this RST segment would disconnect the existing TCP connection successfully. After a while, another SYN segment will be retransmitted by the remote node, and a new TCP connection will be established. Case 2: A malicious node sent a spoofed SYN segment with the same quadruple by chance. When this spoofed SYN segment is received, an ACK segment with window size = TS2_PASASR_WIN without either the TS option or the OTS option is sent in reply because of PASA-SR/TS2. When the remote node receives this ACK segment, since TS2 is enabled on the TCP connection, this ACK segment is dropped by the remote node, and an ACK segment is sent in reply. The ACK segment would be accepted by the local node. Fortunately, it would have no effect other than that the duplicate ACK counter could be falsely increased by one. Thus, the spoofed SYN segment would not disconnect the existing TCP connection. Note: If the duplicate ACK counter is increased only by an ACK segment with the TS option, it would not be falsely increased in this case. See Appendix C. If a spoofed RST segment is sent just after the SYN segment, the possibility of accepting the spoofed RST segment is TS2_PASASR_WIN in 2^32. This would be sufficiently small in most environments. Case 3: The remote node sent a genuine RST segment without the OTS option to disconnect a TCP connection. This case cannot happen if TS2 is enabled, because a legitimate Demizu Expires June 2006 [Page 19] Internet-Draft December 2005 RST segment MUST carry the OTS option. Case 4: A malicious node sent a spoofed RST segment without the OTS option with the same quadruple by chance. This spoofed RST segment is accepted only when TS.PASASR_On is true and (RCV.NXT <= RSEG.SEQ < RCV.NXT + TS2_PASASR_WIN) is satisfied. Thus, the possibility of accepting this spoofed RST segment is lower than TS2_PASASR_WIN in 2^32. This would be sufficiently small in most environments. 8. DLD (Data Loss Detection) DLD is a mechanism for detecting losses of original and retransmitted data segments by making use of the RSEG.TSecr field in the TS option on received segments. It improves overall throughput by reducing the number of retransmission timeouts under heavy loss rates. It can be enabled when TS2 is enabled. If a received segment does not satisfy inequality (1) nor (2), or it does not carry the TS option, DLD/TS2 (DLD with TS2) MUST NOT be performed. Note that if a received segment carries the OTS option, DLD/TS2 MUST NOT be performed. Ideally, two variables are associated with each sent octet: OD.SndTS (32bit-timestamp) and OD.SndRO (integer). OD.SndTS holds the SSEG.TSval value on the latest data segment containing the octet. The initial value of OD.SndRO is zero. When data is retransmitted, the variables of the retransmitted octets are reinitialized. If a segment with the TS option is received, every OD.SndRO of all octets satisfying (RSEG.TSecr > OD.SndTS) is increased by one. Then, all octets satisfying (OD.SndRO >= TS2_DLD_THRESH) are considered lost. A real-world implementation would likely prefer to manage the retransmitted octets as sequence number ranges. TS2_DLD_THRESH indicates the number of observed possible reorders required to declare a loss. The default value is 3, which is the same value as the so-called duplicate acknowledgement threshold specified in [RFC2581]. TS2_DLD_THRESH might be implemented as an adaptive variable in the future. The above algorithm, however, would not be easy to implement because of memory limitations. Therefore, this memo proposes the following space-optimized algorithms that do not require a huge memory space. (1) DLD-SEG/TS2 detects losses of any original and retransmitted data segments. It uses two variables for each data segment. (2) DLD-UNA/TS2 detects losses of original and retransmitted data at SND.UNA. It uses only two variables for each TCP Demizu Expires June 2006 [Page 20] Internet-Draft December 2005 connection. (3) DLD-SACK/TS2 detects losses of original and retransmitted data in SACK holes. It uses two variables for each SACK hole. These algorithms can be implemented independently. Since DLD-SEG/TS2 is sufficiently powerful, however, if it is implemented, DLD-UNA/TS2 and DLD-SACK/TS2 need not be implemented. DLD/TS2 uses the RSEG.TSecr field in the TS option only, because all segments sent in reply to data segments carry the TS option. That is, in the loss recovery phase, the number of data segments arriving at the remote node is basically equal to the number of segments with the TS option that are sent by the remote node. To count the number of observed possible reorders precisely, any segments with the TS option (including data segments, window updates, etc.) should be counted. Note 1: Fast Retransmit [RFC2581] and SACK [RFC2018][RFC3517] are helpful for detecting losses of original data segments. In contrast with DLD/TS2, however, they do not detect losses of retransmitted data segments. They are very helpful when DLD-UNA/TS2 is implemented but DLD-SEG/TS2 and DLD-SACK/TS2 are not implemented. Note 2: When some data have been retransmitted, losses of data between SND.UNA and the highest retransmitted data cannot be detected by IsLost() [RFC3517], while such losses can be detected by DLD/TS2. See Appendix C for more details. 8.1 DLD-SEG/TS2 (DLD for Data Segments) This subsection describes a mechanism called DLD-SEG/TS2 that detects losses of any original and retransmitted data segments. It is useful when each data segment is already managed by a data structure in an implementation. It can be enabled when TS2 is enabled. DLD-SEG/TS2 uses two variables for each sent or retransmitted data segment: DS.SndTS (32bit-timestamp) and DS.SndRO (integer). DS.SndTS holds the SSEG.TSval value on the latest data segment. DS.SndRO counts the number of segments with the TS option that satisfy (RSEG.TSecr > DS.SndTS). In other words, it counts the number of observed possible reorders in the view of the data segment. The procedure is as follows. When a data segment is sent or retransmitted, its SSEG.TSval value is recorded in DS.SndTS, and DS.SndRO is cleared. When a segment with the TS option is received, every DS.SndRO of all Demizu Expires June 2006 [Page 21] Internet-Draft December 2005 data segments satisfying (RSEG.TSecr > DS.SndTS) is increased by one. Then, all data segments satisfying (DS.SndRO >= TS2_DLD_THRESH) are considered lost. Implementation Hint: Prepare a chain of structures for data segments, sorted by DS.SndTS. When a new data segment is sent, a new structure for the data segment is allocated, SSEG.TSval is copied to DS.SndTS, and DS.SndRO is cleared; then the structure is inserted at the tail of the chain. When a data segment is retransmitted, SSEG.TSval is copied to DS.SndTS, and DS.SndRO is cleared; then the structure is moved to the tail of the chain. When a segment with the TS option is received, traverse the chain from the head while (RSEG.TSecr > DS.SndTS). For each structure satisfying this inequality, DS.SndRO is increased by one. Then, all structures satisfying (DS.SndRO >= TS2_DLD_THRESH) are marked as lost. 8.2 DLD-UNA/TS2 (DLD for Data at SND.UNA) This subsection describes a mechanism called DLD-UNA/TS2 that detects losses of original and retransmitted data at SND.UNA. It can be enabled when TS2 is enabled. DLD-UNA/TS2 uses only two variables: TS.SndUnaTS (32bit-timestamp) and TS.SndUnaRO (integer). TS.SndUnaTS holds the SSEG.TSval value on the latest data segment containing data at SND.UNA. TS.SndUnaRO counts the number of segments with the TS option that satisfy (RSEG.TSecr > TS.SndUnaTS). In other words, it counts the number of observed possible reorders in the view of data at SND.UNA. The values are valid only when (SND.UNA < SND.MAX) is true. The procedure is as follows. When data at SND.UNA is sent or retransmitted, the SSEG.TSval value on the data segment is recorded in TS.SndUnaTS, and TS.SndUnaRO is cleared. When (SND.UNA < SND.MAX) is true, if a received segment does not advance SND.UNA (i.e., RSEG.ACK <= SND.UNA), and if it carries the TS option and satisfies (RSEG.TSecr > TS.SndUnaTS), TS.SndUnaRO is increased by one. After that, if (TS.SndUnaRO >= TS2_DLD_THRESH), the data at SND.UNA is considered lost. Otherwise, if a received segment advances SND.UNA (i.e., SND.UNA < RSEG.ACK <= SND.MAX), the current external timestamp value is copied to TS.SndUnaTS, and TS.SndUnaRO is cleared. Implementation Note: If PASA-DF/TS2 is enabled, then when SND.UNA is advanced, for time-optimization TS.SndMax can be copied to TS.SndUnaTS, instead of the current external timestamp value. Demizu Expires June 2006 [Page 22] Internet-Draft December 2005 8.3 DLD-SACK/TS2 (DLD for Data in SACK Holes) This subsection describes a mechanism called DLD-SACK/TS2 that detects losses of original and retransmitted data in SACK holes. It can be enabled when both SACK and TS2 are enabled. DLD-SACK/TS2 uses two variables for each SACK hole: SH.SndTS (32bit-timestamp) and SH.SndRO (integer). SH.SndTS holds the SSEG.TSval value on the latest data segment containing data in the SACK hole. SH.SndRO counts the number of segments with the TS option that satisfy (RSEG.TSecr > SH.SndTS). In other words, it counts the number of observed possible reorders in the view of the data in the SACK hole since part or all of the data in the SACK hole were sent or retransmitted. The procedure is as follows. When an ACK segment with the TCP SACK option is received, if a data structure for a new SACK hole is allocated, the received RSEG.TSecr value is copied to SH.SndTS, and SH.SndRO is cleared. When data in a SACK hole is retransmitted, the SSEG.TSval value on the data segment is copied to SH.SndTS, and SH.SndRO is cleared. When a segment with the TS option is received, SH.SndRO for all SACK holes satisfying (RSEG.TSecr > SH.SndTS) is increased by one. Then, whole data in all SACK holes satisfying (SH.SndRO >= TS2_DLD_THRESH) are considered lost. Implementation Hint: Prepare a chain of SACK holes, sorted by SH.SndTS. When a new SACK hole is created, RSEG.TSecr is copied to SH.SndTS, SH.SndRO is cleared, and the SACK hole is inserted at the tail of the chain. When data in a SACK hole is retransmitted, SSEG.TSval is copied to SH.SndTS, and SH.SndRO is cleared; then the SACK hole is moved to the tail of the chain. When a segment with the TS option is received, traverse the chain from the head while (RSEG.TSecr > SH.SndTS). For each SACK hole satisfying this inequality, SH.SndRO is increased by one. Then, all SACK holes satisfying (SH.SndRO >= TS2_DLD_THRESH) are marked as lost. If a SACK hole is split by a received SACK block, the split SACK holes inherit SH.SndTS, SH.SndRO, and the position in the chain. If DLD-UNA/TS2 is enabled, DLD-SACK/TS2 should help the process as follows. When SND.UNA is advanced by a received segment, if the new SND.UNA is in an existing SACK hole, SH.SndTS and SH.SndRO of the SACK hole are copied to TS.SndUnaTS and TS.SndUnaRO, respectively. Otherwise, if the new SND.UNA is not in an existing SACK hole, the current external timestamp value is copied to TS.SndUnaTS, and TS.SndUnaRO is cleared. Demizu Expires June 2006 [Page 23] Internet-Draft December 2005 Implementation Note: If PASA-DF/TS2 is enabled, then when SND.UNA is advanced, TS.SndMax can be copied to TS.SndUnaTS, instead of the current external timestamp value for time-optimization. 9. SRD (Spurious Retransmission Detection) SRD is a mechanism to detect a posteriori spurious retransmission timeouts and spurious Fast Retransmits [RFC2581] by using the received RSEG.TSecr value in the TS option and the OTS option. The problems solved by SRD are nearly the same as those solved by the Eifel detection algorithm [RFC3522]. SRD can be enabled when TS2 is enabled. When a received segment does not satisfy inequality (1) nor (2), SRD MUST NOT be performed. If TS1 is enabled, the Eifel detection algorithm can be applied. If a remote node supports TS1, the issue discussed in section 3.3 of [RFC3522] will not occur. 9.1 Retransmission Timeout and Fast Retransmit SRD/TS2 (SRD with TS2) uses two variables: TS.SRD_Mode (enumeration) and TS.SRD_TS (32bit-timestamp). The value of TS.SRD_Mode is TS2_SRD_NO, TS2_SRD_TO, or TS2_SRD_FR. Its initial value is TS2_SRD_NO, which means that SRD is not running. TS2_SRD_TO means that SRD is running and was triggered by a retransmission timeout. TS2_SRD_FR means that SRD is running and was triggered by a Fast Retransmit. The value of TS.SRD_TS is valid only when TS.SRD_Mode is not TS2_SRD_NO. Note: Since SRD/TS2 sends an arbitrary in-window data segment upon a retransmission timeout, as described below, in the case where both DLD-UNA/TS2 and SRD/TS2 are implemented, TS.SndUnaTS of DLD-UNA/TS2 and TS.SRD_TS must be implemented as separate variables. Upon a retransmission timeout, standard TCP retransmits data starting at SND.UNA. In contrast, SRD/TS2 sends arbitrary in-window data, as follows. If new data can be sent, it is sent. If lost data in a SACK hole can be retransmitted, it is retransmitted. Otherwise, the data starting at SND.UNA is retransmitted, as in the case of standard TCP. In any case, if TS.SRD_Mode is TS2_SRD_NO, the SSEG.TSval value on the sent data segment is copied to TS.SRD_TS, and TS.SRD_Mode is set to TS2_SRD_TO. As a result, the problem of unnecessary data retransmission and unnecessary delayed ACK segments can be alleviated. Upon a Fast Retransmit, the data starting at SND.UNA is retransmitted as specified in [RFC2581]. If TS.SRD_Mode is TS2_SRD_NO, SSEG.TSval Demizu Expires June 2006 [Page 24] Internet-Draft December 2005 on the sent data segment is copied to TS.SRD_TS, and TS.SRD_Mode is set to TS2_SRD_FR. Note: Upon a Fast Retransmit, SRD/TS2 might be able to send arbitrary in-window data instead of the data at SND.UNA in order to probe whether the inferred loss is genuine or not. If DLD/TS2 is enabled, however, a Fast Retransmit will not be triggered spuriously in the sense of SRD/TS2. Further research on this might be necessary. When TS.SRD_Mode is not TS2_SRD_NO, if an acceptable segment is received, SRD/TS2 checks whether the retransmission timeout or Fast Retransmit was genuine or spurious as follows. If the received segment satisfies (RSEG.TSecr >= TS.SRD_TS), the transmission is considered genuine. In this case, if TS.SRD_Mode is TS2_SRD_TO, congestion window is increased to 2 SMSS, as normal retransmission timeout, and SND.NXT is set to SND.UNA if it is not equal to SND.UNA. Then, TS.SRD_Mode is set to TS2_SRD_NO. On the other hand, if the received segment does not satisfy (RSEG.TSecr >= TS.SRD_TS), but it advances SND.UNA, the transmission is considered spurious. In this case, TS.SRD_Mode is set to TS2_SRD_NO, and a response algorithm such as [RFC4015] is executed. In other cases, nothing is done, and the next segment is awaited. Note: This input procedure assumes that the minimum possible RTT is longer than the timestamp granularity of TS2. If this is false, this procedure may consider a spurious retransmit as genuine. In that case, the inequality (RSEG.TSecr >= TS.SRD_TS) in the procedure SHOULD be replaced with (RSEG.TSecr > TS.SRD_TS). 9.2 SRD for Arbitrary In-Window Data If SACK is enabled, a similar mechanism can detect a posteriori spurious retransmission of arbitrary octets in the window. Two variables are associated for each retransmitted octet: OD.SRD_Do (boolean) and OD.RxmtTS (32bit-timestamp). Upon the first retransmission containing the octet, SSEG.TSval is copied to OD.RxmtTS, and OD.SRD_Do is set to true. When OD.SRD_Do is true, if an acceptable segment is received, the following procedure determines whether the first retransmission was genuine or spurious: If (RSEG.TSecr >= OD.RxmtTS) is satisfied, the first retransmission is considered genuine, and OD.SRD_Do is changed to false. On the other hand, if the received segment does not satisfy (RSEG.TSecr >= OD.RxmtTS) but it SACKs the octet, the first retransmission is considered spurious, and OD.SRD_Do is changed to false. In other cases, another segment is awaited. Demizu Expires June 2006 [Page 25] Internet-Draft December 2005 A real-world implementation would likely favor managing retransmitted octets as sequence number ranges. Note: In this procedure, OD.RxmtTS holds the SSEG.TSval value of the first retransmission. It is not updated by any succeeding retransmission. In contrast, OD.SndTS of DLD/TS2 holds the SSEG.TSval value of the latest transmission including the original transmission, and it is updated by any retransmission. Therefore, OD.SndTS of DLD/TS2 and OD.RxmtTS must be implemented as separate variables. 10. Security Considerations PASA/TS2 is a lightweight protection mechanism against spoofing attacks injecting faked SYN, data, FIN, and RST segments. The vulnerability described in [CVE05] and [CERT05] is mitigated in PAWS/TS1 and PAWS/TS2 because of inequalities (1) and (2). When TS2 is enabled, it is also mitigated by PASA/TS2. 11. IANA Considerations The option-kind value of the TCP Old Timestamps option needs to be assigned. 12. Acknowledgements The TCP Timestamps option was originally specified in [RFC1323] by Van Jacobson, Bob Braden, and Dave Borman. Many ideas in this memo are thus inherited from it. The TS.Recent update rule of TS2 was inspired by Reiner Ludwig [Lud03a][Lud03b]. The idea of detecting spurious timeouts by making use of the TSecr field was proposed by Reiner Ludwig. The idea of detecting spoofed segments by making use of the TSecr field was proposed by Kacheong Poon. He has given the author invaluable insights, ideas, and comments on timestamp handling through discussions on [PD04]. 13. References 13.1 Normative References [RFC793] J. Postel, "Transmission Control Protocol", STD7, RFC793, Demizu Expires June 2006 [Page 26] Internet-Draft December 2005 September 1981. [RFC1323] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for High Performance", RFC1323, May 1992. [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP14, RFC2119, March 1997. [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", RFC2581, April 1999. [RFC2988] V. Paxson, and M. Allman, "Computing TCP's Retransmission Timer", RFC2988, November 2000. [RFC3522] R. Ludwig and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC3522, April 2003. 13.2 Informative References [RFC1122] R. Braden, Editor, "Requirements for Internet Hosts - Communication Layers", RFC1122, October 1989. [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgement Options", RFC2018, October 1996. [RFC2385] A. Heffernan, "Protection of BGP Sessions via the TCP MD5 Signature Option", RFC2385, August 1998. [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, and M. Podolsky, "An Extension to the Selective Acknowledgement (SACK) Option for TCP", RFC2883, July 2000. [RFC3042] M. Allman, H. Balakrishnan, and S. Floyd, "Enhancing TCP's Loss Recovery Using Limited Transmit", RFC3042, January 2001. [RFC3465] M. Allman, "TCP Congestion Control with Appropriate Byte Counting (ABC)", RFC3465, February 2003. [RFC3517] E. Blanton, M. Allman, K. Fall, and L. Wang, "A Conservative Selective Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP", RFC3517, April 2003. [RFC3708] E. Blanton, and M. Allman, "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions", RFC3708, February 2004. Demizu Expires June 2006 [Page 27] Internet-Draft December 2005 [RFC3782] S. Floyd, T. Henderson, and A. Gurtov, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC3582, April 2004. [RFC4015] R. Ludwig and A. Gurtov, "The Eifel Response Algorithm for TCP", RFC4015, February 2005. [All04] M. Allman, "Re: [tcpm] long options draft revision", the IETF TCPM WG mailing list, September 2004. URL "http://www1.ietf.org/mail-archive/web/tcpm/current/ msg00748.html" [Bra93] R. Braden, "TCP Extensions for High Performance: An Update", (work in progress), Internet-Draft, June 1993. URL "http://www.kohala.com/start/tcplw-extensions.txt" [CERT05] US-CERT, "Vulnerability Note VU#637934", May 2005. URL "http://www.kb.cert.org/vuls/id/637934" [CVE05] CVE-2005-0356, May 2005. URL "http://www.cve.mitre.org/ cgi-bin/cvename.cgi?name=2005-0356" [Duk03a] M. Duke, "[Tsvwg] Updating timestamps (ts_recent) in Linux", the IETF TSVWG WG mailing list, August 2003. URL "http://www1.ietf.org/mail-archive/web/tsvwg/current/ msg04379.html" [Duk03b] M. Duke, "RE: [Tsvwg] Updating timestamps (ts_recent) in Linux", the IETF TSVWG WG mailing list, August 2003. URL "http://www1.ietf.org/mail-archive/web/tsvwg/current/ msg04391.html" [JBB97] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for High Performance", (work in progress), Internet-Draft , February 1997. [JBB03] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for High Performance", (work in progress), Internet-Draft , August 2003. [KP87] P. Karn, and C. Partridge, "Estimating Round-Trip Times in Reliable Transport Protocols", Proceedings of SIGCOMM'87, August 1987. [Lud03a] R. Ludwig, "RE: [Tsvwg] Updating timestamps (ts_recent) in Linux", the IETF TSVWG WG mailing list, August 2003. URL "http://www1.ietf.org/mail-archive/web/tsvwg/current/ msg04389.html" Demizu Expires June 2006 [Page 28] Internet-Draft December 2005 [Lud03b] R. Ludwig, "[Tsvwg] RFC1323.bis [was: Updating timestamps (ts_recent)]", the IETF TSVWG WG mailing list, August 2003. URL "http://www1.ietf.org/mail-archive/web/tsvwg/ current/msg04397.html" [Mil98] D. Miller, "possible bug in PAWS", the IETF TCP-IMPL WG mailing list, March 1998. URL "http://tcp-impl.grc.nasa. gov/tcp-impl/list/archive/1035.html" [PD04] K. Poon and N. Demizu, "Use of TCP timestamp option to defend against blind spoofing attack", (work in progress), Internet-Draft , October 2004. Author's Address Noritoshi Demizu National Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan Phone: +81-42-327-7432 (Ex.5813) E-mail: demizu@nict.go.jp Appendix A: TS2 Reference This appendix gives the formal description of TS1 and TS2 by using C-like pseudocode as a reference. An implementation MAY support TS2. If an implementation supports TS2, it MUST implement the two TCP Timestamps options described in section 3 and the base mechanism described in section 4. In addition, it MAY implement one or more of RTTM, PAWS, PASA, DLD and SRD. A.1 Types The following types are used in this appendix: boolean, integer, 32bit-sequence-number, 32bit-timestamp, and internal-time. All arithmetic dealing with the 32bit-sequence-number and 32bit-timestamp types must be performed modulo 2^32. The format of internal-time depends on the implementation: for example, it may be an integer with unit = 1 second, or an OS-dependent structure. The following type conversion function is used: time2ts() converts an internal-time value to a 32bit-timestamp value. Demizu Expires June 2006 [Page 29] Internet-Draft December 2005 A.2 Boolean Macros The following boolean macros are defined. To compare sequence numbers: SEQ_GT(a, b) True if (a > b) in modulo 2^32. False, otherwise. SEQ_GE(a, b) True if (a >= b) in modulo 2^32. False, otherwise. SEQ_LT(a, b) True if (a < b) in modulo 2^32. False, otherwise. SEQ_LE(a, b) True if (a <= b) in modulo 2^32. False, otherwise. To compare timestamps: TS_GT(a, b) True if (a > b) in modulo 2^32. False, otherwise. TS_GE(a, b) True if (a >= b) in modulo 2^32. False, otherwise. TS_LT(a, b) True if (a < b) in modulo 2^32. False, otherwise. TS_LE(a, b) True if (a <= b) in modulo 2^32. False, otherwise. To compare times: TIME_GE(a, b) True if a is no earlier than b. False, otherwise. TIME_LT(a, b) True if a is earlier than b. False, otherwise. A.3 Inequalities The following inequalities are defined for testing received segments. - Inequality (1) --- for data, SYN and FIN segments: (RSEG.LEN > 0 && SEQ_LT(Last.Ack.Sent - max(RCV.WND), RSEG.SEQ + RSEG.LEN) && SEQ_LT(RSEG.SEQ, RCV.NXT + RCV.WND)) - Inequality (2) --- for ACK segments: (RSEG.LEN == 0 && SEQ_LE(Last.Ack.Sent - max(RCV.WND), RSEG.SEQ) && SEQ_LT(RSEG.SEQ, RCV.NXT + RCV.WND)) The following boolean macros are defined for evaluating these inequalities with respect to a received segment. TS_ISLEG() True if (1) or (2) is satisfied. False, otherwise. A.4 Variables If a received segment does not satisfy inequality (1) nor (2), variables below MUST NOT be updated. Demizu Expires June 2006 [Page 30] Internet-Draft December 2005 A.4.1 Variables for Base Mechanism The following variables are defined for the base mechanism. (See section 4) TS.Req (integer) - This variable represents a user's request: 0 if neither TS1 nor TS2 is requested, 1 if TS1 is requested, or 2 if TS2 is requested. - The initial value is given by the user. TS.Mode (integer) - This variable represents the result of the mode negotiation: negative if negotiation has not been completed, 0 if both TS1 and TS2 are disabled, 1 if TS1 is enabled, or 2 if TS2 is enabled. - The initial value is negative. TS.Recent (32bit-timestamp) - This variable holds the value to be echoed in SSEG.TSecr. The initial value is zero. - If TS1 is enabled, this variable holds the maximum RSEG.TSval value received on segments satisfying SEQ_LE(RSEG.SEQ, Last.ACK.sent). - If TS2 is enabled, this variable holds the minimum RSEG.TSval value on segments satisfying (RSEG.LEN > 0) received after a segment has last been sent. - It is similar as TS.Recent as defined in [RFC1323]. TS.RecentIsOld (boolean) - This variable is true when TS.Mode == 2 and no segment is received after the last segment has been sent. Therefore, this variable indicates whether the value in TS.Recent has been echoed to a remote node when TS.Mode == 2. - It is valid only when TS.Mode is 2. Its value is undefined in other modes. Last.Ack.Sent (32bit-sequence-number) - This variable holds the last SSEG.ACK value sent. - It is the same as Last.Ack.Sent as defined in [RFC1323]. TS.SndOff (32bit-timestamp) - This variable holds an offset to convert an internal timestamp value to an external timestamp value. TS.SndAdj (32bit-timestamp) - This variable holds the difference between the return values Demizu Expires June 2006 [Page 31] Internet-Draft December 2005 of GetIntTS1() and GetIntTS2() when the first SYN was sent. - It adjusts TS.SndOff in the TCP three-way handshake phase when a TCP connection is established with TS2 by a local node. A.4.2 Variables for PAWS The following variables are defined for PAWS. (See section 6) TS.RcvMin (32bit-timestamp) - This variable holds the maximum received RSEG.TSval value in both the TS option and the OTS option. TS.RcvMin_time (internal-time) - This variable holds the time when TS.RcvMin was last updated. A.4.3 Variables for PASA A.4.3.1 Variables for PASA-DF/TS2 The following variables are defined for PASA-DF/TS2. (See section 7.1) TS.SndMin (32bit-timestamp) - This variable holds the maximum received RSEG.TSecr value in both the TS option and the OTS option. TS.SndMax (32bit-timestamp) - This variable holds the maximum SSEG.TSval value on sent segments satisfying (SSEG.LEN > 0). TS.SndMax_time (internal-time) - This variable holds the time when TS.SndMax was last updated. - This variable is referred to in order to determine whether TS.SndOff MUST be tweaked. If there is another, simpler way to determine it, this variable can be omitted. TS.PASADF_On (boolean) - This variable indicates whether the PASA-DF test can be performed. A.4.3.2 Variables for PASA-SR/TS2 The following variables are defined for PASA-SR/TS2. (See section 7.2) TS.PASASR_On (boolean) - This variable is set to true when a SYN segment is received against an established TCP connection. Demizu Expires June 2006 [Page 32] Internet-Draft December 2005 - The initial value is false. TS.PASASR_time (internal-time) - This variable holds the time when the last SYN segment was received. - The value of this variable is valid only when TS.PASASR_On is true. A.4.4 Variables for DLD A.4.4.1 Variables for DLD-SEG/TS2 The following variables are defined for DLD-SEG/TS2 for detecting losses of data. They are associated with each data segment. (See section 8.1) DS.SndTS (32bit-timestamp) for each data segment - This variable holds the SSEG.TSval value on the latest data segment. DS.SndRO (integer) for each data segment - This variable counts the number of received ACK segments satisfying TS_GT(RSEG.TSecr, DS.SndTS), which means the number of observed possible reorders. Note: Since DLD-SEG/TS2 is powerful, if it is implemented, DLD-UNA/TS2 and DLD-SACK/TS2 need not be implemented. A.4.4.2 Variables for DLD-UNA/TS2 The following variables are defined for DLD-UNA/TS2 for detecting losses of data at SND.UNA. (See section 8.2) TS.SndUnaTS (32bit-timestamp) - This variable holds the SSEG.TSval value on the latest data segment containing data at SND.UNA. - The value of this variable is valid only when SEQ_LT(SND.UNA, SND.MAX). TS.SndUnaRO (integer) - This variable counts the number of received ACK segments satisfying TS_GT(RSEG.TSecr, TS.SndUnaTS), which means the number of observed possible reorders. - The value of this variable is valid only when SEQ_LT(SND.UNA, SND.MAX). A.4.4.3 Variables for DLD-SACK/TS2 The following variables are defined for DLD-SACK/TS2 for detecting Demizu Expires June 2006 [Page 33] Internet-Draft December 2005 losses of data in SACK holes. They are associated with each SACK hole. (See section 8.3) SH.SndTS (32bit-timestamp) for each SACK hole - This variable holds the SSEG.TSval value on the latest data segment containing data in this SACK hole. SH.SndRO (integer) for each SACK hole - This variable counts the number of received ACK segments satisfying TS_GT(RSEG.TSecr, SH.SndTS), which means the number of observed possible reorders. The following variables are associated with each SACK hole. They would be implemented in typical SACK implementations. SH.Start (32bit-sequence-number) for each SACK hole - This variable holds the lowest sequence number. SH.End (32bit-sequence-number) for each SACK hole - This variable holds the highest sequence number +1. A.4.5 Variables for SRD The following variables are defined for SRD. (See section 9) TS.SRD_Mode (enumeration) - This variable indicates the mode of SRD. Its values are as follows: TS2_SRD_NO ... SRD is not running. TS2_SRD_TO ... SRD was triggered by a retransmission timeout. TS2_SRD_FR ... SRD was triggered by a Fast Retransmit. - The initial value is TS2_SRD_NO. TS.SRD_TS (32bit-timestamp) - This variable holds the SSEG.TSval value on a retransmitted data segment when TS.SRD_Mode is TS2_SRD_NO. - Its value is valid only when TS.SRD_Mode is not TS2_SRD_NO. The value is not changed while it is valid. A.5 Current Time The following pseudocode functions are defined for getting the current time or current timestamp. GetTime() --- Get Current Time (internal-time) - Get the current time in an internal time format. - The time returned by this function MUST NOT be wrapped in the lifetime of any TCP connection. Demizu Expires June 2006 [Page 34] Internet-Draft December 2005 GetIntTS1() --- Get Current Internal TS for TS1 (32bit-timestamp) - Get the current time in a 32-bit unsigned integer in the unit of the TS1 timestamp. Note that the actual SSEG.TSval value is calculated as GetIntTS1() + TS.SndOff. - The timestamp unit MUST be in the range of 1 sec to 1 ms. GetIntTS2() --- Get Current Internal TS for TS2 (32bit-timestamp) - Get the current time in a 32-bit unsigned integer in the unit of the TS2 timestamp. Note that the actual SSEG.TSval value is calculated as GetIntTS2() + TS.SndOff. - The timestamp unit is fixed at 1 usec. GetRTO2() --- Get Current RTO value for TS2 (32bit-timestamp) - Get the current RTO value in a 32-bit unsigned integer so that it can be used in PAWS/TS2 and PASA-DF/TS2. - The timestamp unit is the same as that of TS2. A.6 Random Number Generator The following pseudocode function is defined for getting random numbers. RandomNumber(max) - It returns a random number between 0 and max. A.7 Constants The following constants are defined. TS1_RTTM_G (32bit-timestamp) for RTTM/TS1 - This constant represents the granularity of GetIntTS1() in the unit of the TS1 timestamp. TS2_RTTM_G (32bit-timestamp) for RTTM/TS2 - This constant represents the granularity of GetIntTS2() in the unit of the TS2 timestamp. TS1_PAWS_MARGIN (32bit-timestamp) for PAWS/TS1 - When PAWS/TS1 is enabled, the minimum acceptable RSEG.TSval is calculated as (TS.RcvMin - TS1_PAWS_MARGIN). - The default value is 1. TS1_PAWS_IDLE (internal-time) for PAWS/TS1 - The value of TS.RcvMin is valid for this amount of time if TS1 is enabled. - The default value is 24 days. TS2_PAWS_IDLE (internal-time) for PAWS/TS2 - The value of TS.RcvMin is valid for this amount of time if Demizu Expires June 2006 [Page 35] Internet-Draft December 2005 TS2 is enabled. - The default value is 20 minutes. (This value should be longer than the longest timeout.) TS2_PAWS_DEV (32bit-timestamp) for PAWS/TS2 - This constant represents the acceptable deviation of received RSEG.TSval. - The default value is 1 minute. TS2_PASADF_MAXADV (internal-time) for PASA-DF/TS2 - This constant represents the maximum increase in SSEG.TSval. - The default value is 64 seconds. (This value SHOULD be greater than the maximum RTO value.) TS2_PASASR_WIN (integer) for PASA-SR/TS2 - This constant represents the window size of the ACK segments sent in reply to SYN segments. - The default value is the maximum default SMSS value on the box. The value MAY be RMSS on the TCP connection (In this case, it is not constant, though). TS2_PASASR_TIME (internal-time) for PASA-SR/TS2 - This constant represents the time in which received RST segments should be specially handled. - The default value is 10 seconds. TS2_DLD_THRESH (integer) for DLD/TS2 - If the number of observed possible reorders in the view of a target segment is greater than this value, the target segment is considered lost. - The default value is 3 (The same value as the so-called duplicate acknowledgement threshold specified in [RFC2581]). A.8 Attributes of Received Segments The following flags represent attributes of the received segment. isSYN True only if it is a SYN segment. isRST True only if it is a RST segment. isFirstSYN True only if it is the first SYN segment. isFirstSYNACK True only if it is the first SYN+ACK segment. withTS True only if it carries the TS option. withOTS True only if it carries the OTS option. withOTS_OK True only if it carries the OTS_OK option. Demizu Expires June 2006 [Page 36] Internet-Draft December 2005 A.9 Procedures A.9.1 Initialization When a TCP Control Block is created or reused, the procedure below is followed to initialize the variables. /* Step 1: Base(TS1&TS2) */ TS.Req = 0, 1 or 2; /* Requested by user */ TS.Mode = -1; /* Not negotiated yet */ TS.Recent = 0; TS.SndAdj = 0; if (TCP Control Block is reused) { /* To avoid reusing the same range of SSEG.TSval */ if (TS.Req == 2) { TS.SndOff += (GetIntTS2() - GetIntTS1()); } } else { TS.SndOff = 0; } /* Step 2: PASA-DF/TS2 */ if (TS.Req == 2) { if (TCP Control Block is reused) { /* Add 0 to about 10 minutes */ TS.SndOff += RandomNumber(2^29); } else { /* Randomize the initial timestamp */ TS.SndOff = RandomNumber(2^32); } } /* Step 3: DLD-UNA/TS2 */ TS.SndUnaRO = 0; /* Step 4: SRD/TS2 */ TS.SRD_Mode = TS2_SRD_NO; A.9.2 Input Processing A.9.2.1 Input Processing of the First SYN Segment When the first SYN segment is received, the procedure below is followed. The first SYN segment means a SYN segment received in the LISTEN state, or a SYN or SYN+ACK segment received in the SYN-SENT state. /* Step 1: PASA-DF/TS2 */ if (State == SYN-SENT && TS.Req == 2 && withTS && withOTS_OK) { Demizu Expires June 2006 [Page 37] Internet-Draft December 2005 if (TS_LT(RSEG.TSecr, TS.SndMin) || TS_GT(RSEG.TSecr, TS.SndMax)) { /* This segment MUST be dropped. */ /* ACK MUST NOT be sent. */ } } /* Step 2: Base(TS1&TS2) */ TS.Mode = max(TS.Req, (!withTS ? 0 : !withOTS_OK ? 1 : 2)); if (TS.Mode > 0) { TS.Recent = RSEG.TSval; if (TS.Mode == 2) { TS.RecentIsOld = false; if (State == SYN-SENT) { TS.SndOff += TS.SndAdj; } } } /* Step 3: RTTM(TS1&TS2) */ if (State == SYN-SENT && TS.Mode > 0) { if (TS.Mode == 1) { Measured_RTT = ((GetIntTS1() + TS.SndOff) - RSEG.TSecr + TS1_RTTM_G); } else if (TS.Mode == 2) { Measured_RTT = ((GetIntTS2() + TS.SndOff) - RSEG.TSecr + TS2_RTTM_G); } } /* Step 4: PAWS(TS1&TS2) */ if (TS.Mode > 0) { TS.RcvMin = RSEG.TSval; TS.RcvMin_time = GetTime(); } A.9.2.2 Input Processing of Other Segments When a segment other than the first SYN segment is received, the procedure below is followed. It is received in the SYN-RECEIVED state, the ESTABLISHED state, or a later state. Note that TS.Mode is determined when the first SYN segment is received. /* * Step 1: Check received segment. */ if (TS.Mode == 1) { /* Step 1-1-1: PAWS/TS1 */ elapsed_time = GetTime() - TS.RcvMin_time; Demizu Expires June 2006 [Page 38] Internet-Draft December 2005 if (!isSYN && && !isRST && TIME_LT(elapsed_time, TS1_PAWS_IDLE) && TS_LT(RSEG.TSval, TS.RcvMin - TS1_PAWS_MARGIN)) { /* This segment MUST be dropped. */ /* An ACK with TS SHOULD be sent. */ } } else if (TS.Mode == 2) { /* Step 1-2-1: Base(TS2) */ if (!isSYN && !isRST && !withTS && !withOTS) { /* This segment MUST be dropped. */ /* An ACK SHOULD be sent in reply. */ } /* Step 1-2-2: PAWS/TS2 */ elapsed_time = GetTime() - TS.RcvMin_time; if (!isSYN && (!isRST || withOTS) && TIME_LT(elapsed_time, TS2_PAWS_IDLE)) { if (TS_LT(RSEG.TSval, TS.RcvMin - GetRTO2())) { /* This segment MUST be dropped. */ /* An ACK SHOULD be sent in reply. */ } exp_ts = (TS.RcvMin + time2ts(elapsed_time)); if (TS_GT(RSEG.TSval, exp_ts + TS2_PAWS_DEV)) { /* * This segment MAY be dropped. * If PASA-DF/TS2 is enabled, * it SHOULD be dropped. * If it is dropped, * an ACK SHOULD be sent in reply. */ } } /* Step 1-2-3: PASA-DF/TS2 */ if (!isSYN && (!isRST || withOTS) && TS.PASADF_On) { if (TS_LT(RSEG.TSecr, TS.SndMin - GetRTO2()) || TS_GT(RSEG.TSecr, TS.SndMax)) { /* This segment MUST be dropped. */ /* An ACK SHOULD be sent in reply. */ } } /* Step 1-2-4: PASA-SR/TS2 */ if (isSYN) { TS.PASASR_On = true; TS.PASASR_time = GetTime(); /* * This segment MUST be dropped. * An ACK with win=TS2_PASASR_WIN * without TS nor OTS SHOULD be sent in reply. */ } Demizu Expires June 2006 [Page 39] Internet-Draft December 2005 if (isRST && !withOTS && TS.PASASR_On) { if (TIME_GE(GetTime() - TS.PASASR_time, TS2_PASASR_TIME)) { TS.PASASR_On = false; } if (TS.PASASR_On && !(SEQ_LE(RCV.NXT, RSEG.SEQ) && SEQ_LT(RSEG.SEQ, RCV.NXT + TS2_PASASR_WIN))) { /* This segment MUST be dropped. */ /* ACK MUST NOT be sent. */ } } } /* * Step 2: Check acceptability by [RFC793]. */ /* * Step 3: Process received segment. * It is assumed that SND.UNA has not been updated. * * Note: (RSEG.LEN > 0 && TS_ISLEG()) is equal to inequality (1) */ if (TS.Mode == 1 && TS_ISLEG()) { /* Step 3-1-1: Base(TS1) */ if (RSEG.LEN > 0 && SEQ_LE(RSEG.SEQ, Last.ACK.sent) && TS_LT(TS.Recent, RSEG.TSval)) { TS.Recent = RSEG.TSval; } /* Step 3-1-2: RTTM/TS1 */ if (withTS && RSEG.TSecr != 0) { Measured_RTT = ((GetIntTS1() + TS.SndOff) - RSEG.TSecr + TS1_RTTM_G); } /* Step 3-1-3: PAWS/TS1 */ if (TIME_GE(GetTime() - TS.RcvMin_time, TS1_PAWS_IDLE)|| TS_LT(TS.RcvMin, RSEG.TSval)) { TS.RcvMin = RSEG.TSval; TS.RcvMin_time = GetTime(); } } else if (TS.Mode == 2 && TS_ISLEG()) { /* Step 3-2-1: Base(TS2) */ if (RSEG.LEN > 0 && (TS.RecentIsOld || TS_GT(TS.Recent, RSEG.TSval))) { TS.Recent = RSEG.TSval; TS.RecentIsOld = false; } Demizu Expires June 2006 [Page 40] Internet-Draft December 2005 /* Step 3-2-2: RTTM/TS2 */ if (withTS) { Measured_RTT = ((GetIntTS2() + TS.SndOff) - RSEG.TSecr + TS2_RTTM_G); } /* Step 3-2-3: PAWS/TS2 */ if (TIME_GE(GetTime() - TS.RcvMin_time, TS2_PAWS_IDLE)|| TS_LT(TS.RcvMin, RSEG.TSval)) { TS.RcvMin = RSEG.TSval; TS.RcvMin_time = GetTime(); } /* Step 3-2-4: PASA-DF/TS2 */ if (TS.PASADF_On) { if (TS_LT(TS.SndMin, RSEG.TSecr)) { TS.SndMin = RSEG.TSecr; } } else { if (TS_GE(TS.SndMax, RSEG.TSecr)) { /* Restart the PASA-DF test. */ TS.SndMin = RSEG.TSecr; TS.PASADF_On = true; } } /* Step 3-2-5: PASA-SR/TS2 */ if (TS.PASASR_On && (!isRST || withOTS)) { TS.PASASR_On = false; } /* Step 3-2-6: DLD-SEG/TS2 */ foreach data segment { if (withTS && TS_GT(RSEG.TSecr, DS.SndTS) && ++DS.SndRO >= TS2_DLD_THRESH) { /* This data segment * is considered lost. */ } } /* Step 3-2-7: DLD-UNA/TS2 */ if (SEQ_LE(RSEG.ACK, SND.UNA)) { if (withTS && TS_GT(RSEG.TSecr, TS.SndUnaTS) && ++TS.SndUnaRO >= TS2_DLD_THRESH) { /* Retransmit data at SND.UNA; */ } } else { TS.SndUnaTS = (GetIntTS2() + TS.SndOff); TS.SndUnaRO = 0; } /* Step 3-2-8: DLD-SACK/TS2 */ foreach SACK hole { if (This SACK hole is just allocated) { SH.SndTS = RSEG.TSecr; Demizu Expires June 2006 [Page 41] Internet-Draft December 2005 SH.SndRO = 0; } if (withTS && TS_GT(RSEG.TSecr, SH.SndTS) && ++SH.SndRO >= TS2_DLD_THRESH) { /* All data in this SACK hole * is considered lost. */ } if (SEQ_LE(SH.Start, RSEG.ACK) && SEQ_LT(RSEG.ACK, SH.End)) { /* New SND.UNA is in this SACK hole. */ TS.SndUnaTS = SH.SndTS; TS.SndUnaRO = SH.SndRO; } } /* Step 3-2-9: SRD/TS2 */ if (TS.SRD_Mode != TS2_SRD_NO) { if (RSEG.TSecr > TS.SRD_TS) { /* * The previous retransmission was * GENUINE. */ TS.SRD_Mode = TS2_SRD_NO; } else if (SEQ_LT(SND.UNA, RSEG.ACK)) { /* SND.UNA will be advanced. */ /* * The previous retransmission was * SPURIOUS. Execute a response * algorithm if necessary. */ TS.SRD_Mode = TS2_SRD_NO; } } } A.9.3 Output Processing When a RST segment is sent in reply to a received segment because of [RFC793], the following processing is utilized. - If the received segment carries the TS option or the OTS option, the RST segment MUST carry the OTS option with . Otherwise, the RST segment SHOULD NOT carry the TS option nor the OTS option. When an ACK segment is sent in reply to a received SYN or SYN+ACK segment because of [RFC793], the following procedure is performed: /* Step 1: PASA-SR/TS2 */ TS.PASASR_On = true; Demizu Expires June 2006 [Page 42] Internet-Draft December 2005 TS.PASASR_time = GetTime(); /* * This segment MUST be dropped. * An ACK with win=TS2_PASASR_WIN * without TS nor OTS SHOULD be sent in reply. */ In other cases, when a segment is sent, the procedure below is followed: /* Step 1: PASA-DF/TS2 */ /* To avoid advancing SSEG.TSval too much after an idle. */ if (TS.Mode == 2) { over_time = ((GetTime() - TS.SndMax_time) - TS2_PASADF_MAXADV); if (over_time > 0) { TS.SndOff -= time2ts(over_time); /* Add 0 - 1 minutes */ TS.SndOff += RandomNumber(2^26); TS.SndMax_time = GetTime(); } } /* Step 2: Base(TS1&TS2) */ if (isSYN ? TS.Req > 0 : TS.Mode > 0) { /* TS1 or TS2 */ if (isSYN && TS.Req == 2) { /* Put the OTS_OK option. */ } if (TS.Mode == 2) { if (isRST || TS.RecentIsOld) { SSEG.TSkind = OTS; } else { SSEG.TSkind = TS; TS.RecentIsOld = true; } SSEG.TSval = GetIntTS2() + TS.SndOff; SSEG.TSecr = TS.Recent; } else { SSEG.TSkind = TS; SSEG.TSval = GetIntTS1() + TS.SndOff; SSEG.TSecr = TS.Recent; } } if (isFirstSYN && TS.Req == 2 && TS.Mode < 0) { TS.SndAdj = GetIntTS1() - GetIntTS2(); } LAST.Ack.Sent = SSEG.ACK; /* Step 3: PASA-DF/TS2 */ Demizu Expires June 2006 [Page 43] Internet-Draft December 2005 if ((isSYN ? TS.Req == 2 : TS.Mode == 2) && SSEG.LEN > 0) { TS.SndMax = SSEG.TSval; TS.SndMax_time = GetTime(); if (TS_GT(TS.SndMin, TS.SndMax)) { /* Stop the PASA-DF test */ TS.PASADF_On = false; } } if ((isFirstSYN || isFirstSYNACK) && TS.Req == 2) { TS.SndMin = SSEG.TSval; } /* Step 4: DLD-SEG/TS2 */ if (TS.Mode == 2 && SSEG.LEN > 0) { DS.SndTS = SSEG.TSval; DS.SndRO = 0; } /* Step 5: DLD-UNA/TS2 */ if ((isSYN ? TS.Req == 2 : TS.Mode == 2) && SSEG.LEN > 0) { if (SEQ_LE(SSEG.SEQ, SND.UNA) && SEQ_LT(SND.UNA, SSEG.SEQ + SSEG.LEN)) { TS.SndUnaTS = (GetIntTS2() + TS.SndOff); TS.SndUnaRO = 0; } } /* Step 6: DLD-SACK/TS2 */ if (TS.Mode == 2 && SSEG.LEN > 0) { if (SSEG.SEQ is in a SACK hole) { SH.SndTS = SSEG.TSval; SH.SndRO = 0; } } A.9.4 Retransmission Timeout Upon a retransmission timeout, in addition to the procedure given in Appendix A.9.3 for "Output Processing", the procedure below is followed. /* Step 1: SRD/TS2 */ /* * (1) If new data can be sent, send the new data. * (2) If lost data in a SACK hole can be retransmitted, * retransmit the lost data. * (3) In other cases, the retransmit data at SND.UNA. */ if (TS.SRD_Mode == TS2_SRD_NO) { Demizu Expires June 2006 [Page 44] Internet-Draft December 2005 TS.SRD_TS = SSEG.TSval; TS.SRD_Mode = TS2_SRD_TO; } A.9.5 Fast Retransmit Upon a Fast Retransmit, in addition to the procedure given in Appendix A.9.3 for "Output Processing", the procedure below is followed. /* Step 1: SRD/TS2 */ /* Retransmit data at SND.UNA. */ if (TS.SRD_Mode == TS2_SRD_NO) { TS.SRD_TS = SSEG.TSval; TS.SRD_Mode = TS2_SRD_FR; } A.10 Formats When both the OTS_OK option and the TS option are sent on SYN or SYN+ACK segments, the following format is recommended. +-------+-------+-------+-------+ |Kind=??|Len=2 |Kind=8 |Len=10 | +-------+-------+-------+-------+ | TSval (TS Value) | +-------------------------------+ | TSecr (TS Echo Reply) | +-------------------------------+ Figure A-1: The OTS_OK option and the TS option When either the TS option or the OTS option is sent, the following format is recommended. Two NOPs may be replaced with another 2-octet option. +-------+-------+-------+-------+ | NOP | NOP |Kind=8 |Len=10 | +-------+-------+-------+-------+ | TSval (TS Value) | +-------------------------------+ | TSecr (TS Echo Reply) | +-------------------------------+ Figure A-2: The TS option and NOPs Demizu Expires June 2006 [Page 45] Internet-Draft December 2005 Appendix B: Alternative Ideas B.1 TCP Feature Array Option Since the purpose of the OTS_OK option (i.e., the OTS option with option-length=2) is to negotiate the enabling of a feature, it could be replaced with a bit in something like a "binary option negotiation option" [All04]. The format would be like the following: +-------+-------+-------+ |Kind=??|Len=3 | flags | +-------+-------+-------+ Figure B-1: The TCP Feature Array option This idea is not employed because it requires additional TCP option space (i.e., at least 3 octets) and a new option-kind value. Nevertheless, this new 3-octet option can be carried on a SYN segment even in the following combination (40 octets total). - 4 octets: TCP MSS option [RFC793] - 3 octets: TCP Feature Array option - 3 octets: TCP Window Scale option [RFC1323] - 10 octets: TCP Timestamps option [RFC1323] - 2 octets: TCP SACK-PERMITTED option [RFC2018] - 18 octets: TCP MD5 Signature option [RFC2385] Normally, for alignment at a 32-bit boundary, one NOP is put after the TCP Window Scale option, and two NOPs are put before the TCP Timestamps option, as described in Appendix A of [RFC1323]. If these three NOPs are removed, the TCP Feature Array option can be inserted as above without breaking the 32-bit timestamps alignment. B.2 Timestamp Unit In this memo, the timestamp unit for TS2 is fixed at 1 usec (10^-6). This value is advantageous for taking accurate RTT measurements in LAN environments. In addition, some lower bits of timestamps can be used as nonce. An alternative idea would be to fix the unit at 1 ms (10^-3). Since [RFC1323] specifies that the unit is in the range of 1 second to 1 ms, the unit of 1 ms is compatible with [RFC1323]. As a result, the variable TS.SndAdj can be removed. If the unit is 1 ms, the default value of TS2_PAWS_IDLE is changed from 20 minutes to 24 days. In addition, TS.PASADF_On of PASA-DF/TS2 can be removed. This idea is not employed here, however, in order to obtain the advantages given in the previous paragraph. Demizu Expires June 2006 [Page 46] Internet-Draft December 2005 Another alternative idea would be to negotiate the timestamp unit by using SYN segments within the range between, e.g., 1 sec and 1 nsec. In this case, TS2_PAWS_IDLE should be replaced with a variable. This idea is not employed here because such negotiation would not be simple and would require additional TCP option space. B.3 TS option without TSecr field Since the value in the TSecr field in the OTS option may be very old and useless, an alternative idea would be to replace the OTS option (of option-length=10) with the TS option without the TSecr field (i.e., option-length=6) [Duk03b]. This idea is not employed here because the TSecr field in the OTS option is referred to by PASA-DF/TS2 and SRD/TS2. In addition, some new mechanisms might use this field in the future. Appendix C: Loss Detection With SACK and DLD/TS2 This appendix discusses a loss detection procedure making use of SACK [RFC2018][RFC3517] and DLD/TS2. It does not discuss which data should be transmitted when more than one data segment can be sent or retransmitted. This topic is outside the scope of this appendix. C.1 Highest Sequence Number of Retransmitted Data This appendix uses one variable: SND.RTX (32bit-sequence-number). RTX stands for retransmission. It holds the maximum sequence number of acknowledged or retransmitted data, plus one octet. The initial value of SND.RTX is SND.UNA. When an acceptable segment is received, if SND.UNA is advanced and (SND.RTX < new SND.UNA) is true, then the new SND.UNA is copied to SND.RTX. When a segment is sent, if (SSEG.SEQ < SND.MAX && SND.RTX < SSEG.SEQ + SSEG.LEN) is satisfied, then SSEG.SEQ + SSEG.LEN is copied to SND.RTX. Note that SND.RTX is not rewound upon a retransmission timeout. As a result, all retransmitted but unacknowledged data satisfies (SND.UNA <= data < SND.RTX), while all transmitted but unacknowledged data satisfies (SND.UNA <= data < SND.MAX). By the definition given in the terminology section, at least one octet in an "original data segment" satisfies (SND.RTX <= octet < SND.MAX), and all octets in a "retransmitted data segment" satisfy (SND.UNA <= octet < SND.RTX). Demizu Expires June 2006 [Page 47] Internet-Draft December 2005 C.2 Loss Detection This subsection describes how losses are detected with or without SACK and DLD/TS2. When DLD/TS2 is enabled, it detects losses of both original and retransmitted data segments in the range from SND.UNA to SND.MAX. Since DLD/TS2 counts the number of received segments with the TS option for detecting losses, it is not very robust against losses of segments sent by a remote node. When SACK is enabled, IsLost(), as specified in [RFC3517], detects losses of original data segments. In other words, it can detect losses of data satisfying (SND.RTX <= data < SND.MAX), but it cannot detect losses of data satisfying (SND.UNA <= data < SND.RTX). Nevertheless, it is more robust against losses of ACK segments than DLD/TS2, because multiple SACK blocks can be sent on each segment. Therefore, in spite of its limitations, IsLost() is helpful even when DLD/TS2 is enabled. When DLD/TS2 is disabled, if SACK is disabled or (SND.UNA < SND.RTX) is true, duplicate ACK segments need to be counted to trigger a Fast Retransmit (i.e., to detect the loss of data at SND.UNA), as follows: - After a retransmission timeout, duplicate ACK segments SHOULD NOT be counted until the retransmitted data is acknowledged. The purpose is to avoid counting duplicate ACK segments sent in reply to data segments that were sent before the timeout. Such duplicate ACK segments are often observed when a retransmission timeout is triggered because of the loss of the data segment sent by a Fast Retransmit. - Duplicate ACK segments for data below SND.UNA SHOULD NOT be counted. That is, if SACK is enabled, ACK segments with D-SACK [RFC2883] below RSEG.ACK and ACK segments without SACK blocks SHOULD NOT be counted. - Duplicate ACK segments for data above SND.UNA SHOULD be counted. That is, if SACK is enabled, ACK segments with D-SACK above RSEG.ACK and ACK segments with SACK blocks but without D-SACK SHOULD be counted. If TS2 is enabled, however, segments without the TS option SHOULD NOT be counted for accuracy. - When SND.UNA is advanced in the loss recovery phase, regardless of the number of received duplicate ACK segments, data starting at the new SND.UNA SHOULD be considered lost as with NewReno [RFC3782]. This is especially helpful when SACK is enabled and (new SND.UNA < SND.RTX) is true. Demizu Expires June 2006 [Page 48] Internet-Draft December 2005 C.3 SACK Scoreboard A data sender SHOULD maintain a SACK scoreboard carefully so that it can effectively recover losses and transmit new data. According to section 5.1 of [RFC2018], "When a retransmit timeout occurs the data sender MUST ignore prior SACK information in determining which data to retransmit". When TS2 is enabled, however, this appendix recommends that the SACK scoreboard not be discarded upon a retransmission timeout. Instead, it recommends that existing SACK blocks in the SACK scoreboard be updated by newly received SACK blocks if there are conflicts, as follows. - One variable is associated with each SACK block: SB.RcvTS (32bit-timestamp). It holds the RSEG.TSval value on the segment that last updated this SACK block. - When a received SACK block other than a D-SACK block satisfies (RSEG.TSval > SB.RcvTS), where RSEG.TSval represents the RSEG.TSval value on the segment that carried the received SACK block, the corresponding existing SACK block SHOULD be overwritten by the received SACK block in order to avoid possible conflicts. Otherwise, if (RSEG.TSval == SB.RcvTS) is true, the corresponding existing SACK block MAY be expanded by the received SACK block. - If RSEG.ACK points to the middle of an existing SACK block, the start sequence number of the existing SACK block is changed to new SND.UNA + 1 SMSS without updating SB.RcvTS. Then, the data at SND.UNA are marked as lost. In the Slow Start and Congestion Avoidance phase [RFC2581], when (SND.NXT < SND.MAX) is true (i.e., when SND.NXT has been rewound because of a retransmission timeout), SND.NXT SHOULD skip the SACKed data so as not to retransmit it. In addition, skipped SACKed data SHOULD NOT be calculated as part of the flight size. If ABC [RFC3465] is enabled, then when an ACK segment is received, the number of octets acknowledged by the ACK segment needs to be calculated. In this calculation, already SACKed data SHOULD be omitted. Since the SACK information may not be fully synchronized with the data receiver, the number of octets acknowledged by each ACK segment SHOULD NOT exceed some upper bound (e.g., 2 SMSS). Note: According to the fourth paragraph of section 2.3 in [RFC3465], TCP stacks need to determine whether a TCP connection is "during a slow start phase that follows a retransmission timeout". This appendix recommends that (SND.NXT < SND.MAX) be used to determine this. Demizu Expires June 2006 [Page 49] Internet-Draft December 2005 C.4 SACK-LF (SACK Lowest First) A data receiver SHOULD inform its data sender of appropriate SACK information so that the sender can recover lost data effectively. A data receiver maintains a queue of SACK blocks to be sent in the TCP SACK option to the data sender. To comply with section 4 of [RFC2018], when a SACK block is updated, it is typically moved to the head of the queue. As a result, the most recently updated SACK blocks are informed to the data sender using the TCP SACK option. Suppose that some data segments are lost within an RTT. In this case, a data receiver typically receives the out-of-order data segments in ascending order. Therefore, SACK blocks sent in reply within the same RTT (or the first RTT) are typically sorted in descending order. In contrast, within the next RTT (or the second RTT), if the data receiver receives all the lost data, the same SACK blocks (which would be the highest SACK blocks) on the last ACK segment within the first RTT, excluding cumulatively acknowledged SACK blocks, are sent in reply, while RSEG.ACK is gradually advancing. In general, by considering the possibility that some retransmitted data segments are lost, the most recently updated SACK blocks (which would be located far from SND.NXT) will be sent in reply within the second or later RTTs, while the data sender would want to confirm the SACK blocks just above SND.NXT. In the current standard TCP, whenever a retransmitted data segment is lost, a retransmission timeout is triggered in order to re-retransmit the lost data. According to section 5.1 of [RFC2018], "When a retransmit timeout occurs the data sender MUST ignore prior SACK information in determining which data to retransmit". Thus, for the same reason discussed in the previous paragraph, the data receiver keeps sending the same SACK blocks, which likely would be the highest SACK blocks. As a result, the data sender will retransmit all data between SND.UNA and the lowest reported SACK block. This retransmitted data will include data that was SACKed before the retransmission timeout. That is, bandwidth might be wasted if the data sender complies with section 5.1 of [RFC2018]. To mitigate this problem, this subsection proposes SACK-LF, as follows: When RCV.NXT is advanced at a data receiver, a certain number of the lowest SACK blocks are moved to the head of the queue. The number of SACK blocks to be moved is chosen so that all SACK blocks are sent the same number of times, so as to make the SACK information robust against losses of ACK segments. This memo proposes that the number of SACK blocks to be moved to the Demizu Expires June 2006 [Page 50] Internet-Draft December 2005 head of the queue be the sum of the following two numbers plus one (i.e., num_removed + num_lowest + 1). - num_removed: The number of SACK blocks that were sent in the previous TCP SACK option and are removed by the received RSEG.ACK. - num_lowest: The number of SACK blocks that were sent in the previous TCP SACK option and are in the current lowest N SACK blocks, where N is the number of SACK blocks sent in the previous TCP SACK option. If a data sender discards the SACK scoreboard upon a retransmission timeout, SACK-LF that is performed at a data receiver will mitigate the number of unnecessary retransmissions. If D-SACK is not supported by the data sender, SACK-LF will also mitigate the number of spurious Fast Retransmits. If the SACK information has not been fully synchronized with the data receiver, SACK-LF will suppress unnecessary retransmissions. In addition to SACK-LF, this subsection proposes the following: - If a data receiver discards part of an out-of-order consecutive data block that has been informed to the data sender by using the TCP SACK option, the shrunken SACK block SHOULD be moved to the head of the queue in order to inform of the change. - When a data receiver receives a data segment, if it discards part or all of the data, the SACK blocks on the segment sent in reply SHOULD NOT include the discarded part of the data. Note that section 8 of [RFC2018] says "MUST" instead of "SHOULD NOT". Appendix D: Summary of TCP Timestamps Option in RFC1323 The TCP Timestamps option [RFC1323] is currently deployed widely. There is also a variant of the TCP Timestamps option, which probably is more prevalent than the option described in [RFC1323]. The variant is called "rfc1323bis" [JBB03] (see also [Bra93] and [JBB97]) in this appendix. For simplicity, the TCP Timestamps option is called "the TS option" here. This appendix describes the behaviors of the TCP Timestamps option specified in RFC1323 and rfc1323bis, by using C-like pseudocode. Some definitions are borrowed from the TS2 reference given in Appendix A. Demizu Expires June 2006 [Page 51] Internet-Draft December 2005 D.1 Types The following types are borrowed from the TS2 reference: boolean, integer, 32bit-sequence-number, 32bit-timestamp, and internal-time. D.2 Boolean Macros The following boolean macros are borrowed from the TS2 reference: SEQ_LE(), SEQ_LT(), TS_LT(), TS_LE(), and TIME_LT(). D.3 Inequalities The following inequalities are defined. - Inequality (A) ... RFC1323 (SEQ_LE(RSEG.SEQ, Last.ACK.sent) && SEQ_LT(Last.ACK.sent, RSEG.SEQ + RSEG.LEN)) - Inequality (B) ... rfc1323bis SEQ_LE(RSEG.SEQ, Last.ACK.sent) Note: (RSEG.TSval >= TS.Recent) is omitted in this inequality because it is part of the PAWS test. Only one of (A) or (B) SHOULD be implemented. A boolean macro called TS_ISLEG() returns true if the selected inequality is satisfied. Otherwise, it returns false. Note: In addition to the inequalities given above, this memo recommends that (Last.Ack.Sent - max(RCV.WND) <= RSEG.SEQ) also be checked, in addition to (A) or (B). D.4 Variables The following variables are defined in [RFC1323]. TS.Recent (32bit-timestamp) - This variable records the maximum RSEG.TSval value on the received segments satisfying TS_ISLEG(). It is echoed in SSEG.TSecr. Last.Ack.Sent (32bit-sequence-number) - This variable holds the last SSEG.ACK value sent. The following variables are defined here to describe the behaviors. TS.Req (boolean) Demizu Expires June 2006 [Page 52] Internet-Draft December 2005 - This variable represents a user's request: True if the TCP Timestamps option is requested. False, otherwise. - The initial value is given by the user. TS.OK (boolean) - This variable is true if the TS option is enabled. The initial value is false. It is set to true if the TS option is exchanged on SYN and SYN+ACK segments in the TCP three-way handshake phase. TS.Recent_time (internal-time) - This variable holds the time when TS.Recent was last updated. D.5 Current Time The following pseudocode functions are defined here for getting the current time or current timestamp. GetTime() --- Get Current Time (internal-time) - Get the current time in an internal time format. - The time returned by this function MUST NOT be wrapped in the lifetime of any TCP connections. GetTS() --- Get Current Timestamp (32bit-timestamp) - Get the current time in a 32bit unsigned integer so that it can be sent in the TS option. - The timestamp unit MUST be in the range of 1 sec to 1 ms. D.6 Constants The following constants are defined here to describe the behaviors. TS_RTTM_G (32bit-timestamp) for RTTM - This constant represents the granularity of GetTS() in the unit of the return value of GetTS(). TS_PAWS_IDLE (internal-time) for PAWS - The value of TS.Recent is valid for TS_PAWS_IDLE if TS.OK is true. - The default value is 24 days. D.7 Attributes of Received Segments The following flags are borrowed from the TS2 reference: isFirstSYN, isFirstSYNACK, isSYN, and withTS. Demizu Expires June 2006 [Page 53] Internet-Draft December 2005 D.8 Procedures D.8.1 Initialization When a TCP Control Block is created or reused, the procedure below is followed. TS.Req = true or false; /* Requested by user */ TS.OK = false; D.8.2 Input Processing When a segment is received, the procedure below is followed. if (isFirstSYN || isFirstSYNACK) { if (TS.Req && withTS) { TS.OK = true; TS.Recent = RSEG.TSval; TS.Recent_time = GetTime(); } } else if (TS.OK) { /* (R1) PAWS */ if (!isSYN && TIME_LT(GetTime() - TS.Recent_time, TS_PAWS_IDLE) && TS_LT(RSEG.TSval, TS.Recent)) { /* This segment MUST be dropped. */ /* An ACK with TS SHOULD be sent. */ } /* (R2) If it is outside the window, reject it. */ /* (R3) Update TS.Recent */ if (TS_ISLEG() && TS_LE(TS.Recent, RSEG.TSval)) { TS.Recent = RSEG.TSVal; TS.Recent_time = GetTime(); } /* RTTM: If it advances SND.UNA, do RTTM. */ Measured_RTT = GetTS() - RSEG.TSecr + TS_RTTM_G; } D.8.3 Output Processing When a segment is sent, the procedure below is followed. if (TS.OK) { /* Put the TCP Timestamps option */ SSEG.TSval = GetTS(); SSEG.TSecr = TS.Recent; } LAST.Ack.Sent = SSEG.ACK; Demizu Expires June 2006 [Page 54] Internet-Draft December 2005 Appendix E: Issues with TCP Timestamps Option in RFC1323 This appendix discusses the issues with both the TCP Timestamps option in [RFC1323] and rfc1323bis [JBB03]. It also discusses how these issues are handled in TS1 and TS2. E.1 RTTM Since the RTTMs in RFC1323, rfc1323bis, and TS1 take RTT measurements only when SND.UNA is advanced, they cannot take RTT measurements during the loss recovery phase, except when partial or full acknowledgement is received. In contrast, RTTM/TS2 can take RTT measurements whenever it receives the TS option, even when SND.UNA is not advanced. When a remote node is compliant with RFC1323, RTTM overestimates RTTs in the following scenario. Assume that all data segments sent within an RTT arrive at the remote node but all ACK segments sent in reply are lost. Upon a retransmission timeout, the lowest lost data is retransmitted, and an ACK segment sent in reply is received. In this case, the received ACK segment carries the TSval value on the last original data segment that arrived at the remote node. Therefore, RTTM at the local node measures the time from when the last original data segment was sent until when an ACK segment sent in reply to the retransmitted data segment is received. Thus, the measured RTT is much larger than the real RTT and nearly equal to the RTO value [Duk03a]. In contrast, if the remote node complies with RTTM in rfc1323bis, RTTM/TS1, or RTTM/TS2, then the received ACK segment carries the TSval value on the retransmitted data segment. Therefore, RTTM at the local node takes an accurate RTT measurement, because it measures the time from when the lowest lost data is retransmitted until when its ACK segment is received. Even when the remote node complies with rfc1323bis or RTTM/TS1, however, RTTM at the local node overestimates RTTs in the scenario described in [Duk03b], while RTTM/TS2 will not overestimate RTTs in the same scenario because of the OTS option. E.2 PAWS and Reordering As described in Appendix F, there is a possibility that a legitimate data segment could be discarded by PAWSs in RFC1323 and rfc1323bis when it is delayed because of reordering. In addition, there is a possibility that a legitimate ACK segment in a unidirectional data flow could be discarded by PAWS in rfc1323bis when it is delayed because of reordering [Mil98]. Demizu Expires June 2006 [Page 55] Internet-Draft December 2005 In contrast, PAWS/TS1 is slightly more robust against reordering than PAWS in RFC1323 and rfc1323bis, because of TS1_PAWS_MARGIN. PAWS/TS2 is robust against reordering, and legitimate segments are unlikely to be discarded even when they are delayed because of reordering. Note: Linux seems to comply with RFC1323, instead of rfc1323bis, and it appears to have implemented measures including the same idea as TS1_PAWS_MARGIN. E.3 Spoofed Segment Detection [PD04] proposes to detect spoofed segments by making use of the TSecr field. To achieve this goal, when an ACK segment is sent, its TSval value is the same value as the TSval value on the last data segment. Unfortunately, this mechanism makes it impossible to apply PAWS for ACK segments. In addition, there could be other unknown problems. In contrast, PASA/TS2 detects spoofed segments without tweaking the TSval values. Thus, it does not have such problems. E.4 Retransmitted Data Loss Detection It has been said that if the TSval values on out-of-order data segments were echoed by a data receiver, the data sender would be able to detect losses of retransmitted data segments. The TCP Timestamps options in RFC1323, rfc1323bis, and TS1 cannot detect such losses. In contrast, TS2 enables DLD/TS2 to detect losses of both retransmitted data segments and original data segments. E.5 Corner Case of Eifel According to section 3.3 of [RFC3522], if a remote node supports the TCP Timestamps option in RFC1323 and does not support D-SACK [RFC2883], then when all ACK segments within an RTT are lost, the Eifel Detection Algorithm [RFC3522] will misinterpret the consequent retransmission timeout as a spurious timeout. In contrast, if a remote node supports the TCP Timestamps option in rfc1323bis or TS1, there is no such problem. SRD/TS2 also does not have this problem. E.6 Vulnerability If an implementation that complies with rfc1323bis overwrites TS.Recent with RSEG.TSval whenever it receives a segment satisfying (RSEG.TSval >= TS.Recent && RSEG.SEQ <= Last.ACK.sent), it has a vulnerability [CVE05][CERT05]. Demizu Expires June 2006 [Page 56] Internet-Draft December 2005 In contrast, implementations complying with RFC1323, TS1, and TS2 do not have such a vulnerability when the window size is not very large. If TS2 is enabled, PASA/TS2 combined with PAWS/TS2 will detect spoofed segments even when the window size is very large. E.7 Summary The table below summarizes the issues discussed in this appendix. +----------------------------+---------+------------+------+-----+ | | RFC1323 | rfc1323bis | TS1 | TS2 | +----------------------------+---------+------------+------+-----+ | RTTM: Dup-ACKs | NG | NG | NG | OK | | RTTM: Overestimation | NG | Fair | Fair | OK | | PAWS: Reordering | NG | NG | Fair | OK | | PASA: PAWS for ACKs | NG | NG | NG | OK | | DLD: Retransmitted Data | NG | NG | NG | OK | | Eifel: A Corner Case | NG | OK | OK | OK | | Vulnerability | Fair | NG | Fair | OK | +----------------------------+---------+------------+------+-----+ Table E-1: Summary of Issues with TCP Timestamps Option Appendix F: Problem of PAWS in RFC1323 and Reordering There is a possibility that legitimate data segments could be discarded by PAWS in [RFC1323] when those segments are delayed because of reordering. This appendix shows some examples of this problem and describes a generic scenario and some possible negative effects. F.1 Example 1: Reordering and Fast Retransmit with Limited Transmit In this example, suppose that TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option in [RFC1323], TCP Congestion Control [RFC2581], and Limited Transmit [RFC3042], and that TCP B supports the TCP Timestamps option with PAWS in [RFC1323]. Suppose that the data segment sequence W.1, X.2, Y.3, Z.4, S.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp in the TSval field. In this data segment sequence, suppose that W.1 and X.2 are sent in the Congestion Avoidance phase, Y.3 and Z.4 are sent by Limited Transmit, and S.5 is sent by Fast Retransmit. Figure F-1 illustrates the data segment sequence observed at TCP A. The x-axis represents time, and the y-axis represents the sequence number. W.1 through Z.4 and S.5 indicate the data segments sent. Demizu Expires June 2006 [Page 57] Internet-Draft December 2005 Each 'o' mark indicates a received ACK segment. Lines are drawn to connect the symbols between data segments and between ACK segments. Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | S.5 | o____o____o____o | o~~~~ 1 2 3!! <-- dup-ACK count | o~~~~ +--------------------------------> Time Figure F-1: Time vs. sequence number at TCP A Now, suppose that the data segment sequence W.1, X.2, Y.3, Z.4, S.5 sent by TCP A is reordered as W.1, X.2, Y.3, S.5, Z.4 (i.e., Z.4 and S.5 are exchanged) on the path to TCP B. Figure F-2 illustrates the resulting data segment sequence observed at TCP B. What happens at TCP B is described below. 0. Assume TS.Recent is valid and TS.Recent == 0. Assume RCV.NXT == S. 1. W.1 is received. PAWS accepts it because TS.Recent < 1. TS.Recent is not updated because RCV.NXT < W. 2. X.2 is received. PAWS accepts it because TS.Recent < 2. TS.Recent is not updated because RCV.NXT < X. 3. Y.3 is received. PAWS accepts it because TS.Recent < 3. TS.Recent is not updated because RCV.NXT < Y. 4. S.5 is received. PAWS accepts it because TS.Recent < 5. TS.Recent is updated because RCV.NXT == S and S.5 has data. Now, TS.Recent == 5 and RCV.NXT >= S + the data length of S.5. (The actual new value of RCV.NXT depends on the out-of-order data queue in TCP B.) 5. Z.4 is received. PAWS discards it because TS.Recent > 4. In this example, the legitimate segment Z.4 is discarded by PAWS in step 5. Figure F-2 illustrates this scenario. Demizu Expires June 2006 [Page 58] Internet-Draft December 2005 Sequence number A Z.4 | Y.3 / | X.2~~ \ / | W.1~~ \ / | ~~ \ / | S.5 | +--------------------------------> Time +---------+-------------------------------+ |Segment |(prev) W.1 X.2 Y.3 S.5 Z.4 | +---------+-------------------------------+ |PAWS | - Pass Pass Pass Pass Fail| |TS.Recent| 0 0 0 0 5 5 | |RCV.NXT | S S S S >S >S | +---------+-------------------------------+ Figure F-2: Time vs. sequence number at TCP B Even in the case where TCP A does not support Limited Transmit (i.e., the case where Y.3 and Z.4 are not sent in the example above), if the data segment sequence W.1, X.2, S.5 sent by TCP A is reordered as W.1, S.5, X.2 (i.e., X.2 and S.5 are exchanged) on the path to TCP B, X.2 could be discarded by PAWS. Since there would be a small gap between the time when X.2 is sent and the time when S.5 is sent, the possibility of this problem occurring would be less than in the example above. F.2 Example 2: Reordering and NewReno In this example, suppose that TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option in [RFC1323], TCP Congestion Control [RFC2581], and NewReno [RFC3782], and that TCP B supports the TCP Timestamps option with PAWS in [RFC1323]. Suppose that the data segment sequence W.1, X.2, Y.3, Z.4, S.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp in the TSval field. In the data segment sequence, suppose that W.1 through Z.4 are sent by Fast Recovery at each time when a duplicate ACK segment is received, and that S.5 is sent by NewReno. Figure F-3 illustrates the data segment sequence observed at TCP A. This figure uses the same notation that in Figure F-1. Demizu Expires June 2006 [Page 59] Internet-Draft December 2005 Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | S.5 | o | / | / | / | / | ..o____o____o____o | +--------------------------------> Time Figure F-3: Time vs. sequence number at TCP A Now, suppose that the data segment sequence W.1, X.2, Y.3, Z.4, S.5 sent by TCP A is reordered as W.1, X.2, Y.3, S.5, Z.4 (i.e., Z.4 and S.5 are exchanged) on the path to TCP B. The resulting data segment sequence observed at TCP B is the same as that shown in Figure F-2. What happens at TCP B is also the same as in Example 1 above. Consequently, the legitimate segment Z.4 is discarded by PAWS. F.3 Generic Scenario In general, this problem occurs in the following scenario. Suppose that TCP A is sending data to TCP B, and consider the following steps: 1. Data segment Z.4 is sent by the sender (TCP A). 2. Data segment S.5 is sent by the sender (TCP A). Here, the sequence number of segment S.5 is lower than that of segment Z.4. The TSval value on segment S.5 is newer than that on segment Z.4. Note: Segment S.5 would be a retransmitted segment sent by Fast Retransmit, NewReno, SACK [RFC2018][RFC3517], or another mechanism that infers a segment loss and retransmits the lost data quickly. The sequence number of segment S.5 would be less than SND.NXT. 3. Segment S.5 arrives at the receiver earlier than segment Z.4. Demizu Expires June 2006 [Page 60] Internet-Draft December 2005 Suppose that segment S.5 satisfies (RSEG.SEQ <= RCV.NXT < RSEG.SEQ + RSEG.LEN), and that the TSval value on segment S.5 is not older than the TS.Recent value at the receiver (TCB B). Segment S.5 is accepted by PAWS at the receiver. TS.Recent at the receiver is updated with the TSval value on segment S.5 (i.e., TS.Recent = 5). RCV.NXT is also updated. 4. Segment Z.4 arrives at the receiver (TCP B). Segment Z.4 is discarded by PAWS because the TSval value (= 4) on segment Z.4 is older than the TS.Recent value (= 5) at the receiver. In this scenario, the gap between the time when segment Z.4 is sent and the time when segment S.5 is sent should be small, so that reordering could exchange segments Z.4 and S.5. F.4 Negative effects This problem would cause some negative effects on TCP performance. A data sender would spend additional time detecting a loss and recovering from it. Moreover, the sender would consider the loss to be a congestion indication, and the congestion window would needlessly be further reduced. In addition, discarding legitimate segments at a data receiver is a waste of bandwidth. F.5 Possible Solution A straightforward way to solve this problem would be to modify the rules of PAWS so that valid delayed segments are accepted. The new rule would be as follows: - Change the inequality in R1) in section 4.2.1 of [RFC1323] as shown below: Current: RSEG.TSval < TS.Recent Proposal: RSEG.TSval < TS.Recent - T1, where T1 = RTO value. - In addition, to keep TS.Recent be monotonically nondecreasing, in R3) in section 4.2.1 of [RFC1323], TS.Recent should be updated only when RSEG.TSval >= TS.Recent. With this new rule, it would be very important to choose the value of T1 appropriately. This would be difficult for a data receiver, Demizu Expires June 2006 [Page 61] Internet-Draft December 2005 however, because it does not know the unit of the TSval values on the received segments. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Demizu Expires June 2006 [Page 62]