Network Working Group M. Bagnulo Internet-Draft UC3M Intended status: Informational B. Briscoe Expires: April 2, 2017 Simula Research Lab September 29, 2016 Adding Explicit Congestion Notification (ECN) to TCP control packets and TCP retransmissions draft-bagnulo-tcpm-generalized-ecn-00 Abstract This document describes an experimental modification to ECN to allow the use of ECN to the following TCP packets: SYNs, Pure ACKs, Window probes, FINs, RSTs and retransmissions. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 2, 2017. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Bagnulo & Briscoe Expires April 2, 2017 [Page 1] Internet-Draft ECN and TCP control packets September 2016 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Experiment goals . . . . . . . . . . . . . . . . . . . . 3 1.3. Document structure . . . . . . . . . . . . . . . . . . . 4 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Specification . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Network behaviour . . . . . . . . . . . . . . . . . . . . 5 3.2. Endpoint behaviour . . . . . . . . . . . . . . . . . . . 6 3.2.1. SYN . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2.2. Pure ACK . . . . . . . . . . . . . . . . . . . . . . 8 3.2.3. Window Probe . . . . . . . . . . . . . . . . . . . . 9 3.2.4. FIN . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.5. RST . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.6. Retransmissions . . . . . . . . . . . . . . . . . . . 11 4. Discussion about the arguments in RFC3168 . . . . . . . . . . 12 4.1. The reliability argument . . . . . . . . . . . . . . . . 12 4.2. TCP SYNs . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3. Pure ACKs. . . . . . . . . . . . . . . . . . . . . . . . 16 4.4. Retransmitted packets. . . . . . . . . . . . . . . . . . 18 4.5. Window probe packets . . . . . . . . . . . . . . . . . . 20 5. Security considerations . . . . . . . . . . . . . . . . . . . 21 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 21 8. Informative References . . . . . . . . . . . . . . . . . . . 21 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 1. Introduction RFC3168 [RFC3168] specifies the support of Explicit Congestion Notification (ECN) to IP. By using the ECN capability, switches performing Active Queue Management (AQM) can use ECN marks instead of packets drops to signal congestion to the endpoints of a communication. This results in lower packet loss and increased performance. However, RFC3168 specifies the support of ECN in TCP data packets, but precludes the use of ECN in TCP control packets (TCP SYN, TCP SYN/ACK, pure ACKs, Window probes) and in retransmitted packets. RFC3168 is silent about the use of ECN in RST and FIN packets. RFC 5562 [RFC5562] is an experimental extension to ECN that enables the ECN support for TCP SYN/ACK packets. This document defines an experimental modification to ECN that enables the ECN support in all the aforementioned packet types. Bagnulo & Briscoe Expires April 2, 2017 [Page 2] Internet-Draft ECN and TCP control packets September 2016 1.1. Motivation The inability of using ECN in TCP control packets and retransmissions has a potential harmful effect, especially in environments where ECN support is pervasive. For example, [judd-nsdi] shows that in a data center (DC) environment where DCTCP is used (in conjunction with ECN), the the probability of being able to establish a new connection using a non-ECT-marked SYN packet drops to close to 0 when there are 16 ongoing TCP flows transmitting at full speed. In this particular context of a datacenter using DCTCP, the issue is that the proposed AQM aggressively marks packets to keep the buffer queues small and this implies that non-ECT-marked packets are in turn dropped aggressively as well, rendering nearly impossible to establish new connection when there is ongoing traffic. These limitations are not limited to the data center environment. In any ECN deployment, non ECT marked packets suffer a penalty when they traverse a congested bottleneck. For instance, with a drop probability of 1%, 1% of connection attempts suffer a timeout before the SYN is retransmitted, which is very detrimental to the performance of short flows. Dropping TCP control traffic, such as TCP SYNs and pure ACKs have a negative effect on the overall performance of the communication, so it is beneficial to avoid it. Finally, there are ongoing efforts to promote the adoption of DCTCP (and similar transports) over the Internet to achieve low latency for all communications [I-D.briscoe-tsvwg-aqm-tcpm-rmcat-l4s-problem]. In such approach, ECN capable packets are treated more favorably, as they are likely to experience less delay and lower packet drop probability. Preventing TCP control packets, which are critical for TCP performance, to obtain the benefits of ECN would result in degraded performance. 1.2. Experiment goals The goal of the experimental extensions defined in this document is to allow the use of ECN (both ECT and CE codepoints) in the public Internet as well as in controlled environments so we can find out about the following issues: How SYN, Window probes, pure ACKs, FINs, RSTs and retransmissions that carry the ECT(0), ECT(1) or CE codepoints are processed by the TCP endpoints and the network (including routers, firewalls and other middleboxes). In particular we would like to learn if these packets are frequently blocked or if these packets are usually forwarded and processed. This will affect the design of the support of the different packet types considered. Bagnulo & Briscoe Expires April 2, 2017 [Page 3] Internet-Draft ECN and TCP control packets September 2016 The scale of deployment of the different flavors of ECN, including [RFC3168], [RFC5562], [RFC3540] and [I-D.ietf-tcpm-accurate-ecn]. Depending of how pervasive is the deployment of each option, the design of adding ECN support to the different packet types considered in this document can vary greatly. How much the performance of the TCP communications is improved by allowing the ECN marking of these packets, for each of the different packet types. Identify any issues (including security issues) that enabling the ECN marking of these packets may imply. The data gathered through the experiments described in this document will help in the design of the final mechanism (if any) to add ECN support to the different packet types considered in this document. Whenever data input is needed to assist in a design choice, it is spelled out throughout the document. Success criteria: If we manage to obtain enough data to have a clearer view of the deployability, the benefits and any other issues of ECN marking of the considered packets, the experiment will be a success. If the results of the experiment show that it is feasible to deploy such changes, there are gains to be achieved though the changes described in this specification and no other major issues that may interfere with the deployment of the proposed changes, then it would be reasonable to attempt to update RFC3168 to adopt the proposed changes in a standards track specification. 1.3. Document structure The remaining of this document is structured as follows. In section Section 2, we present the terminology used in the rest of the document. In section Section 3, we specify the modifications to provide ECN support to TCP SYNs, pure ACKs, Window probes, FINs, RSTs and retransmissions. We describe both the network behaviour and the endpoint behaviour (this last one detailed for both the TCP sender and TCP receiver). RFC3168 does not prevents from using ECN in TCP control packets lightly. It provides a number of specific reasons for each packet type. In this Section 4, we revisit each of the arguments provided by RFC3168 and explore possibilities to enable the ECN capability in the different packet types. 2. Terminology The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [RFC2119]. Bagnulo & Briscoe Expires April 2, 2017 [Page 4] Internet-Draft ECN and TCP control packets September 2016 Pure ACK: A TCP segment with the ACK flag set and no data payload. SYN: A TCP segment with the SYN flag set. It may carry data if TCP Fast Open is used. Window probe: Defined in [RFC1122], Window probe is a TCP segment with only one byte of data sent to learn is the the receiver window is still zero or else. FIN: A TCP segment with the FIN flag set. RST: A TCP segment with the RST flag set. Retransmission: A TCP segment that has been retransmitted by the TCP sender because it determined that the original segment was lost, which may or may not be the case. 3. Specification 3.1. Network behaviour If a router or any other middlebox along the path, receives a SYN, or a Window probe or a Pure ACK or FIN or a RST or a Retransmission with the ECT(0) or the ECT(1) codepoint set, then: if the router is not congested (i.e. if in the same situation and if the received packet was the same packet but with the non-ECT codepoint, then the router would forward such packet), the router SHOULD forward the packet. It could be possible to use a MUST instead of a SHOULD. The reason to use a SHOULD is that it may be the case that some firewalls or other middleboxes may decide to block some of these packets, such as ECT marked SYNs if they detect an ongoing attack, which would then qualify the SHOULD and would allow these boxes to drop the packets. This issue is to be discussed. if the router is congested, (i.e. if in the same situation and if the received packet was the same packet but with the non-ECT codepoint, then the router would drop the packet), then the router MAY set the CE codepoint in the packet instead of dropping the packet. If other behaviour alternative to drop is defined in another specification (e.g. for ECT(1)) then this text should be updated to allow this alternative behaviour with a proper link to that specification. Bagnulo & Briscoe Expires April 2, 2017 [Page 5] Internet-Draft ECN and TCP control packets September 2016 3.2. Endpoint behaviour 3.2.1. SYN A first design choice that needs to be taken when proposing the experimental extensions to support ECN marking in SYN packets is whether the ECN marking of the SYN packets is done only for AccECN endpoints [I-D.ietf-tcpm-accurate-ecn] or also RFC3168 endpoints. As far as AccECN capable endpoints, [I-D.ietf-tcpm-accurate-ecn] defines the wire protocol for supporting ECN marking of the SYN as well as feeding back the congestion signal to the sender if a SYN packet with the CE codepoint was delivered to the receiver. In the mechanism described next follows the wire format defined in [I-D.ietf-tcpm-accurate-ecn] and also complements it with the ECT marking of the SYN as well as the endpoint behaviour both of which are left undefined in [I-D.ietf-tcpm-accurate-ecn]. The mechanism described below does not support ECN marking of SYNs with RFC3168 endpoints. Whether this is needed is left for discussion in the WG. 3.2.1.1. TCP client behaviour In this section we specify the behaviour of a TCP client that wishes to support ECN marking in the SYN. The proposed behaviour is fully compliant with [I-D.ietf-tcpm-accurate-ecn]. We call W0 the default value for the Initial Window supported by the TCP endpoint. According to current specifications, W0 can be any value between 2 and 10. A TCP endpoint supporting this specification SHOULD have a cache where it stores information about the type of ECN support (RFC3168 ECN, AccECN, ECN in the SYN, no-ECN). If a TCP client wishes to use ECN in a given connection and it also wishes to enable the ECN marking of the TCP SYN packet, then the TCP client MUST send the TCP SYN with the ECT(0) or ECT(1) codepoints in the IP header and the NS, the CWR and the ECE flags set to 1 in the TCP header. The server SHOULD not set any of the ECT codepoints if the server is included in the cache as not supporting ECN in the SYN packet. Shall we define the lifetime of the entries and how are they extended? The client SHOULD use an initial retransmission timeout value of 500ms or less for this connection. DISCUSSION: Should (may?) the sender send also a SYN with the non- ECT codepoint, perhaps slightly delayed after this first one? This would reduce the penalty of adoption the ECN marking of SYNs when communicating with TCP receivers that silently drop ECT SYNs. In the text below, we first consider the case where the client Bagnulo & Briscoe Expires April 2, 2017 [Page 6] Internet-Draft ECN and TCP control packets September 2016 only sends the ECT marked SYN and then we consider the case where the client sends first an ECT marked SYN followed by a regular SYN. After sending the ECT marked SYN, the client may receive the following different replies and will react as follows: If the client receives a SYN/ACK with the CWR and the ECE flags set to 1 and the NS flag set to 0, (this means that the server supports AccECN and that the SYN was not marked with the CE codepoint while being forwarded though the network) then the client must continue with the connection establishment using an Initial Window of W0 and it must use AccECN for this connection (as defined in [I-D.ietf-tcpm-accurate-ecn]). The client SHOULD cache that this server supports ECN in the SYN packet and AccECN. If the client receives a SYN/ACK with the CWR, the ECE and the NS flags set to 1, (this means that the server supports AccECN and that the SYN was marked with the CE codepoint while being forwarded though the network) then the client must continue with the connection establishment using an Initial Window of 1 SMSS and it must use AccECN for this connection (as defined in [I-D.ietf-tcpm-accurate-ecn]).The client SHOULD cache that this server supports ECN in the SYN packet and AccECN. If the client receives a SYN/ACK with the ECE flag set to 1 and the CWR flag set to 0 (the value of the NS flag can be either 1 or 0), (this means that the server supports RFC3168 ECN but does not support AccECN nor this specification) then the client must continue with the connection establishment and it must use RFC3168 ECN for this connection The client SHOULD cache that this server does not support ECN in the SYN packet nor AccECN but it supports RFC3168 ECN. DISCUSSION: Initial Window. Because the server does not support ECT marking of the SYN message, it is not possible for the client to know if the SYN was marked with the CE codepoint while transiting to the server. We can either assume that the SYN was marked with CE and impose an initial window of 1 MSS, or we can assume it was not marked and use an initial window of W0 or something in between (e.g. W0/2). If we impose a smaller initial window, we are penalizing those that implement this specification, which is in itself an optimization. Also, caching will help in this case, as after the first contact a client will use the appropriate mode to contact a server. If the client receives a SYN/ACK with the ECE, the CWR and the NS flags set to 0, (this means that the server does not support ECN) then the client must continue with the connection establishment and Bagnulo & Briscoe Expires April 2, 2017 [Page 7] Internet-Draft ECN and TCP control packets September 2016 it must not use ECN for this connection The client SHOULD cache that this server does not support ECN. DISCUSSION: Same discussion about the Initial Window as in the previous case. If the client receives a SYN/ACK with the ECE, the CWR and the NS flags set to 1, (this means that the server does not support ECN and it is not compliant with current standards) then the client may continue with the connection establishment and it must not use ECN for this connection The client SHOULD cache that this server does not support ECN. See Appendix B of [I-D.ietf-tcpm-accurate-ecn] for ideas about how these tow last cases can be used. DISCUSSION: Same discussion about the Initial Window as in the previous case. If the client receives a SYN, then the client must behave as defined in [I-D.ietf-tcpm-accurate-ecn], which implies sending some form of SYN/ACK. After doing so, the client may receive some form os SYN/ACK back. the processing of the SYN/ACK will follow the rules specified above. This is the case of simultaneous open. If the retransmission timer times out, then the client SHOULD send a TCP SYN with the No-ECT codepoint in the IP header. The initial retransmission timeout MUST be set to 1 sec. 3.2.1.2. TCP server The TCP server behaviour is defined in [I-D.ietf-tcpm-accurate-ecn]. In addition, the TCP server SHOULD cache the type of endpoint of the client, for future connections. 3.2.2. Pure ACK 3.2.2.1. TCP sender behaviour A TCP endpoint MAY set the ECT(0) or the ECT(1) codepoints in the IP header of packets carrying a TCP pure ACK. In the case that a TCP endpoint is only sending pure ACKs and it decides to mark them as ECT, the endpoint may receive a congestion signal back indicating that one or more of the pure ACKs it has sent have experienced congestion. when this happens, the endpoint will react in the same way than it would if any other packet has experienced congestion i.e. it will reduce its congestion window accordingly. However, if the endpoint is only sending pure ACKs this will have no effect in the load offered to the network and hence it Bagnulo & Briscoe Expires April 2, 2017 [Page 8] Internet-Draft ECN and TCP control packets September 2016 will not help in reducing the congestion. It may be possible to explore some ways to help reducing congestion in this scenario. For instance, one possibility would be for the endpoint to increase the maximum number of data packets that the endpoint can wait until sending a delayed ACK. Current specifications (RFC 1122 and RFC 5681) mandate that at most one ACK must be sent every two full-sized segments. Upon congestion notification, the endpoint could increase the number of segments required to send an ACK (while preserving the timeout value unchanged). This would reduce the number of ACKs sent by the endpoint and hence the offered load. It is up for discussion in the WG if it is worth it. Also, it should be noted than if an ACK is dropped due to congestion the sender of the ACK does not react by reducing the load in any way. 3.2.2.2. TCP receiver behaviour Upon reception of a pure ACK with the ECT(0), ECT(1) or CE codepoints, the TCP receiver will process it as if it were any other legitimate packet (e.g. a data packet). The exact treatment depends on the flavour of ECN that the endpoint implements (either RFC3168 or AccECN). In particular for AccECN, a CE marked pure ACK would increase the CE packet counter and would not increase the CE byte counter. 3.2.3. Window Probe 3.2.3.1. TCP sender behaviour A TCP endpoint MAY set the ECT(0) or the ECT(1) codepoints in the IP header of a packet carrying a zero window probe (ZWP) packet. According to RFC793, a TCP endpoint that has an ongoing connection for which the other endpoint has announced a receiver window equal to zero will send periodic ZWP every two minutes until a non zero window is announced by the other endpoint of the connection. In case a TCP endpoint that has an ongoing connection for which the other endpoint has announced a receiver window equal to zero and it receives incoming packets that include a congestion notification signal, the only option for the endpoint to reduce the offered load is to increase the time between ZWP messages e.g. do an exponential back- off or increment the retransmission timer in some other way. However, given that the current retransmission timer is pretty long, it is not clear if this is an effective decrease of the offered load from a congestion perspective. Note that the endpoint receiving the congestion notification will reduce its congestion window, so that when the receiver window opens, it will transmit with such a congestion window. Bagnulo & Briscoe Expires April 2, 2017 [Page 9] Internet-Draft ECN and TCP control packets September 2016 3.2.3.2. TCP receiver behaviour Upon reception of a ZWP with the ECT(0), ECT(1) or CE codepoints, the TCP receiver will process it as if it were any other legitimate packet (e.g. a data packet). The exact treatment depends on the flavour of ECN that the endpoint implements (either RFC3168 or AccECN). 3.2.4. FIN 3.2.4.1. TCP sender behaviour A TCP endpoint MAY set the ECT(0) or the ECT(1) codepoints in the IP header of a packet carrying the FIN flag of the TCP header set. After sending the FIN, the endpoint will not send any more data in the connection. It may send one or more pure ACKs, so if the endpoint that has set the ECT codepoint in the FIN receives feedback from the other endpoint that the FIN was receives with the CE codepoint, there is little it can do to reduce the load offered to the network. It is pointless to reduce the congestion window as the endpoint will not send any more data. It can try to reduce the amount of pure ACKs it sends, by using a similar approach as the one suggested in Section 3.2.2 about incrementing the number of ACKs accumulated before sending a delayed ACK. 3.2.4.2. TCP receiver behaviour Upon reception of a FIN with the ECT(0), ECT(1) or CE codepoints, the TCP receiver will process it as if it were any other legitimate packet (e.g. a data packet). The exact treatment depends on the flavour of ECN that the endpoint implements (either RFC3168 or AccECN). 3.2.5. RST A RST message is hardly a useful vehicle to convey congestion notification information. The reason for this is that the endpoint generating the RST message does not have an open connection after sending it (either because there was no such connection when the packet that triggered the RST message was received or because the packet that triggered the RST message also triggered the closure of the connection). So, if a congestion notification signal is fed back to the sender to the RST message, the sender will not be able to do anything about it. Moreover, the the perspective of the receiver of the RST message with the CE bit set, it can either accept the RST message and close the connection, so there is no point in echoing the congestion notification signal received or it can discard the RST Bagnulo & Briscoe Expires April 2, 2017 [Page 10] Internet-Draft ECN and TCP control packets September 2016 message (e.g. because the sequence number is out of window) so it probably makes sense also to discard the CE signal as well. So, from the receiver perspective, there is no reaction to the reception of a CE marked RST message. So, the only motivation for marking the RST message with the ECT codepoint is to reduce the chances of the RST message getting dropped. The question whether it is useful to provide more reliable delivery of RST messages is also non trivial. RST messages are used to both create and mitigate attacks. Spoofed RST messages are used by attackers to terminate ongoing connections. Legitimate RST messages allow endpoints to inform their peers to eliminate existing state that correspond to non existing connections, liberating resources e.g. in DoS attacks scenarios. So, with all this, probably the recommendation should be that: for senders, stacks MUST allow for administrators to configure whether the RST messages are marked with the ECT(0) or ECT(1) codepoints. We should define a default behaviour, not sure which that one should be. for receivers, ECT and CE codepoints are ignored. 3.2.6. Retransmissions 3.2.6.1. TCP sender behaviour A TCP endpoint MAY set the ECT(0) or the ECT(1) codepoints in the IP header of a packet carrying a retransmitted segment. Upon reception of congestion notification that the retransmitted packet was marked with CE, the sender will react as with it would do if it received congestion notification feedback concerning any other data packet. 3.2.6.2. TCP receiver behaviour The receiver of a retransmitted packet marked with the ECT(0), ECT(1) or CE codepoints, reacts as it would do with any other data packet. In particular, the condition of ignoring ECN information for packets outside the receiver window still hold. This means that for those retransmitted packets that the original packet was properly received, the ECN information will be ignored. There is no problem with that, since allowing the ECN marking of retransmitted packets still increases the reliability of their transmission. Bagnulo & Briscoe Expires April 2, 2017 [Page 11] Internet-Draft ECN and TCP control packets September 2016 4. Discussion about the arguments in RFC3168 This section goes through each of the arguments presented in RFC3168 to prevent the ECN marking of the different packet types and provides counter-arguments for each of them. 4.1. The reliability argument While for each type of packet RFC 3168 provides a set of specific arguments for preventing their marking, RFC3168 presents the reliable delivery of the congestion signal as an overarching argument that needs to be consider when trying to enable the ECT marking of TCP control packets. In particular, Section 5.2 of RFC3168 states: To ensure the reliable delivery of the congestion indication of the CE codepoint, an ECT codepoint MUST NOT be set in a packet unless the loss of that packet in the network would be detected by the end nodes and interpreted as an indication of congestion. We believe this argument is overly conservative. The overall principle that should determine the level of reliability required for ECN capable packets should be the one of "do not harm". Reliable delivery of the CE codepoint is indeed paramount but the level of reliability required should be the one of the original congestion signal (i.e. the detection of the loss of the original packet). In other words, the situation without ECN is that when a packet is to be transmitted through a congested link, the packet may be dropped and that is the congestion signal sent to the endpoint. When ECN is introduced, the reliability of the delivery of the congestion signal should be no worse than without ECN. In particular, setting the CE codepoint in the very same packet seem to fulfill this criteria, since either the packet is delivered and the CE codepoint signal is delivered to the endpoint, or the packet is dropped, so the original congestion signal through the packet loss is delivered to the endpoint. Requiring more than this implies that the ECN congestion signal is delivered more reliably than the current situation, which is not a bad thing per se, but, as we describe in this memo, it results in performance penalties that should be reconsidered in the view of current deployments. In addition, the reliability of the delivery of the congestion signal is used an argument for not setting the ECT codepoint in TCP control packets, which effectively reduced the reliability of the transmission of these TCP control packets. There is the then a tradeoff between the reliability of the delivery of the congestion signal and the reliability of the delivery of TCP control packets. As currently specified, ECN adoption implies an increased reliability of the ECN congestion signal and a decrease in the reliability in the Bagnulo & Briscoe Expires April 2, 2017 [Page 12] Internet-Draft ECN and TCP control packets September 2016 TCP control packets. We believe that it is possible and desirable to restore the tradeoff existent in non ECN capable networks in terms of reliability, where the congestion signal delivery is as reliable as in a non ECN capable network and so it is the delivery of TCP control packets. 4.2. TCP SYNs We next describe he arguments given by current specifications for precluding ECT on SYN packets. RFC 5562 presents two arguments against ECT marking of SYN packets (quoted verbatim): There are several reasons why an ECN-Capable codepoint must not be set in the IP header of the initiating TCP SYN packet. First, when the TCP SYN packet is sent, there are no guarantees that the other TCP endpoint (node B in Figure 2) is ECN-Capable, or that it would be able to understand and react if the ECN CE codepoint was set by a congested router. Second, the ECN-Capable codepoint in TCP SYN packets could be misused by malicious clients to "improve" the well-known TCP SYN attack. By setting an ECN-Capable codepoint in TCP SYN packets, a malicious host might be able to inject a large number of TCP SYN packets through a potentially congested ECN-enabled router, congesting it even further. We next go through all the arguments stated above to enable ECT marking of SYN packets. Argument 1: Unknown ECN capability at the responder. The initiator does not know what the responder will do if an ECT or CE SYN arrives. In a controlled environment, this argument does not hold because the administrator can make sure that servers support ECN and in particular ECN-capable SYN packets. Examples of controlled environments are single-tenant DCs, and possibly multi-tenant DCs if we assume that each tenant mostly communicates with its own VMs. However, in the public Internet context, it cannot be assumed that all TCP responders support ECN, and much less that they support ECT marked SYN packets. It is possible that the responder will check that the SYN complies with RFC 3168, which says a host "MUST NOT" set ECT on a SYN. RFC 3168 does not say what the responder should do if an ECN-capable SYN arrives. Some implementation might ignore the SYN (either silently or by returning a RST). Also some middleboxes (e.g. Bagnulo & Briscoe Expires April 2, 2017 [Page 13] Internet-Draft ECN and TCP control packets September 2016 firewalls) might take either of these actions on behalf of the responder. Silent losses lead to much longer delays than resets by the following reasoning. The responder sends a reset immediately, then the initiator falls back to retransmitting a non-ECT SYN (and possibly falls back from negotiating ECN in the TCP flags as well). However, after a silent discard, the initiator has to wait longer for a timeout. Then it might immediately fall back to retransmitting a non-ECT SYN, or it might retransmit an unchanged SYN first, in case the loss was simply due to congestion. Ironically, the benefit of making SYNs ECN-capable is to avoid the delays when SYNs are lost due to congestion. Policy-based discard of ECN-capable SYNs would merely replace congestion as a cause of these delays. So for ECT SYNs to be worthwhile it seems that the percentage loss due to policy would have to be less than that due to congestion. However, unlike congestion loss, policy loss is predictable, so the initiator can avoid it by caching those sites that do not support ECN-capable SYNs. According to a study using 2014 data [ecn-pam] from a limited range of vantage points, out of the top 1M Alexa web sites, 4791 (0,82%) IPv4 sites and 104 (0,61%) IPv6 sites failed to establish a connection when they received a TCP SYN with any ECN codepoint set in the IP header and the appropriate ECN flags in the TCP header. Of these, about 41% failed to establish a connection due to the ECN flags in the TCP header even with a Not-ECT ECN field in the IP header (i.e. despite full compliance with RFC 3168). One option, would be to first send an ECT SYN and then a non-ECT SYN (possibly with a small delay between them) and only accept the non- ECT connection if it returned first. Nonetheless, even a cache of a dozen or so sites would avoid performance problems with roughly the Alexa top thousand, so it is questionable whether the level of failure of ECT on SYNs warrants always sending two SYNs, particularly given failures at well-maintained sites could reduce if ECT SYNs are standardized. Argument 2: Loss of congestion notification in the SYN packet due to lack of support from the responder. If an ECT SYN packet is marked as CE by a congested router along the path but the responder cannot feed back CE marks on SYN packets, the congestion information will be lost. Currently, neither the TCP nor the DCTCP protocol provides space in the SYN/ACK to send feed back in response CE on the SYN. The problem is that there are two mutually exclusive uses of ECE on the SYN/ACK: Bagnulo & Briscoe Expires April 2, 2017 [Page 14] Internet-Draft ECN and TCP control packets September 2016 i) the responder has to set ECE=0 to agree to use ECN as part of the 3-way hand-shake; ii) both TCP and DCTCP use ECE=1 to feed back CE. The accurate ECN (AccECN) proposal [I-D.ietf-tcpm-accurate-ecn] suggests a two-pronged solution to this problem. First AccECN provides a way for the responder to feed back whether there was CE on the SYN, and second AccECN introduces a different combination of TCP header flags on the SYN/ACK so that the initiator knows whether or not the responder supports AccECN. Then if the responder does indicate that it supports AccECN the initiator can be sure that, if there is no CE feedback on the SYNACK, then there really was no CE on the SYN. If the responder's SYN/ACK shows that it does not support AccECN, the initiator can take a conservative approach and assume the SYN was marked with CE and reduce its initial window. However, the initiator knows that congestion is not pathological enough for a router to have had to turn off ECN, because it knows that both the SYN and the SYN/ ACK have been delivered through the network. Therefore, even a conservative initiator would not have to reduce its initial window as much as it would in response to a timeout following no response to its SYN. Nonetheless, even a slight conservative reduction in initial window might be a significant penalty, especially in the early days of deployment, when little support for ECT SYN packets will be available. This could be mitigated by caching previous experience of which servers support AccECN. Argument 3: DoS attacks. [RFC5562] says that ECT SYN packets could be misused by malicious clients to augment "the well-known TCP SYN attack". It goes on to say "a malicious host might be able to inject a large number of TCP SYN packets through a potentially congested ECN-enabled router, congesting it even further." We assume this is a reference to the TCP SYN flood attack (see https://en.wikipedia.org/wiki/SYN_flood), which is an attack against a responder end point. We assume the idea of this attack is to use ECT to get more packets through an ECN-enabled router in preference to other non-ECN traffic so that they can go on to use the SYN flooding attack to inflict more damage on the responder end point. This argument could apply to flooding with any type of packet, but we assume SYNs are singled out because their source address is easier to spoof, whereas floods of other types of packets are easier to block . Mandating Not-ECT in an RFC does not stop attackers using ECT for flooding. Nonetheless, if a standard says SYNs are not meant to be ECT it would make it legitimate for firewalls to discard them. Bagnulo & Briscoe Expires April 2, 2017 [Page 15] Internet-Draft ECN and TCP control packets September 2016 However this would negate the considerable benefit of ECT SYNs for compliant transports and seems unnecessary because RFC 3168 already provides the means to address this concern. In section 7 is says that an AQM MUST turn off ECN support if under persistent overload, and this advice is repeated in [RFC7567] (section 4.2.1). This makes it hard for flooding packets to gain from ECT, but more experiments are needed to see how much might be gained by an attacker flying "just under the radar". Alternative behaviour. The initiator can set ECT on a SYN as long as it also negotiates for the use of AccECN [I-D.ietf-tcpm-accurate-ecn] and as long as it conservatively reduces its initial window if the SYN/ACK shows that the responder does not support AccECN. The reduction in initial window need not be as great as that required in response to a timeout, because the return of a SYN/ACK proves that congestion is not severe. In controlled environments like data centres, universal support for AccECN could be arranged. Further experiments are needed to test how much malicious hosts can use ECT to augment flooding attacks without triggering AQMs to turn off ECN support (as mandated by RFC 3168 and RFC 7567). If it is found that ECT can only slightly augment flooding attacks, the risk of such attacks will need to be weighed against the performance benefits of ECT SYNs. 4.3. Pure ACKs. RFC3168 gives the following arguments for not allowing the ECT marking of pure ACKs (ACKs not piggy-backed on data). In section 5.2 it reads: To ensure the reliable delivery of the congestion indication of the CE codepoint, an ECT codepoint MUST NOT be set in a packet unless the loss of that packet in the network would be detected by the end nodes and interpreted as an indication of congestion. Transport protocols such as TCP do not necessarily detect all packet drops, such as the drop of a "pure" ACK packet; for example, TCP does not reduce the arrival rate of subsequent ACK packets in response to an earlier dropped ACK packet. Any proposal for extending ECN- Capability to such packets would have to address issues such as the case of an ACK packet that was marked with the CE codepoint but was later dropped in the network. We believe that this aspect is still the subject of research, so this document specifies that at this time, "pure" ACK packets MUST NOT indicate ECN-Capability. Later on, in section 6.1.4 it reads: Bagnulo & Briscoe Expires April 2, 2017 [Page 16] Internet-Draft ECN and TCP control packets September 2016 For the current generation of TCP congestion control algorithms, pure acknowledgement packets (e.g., packets that do not contain any accompanying data) MUST be sent with the not-ECT codepoint. Current TCP receivers have no mechanisms for reducing traffic on the ACK-path in response to congestion notification. Mechanisms for responding to congestion on the ACK-path are areas for current and future research. (One simple possibility would be for the sender to reduce its congestion window when it receives a pure ACK packet with the CE codepoint set). For current TCP implementations, a single dropped ACK generally has only a very small effect on the TCP's sending rate. We next address each of the arguments presented above. The first argument is about lack of reliability while conveying congestion notification information when carried in pure ACKs. This is the specific instance for the pure ACK messages of the reliability argument discussed in Section 4.1. In some cases, the loss of pure ACKs is not detected by the endpoints, losing the congestion notification information indadvertedly if it was to be carried in those packets. As we argued before, the bar for deciding if a packet can be marked with the ECT codepoint i.e. if it is suitable for carrying congestion notification information is that the congestion signal communication should be as reliable as dropping the packet. After all, the alternative of setting the CE bit in the packet is dropping the packet. So, the question is whether carrying congestion information in a pure ACK conveys the congestion information as reliably as when the pure ACK is dropped and it is obvious that the answer to that question is clearly yes. If the pure ACK carrying the ECT and the CE bits set is later dropped by the network, it will be essentially falling back to the use of drop as congestion signal. The second argument given in RFC3168 is the lack of means in a sender of pure ACKs to reduce the load that is creating the congestion. Again, marking pure ACKs with the ECT codepoint to allow them to carry congestion marks would be no worse than not doing so (and it would be detrimental from a performance perspective). The TCP receiver does not ACK pure ACKs so the sender of the pure ACK will receive no echo of any congestion notification. However, this is no worse than if a pure ACK is dropped, which cannot even be detected by the remote end. The proposed AccECN modification to TCP feedback [I-D.ietf-tcpm-accurate-ecn] involves a data receiver repeatedly sending a count of received congestion marks. So AccECN could include marks on pure ACKs in this count, even though it does not ACK pure ACKs themselves. Nonetheless, if the original sender of the pure ACK does not respond to this feedback, or if it is decided that Bagnulo & Briscoe Expires April 2, 2017 [Page 17] Internet-Draft ECN and TCP control packets September 2016 AccECN will not provide this information, it will still make sense to set ECT on pure ACKs, because the congestion situation will be no worse than it is today with non-ECT pure ACKs. So, overall, we believe that in terms of conveying and reacting to congestion, allowing ECT (and CE) to be set on pure ACKs is no worse than not doing so (and dropping the pure ACK). ANd not setting ECT on pure ACKs is certainly detrimental to performance because when a pure ACK is lost it can prevent the release of new data. 4.4. Retransmitted packets. RFC3168 does not allow setting the ECT codepoint in retransmitted packets. The arguments presented in the specification for supporting this design choice are the following ones (the text is quite long, not sure if we should keep it all): This document specifies ECN-capable TCP implementations MUST NOT set either ECT codepoint (ECT(0) or ECT(1)) in the IP header for retransmitted data packets, and that the TCP data receiver SHOULD ignore the ECN field on arriving data packets that are outside of the receiver's current window. This is for greater security against denial-of-service attacks, as well as for robustness of the ECN congestion indication with packets that are dropped later in the network. First, we note that if the TCP sender were to set an ECT codepoint on a retransmitted packet, then if an unnecessarily-retransmitted packet was later dropped in the network, the end nodes would never receive the indication of congestion from the router setting the CE codepoint. Thus, setting an ECT codepoint on retransmitted data packets is not consistent with the robust delivery of the congestion indication even for packets that are later dropped in the network. In addition, an attacker capable of spoofing the IP source address of the TCP sender could send data packets with arbitrary sequence numbers, with the CE codepoint set in the IP header. On receiving this spoofed data packet, the TCP data receiver would determine that the data does not lie in the current receive window, and return a duplicate acknowledgement. We define an out-of-window packet at the TCP data receiver as a data packet that lies outside the receiver's current window. On receiving an out-of-window packet, the TCP data receiver has to decide whether or not to treat the CE codepoint in the packet header as a valid indication of congestion, and therefore whether to return ECN-Echo indications to the TCP data sender. If the TCP data receiver ignored the CE codepoint in an out-of-window packet, then the TCP Bagnulo & Briscoe Expires April 2, 2017 [Page 18] Internet-Draft ECN and TCP control packets September 2016 data sender would not receive this possibly- legitimate indication of congestion from the network, resulting in a violation of end- to-end congestion control. On the other hand, if the TCP data receiver honors the CE indication in the out-of-window packet, and reports the indication of congestion to the TCP data sender, then the malicious node that created the spoofed, out-of- window packet has successfully "attacked" the TCP connection by forcing the data sender to unnecessarily reduce (halve) its congestion window. To prevent such a denial-of-service attack, we specify that a legitimate TCP data sender MUST NOT set an ECT codepoint on retransmitted data packets, and that the TCP data receiver SHOULD ignore the CE codepoint on out-of-window packets. One drawback of not setting ECT(0) or ECT(1) on retransmitted packets is that it denies ECN protection for retransmitted packets. However, for an ECN-capable TCP connection in a fully- ECN-capable environment with mild congestion, packets should rarely be dropped due to congestion in the first place, and so instances of retransmitted packets should rarely arise. If packets are being retransmitted, then there are already packet losses (from corruption or from congestion) that ECN has been unable to prevent. We note that if the router sets the CE codepoint for an ECN- capable data packet within a TCP connection, then the TCP connection is guaranteed to receive that indication of congestion, or to receive some other indication of congestion within the same window of data, even if this packet is dropped or reordered in the network. We consider two cases, when the packet is later retransmitted, and when the packet is not later retransmitted. In the first case, if the packet is either dropped or delayed, and at some point retransmitted by the data sender, then the retransmission is a result of a Fast Retransmit or a Retransmit Timeout for either that packet or for some prior packet in the same window of data. In this case, because the data sender already has retransmitted this packet, we know that the data sender has already responded to an indication of congestion for some packet within the same window of data as the original packet. Thus, even if the first transmission of the packet is dropped in the network, or is delayed, if it had the CE codepoint set, and is later ignored by the data receiver as an out- of-window packet, this is not a problem, because the sender has already responded to an indication of congestion for that window of data. In the second case, if the packet is never retransmitted by the data sender, then this data packet is the only copy of this data received by the data receiver, and therefore arrives at the data Bagnulo & Briscoe Expires April 2, 2017 [Page 19] Internet-Draft ECN and TCP control packets September 2016 receiver as an in-window packet, regardless of how much the packet might be delayed or reordered. In this case, if the CE codepoint is set on the packet within the network, this will be treated by the data receiver as a valid indication of congestion. There are essentially three arguments for not ECT marking retransmitted packets, namely, reliability, DoS attacks and over- reaction to congestion. We address all of them next in order. About reliability, as described in Section 4.1, we believe that the bar should be that the congestion signal should be delivered as reliably as if it was a packet drop. So, if a retransmitted packet is dropped and this goes by unnoticed by the receiver, then the congestion signal expressed as a drop would be lost. The same applies to the congestion signal resulting from marking with ECT and CE the very same retransmitted packet which later is dropped. About the possibility of DoS attacks, the protection against the DoS attack does not result from not allowing retransmitted packets to be ECT marked. If an attacker decided to launch such an attack, it would craft the packet with the ECT codepoint set. Effectively, the protection against the described DoS attack comes from the requirement that the receiver should not ignore the CE codepoint in out-of-window packets. We proposed to allow ECT marking of retransmitted packets, in order reduces the chances of it being dropped, but keep the requirement to ignore the CE codepoint in out- of-window packets. Finally, the third argument is about over-reacting to congestion. The argument goes that, if a retransmitted packet is dropped, the sender will not detect it, so it will not react again to congestion (it would have reduced its congestion window already when it retransmitted the packet). Whereas, if retransmitted packets can be CE tagged instead of dropped, senders could potentially react more than once to congestion. However, we argue that it is legitimate to respond again to congestion if it still persists in subsequent round trip(s). So it is not incorrect to set ECT on retransmissions. 4.5. Window probe packets RFC3168 presents only the reliability argument for preventing setting the ECT codepoint in Window Probe packets. Specifically, it states: When the TCP data receiver advertises a zero window, the TCP data sender sends window probes to determine if the receiver's window has increased. Window probe packets do not contain any user data Bagnulo & Briscoe Expires April 2, 2017 [Page 20] Internet-Draft ECN and TCP control packets September 2016 except for the sequence number, which is a byte. If a window probe packet is dropped in the network, this loss is not detected by the receiver. Therefore, the TCP data sender MUST NOT set either an ECT codepoint or the CWR bit on window probe packets. However, because window probes use exact sequence numbers, they cannot be easily spoofed in denial-of-service attacks. Therefore, if a window probe arrives with the CE codepoint set, then the receiver SHOULD respond to the ECN indications. The reliability argument has been addressed in Section 4.1. dropping the window probe message in the case the conditions for the Silly Window Syndrome are on, basically implies that the sender will be stalled until the new Window Probe message reaches the receiver, which agains results in a performance penalty. On the bright side, receivers should respond to ECN messages in these packets, so changing the behaviour should be less painful than for other packet types. 5. Security considerations There are several security arguments presented in RFC 3168 for preventing the ECN marking of TCP control packets and retransmitted segments. We believe all of them have been properly addressed in Section 4. 6. IANA Considerations There are no IANA considerations in this memo. 7. Acknowledgments TBD 8. Informative References [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, DOI 10.17487/RFC5562, June 2009, . Bagnulo & Briscoe Expires April 2, 2017 [Page 21] Internet-Draft ECN and TCP control packets September 2016 [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF Recommendations Regarding Active Queue Management", BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, . [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit Congestion Notification (ECN) Signaling with Nonces", RFC 3540, DOI 10.17487/RFC3540, June 2003, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, DOI 10.17487/RFC1122, October 1989, . [I-D.briscoe-tsvwg-aqm-tcpm-rmcat-l4s-problem] Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, Low Loss, Scalable Throughput (L4S) Internet Service: Problem Statement", draft-briscoe-tsvwg-aqm-tcpm-rmcat- l4s-problem-02 (work in progress), July 2016. [I-D.ietf-tcpm-accurate-ecn] Briscoe, B., Kuehlewind, M., and R. Scheffenegger, "More Accurate ECN Feedback in TCP", draft-ietf-tcpm-accurate- ecn-01 (work in progress), June 2016. [judd-nsdi] Judd, G., "Attaining the promise and avoiding the pitfalls of TCP in the Datacenter", NSDI 2015, 2015. [ecn-pam] Brian, B., Mirja, M., Damiano, D., Iain, I., Gorry, G., and R. Richard, "Enabling Internet-Wide Deployment of Explicit Congestion Notification", PAM 2015, 2015. Authors' Addresses Bagnulo & Briscoe Expires April 2, 2017 [Page 22] Internet-Draft ECN and TCP control packets September 2016 Marcelo Bagnulo Universidad Carlos III de Madrid Av. Universidad 30 Leganes, Madrid 28911 SPAIN Phone: 34 91 6249500 Email: marcelo@it.uc3m.es URI: http://www.it.uc3m.es Bob Briscoe Simula Research Lab Email: ietf@bobbriscoe.net URI: http://bobbriscoe.net/ Bagnulo & Briscoe Expires April 2, 2017 [Page 23]