Internet Engineering Task Force K. K. Ramakrishnan INTERNET DRAFT AT&T Labs Research draft-kksjf-ecn-00.txt Sally Floyd LBNL November 1997 Expires: May 1998 A Proposal to add Explicit Congestion Notification (ECN) to IPv6 and to TCP Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract This note describes a proposed addition of ECN (Explicit Congestion Notification) to IPv6 and to TCP. First we describe TCP's use of packet drops as an indication of congestion. Next we argue that with the addition of active queue management (e.g., RED) to the Internet infrastructure, where routers detect congestion before the queue overflows, routers are no longer limited to packet drops as an indication of congestion, but could instead set an ECN bit in the packet header, for ECN-capable transport protocols. We describe when the ECN bit would be set in the routers, and describe what modifications would be needed to TCP to make it ECN-capable. Modifications to other transport protocols (e.g., unreliable unicast or multicast, reliable multicast, other reliable unicast transport protocols) could be considered as those protocols advance through the standards process. Ramakrishnan and Floyd Informational [Page 1] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 TCP's congestion control and avoidance algorithms are based on the notion that the network is a black-box [Jacobson88, Jacobson90]. The network's state of congestion or otherwise is determined by end- systems probing for the network state, by gradually increasing the load on the network (by increasing the window of packets that are outstanding in the network) until the network becomes congested and a packet is lost. Treating the network as a "black-box" and treating loss as an indication of congestion in the network is appropriate for pure best-effort data carried by TCP that has little or no sensitivity to delay or loss of individual packets. In addition, TCP's congestion management algorithms have techniques built-in (such as fast retransmit and fast recovery) to minimize the impact of losses from a throughput perspective. However, these mechanisms are not intended to help applications that are in fact sensitive to the delay or loss of one or more individual packets. Interactive traffic such as telnet, web-browsing, and transfer of audio and video data ("real-audio" and "real-video") can be sensitive to packet losses (for unreliable data delivery such as UDP) or to the increased latency of the packet caused by the need to retransmit the packet after a loss (for reliable data delivery such as TCP). Since TCP determines the appropriate congestion window to use by gradually increasing the window size until it experiences a dropped packet, this causes the queues at the bottleneck router to build up. With most packet drop policies at the router that are not sensitive to the load placed by each individual flow, this means that some of the packets of latency-sensitive flows are going to be dropped. Active queue management mechanisms that detect congestion before the queue overflows, and provide an indication of this congestion to TCP, is desirable because it avoids some bad properties of dropping on queue overflow, especially with drop-tail schemes. Drop tail introduces synchronization of loss across multiple flows which is undesirable. Indicating incipient congestion means that TCP does not have to increase its window size up to the point where a router's buffer is filled up. This can reduce queuing delays and avoid synchronization, which are desirable characteristics. 2. Random Early Detection (RED) Random Early Detection (RED) is a mechanism for active queue management that has been proposed to detect incipient congestion [FJ93], and is currently being deployed in the Internet backbone [RED-ietf-draft]. Although RED is meant to be a general mechanism using one of several alternatives for congestion indication, in the current environment of the Internet RED is restricted to using packet drops as a mechanism for congestion indication. By dropping packets Ramakrishnan and Floyd Informational [Page 2] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 based on the average queue length exceeding a threshold, rather than only when the queue overflows, RED maintains the average queue at a smaller level, and improves the delay experienced by the flows. However, when RED drops packets before the queue actually overflows, RED is not forced by memory limitations to discard the packet. RED could set an Explicit Congestion Notification bit in the packet header instead of dropping the packet, if such a bit was provided in the IP header and understood by the transport protocol. The use of the Explicit Congestion Notification bit would allow the receiver(s) to receive the packet, avoiding the potential for excessive delays due to retransmissions after packet losses. 3. Explicit Congestion Notification We propose that the Internet provide a congestion indication for incipient congestion (as in RED and earlier work [RJ90]) where the notification can sometimes be through marking packets rather than dropping them. This would require an ECN field in the IP header with two bits. The ECN-Capable bit would be set by the data sender to indicate an ECN-capable transport protocol. The ECN bit would be set by the router to indicate congestion to the end nodes. ([Floyd94] outlines a scheme where a single bit could be overloaded to serve the function of both the ECN-Capable bit and the ECN bit, but the two-bit scheme is more straightforward to explain). We expect that routers would provide the congestion indication on incipient congestion as indicated by the average queue size, using the RED algorithms suggested in [FJ93, RED-ietf-draft]. Routers that have a packet arriving at a full queue would drop the packet, just as they do now. The congestion control algorithms followed at the end-systems would be essentially the same as the congestion control response to a *single* dropped packet, for a transport protocol where a dropped packet is used as an indication of congestion. For TCP in particular, the source TCP would halve its congestion window "cwnd" in response to an ECN indication received by the data receiver. However, this action is done only once per window of data (i.e., at most once per roundtrip time), to avoid reacting multiple times to multiple indications of congestion within a roundtrip time. 4. Proposed Algorithm at the Router We describe the proposed algorithm at the router in the context of current router implementations. We assume that the router is capable of implementing the probability computation for RED and uses a pure packet drop mechanism (e.g., drop from front, drop from tail, or random drop) whenever a packet arrives at a full queue. When the router's buffer is not yet full and the router is prepared Ramakrishnan and Floyd Informational [Page 3] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 to drop a packet to inform end nodes of incipient congestion, the router should first check to see if the ECN-Capable bit is set in that packet's IP header. If so, then instead of dropping the packet, the router could instead set the ECN bit in the IP header. When more severe congestion has occurred and the router's queue is full, then the router has no choice but to drop some packet when a new packet arrives. The router determines it is congested if the AVERAGE length of any of its queues where packets are waiting to be processed or transmitted exceeds a threshold. We believe that the router should use the ECN bit to notify that it is congested only when the *average* queue length, rather than the instantaneous queue length, exceeds a threshold. There are potentially several alternatives for estimating the average queue length and marking the ECN bit. Since there is considerable effort involved already in implementing RED, we believe it is best to leverage these efforts for ECN as well. One potential mechanism for the averaging and marking is to perform functions similar to RED queue management: RED uses an exponential moving average of the queue size. When the average queue size goes above a lower threshold, packets are marked with a probability of marking that increases with the average queue size. (Packets that are not ECN-capable are dropped instead of marked.) When the average queue size gets up to or above a high threshold, all incoming packets should be dropped (assuming that the router intends to control the average queue size even in the presence of unresponsive traffic). It is anticipated that when all of the source end-systems participate in TCP's congestion management mechanisms or other compatible congestion control, and respond to ECN by reducing their offered load, packet losses would be relatively infrequent. Packet losses in this case would occur primarily during transients and in the presence of non-cooperating entities. When a packet is received by a router with the ECN bit set indicating that congestion was encountered upstream, then the bit is left unchanged, and the packet transmitted as usual. 5. Support from the Transport Protocol ECN requires support from the transport protocol, in addition to the ECN field in the IPv6 packet header. For TCP, ECN requires two new mechanisms: negotiation between the endpoints during setup to determine if they are both ECN-capable, and an ECN-Notify bit in the TCP header so that the data receiver can inform the data sender when a packet has been received with the ECN bit set. The support Ramakrishnan and Floyd Informational [Page 4] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 required from other transport protocols is likely to be different, particular for unreliable or reliable multicast transport protocols, and will have to be determined as other transport protocols are brought to the IETF for standardization. The following sections describe in detail the proposed TCP use of ECN. This is also described in [Floyd94]. We assume that the source TCP uses the current set of congestion control algorithms of Slow-start, Fast Retransmit and Fast Recovery [RFC 2001]. 5.1. TCP Initialization Initially, the source and destination TCPs exchange the desire and/or capability to use ECN in the TCP connection setup phase. As a result of the negotiation, the TCP sender indicates using the ECN-Capable bit in the IPv6 header that the transport is capable and willing to participate in ECN. This will indicate to the routers that they may mark packets with the ECN bit, if they would like to use that as a method of congestion notification. If the TCP connection does not wish to use ECN notification, the sending TCP sets the ECN-Capable bit equal to 0 (i.e., not set), and the TCP receiver ignores the ECN bit in received packets. 5.2. The TCP Sender For a connection that expects to use ECN, packets are transmitted with the ECN-Capable bit set in the IP header (set to a "1"). If the sender receives a TCP acknowledgement with the ECN-Notify bit set in the TCP header, then the sender knows that congestion was encountered in the network on the path from the sender to the receiver. The indication of congestion should be treated just as a congestion loss in non-ECN-Capable TCP. That is, the TCP source halves the congestion window "cwnd" and reduces the slow start threshold "ssthresh". The sending TCP does NOT increase the congestion window in response to the receipt of an ACK packet with the ECN-Notify bit set. However, a very important difference is that TCP does not react to ECN congestion indications more than once every window of data (or more loosely, more than once every round-trip time). If a response to the ECN-Notify bit was made over the last round-trip time, based on the window of packets, then the sending TCP doesn't respond to any further ECN messages. If at time "t", the source TCP reacted to an ECN, then it notes the packets that are outstanding at that time and have not yet been acknowledged. Until all these packets are acknowledged, say at time "u", the source TCP does not react to another ECN indication of congestion. In addition, when a TCP sender receives duplicate acks during the time interval between "t" and "u", it does not reduce the congestion window. The result is that decreases in the congestion window occur Ramakrishnan and Floyd Informational [Page 5] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 at most once per roundtrip time. When the TCP sender receives a packet with the ECN-Notify bit set, and therefore reduces its congestion window, the sender does not need to slow-start (as is done in Tahoe TCP in response to a packet drop) or to stop sending packets for a period of time to allow the queue to dissipate (as is done by Reno TCP for roughly half a round-trip time in response to a packet drop). The ECN-Notify bit being set does not indicate the urgent transient congestion state of a buffer overflow. Incoming acknowledgements will still arrive to "clock out" outgoing packets when allowed by the congestion window. TCP follows existing algorithms for sending data packets in response to incoming ACKs, multiple duplicate acknowledgements, or retransmit timeouts [RFC2001]. 5.3. The TCP Receiver At the destination end-system, when TCP receives a packet with the ECN bit set in the IP header, TCP sets the ECN-Notify bit in the TCP header in the returning ACK packet. We do not provide here any notion of destination congestion, because this is already being indicated in the receiver's advertised window. The destination TCP continues to perform the duplicate ACK procedure already specified - to generate a duplicate ACK when an out-of- sequence packet is received. If there is any ACK withholding implemented, as in current TCP implementations where the TCP receiver often sends an ACK for two arriving data packets, then the TCP destination will send the OR of all the ECN bits of packets that the ACK is acknowledging. That is, if any packet is received with the ECN bit set, then the ACK carries the ECN-Notify bit set. 5.4. Congestion on the ACK-path For the current generation of TCP congestion control algorithms, pure acknowledgement packets (e.g., packets that do not contain any accompanying data) should be sent with the ECN-capable bit off. Current TCP receivers have no mechanisms for reducing traffic on the ACK-path in response to congestion notification. Mechanisms for responding to congestion on the ACK-path can be relegated as an area for future research. (One simple possibility would be for the sender to reduce its congestion window when it receives a pure ACK packet with the ECN bit set). For current TCP implementations, a single dropped ACK generally has only a very small effect on the TCP's sending rate. Ramakrishnan and Floyd Informational [Page 6] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 6. Summary of changes required in IPv6 and TCP Two bits need to be specified in the IPv6 header, the ECN-Capable bit and the ECN bit. The ECN-Capable bit set to "0" indicates that the transport protocol will ignore the ECN bit. This is the default value. The ECN-Capable bit set to "1" indicates that the transport protocol is willing and able to participate in ECN. The default value for the ECN bit is "0". The router sets the ECN bit to "1" to indicate congestion to the end nodes. The ECN bit in a packet header should never be reset by a router from "1" to "0". TCP requires two changes, a negotiation phase during setup to determine if both end nodes are ECN-capable, and a bit in the TCP header (possibly one of the "reserved" bits in the TCP flags field) as an ECN-Notify bit so that the receiver can inform the sender of a packet received with the ECN bit set. 7. Non-relationship to ATM's EFCI indicator or Frame Relay's FECN Since these ATM and Frame Relay mechanisms typically have been defined without any notion of average queue size as the basis for concluding that there is congestion, we believe that they provide a very noisy signal. The interpretation we have here for ECN is NOT the appropriate reaction for such a noisy signal of congestion notification. It is our belief that such mechanisms would be phased out over time within the ATM network. However, if the routers that interface to the ATM network have a way of maintaining the average queue at the interface, and use it to come to a conclusion that the ATM subnet is congested or otherwise, they may use the ECN notification that is defined here. 8. Non-compliance by the End Nodes We believe that, for the most part, the fairness properties of TCP will not be changed with the introduction of ECN. A key issue concerns the vulnerability of ECN to non-compliant end- nodes (i.e., end nodes that set the ECN-capable bit in packets, but do not respond to the ECN bit itself). These concerns exist even in non-ECN environments. An end-node could "turn off congestion control" by not reducing its congestion window in response to packet drops. We recognize that this is a concern for the current Internet. It has been argued that routers will have to deploy mechanisms to detect and differentially treat packets from non-compliant flows. It is likely that techniques such as end-to-end per-flow scheduling and isolation of one flow from another, potentially accompanied by end- to-end reservations, could mitigate such effects. Such isolation Ramakrishnan and Floyd Informational [Page 7] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 mechanisms could remove some of the more egregious effects of non- compliance. However, even in networks just restricted to packet losses as an indication of congestion, several methods have been proposed to identify and treat non-compliant or unresponsive flows. These mechanisms would be equally applicable for identifying flows that do not respond to ECN. If anything, routers would have a slightly easier time identifying flows that do not respond to ECN. For example, routers can observe packets arriving at the router with the ECN bit set, as well as keeping note of packets that have the ECN bit set at that router itself. It has been argued that dropping packets in itself may be considered a deterrrent for non-compliance. However, we believe that the packet drop rates are likely to be reasonably low in environments where ECN is deployed. The reduction in load due to packet drops to deal with non-compliant nodes is likely to be small. The control of congestion is more likely to come from end-nodes reacting to congestion - either from responding to dropped packets or ECN Notify indications and halving the window. ECN should be used at a router when the average queue size is below some high threshold; when the average queue size exceeds the high threshold, and therefore packet drop/marking rates are higher, our recommendation is that routers drop packets, rather then setting the ECN bit in packet headers. Thus, in scenarios with low packet drop rates, the fact that the congestion control indications are in the form of packet drops rather than ECN bits does not significantly change the negative consequences on the compliant flows because of some flow "turning off" congestion control. We also do not believe that packet dropping itself is an effective deterrent for non-compliance. Many flows that retransmit dropped packets could have an incentive to maintain or even increase their sending rate in response to packet drops, rather than decreasing their sending rate, in the absence of mechanisms at the router to provide a negative deterrance for such behavior. For example, flows that use unreliable transport protocols could simply increase their use of FEC in response to an increased packet drop rate, and might choose increased FEC and no congestion control. We believe that the effect of packet dropping as a deterrence for non-compliance with congestion control mechanisms is quite small. The possibility of non-compliant flows does not offer a compelling reason not to deploy ECN. 9. Additional Considerations Some care is required to handle the ECN and ECN-Capable bits appropriately when packets are encapsulated and un-encapsulated for Ramakrishnan and Floyd Informational [Page 8] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 tunnels. When the router at the end of the tunnel decapsulates the packet, then the ECN bit in the encapsulating ('outside') header should be ORed with the ECN bit in the encapsulated ('inside') header that remains. Basically, a 1 in the encapsulating header should be copied into the encapsulated header. An additional issue concerns packets that have the ECN bit set at one router, and are later dropped at another router. For the proposed use for ECN in this paper (that is, for data packets for TCP), this is not a concern, because end nodes detect dropped data packets, and the congestion response of the end nodes to a dropped data packet is at least as strong as the congestion response to a packet received with the ECN bit set. This issue will have to be addressed if ECN and ECN-Capable bits are used on pure ACK packets, because in current implementations of TCP the drop of an ACK packet is not explicitly detected by the end nodes. If a packet with the ECN bit is later dropped due to corruption (bit errors), the end node should still invoke congestion control, just as TCP would today, to a dropped data packet. This issue would also have to be addressed in future proposals for distinguishing between packets dropped due to corruption and packets dropped due to congestion. 10. Conclusions Given the current effort to implement RED, we believe this is the right time for router vendors to examine how to also implement congestion avoidance mechanisms that do not depend on packet drops alone. With the growth of applications and transports that are sensitive to delay and loss of a single packet, depending on packet loss as a normal congestion notification mechanism appears to be insufficient (or at the very least, non-optimal). Ramakrishnan and Floyd Informational [Page 9] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 REFERENCES [FJ93] Floyd, S., and Jacobson, V., "Random Early Detection gateways for Congestion Avoidance", IEEE/ACM Transactions on Networking, V.1 N.4, August 1993, p. 397-413. URL "ftp://ftp.ee.lbl.gov/papers/early.pdf". [Floyd94] Floyd, S., "TCP and Explicit Congestion Notification", ACM Computer Communication Review, V. 24 N. 5, October 1994, p. 10-23. URL "ftp://ftp.ee.lbl.gov/papers/tcp_ecn.4.ps.Z". [Floyd97] Floyd, S., and Fall, K., "Router Mechanisms to Support End-to-End Congestion Control", Technical report, February 1997. URL "ftp://ftp.ee.lbl.gov/papers/collapse.ps". [FRED] Lin, D., and Morris, R., "Dynamics of Random Early Detection", SIGCOMM '97, September 1997. URL "http://www.inria.fr/rodeo/sigcomm97/program.html#ab078". [Jacobson88] V. Jacobson, "Congestion Avoidance and Control", Proc. ACM SIGCOMM '88, pp. 314-329. URL "ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z". [Jacobson90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm", Message to end2end-interest mailing list, April 1990. URL "ftp://ftp.ee.lbl.gov/email/vanj.90apr30.txt". [RED-ietf-draft] B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", Internet draft draft-irtf-e2e-queue-mgt-00.txt, March 25, 1997. [RFC2001] W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", RFC 2001, January 1997. [RJ90] K. K. Ramakrishnan and Raj Jain, "A Binary Feedback Scheme for Congestion Avoidance in Computer Networks", ACM Transactions on Computer Systems, Vol.8, No.2, pp. 158-181, May 1990. SECURITY CONSIDERATIONS Security issues are not discussed in this document. Ramakrishnan and Floyd Informational [Page 10] draft-kksjf-ecn Addition of ECN to IPv6 and TCP November 1997 AUTHORS' ADDRESSES K. K. Ramakrishnan AT&T Labs. Research Phone: +1 (973) 360-8766 Email: kkrama@research.att.com URL: http://www.research.att.com/info/kkrama Sally Floyd Lawrence Berkeley National Laboratory Phone: +1 (510) 486-7518 Email: floyd@ee.lbl.gov URL: http://www-nrg.ee.lbl.gov/floyd/ This draft was created in November 1997. It expires May 1998. Ramakrishnan and Floyd Informational [Page 11]