Internet Engineering Task Force Hadi Salim, J Internet Draft Nandy, B Seddigh, N Computing Technology Labs, Nortel June 1998 A proposal for Backward ECN for the Internet Protocol (IPv4/IPv6) Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Abstract This memo proposes an alternative approach to the current ECN mechanism as proposed in the internet draft [draft-kksjf]. A Backward-ECN(BECN) is proposed which uses the existing IP signalling mechanism, the Internet Control Messaging Protocol (ICMP) [RFC 792] Source Quench message. The use of ICMP Source Quench (ISQ) allows a basic ECN mechanism for IP which does not require any negotiation between end systems. Congestion notification is kept at the network(IP) level. The congestion state can be reflected up to the transport layer (e.g. TCP or UDP) for appropriate action. The ISQ based approach reduces the reaction time to a congestion in the network. In addition, the ISQ message can include information on the severity of the congestion allowing the end host to react accordingly so as to make maximal use of the resources while maintaining network equilibrium. Hadi et al Expires December 1998 [Page 1] Internet Draft Backward ECN for the Internet Protocol June 1998 1.0 Introduction IP currently does not have any adhered to mechanism to notify its transport protocols of network congestion problems. ISQs have been in the past used for congestion notification; TCP implements its own congestion control algorithm and makes inferences about network congestion: TCP-Reno and variants use packet losses as an indicator whereas TCP-Vegas uses delay/throughput as the indicator. UDP applications are usually unresponsive and the protocols running over UDP (e.g., RTP) use their own congestion control methods if they do at all. The initial suggestions to introduce a methodology for adding Explicit Congestion Notification to IP are outlined in [Floyd94] and later in the IETF draft [draft-kksjf]. 1.1 Current ECN Proposal [draft-kksjf] Bits 10 and 11 in the IPV6 header are proposed respectively for the ECT (ECN Capable Transport indicator) and CE (Congestion Experienced indicator). Bits 6 and 7 of the IPV4 header TOS field are also proposed as the ECT and CE place holders respectively. The TCP header is modified to add an additional flag, the ECN Echo, to notify the sender (from the receiver) that it is contributing to congestion. The flag's bit-space is borrowed from the reserved field in the TCP header. This bit is also interchangebly referred to as the ECE bit in this text. The ECT bit is set by the sender end system if both the end systems are ECN capable. This is confirmed in the pre-negotiation during the connection setup phase in TCP. Packets encountering congestion are marked (CE bit) by a router on their way to the receiver end system (from the sender end system), with a probability proportional to their bandwidth usage following the procedure used in RED [RFC2309] routers. When the receiver end system receives the congestion causing packet with CE and ECT bits set, it informs the sender end system that it is contributing to congestion by the setting of ECE bit in the ACK packet. The sender end system reacts by halving the congestion window upon receiving the ACK packet. The sender end system reacts only once to ECE messages per in-flight window of messages. 1.2 Limitations of the Current ECN Proposal [draft-kksjf] 1) The [draft-kksjf] proposal's congestion notification is coupled to the transport layer(TCP) via the use of header information (ECE bit). To extend this proposal to other transport protocols will require changes to each of their respective headers. 2) The proposed [draft-kksjf] scheme requires the congestion notification to incur a round trip time (RTT) before the sender can react. In a path with high delay-bandwidth product this would be Hadi et al Expires December 1998 [Page 2] Internet Draft Backward ECN for the Internet Protocol June 1998 problematic for two reasons: i) in the scenario where the delay- bandwidth product is dominated mostly by the high bandwidth (as in in high-speed networks), a large amount of traffic will pass through the intermediate routers causing an increase in congestion level before the sender is notified. ii) in the scenario where the delay-bandwidth product is dominated mostly by the high latency/RTT (as in satellite networks), the reaction will take too long to address the congestion issue. In both cases, the efficient use of the available bandwidth is affected. 3) Because of the binary nature of the feedback, the reaction is limited to halving the window size even if the congestion level is very low. Network resources could be more effectively utilized if the feedback was indicative of the congestion level at the overloaded point in the network. In this document we introduce a Backward ECN (BECN) which is a binary feedback mechanism and then an incremental improvement to BECN which provides Multi-level Backward ECN which we refer to as Multilevel ECN (MECN). Section 2 gives an introduction to our solution and how it addresses the above limitations: a justification for using ISQ is made and Backward ECN (BECN) and then multi-level BECN (MECN). Section 3 goes into the details of BECN and suggests a role for the router and the end system. Section 4 goes into the details of MECN and suggests a role for the router and the end system. Section 5 addresses the situation of multiple congested routers with our scheme. Section 6 is on security issues. 2.0 Network Level Signalling for ECN We argue that ECN is a network level functionality and should be decoupled from the transport protocols. A mechanism should be provided for the end IP layer to inform its transport protocols of congestion problems without using their header bit(s). This provides the value that all IP transport protocols (including any new ones that might be added in the future) are notified in the same manner about network congestion. In this document we only deal with TCP and in particular TCP mechanisms which use packet drops as indicators of congestion such as TCP-Reno and its variants. It is assumed that the participating routers are capable of RED or some other active queue management mechanism. In such a router, a packet has a probability of being dropped where this probability is dependent on average queue size. For packets with the ECT bit set in the IP header, instead of the packet being dropped it would have the CE bit in the Hadi et al Expires December 1998 [Page 3] Internet Draft Backward ECN for the Internet Protocol June 1998 header set before being forwarded with a given probability if the average queue size goes between the minimum and maximum thresholds as described in [draft-kksjf]. We leverage ICMP's Source Quench message whose design intent is to provide feedback to a source end system about network congestion. Both the CE and ECT bits defined in [draft-kksjf] are maintained. During the de-multiplexing of the IP message, the values of both CE and ECT are passed to the transport layer. We start by introducing a traditional ISQ which comprises a binary feedback mechanism and a relatively modified binary reaction at the source end system (in comparison to what the requirements for the end host's reaction to ISQ are at the moment [RFC1122]) Definition: The term binary congestion feedback is used to define gathered knowledge of network congestion being passed back to an end node, explicit or otherwise, ignoring the levels of congestion. The data only says that the network is congested. We then introduce a multilevel congestion feedback mechanism based on the various incipient congestion levels detected at the RED router. The sender end system in that scenario has the luxury of having more varied reactions based on the congestion level that is fed back. This results in effective use of the network resources and performance. Definition: The term multilevel congestion feedback is used to define gathered knowledge of network congestion being passed back to an end node with explicit level indicators of how severely the network is congested. We propose the multilevel congestion feedback and reaction as an incremental improvement over the binary congestion feedback and reaction mechanism. In sections 3 and 4 we suggest some simple algorithms for both the binary and multilevel solutions. 2.1 Backward ECN (BECN) This section briefly describes the binary feedback-reaction mechanism. ICMP Source Quench messages (ISQ) are generated by the intermediate congested RED router and sent back to the source as an indication of incipient congestion whenever that router decides to mark the CE bit. ISQs are usually not generated for a packet that has already been marked previously by another router regardless of whether that packet is contributing to some congestion; however, when the router queue level mandates that the packet be dropped then an ISQ is sent back to the source regardless of whether the packet was marked previously or not. Hadi et al Expires December 1998 [Page 4] Internet Draft Backward ECN for the Internet Protocol June 1998 The source reacts at the transport protocol level by lowering its data throughput into the network. In TCP, upon identifying the flow causing the congestion, the sender reacts by halving both the congestion window and the slow start threshold value for that flow. The sender does not react to an ISQ message more than once per window. This is similar to the algorithm defined in the draft[draft-kksjf]. 2.2 Multilevel BECN (MECN) This section briefly describes the multilevel congestion feedback- reaction. Multi-level ICMP Source Quench messages (ISQ) are generated by the RED router and sent back to the source as an indication of incipient congestion whenever the CE bit is marked by the intermediate congested router. The levels are based on the RED probability, and therefore average queue size, at the time a congestive packet arrives at the router. The congestion level sent back is a multiplicative factor of the marking probability and is stored in the 32-bit unused field of the ISQ. As an example the multiplicative value selected is 100. The upper limit of 100 is returned when the probability of dropping the packet is equal to one.(i.e average queue size is above maximum threshold). ISQs are not generated for a packet that has already been marked; however, as in the case of the BECN when the router queue level mandates that a packet is dropped then an ISQ is sent back to the source regardless of whether the packet was previously marked or not. The value is the maximum i.e 100 in the above example. 2.3 The argument to justify the use of ISQ ISQ messages, generated by a router to an end system, in the past have been considered inefficient due to the following reasons: 1) Gateway CPU abuse while processing these extra messages and 2) Bandwidth consumption on the reverse path. It is suggested [RFC1812] that the routers, if implementing ISQs, should rate limit their generation because they consume too much bandwidth in the reverse path. We argue that CPU time is no longer a constrained resource today and that the benefits provided by ECN outweigh the small performance hit added. Moreover, it has been shown [red-paper] that when using RED (with cooperating end systems) less packet drops happen at the router in comparison to the traditional drop-tail algorithms used in disapproving ISQ. This implies the amount of processing needed at the router is reduced. It has been quantitatively shown in simulations [kcho-97] that only about 1-5% of the packets are marked or dropped in a RED gateway Hadi et al Expires December 1998 [Page 5] Internet Draft Backward ECN for the Internet Protocol June 1998 under incipient congestion. We argue that a faster reaction to the problem as provided by ISQ would alleviate the problem faster resulting in even further reductions. Using a RED gateway provides us with an advantage. A connection is notified (by an ISQ in this case) of congestion at a rate proportional to the connection's share of the bandwidth at the congested gateway. Generation of ISQ messages will be limited to the period between when incipient congestion is detected all the way until the source end system adjusts. In fact, given our scheme which addresses congested routers sequentially on a downstream path, we argue that the back-path even if it is the same as the forward path is probably not really congested since it covers the path only to the first point of congestion along that path. More details in section 5. In essence RED addresses both the backward path congestion problem, if the back path is the same one as the forward path, as well as the router processing concerns. 3.0 Suggested BECN algorithm This is a binary feedback-reaction mechanism. The ISQs sent by the router to the source host act as an indication of incipient congestion. The source reacts at the transport level by lowering its congestion window. The algorithm supplied here is the same as the one used in the ECN proposal [draft-kksjf] 3.1 Role of the Router If the incoming message causes the average queue size to go above the maximum threshold, then drop the segment if the ECT bit is marked in the IP header send an ISQ back to the source. else if the incoming message causes the average queue to go between the minimum and maximum thresholds then: if the RED probability chooses this packet and the ECT bit is set and if packet is not already marked then: mark the packet (CE bit) and send an ISQ back. else if RED chooses this packet and the ECT bit is not set then: drop the packet. 3.2 Role of the Source End System If an ISQ message is received then the sender knows that there is network congestion. The flow causing the congestion is identified from the ICMP data. The TCP source reacts by halving both the congestion window and the slow start threshold value for that flow. Hadi et al Expires December 1998 [Page 6] Internet Draft Backward ECN for the Internet Protocol June 1998 The sender does not react to ISQ more than once per window. Upon receipt of an ISQ packet at time t, it notes the packets that are outstanding at that time (sent but not yet acked) and waits until a time u when they have all been acknowledged before reacting to a new ISQ message. 4.0 Suggested MECN algorithm This is an evolution of BECN. The router now sends levels of congestion notification and the source end system reacts differently depending on the severity of the congestion. The level of notification is stored in the 32-bit unused field in the ISQ. 4.1 Role of the Router 4.1.1 How the congestion level weight is computed Pb refers to the computed RED packet marking probability. Pb is a function of the computed average queue size. As the average queue size varies from minimum to maximum threshold, Pb varies between 0 and the maximum value set for it, Maxp. Note that we quantify Pb to be one when the threshold is above maximum; in that particular case, the maximum weight is sent to the source system. We choose for simplicity's sake a multiplicative factor to be 100 to fashion the weight as a percentage congestion level. Above the maximum threshold we send a value of 100 in the feedback message indicating 100% incipient congestion. We multiply Pb by some factor such that we get a reflection of 99% congestion when Pb reaches its maximum value and we add 1 to counter for the fact that Pb is zero at the minimum threshold. The equation used to compute the weight to send between the minimum and maximum thresholds is: level= Pb*(98/Maxp) + 1 At the maximum threshold the weight sent is 99 and at minimum threshold the weight sent is 1. For efficiency, 98/Maxp could be computed at RED initialization. 4.1.2 The Router functionality If the incoming message causes the average queue size to go above the maximum threshold, then: drop the packet, if the ECT bit is marked in the IP header then: send an ISQ back to the source with a weight of 100 Hadi et al Expires December 1998 [Page 7] Internet Draft Backward ECN for the Internet Protocol June 1998 If the incoming message causes the average queue to go between the minimum and maximum thresholds then: if the RED probability picks this packet then: if the ECT bit is set and the CE bit is not already marked then: mark the packet and send an ISQ of integer level 1+(Pb*98/Maxp) back to the source else (the ECT bit is not set in the IP header) then: drop the packet. 4.2 Role of the End System The end system can now react to a shade of congestion level notifications. We show here a simple algorithm that could be incrementally improved. We react to each ISQ received under the assumption that the effect of burstiness and spuriousness is accounted for by the RED algorithm at the router. Since a weight of 100 indicates that the packet was dropped we use this information to improve RTO in TCP by retransmitting that packet. Note that the packet sequence number can be deduced from the 8 bytes of the TCP header passed back in the ISQ message (ISQs always pass 8 bytes on top of the IP header's information). The slow start, congestion avoidance and Fast retransmit/recovery mechanics are maintained. 4.2.1 The Source end system functionality If an ISQ message is received then the sender knows that there is network congestion. The flow causing the congestion is identified from the ICMP data and the congestion level is extracted. If the congestion level == 100 then: extract the TCP sequence number from the ISQ. retransmit the packet. cut the congestion window and threshold value by 1/2. else (we are between max and min threshold at the router) then: if congestion level >=50 then: cut the congestion window and threshold value by 1/2. else (anything below 50%) then: congestion window is linearly decremented by 1. Note: a) The usual rules about the lower bounds of the threshold and congestion window values apply when decrementing. b) The MECN method outlined above will have interactions with the Hadi et al Expires December 1998 [Page 8] Internet Draft Backward ECN for the Internet Protocol June 1998 existing congestion control mechanisms in TCP. The overall effect still slows down the system throughput if the congestion levels warrant it. 5.0 Multiple congested routers Multiple congested routers on the path between the sender and the receiver have their concerns addressed one at a time in a domino effect. If any of the downstream routers are congested to the extent of a packet drop then that router's congestion concerns are addressed immediately. If a packet is marked by a congested router, no ISQ message is generated further for it on its way to the destination. The exception to the rule is, if along the path after the marking, some other intermediate router decides to drop this packet. In that case it will transmit an ISQ of level 100 to which the end system will have to invoke the congestion reaction immediately. Therefore any router which is congested to the level of dropping packets will participate in the congestion control. Routers which are closer to the source will be favored in the sense that their incipient congestion levels will be reacted to first. If the flow is long enough, the router closest to the source will have its congestion concerns serviced first with the next downstream router serviced next and so forth with the router closest to the destination being the last one responded to. The bias is more eminent when a further downstream router (other the one that marked the packet) would have sent a higher notification level had it had the opportunity i.e had a packet not been marked and given a lesser weight in a previous router. We feel that this bias is not of great significance given that any downstream router dropping a packet will contribute to the congestion reaction at the source. 6.0 Security issues ISQ messages can be spoofed. This can be used for a Denial of Service attack on a source end system. Building authentication is probably too heavy weight. This is a problem faced by IP in general and so we have not attempted to address it. 7.0 References [draft-kksjf] Ramakrishanan, KK and Floyd, S. A proposal to add Explicit Congestion Notification(ECN) to IPv6 and to TCP, IETF Draft draft-kksjf- ECN-00.txt, November 1997. [Floyd94] Floyd, S. TCP and Explicit Congestion Notification, ACM Computer Communications Review, V.24N, October 1994. [red-paper] Floyd,S. and Jacobson, V. Random Early Detection Gateways for Congestion Avoidance, IEEE/ACM Transactions on Networking,Aug 1993. Hadi et al Expires December 1998 [Page 9] Internet Draft Backward ECN for the Internet Protocol June 1998 [kcho-97] Cho, K.J. ALTQ/RED Performance, http://www.csl.csl.sony.co.jp/person/kjc/red/perf.html [RFC 792] Postel, J Internet Control Message Protocol (sep 1981) [RFC1122] Braden, R (Editor) Requirements for Internet Hosts -- Communication Layers (oct 1989). [RFC2309] Braden, B.,Clark, D.,Crowcroft, J.,Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterseon, L., Ramakrishnan, K., Shenker, S.,Wroclaski, J., and Zhang, L. Recommendations on Queue Management and Congestion Avoidance in the Internet (April 1998). [RFC 1812] Baker, F. Requirements for IPv4 routers (June 1995). 8.0 Acknowledgements The authors are much indebted to Alan Chapman. Without his insight and multiple edits the ideas embedded in here would have been much difficult to present. 9.0 Authors' Addresses Jamal Hadi Salim, Computing Technology Labs, Nortel Canada, PO Box 3511 Station C Ottawa ON K1Y 4H7 Canada Phone: 613-763-6395 Email: hadi@nortel.com Biswajit Nandy, Computing Technology Labs, Nortel Canada, PO Box 3511 Station C Ottawa ON K1Y 4H7 Canada Phone: 613-765-3709 Email: bnandy@nortel.com Nabil Seddigh, Computing Technology Labs, Nortel Canada, Hadi et al Expires December 1998 [Page 10] Internet Draft Backward ECN for the Internet Protocol June 1998 PO Box 3511 Station C Ottawa ON K1Y 4H7 Canada Phone: 613-763-6396 Email: nseddigh@nortel.com Hadi et al Expires December 1998 [Page 11]