Internet Engineering Task Force N. Venkitaraman Motorola Labs INTERNET DRAFT J. Mysore Motorola Labs Expires January, 2001 R. Srikant UIUC R. Barnes UIUC July 2000 Stateless Prioritized Fair Queueing Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document proposes a Per Hop Behavior(PHB) in the context of diffserv. We first propose a labeling procedure for the ingress router that labels packets in a manner that implicitly conveys the packet arrival rate information of each flow to the core router. Coupled with a simple forwarding procedure, this provides a fair allocation of bandwidth. We then propose other labeling procedures that add little computational complexity, but provide weighted fair allocation, minimum bandwidth assurances and support for intra-flow priorities. By appropriately designing the labeling algorithms, this PHB can be used to optimize the performance of a variety of flows. 1. Introduction The Internet is being increasingly used for carrying multimedia streams that are sensitive to the end to end rate, delay, and drop assurances they receive from the network. Another characteristic Venkitaraman et al Expires January 2001 [Page 1] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 of flows sharing the Internet is that most of them tend to be composed of packets which contribute varying amounts of utility to the flow they belong to. That being the case, the utility provided by a quantum of resource allocated to a flow depends on the value of the packets that use it. In summary, merely allocating a certain quantity of bandwidth to a flow does not always imply that the flow will be able to make optimal use of it at all times. This intra-flow heterogeneity in packet utility could, for instance, be caused due to the stream employing a hierarchical coding mechanism as in MPEG or for reasons intrinsic to the rate adaptation algorithm implemented by the transport layer. In such cases, dropping the ``wrong'' packet(s) can significantly impact the qualitative and quantitative extent to which a flow is able to make use of the resources allocated to it. In general, if we consider optimizing the perceived quality of service as the end goal of an architecture for providing end to end QoS, then it is important that we ensure fair allocation of resources, and enable simple ways for a flow to make best use of its fair share. In order to perform fair allocation of resources to contending flows, traditionally either per-flow queueing mechanisms [3] or per-flow dropping mechanisms [6] have been proposed in core routers. These mechanisms however require core routers to maintain flow specific state for each flow sharing a link. This results in limiting the scalability of these approaches. In addition, dropping, queueing and forwarding decisions are more complex than traditional FIFO queueing mechanisms, introducing delays in the forwarding path. More recently CSFQ [8], Corelite [10], CJVC [7] have proposed scalable core-stateless architectures for providing per-flow QoS commitments. Stoica et al.[9] introduced the concept of dynamic packet state(DPS), where in, each packet carries in its header, in addition to the PHB codepoint, some PHB-specific state. The state is initialized by the ingress node. Core nodes process each incoming packet based on the state carried in the packet's header, and optionally updates their internal state and/or the state in the packet's header before forwarding the packet to the next hop. By using DPS to coordinate the actions of edge(ingress) and core nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks using networks in which core nodes do not maintain per flow state. This document proposes a PHB which is based on the concept of dynamic packet state. Using different labeling procedures at the ingress nodes this PHB can be tailored to provide services such as fair allocation of bandwidth to flows sharing a network and support for intra-flow priorities which ensures that lower priority packets of a flow get dropped preferentially over high priority packets. Venkitaraman et al Expires January 2001 [Page 2] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 By appropriately designing the labeling procedure at the ingress node, such a PHB can be tailored to optimize the performance of a variety of flows. We refer to our approach as Stateless Prioritized Fair Queueing(SPFQ). In section 2, we explain the principle behind SPFQ using a simple labeling algorithm for the ingress node. In section 3, we present different labeling algorithms that can be used to optimize the perceived quality of service for different classes of flows. 2. Stateless Prioritized Fair Queueing (SPFQ) Unlike common PHB codepoints [1,4,5], a PHB codepoint based on DPS has extra state associated to it. This state is initialized by ingress nodes and carried by packets inside the DS domain. The state semantic is PHB dependent. The values taken by the state can be either flow, path, or packet dependent. The state carried by packets can be used by core nodes for a variety of purposes such as, packet scheduling, updating the local node's state, or extending the codepoint space. We first explain our approach using an idealized fluid model and then present its packetized version. 2.1 Fluid Model Algorithm: Bit Labeling We first restate the key observation in CSFQ [8] that we will also use. In a router implementing fair queueing, all flows that are bottlenecked at a router have the same output rate. We call this the rate threshold(r_t(t)). Let C be the capacity of an output link at a router, and r_i(t) the arrival rate of flow i in bits per second. Let A(t) denote the total arrival rate of n flows: A(t)= r_1(t) + r_2(t) + ... + r_n(t). If A(t) <= C then no bits are dropped. But if A(t) > C, then r_t(t) is the unique solution to C = min(r_1(t), r_t(t))+min(r_2(t), r_t(t))+ ... +min(r_n(t), r_t(t)). In general, if max-min bandwidth allocations are achieved, each flow i receives service at a rate given by min(r_i(t), r_t(t)). For every flow, in any given second, we consider up to r_t(t) bits of the flow as being conforming, and all bits in excess of that as being non-conforming. If we mark every bit of a flow as being conforming or non-conforming, we can obtain the allocation provided by a fair queueing router, by simply having core routers accept all conforming bits of a flow and dropping all non-conforming bits. Venkitaraman et al Expires January 2001 [Page 3] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 What we need now, is a labeling algorithm at the ingress node, that would enable the core router to distinguish between conforming and non-confirming traffic. Consider the simple sequential labeling algorithm given below: label(bit) served+ = 1 bit->label = served where the value of 'served' is reset to 0, after every second. Let us suppose that the rate at which each flow is sending bits is constant. The result of this algorithm is that during any given second, the bits from a flow sending at rate r_i bits per second are marked sequentially from 1 to r_i . Each flow's bits are labeled from 1 to r_i during any second; and the label is reset to 0 at the end of each second. Then, for any given flow, accepting bits with label 1 to 'r_a', would be equivalent to providing the flow with a rate of 'r_a' bits per second. So, if all bits carry such a label, the core router can simply identify non-conforming bits to be those with label > r_t and drop them. Consequently, no flow can receive a service in excess of r_t . Furthermore, as all bits with label <= r_t are accepted, all flows sending at a rate less than or equal to r_t will not have any of their bits dropped. As we will see in the next section, a key advantage of such a labeling procedure is that it allows us to convey rate information as well as intra-flow utility to the core router using the same field in the packet header. Venkitaraman et al Expires January 2001 [Page 4] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 2.2 Packet Labeling In order to extend the sequential labeling algorithm given for the fluid model to a packetized model, we essentially need to take variable packet sizes into account. Hence, instead of incrementing the counter 'served' by 1 (which is the size of any packet in a network with purely single-bit packets), we increment the value of 'served' by the size of the packet. Given below is a pseudo code for the packetized version of the sequential marking algorithm. label(pkt) served+ = pkt->size pkt->label = served where the value of served is reset to 0, after a fixed size epoch. All ingress routers in a DS domain MUST use a same sized epoch. The size of the epoch is a design parameter that should be chosen to reflect a trade-off between mimicking the fluid model accurately and not giving an unfair advantage to flows that arrived most recently in the system. To understand this trade-off, suppose that the epoch is chosen to be very long and that a new flow arrives in the middle of an epoch. Then the bits from the new flow would be labelled starting from a value of one and would have a higher priority throughout the rest of the epoch than the bits of flows that have been sending bits from the beginning of the epoch. [2], [8] have proposed algorithms for updating the rate threshold (r_t) based on link state, i.e, based on parameters such as queue length and the aggregate accepted rate of packets. 2.2.1 Forwarding Decision at a Core Router The forwarding decision at the core router is made based on the following algorithm: enque(pkt) if (pkt->label <= r_t) Accept(pkt) else Drop(pkt) Note that the core routers do not have to relabel the packets. This can have significant performance benefits over approaches that require relabeling in the core. 3. Services based on SPFQ In this section, we present examples of labeling methods that can provide different services with marginal increase in computational complexity. Venkitaraman et al Expires January 2001 [Page 5] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 3.1 Weighted SPFQ The SPFQ algorithm can be readily extended to support flows with different weights. Let w_i be the weight of flow i. An allocation is weighted fair if all bottlenecked flows have the same value for r_i/w_i. The only major change required to achieve weighted fair allocation is in the ingress labeling algorithm, where we need to use 'served/w_i' instead of 'served'. This enables per-flow service differentiation without maintaining per-flow state in the core nodes of the network. 3.2 Minimum bandwidth allocation From the forwarding algorithm given in section 2, it is clear that the packets marked with lower values of label are dropped only after all packets with larger labels have been dropped. This suggests that packets marked with the smallest label (of 0) will not be dropped as long as the aggregate rate of such packets does not exceed the link capacity. So, for a flow requiring a minimum bandwidth allocation of 'b_min', labeling packets with the smallest label at a rate of 'b_min' would guarantee that the flow will receive the rate that it has been guaranteed within any reasonable time window (assuming that there is no packet loss due to channel error). An admission control mechanism MUST be used to ensure that the aggregate reserved rate does not exceed the capacity of the link. A distributed admission control mechanism, such as the one proposed in [7] can be used for this purpose. 3.3 Avoiding back-to-back packet drops Many real-time applications and certain versions of TCP perform poorly in the presence of back-to-back or bursty pattern of drops. The labeling procedure at the ingress router can be exploited to naturally reduce the possibility of bursty packet drops. For ease of presentation, let us first consider a situation where the router stores all packets that arrive during an epoch, then stamps the packets and sends them out. Let us assume that this procedure is repeated at the end of each epoch. Suppose that a flow, let us say Flow i, sends 10 packets in an epoch and that the length of each packet is the same. In the algorithm as described so far, the first packet that arrived in the epoch will be stamped 1; the second packet that arrived in the epoch would be stamped 2 (assuming the packet lengths are normalized to be equal to 1) and so on. Suppose that the threshold r_t at a core router is 5; then the first five packets would be accepted and the next five would be dropped. To avoid such a bursty loss pattern, we could instead label the first arriving packet as 1, the next arriving packet 6; the third packet 2; the fourth packet 7; etc. We call this procedure interleaving. Using interleaving, with the same forwarding procedure at the core router, alternate packets would be dropped instead of consecutive packets. Of course, a different choice of interleaving the labels could provide some countermeasure against three consecutive losses, four consecutive losses etc. Thus, in principle, the labeling procedure can be used to convey both rate information to the core as well as to convey intra-flow priorities. Venkitaraman et al Expires January 2001 [Page 6] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 The above algorithm is difficult to implement since it requires the edge router to store all the packets from all the ingress flows during an epoch. We can get an implementable version of the above algorithm by maintaining an exponentially-averaged estimate of the rate of each flow at the edge router. This rate estimate can then be used to estimate the number of expected packets in the next epoch. Let us suppose that the estimate of the number of packets in the next epoch is 10; but the actual number of packets that arrive in the next epoch is larger than the estimate. Then, the first 10 packets could be interleaved, but the remaining packets would not be. If the number of packet arrivals is less than the estimate, then we would end up assigning larger label (lower priority) to some of the packets than we should have. However in a typical epoch, the labeling and interleaving procedures allow us to significantly reduce the possibility of back-to-back drops. 3.4 Intra-flow priorities for streams encoded in layers Consider a flow, Flow i, which is sending packets into the network at rate r_i : Suppose that the packets of Flow i belong to three priority levels and the rate at which Flow i generates packets of level 1 be r_i1 : Rates r_i2 and r_i3 are defined in a similar manner and r_i1 + r_i2 + r_i3 = r_i : Let us suppose that intra-flow priority level 1 has higher priority than level 2 which in turn has higher priority than level 3. As examples, the three levels can be thought as I, P and B frames in an MPEG video stream or as layers in a layered multicast session. Further, suppose that r_i > r_t where we recall that r_t is the threshold at a core router. Then, Flow i's packets would be dropped at rate (r_i-r_t)/r_i. In addition to providing fairness across flows, we would like the network to provide differentiated dropping for the three priority levels for Flow i as follows: Level 1 packets are dropped at rate max(0, r_i1-r_t)/r_i1; Level 2 packets are dropped at rate max(0, r_i1 +r_i2-max(r_t ;r_i1 ))/ r_i2 and Level 3 packets are dropped at rate max(0, r_i1 +r_i2 +r_i3-max(r_t ;r_i1 +r_i2 ))/r_i3 : In other words, if dropping is necessary Level 3 packets are dropped first, followed by Level 2 packets and finally, Level 1 packets. The priority dropping described above can be implemented in SPFQ as follows: the edge router maintains an estimate of r_i1 ; r_i2 and r_i3. Thus, if priority dropping is desired at the core with no increase in computational complexity at the core routers, then the end hosts are required to communicate their priorities to the edge. A simple mechanism to accomplish this could be in the form of a field in the packet header. If priority dropping is required, this is inevitable; the priorities of the packets have to be communicated by the end hosts to the network in some manner. The interleaving procedure in an epoch is as follows: without loss of generality, let us suppose that an epoch is of length 1 second. Then, we would label Level 1 bits of Flow i starting from 1 through r_i1 . Level 2 bits from r_i1 +1 through r_i1 +r_i2 and Level 3 bits from r_i1 +r_i2 +1 through r_i1 +r_i2 +r_i3. Venkitaraman et al Expires January 2001 [Page 7] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 4. Issues 4.1 Recommended Codepoints In keeping with the recommendation in [9], it is appropriate to use values drawn from the 16 codepoints reserved for local and experimental use (xxxx11). 4.2 Encoding Flow State In PHBs based on DPS, the ingress router inserts flow specific state when the packet enters the DS domain. In SPFQ, when a packet from a flow enters the DS domain, the ingress router MUST insert a label that implicitly conveys both the rate of the flow and relative priority of the packet within the flow. Additionally, a flow MAY convey the priority of a packet to the edge ingress router by encoding the information in the packet header. There are at least three ways to encode state in the packet header: (1) introduce a new IP option and insert the option at the ingress router, (2) introduce a new header between layer 2 and layer 3, and (3) use the existing IP header. [9] describes the relative merits of each approach. The key requirement to be met is that all changes performed at ingress nodes or within the DS domain on packets' headers (possible for the purpose of storing the state) must be undone by egress nodes. 4.3 IP Fragmentation Consider a packet of size 'S' bytes carrying a label of 'L'. When the packet gets fragmented into 'n+1' fragments such that S=n*s+d, its label must be updated as follows: the ith fragment in the first 'n' fragments of size 's' will have a label (L-d)-(n-i)*s and the last fragment will have a label of L. 5. Conclusion In this document, we have presented a scalable architecture for jointly providing fair bandwidth allocation to flows and prioritized dropping for flows with intra-flow priorities. Key attributes of the presented approach are that it enables the network to make per-flow rate commitments without maintaining per-flow state in the core routers and to provide support for intra-flow priorities. The labeling algorithms in the ingress nodes can be tailored to optimize the performance of a variety of flows. We presented labeling procedures that can be used to provide weighted service differentiation, minimum rate contracts, reduce the odds of back to back drops in flows sensitive to bursty drops and optimize the perceived quality of layered streams in a network providing equal sharing of the bandwidth. We believe that PHBs such as SPFQ will be useful in enabling fair resource allocation, and optimizing the percieved quality of service of a variety of flows in a DS domain. Venkitaraman et al Expires January 2001 [Page 8] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 References [1] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Architecture for Differentiated Services, Internet Draft, draft-ietf-diffserv-arch-02.txt, October 1998. [2] Z. Cao, Z Wang and E. Zegura, ``Rainbow Fair Queueing: Fair Bandwidth Sharing Without Per-Flow State,'' Proc. of INFOCOM 2000. [3] A. Demers, S. Keshav, and S. Shenker. ''Analysis and Simulation of a Fair Queueing Algorithm,'' Journal of Internetworking Research and Experience, October 1990. [4] J. Heinanen, F. Baker, W. Weiss, and J. Wroclawski. ''Assured Forwarding PHB Group,'' Internet Draft, draft-ietf-diffserv-af-04.txt, January 1999. [5] V. Jacobson, K. Poduri and K. Nichols. ''An Expedited Forwarding PHB, November 1998. Internet Draft,'' draft-ietf-diffserv-phb-ef-01.txt, November 1999. [6] D. Lin and R. Morris. ''Dynamics of Random Early Detection,'' Proc. of SIGCOMM'97, October 1997. [7] I. Stoica and H. Zhang. ``Providing Guaranteed Services Without Per Flow Management,'' Proc. of SIGCOMM '99, September 1999. [8] I. Stoica, S. Shenker and H. Zhang. ``Core-Stateless Fair Queue- ing: Achieving approximately Fair Bandwidth Allocations in High Speed Networks,'' Proc. of SIGCOMM '98, September 1998. [9] I. Stoica et al. "Per Hop Behaviors Based on Dynamic Packet States, , February 1999. [10] S. Raghupathy, T. Kim, N. Venkitaraman and V. Bharghavan. ``Achieving Per-flow Weighted Rate Fairness in a Core-Stateless Network,'' Proc. of ICDCS '00, April 2000. [11] K. Nichols, S. Blake, F. Baker, and D. L. Black, ``Definition of the Differentiated Services Field (DS Field) in the ipv4 and ipv6 Headers, Internet Draft, draft-ietf-diffserv- header-04.txt,'' October 1998. Venkitaraman et al Expires January 2001 [Page 9] INTERNET DRAFT Stateless Prioritized Fair Queueing July 2000 9. Author's Addresses Narayanan Venkitaraman Motorola Labs, IL02/2240 1301 E. Algonquin Rd. Schaumburg, IL 60196 venkitar@rsch.comm.mot.com Jayanth Mysore Motorola Labs, IL02/2240 1301 E. Algonquin Rd. Schaumburg, IL 60196 jayanth@rsch.comm.mot.com R. Srikant Coordinated Science Lab Univ. of Illinois 1308 W Main St. Urbana, IL 61801 rsrikant@uiuc.edu Roberto Barnes Coordinated Science Lab Univ. of Illinois 1308 W Main St. Urbana, IL 61801 rbarnes@comm.csl.uiuc.edu Venkitaraman et al Expires January 2001 [Page 10]