Internet Draft Anna Charny, ed. Cisco Systems Fred Baker Cisco Systems Jon Bennett Riverdelta Networks Kent Benson Tellabs Jean-Yves Le Boudec EPFL Angela Chiu AT&T Labs William Courtney TRW Bruce Davie Cisco Systems Shahram Davari PMC-Sierra Victor Firou Nortel Networks Charles Kalmanek AT&T Research K.K. Ramakrishnan AT&T Research Dimitrios Stiliadis Lucent Technologies Expires January, 2001 draft-charny-ef-definition-00.txt July 2000 EF PHB Redefined Status of this Memo This document is an Internet Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, Charny, ed. et al. 1 draft-charny-ef-definition.txt July, 2000 and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.ietf.org (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). This document is a product of the Diffserv working group of the Internet Engineering Task Force. Please address comments to the group's mailing list at diffserv@ietf.org, with a copy to the authors. Copyright (C) The Internet Society (1999). All Rights Reserved. Abstract This document proposes text aiming at providing clarification to RFC 2598. The primary motivation for this draft is that the definition of EF PHB given in RFC 2598 does not match the intuition of the EF PHB, and has been determined to be unimplementable in a manner compliant with the definition. In particular, no work-conserving scheduler, including all of the schedulers listed in the Implementation Examples section of RFC 2598, can comply with the definition given in that RFC. This draft gives a rigorous definition of EF PHB which in our opinion preserves the spirit of the EF PHB as intended by RFC 2598 while allowing a number of reasonable compliant implementations. The document also provides implementation guidance and lists several possible implementation examples which conform to the definition given in this draft. 1 Introduction Charny, ed. et al. 2 draft-charny-ef-definition.txt July, 2000 The Expedited Forwarding (EF) Per-Hop Behavior (PHB) was designed to be used to build a low-loss, low-latency, low-jitter, assured bandwidth service. The potential benefits of this service, and therefore the EF PHB, are enormous. Because of the great value of this PHB, it is critical that the forwarding behavior required of and delivered by an EF-compliant node be specific, quantifiable, and unambiguous. Many members of the data networking community have a strong intuitive feel for the packet forwarding behavior required by EF. This intuitive forwarding behavior is very useful and should form the basis of the formal EF conformance definition. Unfortunately however, the actual definition contained in section 2 of RFC 2598 does not fully reflect this intuitive understanding. Many of the forwarding behaviors which are intuitively reasonable do not actually comply with the formal definition of RFC 2598. Furthermore, many of the schedulers believed to deliver EF-compliant behavior cannot be used to implement the formal definition of EF since they result in forwarding treatment which does not comply with the definition of RFC 2598. In fact, schedulers listed in sections 2.2 and A.3 of RFC 2598 as implementation examples realizing EF do not actually deliver the forwarding behavior defined earlier in the same document. Moreover, it can be shown that no work-conserving scheduler can satisfy the definition of RFC 2598. There are several potentially serious problems with having a formal EF definition that does not match people's intuitive understanding. First, the understanding of what it means for a node to be EF- compliant may vary among people. This discrepancy may arise due to the fact that two people's intuitive understanding of the definition may actually differ somewhat; also, someone learning about EF from the formal definition may develop an understanding of EF at odds with the understanding that most people currently familiar with EF have. These discrepancies in people's understanding of EF may have serious consequences. The resulting confusion may increase the time and cost needed to develop equipment, cause interoperability problems, and create mismatches between expected node and network performance and actual performance. Second, the lack of a clear conformance definition makes it impossible to test a piece of equipment and declare it "conforming" or "non-conforming." Third, the lack of a mathematically precise description of a node's behavior makes it impossible to analytically design or evaluate services constructed using the EF PHB or other PHBs that must contend for resources with EF traffic. Fourth, an incorrect formal definition of EF may lead to erroneous reasoning about the properties of networks implementing EF. The goal of this document is to bring the formal compliance definition in line with what we believe is the intuitive understanding of EF PHB. It is NOT the goal of this document to redefine the intuitive behavior expected of EF PHB, nor it is the Charny, ed. et al. 3 draft-charny-ef-definition.txt July, 2000 goal to change the expectations of the properties of schedulers widely believed to deliver adequate EF PHB. Rather, our sole goal is to introduce minimal changes to the existing definition so that forwarding behaviors that unambiguously satisfy the well-established intuition become formally compliant with the definition, and schedulers widely believed suitable for implementing EF PHB actually formally comply with the modified definition. Hence, our goal is not to change the concept of EF PHB, but rather to remove ambiguities and add mathematical precision to the definition of the familiar EF PHB. With this goal in mind, this document proposes a new definition of EF PHB. This definition provides several benefits. First, it closely matches what we believe is the intuitive definition of EF. Second, the proposed definition permits EF to be implemented using many well-known schedulers. Third, the proposed definition allows the behavior of EF-compliant implementations to be unambiguously expressed in mathematical terms. This means that a quantitative guarantee can be given about a node's behavior, nodes can be tested for compliance, and local, per-domain, and end-to-end behaviors of a variety of compliant practical implementations can be studied analytically. In the remainder of section 1, we consider the definition and intuitive content of EF PHB as given in RFC 2598. We identify problems with that definition and show that without some change to the current definition it is impossible to implement EF PHB with any work-conserving scheduler. We then discuss possible simple changes that can be made to the definition which correct some, but not all, of the identified problems. We then argue that a different solution is necessary to correct the remaining problems. In section 2, we provide a new definition for EF PHB called the "packet scale rate guarantee." We then discuss this definition, describe some of its properties, and compare it to the better-known rate-latency curve. In sections 3 and 4, we demonstrate that our definition is satisfiable by several practical schedulers such as Priority Queuing (PQ) and Worst-Case Fair Weighted Fair Queuing (WF2Q). We also discuss other implementations and give values for the key parameter of the new definition for a large variety of schedulers. Proofs of critical statements about the properties and satisfiability of the new definition are given in the Appendices. 1.1 The RFC 2598 Definition of EF PHB and Its Intuitive Meaning The definition of the EF PHB as given in [RFC 2598] states: Charny, ed. et al. 4 draft-charny-ef-definition.txt July, 2000 It [the EF PHB departure rate] SHOULD average at least the configured rate when measured over any time interval equal to or longer than the time it takes to send an output link MTU sized packet at the configured rate. The intuitive content of this definition is fairly clear. On all time scales ranging down to very small time scales, the EF aggregate should be given at least its configured share of the output link bandwidth. Among other things, this allows EF to support applications that are delay- and jitter-sensitive. However, intuition alone will not allow vendors to design compliant schedulers capable of advertising their EF configuration to other routers. As we show in the next section, the simplicity of the definition is misleading in the sense that it does not actually capture the intuition correctly under a number of circumstances. A note is due here on the precise interpretation of the wording of the definition. A potential cause of ambiguity is the fact that the definition contains the word SHOULD which according to [Bra97] means that in principle an implementation of EF PHB may under some circumstances choose not to be strictly compliant with the specified requirement, in which case any issues with the strict definition may be viewed as irrelevant. However, it seems that in order for the SHOULD to be meaningful, there should exist at least some implementations which are strictly compliant, even if non-compliant implementations may be chosen under some circumstances. Furthermore, the Virtual Wire behavior aggregate [JNP2000] is defined by replacing SHOULD by MUST in the definition of EF PHB in RFC 2598. Therefore, in all cases the exact mathematical properties of the EF definition and the existence of strictly compliant implementations are of substantial interest. The remainder of this section concentrates on the discussion of these issues in detail. 1.2 Difficulties with the RFC 2598 EF PHB Definition A literal interpretation of the definition would consider the behaviors given in the next two subsections as non-compliant. The definition also unnecessarily constrains the maximum configurable rate of an EF aggregate. 1.2.1 Perfectly-Clocked Forwarding Consider the following stream forwarded from a router with EF- configured rate R=C/2, where C is the output line rate. In the illustration, E is an MTU-sized EF packet while x is a non-EF packet or unused capacity, also of size MTU. ... E x E x E x E x E x E x... |-----| Charny, ed. et al. 5 draft-charny-ef-definition.txt July, 2000 The interval between the vertical bars is 3*MTU/C, which is greater than MTU/(C/2), and so is subject to the EF PHB definition. During this interval, 3*MTU/2 bits of the EF aggregate should be forwarded, but only MTU bits are forwarded. Therefore, while this forwarding pattern should be considered compliant under any reasonable interpretation of the EF PHB, it actually does not formally comply with the definition of RFC 2598. Note that this forwarding pattern can occur in any work-conserving scheduler in an ideal output-buffered architecture where EF packets arrive in a perfectly clocked manner according to the above pattern and are forwarded according to exactly the same pattern in the absence of any non-EF traffic. Trivial as this example may be, it reveals the lack of mathematical precision in the formal definition. The fact that no work-conserving scheduler can formally comply with the definition is unfortunate, and appears to warrant some changes to the definition that would correct this problem. The underlying reason for the problem described here is quite simple - one can only expect that the EF aggregate is served at configured rate in some interval where there is enough backlog of EF packets to sustain that rate. In the example above the packets come in exactly at the rate at which they are served, and so there is no persistent backlog. Certainly, if the input rate is even smaller than the configured rate of the EF aggregate, there will be no backlog as well, and a similar formal difficulty will occur. A seemingly simple solution to this difficulty might be to require that the EF aggregate is served at its configured rate only when the queue is backlogged. However, as we show in the remainder of this section, this solution does not suffice. 1.2.2 Router Internal Delay We now argue that the example considered in the previous section is not as trivial as it may seem at first glance. Consider a router with EF configured rate R = C/2 as in the previous example, but with an internal delay of 3T (where T = MTU/C) between the time that a packet arrives at the router and the time that it is first eligible for forwarding at the output link. Such things as header processing, route look-up, and delay in switching through a multi-layer fabric could cause this delay. Now suppose that EF traffic arrives regularly at a rate of (2/3)R = C/3. The router will perform as shown below. EF Packet Number 1 2 3 4 5 6 ... Arrival (at router) 0 3T 6T 9T 12T 15T ... Arrival (at scheduler) 3T 6T 9T 12T 15T 18T ... Departure 4T 7T 10T 13T 16T 19T ... Charny, ed. et al. 6 draft-charny-ef-definition.txt July, 2000 Again, the output does not satisfy the RFC 2598 definition of EF PHB. As in the previous example, the underlying reason for this problem is that the scheduler cannot forward EF traffic faster than it arrives. However, it can be easily seen that the existence of internal delay causes one packet to be inside the router at all times. An external observer will rightfully conclude that the number of EF packets that arrived to the router is always at least one greater than the number of EF packets that left the router, and therefore the EF aggregate is constantly backlogged. However, while the EF aggregate is continuously backlogged, the observed output rate is nevertheless strictly less that the configured rate. This example indicates that the simple addition of the condition that EF aggregate must receive its configured rate only when the EF aggregate is backlogged does not suffice in this case. Yet, the problem described here is of fundamental importance in practice. Most routers have a certain amount of internal delay. A vendor declaring EF compliance is not expected to simultaneously declare the details of the internals of the router. Therefore, the existence of internal delay may cause a perfectly reasonable EF implementation to display seemingly non-conformant behavior, which is clearly undesirable. 1.2.3 Maximum Configurable Rate and Provisioning Efficiency It is well understood that with any non-preemptive scheduler, the compliant configurable rate for an EF aggregate cannot exceed C/2 [JNP2000]. This is because an MTU-sized EF packet may arrive to an empty queue at time t just as an MTU-sized non-EF packet begins service. The maximum number of EF bits that could be forwarded during the interval [t, t + 2*MTU/C] is MTU. But if configured rate R > C/2, then this interval would be of length greater than MTU/R, and more than MTU EF bits would have to be served during this interval for the router to be compliant. Thus, R must be no greater than C/2. It can be shown that for schedulers other than PQ, such as various implementations of WFQ, the maximum compliant configured rate may be much smaller than 50%. For example, for SCFQ [Gol94] the maximum configured rate cannot exceed C/N, where N is the number of queues in the scheduler. For WRR, mentioned as compliant in section 2.2 of RFC 2598, this limitation is even more severe. This is because in these schedulers a packet arriving to an empty EF queue may be forced to wait until one packet from each other queue (in the case of SCFQ) or until several packets from each other queue (in the case of WRR) are served before it will finally be forwarded. While it is frequently assumed that the configured rate of EF traffic will be substantially smaller than the link bandwidth, the requirement that this rate should never exceed 50% of the link Charny, ed. et al. 7 draft-charny-ef-definition.txt July, 2000 bandwidth appears unnecessarily limiting. For example, in a fully connected mesh network, where any flow traverses a single link on its way from source to its destination there seems no compelling reason to limit the amount of EF traffic to 50% (or an even smaller percentage for some schedulers) of the link bandwidth. Another, perhaps even more striking example is the fact that even a TDM circuit with dedicated slots cannot be configured to forward EF packets at more than 50% of the link speed without violating RFC 2598 (unless the entire link is configured for EF). If the configured rate of EF traffic is greater than 50% (but less than the link speed), there will always exist an interval longer than MTU/R in which less than the configured rate is achieved. For example, suppose the configured rate of the EF aggregate is 2C/3. Then the forwarding pattern of the TDM circuit might be E E x E E x E E x ... |---| where only one packet is served in the marked interval of length 2T = 2MTU/C. But at least 4/3 MTU would have to be served during this interval by a router in compliance with the definition in RFC 2598. The fact that even a TDM line cannot be booked over 50% by EF traffic indicates that the restriction is artificial and unnecessary. 1.3 The Non-trivial Nature of the Difficulties One possibility to correct the problems discussed in the previous sections might be to attempt to clarify the definition of the intervals to which the definition applied or by averaging over multiple intervals. However, an attempt to do so meets with considerable analytical and implementation difficulties. For example, attempting to align interval start times with some epochs of the forwarded stream appears to require a certain degree of global clock synchronization and is fraught with the risk of misinterpretation and mistake in practice. Another approach might be to allow averaging of the rates over some larger time scale. However, it is unclear exactly what finite time scale would suffice in all reasonable cases. Furthermore, this approach would compromise the notion of very short-term time scale guarantees that are the essence of EF PHB. We also explored a combination of two simple fixes. The first is the addition of the condition that the only intervals subject to the definition are those that fall inside a period during which the EF aggregate is continuously backlogged in the router (i.e., when an EF packet is in the router). The second is the addition of an error (latency) term that could serve as a figure-of-merit in the advertising of EF services. Charny, ed. et al. 8 draft-charny-ef-definition.txt July, 2000 With the addition of these two changes the candidate definition becomes as follows: In any interval of time (t1, t2) in which EF traffic is continuously backlogged, at least R(t2 - t1 - E) bits of EF traffic must be served, where R is the configured rate for the EF aggregate and E is an implementation-specific latency term. The "continuously backlogged" condition eliminates the insufficient- packets-to-forward difficulty while the addition of the latency term of size MTU/C resolves the perfectly-clocked forwarding example (section 1.2.1), and also removes the limitation on EF configured rate. However, neither fix (nor the two of them together) resolves the example of section 1.2.2. To see this, recall that in the example of section 1.2.2 the EF aggregate is continuously backlogged, but the service rate of the EF aggregate is consistently smaller than the configured rate, and therefore no finite latency term will suffice to bring the example into conformance. This appears to be a serious problem. Therefore, we believe that such modification, albeit attractive in its simplicity, falls short of addressing all the problems identified with the definition of the RFC 2598. The next section proposes a new definition that aims at addressing all the issues discussed so far. 2 Definition of EF PHB 2.1 The Formal Definition The intent of EF PHB is to provide the configured service rate to the EF aggregate at as small a timescale as possible. In order to express this notion rigorously, we introduce the definition of "packet time scale rate guarantee". We first need to introduce some notation. Let P_inp(j) and P_out(j) denote the j-th packet of the EF aggregate arriving to and departing from a network node respectively. If several packets of an EF aggregate arrive simultaneously from different inputs, ties are broken arbitrarily. In the case when all EF traffic shares a single FIFO queue, P_inp(j) and P_out(j) refer to the same packet, but in general the j-th arrival and the j-th departure may correspond to different packets. Let a(j) denote the time of arrival of the last bit of P_inp(j) to a network node. Let d(j) denote the time of departure of the last bit of P_out(j) from the network node. Let L(j) denote the length of P_out(j). Charny, ed. et al. 9 draft-charny-ef-definition.txt July, 2000 We require that the indexing is chosen in such a way that the packet P_inp(1) arriving at time a(1) sees no other packet of the EF aggregate in the node upon arrival, and d(1) >= a(1). Definition DEF_1 ---------------- We say that a node offers to the EF aggregate a "packet time scale rate guarantee R with latency E" if the j-th departure time satisfies the following condition for all j>=0: d(j) <= F(j) + E (eq_1) where F(j) is defined iteratively by F(0)=0, d(0) = 0 F(j)=max(a(j), min(d(j-1), FF(j-1)))+ L(j)/R for all j>0 (eq_2) Note that the choice of indexes does not restrict when in the actual packet stream we start the observation of the arrival and departure process. The only restriction that is being imposed is that the observation starts when there are no EF packets in the node. Note also that while index j=1 corresponds to the first packet in the observation, index j=0 does not correspond to any packet at all and is used solely for initial conditions on d(j) and F(j) in the recursion. We now define EF PHB as a forwarding treatment for a particular diffserv aggregate where the node offers to the aggregate a packet time scale rate guarantee R with latency E, where R is a configurable rate and E is a tolerance which depends on the particular node characteristics. 2.2 The Meaning and Basic Properties of DEF_1. To understand the meaning of DEF_1 we compare it with a well-known rate-latency curve [LEB98], and argue that DEF_1 is stronger than the rate-latency curve [LEB98] in the sense that if a scheduler satisfies DEF_1 it also satisfies the rate-latency curve. As a result, all the properties known for the rate-latency curve also apply to DEF_1 above. We also argue why DEF_1 is more suitable to reflect the intent of EF PHB than the rate-latency curve. It is shown in [LEB98] that the rate-latency curve is equivalent to the following definition: Definition DEF_2: d(j) <= F'(j) + E (eq_3) Charny, ed. et al. 10 draft-charny-ef-definition.txt July, 2000 where F'(0)=0, F'(j)=max(a(j), F'(j-1))+ L(j)/R for all j>0 (eq_4) It can be easily verified that DEF_1 is stronger than DEF_2 by noticing that for all j, F'(j) >= F(j). It is easy to see that F'(j) in the definition DEF_2 corresponds to the time the j-th departure should have occurred should the EF aggregate be constantly served exactly at its configured rate R. Following the common convention, we refer to F'(j) as the "fluid finish time" of the j-th packet to depart. The intuitive meaning of the rate-latency curve of DEF_2 is that any packet is served at most time E later than this packet would finish service in the fluid model. For a rate-latency curve DEF_2 (and hence for the stronger DEF_1) it holds that in any interval (0,t) the EF aggregate gets close to the desired service rate R (as long as there is enough traffic to sustain this rate). The discrepancy between the ideal and the actual service in this interval depends on the latency term E, which in turn depends on the scheduling implementation. The smaller E, the smaller the difference between the configured rate and the actual rate achieved by the scheduler. While DEF_2 guarantees the desired rate to the EF aggregate in all intervals (0,t) to within a specified error, it may nevertheless result in large gaps in service. For example, suppose that (a large number) N of identical EF packets of length L arrived from different interfaces to the EF queue in the absence of any non-EF traffic. Then any work-conserving scheduler will serve all N packets at link speed. When the last packet is sent at time NL/C, where C is the capacity of output link, F(N) will be equal to NL/R. Suppose now that at time NL/C a large number of non-EF packets arrive, followed by a single EF packet. Then the scheduler can legitimately delay starting to send the EF packet until time F(N+1)=(N+1)L/R + E - L/C. This means that the EF aggregate will have no service at all in the interval (NL/C, (N+1)L/R + E - L/C). This interval can be quite large if R is substantially smaller than C. In essence, the EF aggregate can be "punished" by a gap in service for receiving faster service than its configured rate at the beginning. Definition DEF_1 alleviates this problem by introducing the term min(d(j-1), F(j-1)) in the recursion. Essentially, this means that the fluid finishing time is "reset" if that packet is sent too early. As a consequence of that, for the case where the EF aggregate is served in a FIFO order, suppose a packet arrives at time t to a server satisfying DEF_1. The packet will be transmitted no later Charny, ed. et al. 11 draft-charny-ef-definition.txt July, 2000 than time t + Q(t)/R + E, where Q(t) is the EF queue size at time t (including the packet under discussion). This statement is proved in Appendix A1. Note finally that just as discussed in section 1.3, the latency term E in EF_1 can still be viewed as a "figure of merit" and can be used to compare different implementations of EF PHB. 2.3 Reordering within the EF aggregate. The EF PHB SHOULD NOT cause packets in a microflow to be reordered. Neither DEF_1 nor the RFC 2598 definition directly addresses how reordering is to be avoided should packets from the same microflow arrive at a node on different inputs. Nonetheless, because DEF_1 explicitly refers to the times of arrival and departure of EF packets, the issue of re-ordering in microflows deserves a brief discussion. Section 4.5, below, discusses the effects that router internal delay can have on the order in which EF packets are forwarded. The upshot there is that differences in internal delay between one input and output and another input and the same output can cause a later- arriving packet to be delivered to the output scheduler ahead of an earlier-arriving packet. A work-conserving scheduler may have no alternative but to forward the two packets in reverse order of their arrival. This shuffling of packets causes no harm, provided the packets are not part of the same microflow. That is, this shuffling (and any other shuffling caused by the scheduler itself, if the scheduler is not a FIFO) will not reorder packets (in the EF PHB sense) unless packets from the same microflow arrive on different inputs of the node. This would imply that somewhere upstream, packets of the same EF microflow were distributed across multiple outputs. Such behavior is inherently destructive of microflow packet order. It is recommended that all EF packets with the same destination address arriving at a node be forwarded on the same output. 2.4 Per Packet Delay. It is important to note that just as the definition of EF PHB in RFC 2598, DEF_1 does not in itself guarantee per-packet delay for EF aggregate. While the definition implies that the aggregate service is within a certain error from the desired service at the configured rate, the definition says nothing about per-packet delay. In particular, for non-FIFO service order for packets within the EF aggregate it is possible in principle that a scheduler satisfying the EF definition (both new and old) delays a given packet an infinite amount of time. However, once the queuing and scheduling mechanisms are specified, it should be possible to Charny, ed. et al. 12 draft-charny-ef-definition.txt July, 2000 demonstrate per-packet delay bound for a reasonable scheduling implementation (see section 4 for implementation examples). For the FIFO service order within the EF aggregate the DEF_1 does imply per-packet per-hop, per-domain, and end-to-end delay guarantees. For example, an end-to-end delay bound for an EF aggregate sharing an aggregate FIFO queue in a network of rate- latency servers is derived in [CLeB2000]. This result implies that the same per packet delay bound holds for a stronger DEF_1. 3. Satisfiability of the definition It is important to make sure that there exist at least some schedulers that satisfy definition DEF_1 with reasonable latency terms. It can be shown that the strict priority scheduler in which all EF packets share a single FIFO queue which is served at strict non- preemptive priority over other queues satisfies DEF_1 with the latency term E = MTU/C where MTU is the maximum packet size and C is the speed of output link. The proof that PQ satisfies DEF_1 is given in appendix B. Another scheduler that satisfies DEF_1 with a small latency term is WF2Q described in [BZ96a]. A class-based WF2Q scheduler, in which all EF traffic shares a single queue with the weight corresponding to the configured rate of the EF aggregate can be shown to satisfy DEF_1 with the latency term E = MTU/C+MTU/R. The proof that WF2Q satisfies DEF_1 with E=MTU/C+MTU/R is given in Appendix C. 4. Implementation considerations 4.1 General queuing and scheduling considerations. Definition DEF_1 does not mandate a particular underlying queuing structure. While it can be implemented using aggregate queuing, where all packets of the EF aggregate share a single queue, it also allows finer queuing granularity, where EF packets may be assigned to a number of different queues. Likewise, DEF_1 allows in principle a wide range of scheduling algorithms, ranging from a strict priority scheduling of aggregate EF queue, to hierarchical scheduling with per-flow queuing as described in section 4.4 below. Both the queuing structure and the scheduling algorithm have a significant impact on the delay and jitter which can be provided to the packets of the EF aggregate. It is typically more difficult to Charny, ed. et al. 13 draft-charny-ef-definition.txt July, 2000 provide strict deterministic end-to-end delay and/or jitter guarantees if aggregate queuing is implemented [CLeB2000]. However, implementing and scheduling a large number of queues at high speeds presents a significant engineering challenge, while aggregate scheduling is very attractive due to its simplicity and scalability. 4.2 Aggregate Queuing and Scheduling Accuracy for FIFO Service of the EF Aggregate It can be shown that if all packets in the EF aggregate share a single FIFO queue served by a scheduler satisfying the rate-latency service curve, then end-to-end delay and jitter guarantees depend on the latency term E of the scheduler [CLeB2000]. The smaller the latency term, the better the delay and jitter bounds that can be provided. In that respect, a strict priority queuing implementation which has a very small latency term is a natural candidate for implementing EF PHB. Various implementations of Weighted Fair Queuing-like schedulers are also possible candidates for such implementation, but the delay and jitter characteristics of these schedulers differ substantially depending on the accuracy of the implementation. A widely used way of evaluating the accuracy of rate-based scheduling implementations is to compare the output of the scheduler with the so-called "fluid model" [Par92]. In this framework, a given scheduler S and the reference fluid scheduler are subject to the same arrival patterns. The accuracy of the scheduler S can be determined by how close the time of the i-th departure in the scheduler S is to the corresponding departure time in the fluid scheduler. More precisely, if d(i) is the time of the i-th departure under some scheduler S, and G(i) is the time of the i-th departure in the reference fluid scheduler, then the accuracy of S may be determined by two latency terms E1 and E2 such that for all i G(i)-E1 <= d(i)<= G(i) + E2 (eq_5) While the term E2 determines the maximum per-hop delay bound, E1 has an effect on the jitter at the output of the scheduler. For example, as shown in [BZ96a], for WF2Q, E1 = MTU/R, E2= MTU/C, and for PGPS [Par92] E2 = MTU/C as well, while E1 is linear in the number of queues in the scheduler. It is demonstrated in [BZ96a] that while WF2Q and PGPS have the same delay bounds, PGPS may result in substantially burstier departure patterns. In general, it can be shown that if a scheduler satisfies DEF_2, then it also satisfies definition DEF_1 with the latency term E <= E1 + E2. The proof of this statement is given in Appendix C. Note that E1+E2 is not necessarily a tight latency bound, and for a given scheduler a tighter bound may be obtained. That is, the fact that a given scheduler has a large E1+E2 does not necessarily mean that is has a large E. Charny, ed. et al. 14 draft-charny-ef-definition.txt July, 2000 4.3 Additional examples of efficient WFQ-Like Scheduling Implementations and their Latency Terms. In this section we briefly discuss some schedulers that can be used to implement EF PHB with different degrees of accuracy and with different implementation complexity. 4.3.1 Weighted Fair Queuing (WFQ/PGPS) For WFQ/PGPS ([DKS90],[Par92]), E2 = MTU/C just as for the case of WF2Q. However, it can be shown that E1 can grow linearly with the number of queues in the scheduler (which here and below is denoted by N). The worst case complexity of WFQ is also O(N). 4.3.2.Deficit Round Robin (DRR) For DRR [SV95], both E1 and E2 can be shown to grow linearly with N*(r_max/r_min)*MTU, where r_min and r_max denote the smallest and the largest rate among the rate assignments of all queues in the scheduler. The implementation complexity of DRR is O(1). 4.3.3. Start-Time Fair Queuing (SFQ) and Self-Clocked Fair Queuing (SCFQ) For SFQ [GVC96] and SCFQ [Gol94] both E1 and E2 can be shown to grow linearly with N. Implementation complexity of both of these schedulers is O(log N). 4.3.4 WF2Q+ For WF2Q+ [BZ96b], E1 = MTU/R, while E2 can grow linearly with N. The implementation complexity of WF2Q+ is O(log N). 4.4. Hierarchical scheduling implementations A possible implementation of EF PHB may be based on a hierarchical scheduling framework, such as described in [FJ95]. In this framework, different subsets of EF packets may be assigned to different queues. The semantics of exactly how packets are classified into different EF queues is highly implementation- dependent. For convenience, the subset of EF packets sharing a single queue will be referred to as "EF flows". The EF queues are grouped in a "logical queue", which is scheduled as a single entity along with other non-EF queues or groups of queues by a "top-level" scheduler. It is this top-level scheduler that must satisfy DEF_1. Once the EF aggregate (i.e the EF "logical queue") is scheduled by this top-level scheduler, an "EF flow-level" scheduler is invoked. As an example, a hierarchical scheduler with WF2Q at each level of the hierarchy (as described in [BZ96b]) can be used for such a Charny, ed. et al. 15 draft-charny-ef-definition.txt July, 2000 purpose. Alternatively, the EF "logical" queue can be served at strict priority over all non-EF queues, while the EF queue at the "EF flow" level can be served by some other scheduler, such as WF2Q. In principle, hierarchical scheduling structure allows a substantial flexibility in the choice of scheduling mechanisms at each level of the hierarchy. Per-packet delay guarantees in such a hierarchical scheduling framework strongly depend on the accuracy of schedulers employed at each level of the hierarchy. In general, the more accurate the scheduling implementation at each level, the better the per-packet guarantee that can be provided. It can be shown that for the scheduling hierarchy, the E1 and E2 latency terms of the hierarchical scheduler with respect to a particular "leaf queue" can be obtained by summing the E1 and E2 terms of the schedulers employed at the nodes of the scheduling tree along the ascending branch of the tree from the root to the leaf. 4.5. Effect of internal switching mechanisms A packet passing through a router will experience delay for a number of reasons. Two familiar components of this delay are the time the packet sits in a buffer at an outgoing link waiting for the scheduler to select it and the time it takes to actually transmit the packet on the outgoing line. There may be other components of a packet's delay through a router, however. A router might have to do some amount of header processing before the packet can be given to the correct output scheduler, for example. In another case a router may have a FIFO buffer (called a transmission queue in [FC2000]) where the packet sits after being selected by the output scheduler but before it is transmitted. In cases such as these, the extra delay a packet may experience can be accounted for by absorbing it into the latency term, E, in DEF_1. Implementing EF on a router with a multi-stage switch fabric requires special attention. A packet may experience additional delays due to the fact that it must compete with other traffic for forwarding resources at multiple contention points in the core. The delay an EF packet may experience before it even reaches the output- link scheduler should be included in the latency term. Input- buffered and input/output-buffered routers may also require modification of their latency terms. Delay in the switch core comes from two sources, both of which must be considered. The first part of this delay is the fixed delay a packet experiences regardless of the other traffic. This component of the delay includes the time it takes for things such as packet segmentation and reassembly in cell based cores, enqueueing and dequeueing at each stage, and transmission between stages. The second part of the switch core delay is variable and depends on the type and amount of other traffic traversing the core. This delay comes about if Charny, ed. et al. 16 draft-charny-ef-definition.txt July, 2000 the stages in the core mix traffic flowing between different input/output port pairs. Thus, EF packets must compete against other traffic for forwarding resources in the core. Some of this competing traffic may even be EF traffic from other aggregates. This introduces extra delay, that can also be absorbed by the latency term in the definition. 5. Security Considerations This draft makes the PHB definition in [RFC2598] more rigorous, but adds no new functions to it. As a result, it adds no security issues to those described in that specification. 6. Appendix A. Statement A1. If a scheduler conforms to DEF_1, and the EF packets are served in the FIFO order, then all EF packets that are in the system at time t will leave the system no later than at time t+Q/R+E, where Q is the total EF queue at time t. Proof of A1. Consider the time t. Let a(1)...a(n) denote the arrival times of all EF packets that are in the system at time t. For all of these packets a(i)<=t. Let d(0)<=t denote the previous departure time before time t. Let d(1),...d(n) denote the first n departures after time t, and let L(1),...L(n) denote the packet lengths corresponding to these departures and let F(1),...F(n) denote the corresponding "finish times" in DEF_1. We now prove by induction that F(i)<=t+(L(1)+...L(i))/R for all 1<=i<=n. Base case. Since a(1)=0. We prove by induction that for all k>=1 in this busy period d(k) <= F(k)+MTU/C (eq_b1) This would immediately imply Statement B1. Base case. For k=1, F(1) = max(a(1), min(d(0), F(0)) + L(1)/R >= a(1) + L(1)/R >= a(1)+L(1)/C (eq_b2) and d(1) <= a(1) + MTU/C + L(1)/C <= F(1) + MTU/C where the first inequality follows from the fact that the first packet in a PQ may wait at most for one largest packet transmission before its own transmission begins, and the second inequality follows from (eq_b2). Inductive step. Note that since EF has the highest priority, for k>1 in the busy period of the EF queue d(k)=d(k-1) + L(k)/C (eq_b3) Now from the induction hypothesis F(k-1) >= d(k-1) - MTU/C And the definition (eq_2) of F(k) gives F(k) >= max(a(k), min(d(k-1), d(k-1) - MTU/C))+ L(k)/C = max(a(k), d(k-1)- MTU/C)+ L(k)/C (eq_b4) It follows immediately from (eq_b4) that F(k) >= d(k-1)- MTU/C + L(k)/C Charny, ed. et al. 18 draft-charny-ef-definition.txt July, 2000 Combining with (eq_b3) demonstrates (eq_b1) and completes the inductive step. 8. Appendix C Statement C1. ============ If a scheduler satisfies the condition G(i) - E1 <= d(i) <= G(i) + E2 (eq_c1) where G(i) is the i-th finishing time of the reference fluid scheduler, then it satisfies DEF_1 in section 2.1 with latency term E <= E1 + E2 Proof of Statement C1. ---------------------- To prove Statement C1 we will prove that for all i >= 0 F(i) >= G(i) - E1 (eq_c2) where F(i) is the set of finish times recursively defined by (eq_2). If (eq_c2) is proven, then from (eq_c1) and (eq_c2) d(i) <= G(i) + E2 <= F(i) + E1 + E2, which means that the scheduler satisfies DEF_1 with the latency term E = E1 + E2 Proof of (eq_c2). ----------------- First note that in the reference GPS system, packet i starts its service at time max ( a(i), G(i-1)) and receives a service rate at least equal to R. Thus G(i) <= max ( a(i), G(i-1)) + L(i)/R (eq_c3) Now the proof of (eq_c2) proceeds by induction. Base case F(0)=0, G(0) = 0, so (eq_c2) trivially holds for i=0. Inductive step. Suppose (eq_c2) holds for all j=0,1...i-1, (i>=1) We have both F(i-1) >= G(i-1) - E1 and d(i-1) >= G(i-1) - E1, thus min (F(i-1), d(i-1)) >= G(i-1) - E1 (eq_c4) Charny, ed. et al. 19 draft-charny-ef-definition.txt July, 2000 Combining this with equation (eq_2), we obtain F(i) >= G(i-1) - E1 + L(i)/R (eq_c5) Again from equation (eq_2) we have F(i) >= a(i)+ L(i)/R >= a(i) - E1 + L(i)/R (eq_c6) Combining (eq_c5), (eq_c6) and (eq_c3) gives F(i) >= G(i)-E1, which completes the proof of (eq_c2) and statement C1. Statement C2. ============= WF2Q satisfies DEF_1 with E = MTU/C + MTU/R Proof of C2. ------------ It follows from the results of [BZ96a] that the departures in WF2Q satisfy the condition max(G(i-1), a(i))<= d(i) <= G(i) + MTU/C From Equation (eq_c3) this implies that d(i) >= G(i) - L(i)/R >= G(i) - MTU/R Therefore (eq_c1) in Statement C1 holds with E1=MTU/R and E2 = MTU/C. Therefore, by Statement C1, WF2Q satisfies DEF_1 with E=MTU/C + MTU/R. 9. References [BZ96a] J.C.R. Bennett and H. Zhang, ``WF2Q: Worst-case Fair Weighted Fair Queuing'', INFOCOM'96, Mar, 1996 [BZ96b] J.C.R. Bennett and H. Zhang, Hierarchical Packet Fair Queuing Algorithms. IEEE/ACM Transactions on Networking, 5(5):675-689, Oct 1997. Also in Proceedings of SIGCOMM'96, Aug, 1996 [RFC2475] Black, D., Blake, S., Carlson, M., Davies, E., Wang, Z. and W. Weiss, "An Architecture for Differentiated Services", RFC 2475, December 1998. [LEB98] J.-Y. Le Boudec, "Application of Network Calculus To Guaranteed Service Networks", IEEE Transactions on Information theory, (44) 3, May 1998 Charny, ed. et al. 20 draft-charny-ef-definition.txt July, 2000 [Bra97] Bradner, S., "Key Words for Use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [CLeB2000] A. Charny, J.-Y. Le Boudec "Delay Bounds in a Network with Aggregate Scheduling". To appear in Proc. of QoFIS'2000, September 25-26, 2000, Berlin, Germany. [DKS90] A. Demers, S. Keshav, and S. Shenker, "Analysis and Simulation of a Fair Queuing Algorithm". In Journal of Internetworking Research and Experience, pages 3-26, October 1990. Also in Proceedings of ACM SIGCOMM'89, pp 3-12. [FC2000] T. Ferrari and P. F. Chimento, "A Measurement- Based Analysis of Expedited Forwarding PHB Mechanisms," Eighth International Workshop on Quality of Service, Pittsburgh, PA, June 2000, [FJ95] S. Floyd and V. Jacobson, "Link-sharing and Resource Management Models for Packet Networks", IEEE/ACM Transactions on Networking, Vol. 3 no. 4, pp. 365- 386,August 1995. [Gol94] S.J. Golestani. "A Self-clocked Fair Queuing Scheme for Broad-band Applications". In Proceedings of IEEE INFOCOM'94, pages 636-646, Toronto, CA, April 1994. [GVC96] P. Goyal, H.M. Vin, and H. Chen. "Start-time Fair Queuing: A Scheduling Algorithm for Integrated Services". In Proceedings of the ACM-SIGCOMM 96, pages 157-168, Palo Alto, CA, August 1996. [RFC2598] V. Jacobson, K. Nichols, K. Poduri, "An Expedited Forwarding PHB", RFC 2598, June 1999 [RFC2474] Nichols, K., Blake, S., Baker, F. and D. Black, "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, December 1998. [JNP2000] V. Jacobson, K. Nichols, K. Poduri, "The 'Virtual Wire' Behavior Aggregate," (draft-ietf-diffserv-ba-vw-00.txt), March 2000. [Par92] A. Parekh. "A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks". PhD dissertation, Massachusetts Institute of Technology, February 1992. [SV95] M. Shreedhar and G. Varghese. "Effient Fair Queueing Charny, ed. et al. 21 draft-charny-ef-definition.txt July, 2000 Using Deficit Round Robin". In Proceedings of SIGCOMM'95, pages 231-243, Boston, MA, September 1995. [Sto95] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation", Technical Report 95-22, Old Dominion University, November 1995. 9. Authors' addresses Anna Charny, ed. Cisco Systems 300 Apollo Drive Chelmsford, MA 01824 acharny@cisco.edu Fred Baker Cisco Systems 170 West Tasman Dr. San Jose, CA 95134 fred@cisco.com Jon Bennett RiverDelta Networks 3 Highwood Drive East Tewksbury, MA 01876 jcrb@riverdelta.com Kent Benson Tellabs Research Center 3740 Edison Lake Parkway #101 Mishawaka, IN 46545 Kent.Benson@tellabs.com Jean-Yves Le Boudec ICA-EPFL, INN Ecublens, CH-1015 Lausanne-EPFL, Switzerland leboudec@epfl.c Angela Chiu AT&T Labs 100 Schulz Dr. Rm 4-204 Red Bank, NJ 07701 alchiu@att.com Bill Courtney TRW Bldg. 201/3702 One Space Park Redondo Beach, CA 90278 Charny, ed. et al. 22 draft-charny-ef-definition.txt July, 2000 bill.courtney@trw.com Shahram Davari PMC-Sierra Inc 555 Legget drive Suit 834, Tower B Ottawa, ON K2K 2X3, Canada shahram_davari@pmc-sierra.com Bruce Davie Cisco Systems 300 Apollo Drive Chelmsford, MA 01824 bsd@cisco.com Victor Firou Nortel Networks 600 Tech Park Billerica, MA 01821 vfirou@nortelnetworks.com Charles Kalmanek AT&T Labs-Research 180 Park Avenue, Room A113, Florham Park NJ crk@research.att.com. K.K. Ramakrishnan AT&T Labs-Research Rm. A155, 180 Park Ave, Florham Park, NJ 07932 kkrama@research.att.com Dimitrios Stiliadis Lucent Technologies 1380 Rodick Road Markham, Ontario, L3R-4G5, Canada stiliadi@bell-labs.com 10. Full Copyright Copyright (C) The Internet Society 1999. All Rights Reserved. This document and translations of it may be copied andFurnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of Charny, ed. et al. 23 draft-charny-ef-definition.txt July, 2000 developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. draft-charny-ef-definition-00.txt Expires January, 2001 Charny, ed. et al. 24