INTERNET-DRAFT Werner Almesberger EPFL ICA, Switzerland Jamal Hadi Salim CTL Nortel Networks Alexey Kuznetsov INR Moscow June 1999 Differentiated Services on Linux Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Recent Linux kernels offer a wide variety of traffic control functions, which can be combined in a modular way. We have designed support for Differentiated Services based on the existing traffic control elements, and we have implemented new components where necessary. In this document we give a brief overview of the structure of Linux traffic control, and we describe our prototype implementation in more detail. 1. Introduction The Differentiated Services architecture (Diffserv) lays the foundation for implementing service differentiation in the Internet in an efficient, scalable way. We assume that readers are familiar Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 1] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 with the concepts and terminology defined in [1]. Furthermore, we assume familiarity with the packet marking as described in [2]. We have developed a design to support basic classification and DS field manipulation required by Diffserv nodes. The design enables configuration of the first PHBs that are being defined in the Diffserv WG. We have implemented a prototype of this design using the traffic control framework available in recent Linux kernels. The source code, configuration examples, and related information can be obtained from http://icawww1.epfl.ch/linux-diffserv/ The main focus of our work is to allow maximum flexibility for node configuration and for experiments with PHBs, while still maintaining a design that does not unnecessarily sacrifice performance. This document is structured as follows. Section "Linux Traffic Control" gives a brief overview of traffic control functions in recent Linux kernels. Section "Diffserv extensions to Linux traffic control" discusses where the existing model needed to be extended. Section "New components" describes the new components in more detail. We conclude with examples of configuration scripts in section "Building sample configurations". 2. Linux Traffic Control Figure 1 shows roughly how the kernel processes data received from the network, and how it generates new data to be sent on the network. +---------------+ +---->| TCP, UDP, ... | | +---------------+ | | TRAFFIC CONTROL | v | +---------------------+ +------------+ +----------------+ -->|Input de-multiplexing|-->| Forwarding |-->| Output queuing |--> +---------------------+ +------------+ +----------------+ Figure 1: Processing of network data. "Forwarding" includes the selection of the output interface, the selection of the next hop, encapsulation, etc. Once all this is done, packets are queued on the respective output interface. This is the point where traffic control comes into play. Traffic control can, among other things, decide if packets are queued or if they are dropped (e.g. if the queue has reached some length limit, or if the traffic exceeds some rate limit), it can decide in which order packets are sent (e.g. to give priority to certain flows), it can delay the sending of packets (e.g. to limit the rate of outbound traffic), etc. Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 2] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 Once traffic control has released a packet for sending, the device driver picks it up and emits it on the network. 2.1 Components The traffic control code in the Linux kernel consists of the following major conceptual components: - queuing disciplines - classes (within a queuing discipline) - filters - policing Each network device has a queuing discipline associated with it, which controls how packets enqueued on that device are treated. A very simple queuing discipline may just consist of a single queue, where all packets are stored in the order in which they have been enqueued, and which is emptied as fast as the respective device can send. See figure 2 for such a queuing discipline without externally visible internal structure. +--------------------+ --->| Queuing discipline |---> +--------------------+ Figure 2: A simple queuing discipline without classes. More elaborate queuing disciplines may use filters to distinguish among different classes of packets and process each class in a specific way, e.g. by giving one class priority over other classes. +-+ +-----+ +------------------+ +-+ +-+ | | +------+ | |-->|Queuing discipline|-->| | | | | |---->|Filter|--->|Class| +------------------+ | |--+ | | | | | +------+ | +--------------------------+ | | | | | | | +----------------------------------+ | | | | | | +------+ | | | | | +->|Filter|-_ +-----+ +------------------+ +-+ | | | -->| | | +------+ ->| |-->|Queuing discipline|-->| | | | |--> | | | |Class| +------------------+ | |--+-->| | | | | +------+ _->| +--------------------------+ | | | | | +->|Filter|- +----------------------------------+ | | | | +------+ | | | +-----------------------------------------------------------+ | | Queuing discipline | +---------------------------------------------------------------+ Figure 3: A simple queuing discipline with multiple classes. Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 3] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 Figure 3 shows an example of such a queuing discipline. Note that multiple filters may map to the same class. Queuing disciplines and classes are intimately tied together: the presence of classes and their semantics are fundamental properties of the queuing discipline. In contrast to that, filters can be combined arbitrarily with queuing disciplines and classes as long as the queuing discipline has classes to map the packets to. But flexibility does not end there yet - classes normally do not take care of storing their packets themselves, but they use another queuing discipline to take care of that. That queuing discipline can be arbitrarily chosen from the set of available queuing disciplines, and it may well have classes, which in turn use queuing disciplines, etc. The term qdisc would be used interchangeably to mean queueing discipline in this draft. +-+ +------+ +---------------+ +-+ +-+ | | +------+ | |-->|TBF, rate=1Mbps|-->| | | | | |--+->|Filter|--->|"high"| +---------------+ | |--+ | | | | | +------+ | +-----------------------+ | | | | | | | +--------------------------------+ | | | | | | | | | | | | +------+ +---------------+ +-+ | | | -->| | | Default | |-->| FIFO |-->| | | | |--> | | +------------->|"low" | +---------------+ | |--+-->| | | | | +-----------------------+ | | | | | +--------------------------------+ | | | | | | | +---------------------------------------------------------+ | | Queuing discipline with two delay priorities | +-------------------------------------------------------------+ Figure 4: Combination of priority, TBF, and FIFO queuing disciplines. Figure 4 shows an example of such a stack: first, there is a queuing discipline with two delay priorities. Packets which are selected by the filter go to the high-priority class, while all other packets go to the low-priority class. Whenever there are packets in the high-priority queue, they are sent before packets in the low-priority queue (e.g. the "sch_prio" queuing discipline works this way). In order to prevent high-priority traffic from starving low-priority traffic, we use a token bucket filter (TBF), which enforces a rate of at most 1 Mbps. Finally, the queuing of low-priority packets is done by a FIFO queuing discipline. Note that there are other ways to accomplish what we have done here, e.g. by using class-based queuing (CBQ). Packets are enqueued as follows: when the "enqueue" function of a queuing discipline is called, it scans the filters until one of them indicates a match to a class identifier. It then queues the Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 4] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 packet for the corresponding class, which usually means to invoke the "enqueue" function of the queuing discipline "owned" by that class. Packets which do not match any of the filters are typically attributed to some default class. Typically, each class "owns" one queue, but it is in principle also possible that several classes share the same queue or even that a single queue is used by all classes of the respective queuing discipline. Note, however, that packets do not carry any explicit indication of which class they were attributed to. Queuing disciplines that change per-class information when dequeuing packets (e.g. CBQ) will therefore not work properly if the "inner" queues are shared, unless they are able either to repeat the classification or to pass the classification result from "enqueue" to "dequeue" by some other means. Usually when enqueuing packets, the corresponding flow(s) can be policed, e.g. by discarding packets which exceed a certain rate. 3. Diffserv extensions to Linux traffic control The traffic control framework available in recent Linux kernels [3] already offers most of the functionality required for implementing Diffserv support. We therefore closely followed the existing design and added new components only where it was deemed strictly necessary. 3.1 Overview Figure 5 shows the general structure of the forwarding path in a Diffserv node. +---------+ +-----+ +---------+ | Classi- |--->| +-----+ | | --->| fier & |----->| +-----+ | Marking |---> | Meter |------->| PHB |--->| | +---------+ +-----+ +---------+ Figure 5: General Diffserv forwarding path. Depending on the implementation, marking may also occur at different places, possibly even several times. The classification result may be used several times in the Diffserv processing path, and it may also depend on external factors (e.g. time), so reproducing the classification result may not only be expensive, but actually impossible. We therefore added a new field "tc_index" to the packet buffer descriptor ("struct sk_buff"), where we store the result of the Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 5] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 initial classification. In order to avoid confusing "tc_index" with the classifier "cls_tcindex", we will call the former skb->tc_index throughout this document. skb->tc_index is set using the "sch_dsmark" queuing discipline, which is also responsible for initially retrieving the DSCP, and for setting the DS field in packets before they are sent on the network. "sch_dsmark" provides the framework for all other operations. The "cls_tcindex" classifier reads all or part of skb->tc_index and uses this to select classes. Finally, we need a queuing discipline to support multiple drop priorities as required for Assured Forwarding. For this, we designed GRED, a generalized RED. "sch_gred" provides a configurable number of drop priorities which are selected by the lower bits of skb->tc_index. 3.2 Classification and marking The classifiers "cls_rsvp" and "cls_u32" can handle all micro-flow classification tasks. Additionally, the "ipchains" firewall is also capable of tagging microflows into classes. Behavior aggregate classification could also be done using "cls_u32" and "ipchains", but since we usually already have "sch_dsmark" at the top level, we use the simpler "cls_tcindex" and retrieve the DSCP using "sch_dsmark", which then puts it into skb->tc_index. When using "sch_dsmark", the class number returned by the classifier is stored in skb->tc_index. This way, the result can be re-used during later processing steps. Nodes in multiple DS domains must also be able to distinguish packets by the inbound interface in order to translate the DSCP to the correct PHB. This can be done using the "route" classifier, in combination with the "ip rule" command interface subset. Marking is done when a packet is dequeued from "sch_dsmark". "sch_dsmark" uses skb->tc_index as an index to a table in which the outbound DSCP is stored and puts this value into the packet's DS field. skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> Initial DS field marking -- ^ Initial value of tc_index | +-+ +------+ | +---+-+ +-+ +-|-+ | | | cls_ |--->| | |--> . . . -->| | | | | | |--->| rsvp |--->| | | | |--->| | | | | | |--->| | | +---------------+ | | v | Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 6] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 -->| | +------+ +-|-+-------^-----------+ | O |--> | | | | | ^ | | +------------------|---------|----------------+ | | | sch_dsmark | | | | +--------------------|---------|------------------|-+ | | -- tc_index may | v v change | - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 6: Micro-flow classifier. Figure 6 shows the use of "sch_dsmark" and skb->tc_index in a micro-flow classifier based on "cls_rsvp". Figure 7 shows a behavior aggregate classifier using "cls_tcindex". skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> | -- DS field is used May change DS field -- ^ | for classification | +-|-+ +------+ +---+-+ +-+ +-|-+ | | | | cls_ |--->| | |--> . . . -->| | | | | | | |----->| tc |--->| | | | |--->| | | | | | |index |--->| | | +---------------+ | | v | -->| O | +------+ +-|-+--------------^----+ | O |--> | | | ^ | | | ^ | | | +----------|---------|----------------|---------+ | | | | sch_dsmark | | | | | +-|------------|---------|----------------|-----------|-+ | | | -- tc_index -- | | v | v may change v | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 7: Behaviour aggregate classifier. 3.3 Cascaded meters Multiple meters are needed if traffic should be assigned to more than two classes, based on the bandwidth it uses. As an example, such classes could be for "low", "high", and "excess" traffic. Our current implementation supports a limited form of cascading at the level of classifiers. We are testing a cleaner and more efficient solution at the time of writing. 3.4 Implementing PHBs Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 7] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 PHBs based only on delay priorities, e.g. Expedited Forwarding [4], can be built using CBQ [5] or the more simple "sch_prio". (See section "Building sample configurations".) Besides four delay priorities, which can again be implemented with already existing components, Assured Forwarding [6] also needs three drop priorities, which is more than the current implementation of RED supports. We therefore added a new queuing discipline which we call "generalized RED" (GRED). GRED uses the lower bits of skb->tc_index to select the drop class and hence the corresponding set of RED parameters. 3.5 Shaping The so-called Token Bucket Filter ("sch_tbf") can be used for shaping at edge nodes. Unfortunately, the highest rate at which "sch_tbf" can shape is limited by the system timer, which normally ticks at 100 Hz, but can be accelerated to 1 kHz or more if the processor is sufficiently powerful. Note that Linux traffic control supports more granular clocking for droppers (i.e. shapers without buffer). CBQ can also be used to do shaping. Higher rates can be shaped when using hardware-based solutions, such as ATM. 3.6 End systems Diffserv-capable sources use the same functionality as edge routers, i.e. any classification and traffic conditioning can be administratively configured. In addition to that, an application may also choose to mark packets when they are generated. For IPv4, this can be done using the "IP_TOS" socket option, which is commonly available on Unix, and of course also on Linux. Note that Linux follows the [7] convention of not allowing the lowest bit of the TOS byte to be different from zero. This restriction is compatible with use for Diffserv. Furthermore, the use of values corresponding to high precedences (i.e. DSCP 0x28 and above) is restricted. This can be avoided either by giving the application the appropriate capabilities (privileges), or by re-marking (see below). Setting the DS field with IPv6 is currently very awkward. In the future, an improved interface is likely to be provided that unifies the IPv4 and IPv6 usage and that may contain additional improvements, e.g. selection of services instead of "raw" DS field values. Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 8] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 An application's choice of DS field values can always be refused or changed by traffic control (using re-marking) before a packet actually reaches the network. 4. New components The prototype implementation of Diffserv support requires the addition of three new traffic control elements to the kernel: (1) the queuing discipline "sch_dsmark" to extract and to set the DSCP, (2) the classifier "cls_tcindex" which uses this information, and (3) the queuing discipline "sch_gred" which supports multiple drop priorities and buffer sharing. Only the queueing discipline to extract and set the DSCP is truly specific to the differentiated services architecture. The other two elements can also be used in other contexts. Figure 6 shows the use of "sch_dsmark" for the initial packet marking when entering a Diffserv domain. The classification and rate control metering is performed by a micro-flow classifier, e.g. "cls_rsvp", in this case. This classifier determines the initial TC index which is then stored in skb->tc_index. Afterwards, further processing is performed by an inner queuing discipline. Note that this queuing discipline may read and even change skb->tc_index. When a packet leaves "sch_dsmark", skb->tc_index is examined and the DS field of the packet is set accordingly. Figure 7 shows the use of "sch_dsmark" and "cls_tcindex" in a node which works on a behavior aggregate, i.e. on packets with the DS field already set. The procedure is quite similar to the previous scenario, with the exception that "cls_tcindex" takes over the role of "cls_rsvp" and that the DS field of the incoming packet is copied to "tc_index" before invoking the classifier. Note that the value of the outbound DS field can be affected at three locations: (1) in "sch_dsmark", when classifying based on skb->tc_index, which contains the original value of the DS field; (2) by changing skb->tc_index in an inner queuing discipline; and (3) in "sch_dsmark", when mapping the final value of skb->tc_index back to a new value of the DS field. 4.1 "sch_dsmark" As illustrated in figure 8, the "sch_dsmark" queuing discipline performs three actions based on the scripting invocation: - If "set_tc_index" is set, it retrieves the content of the DS Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 9] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 field and stores it in skb->tc_index. - It invokes a classifier and stores the class ID returned in skb->tc_index. If the classifier finds no match, the value of "default_index" is used instead. If "default_index" is not set, the value of skb->tc_index is not changed. Note that this can yield undefined behaviour if neither "set_tc_index" nor "default_index" is set. - After sending the packet through its inner queuing discipline, it uses the resulting value of skb->tc_index as an index into a table of (mask,value) pairs. The original value of the DS field is then replaced using the following formula: ds_field = (ds_field & mask) | value skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> | -- Optional: DS field is copied to tc_index | ^ | | | +-|-+ +---------+ tc_index is translated +----|---|---+ | | |-+->|Filter |_ to new DSCP \| | | | | | | | +-------^-+ -_ -- res.classid contains \ v | | | | | | | -_ new tc_index |\ (&)>(or) | | | | | +-------|-+ -->+-------+-------------+ | ^ ^ | | v | +->|Filter | |------->|classid|Queuing disc.|-->| | | | -->| O | | +-------^-+ _-->+-------+-------------+ | |___| |--> | | | | Default | _- | | [__|__] | | | | +----------|---- | | +>[__|__] | | | | | default_index provides tc_index | | [__|__] | | | +------------|--------------|---------------------+ |Mask Value| | | sch_dsmark | | | | +-|--------------|--------------|-----------------------|----------+ | | -- classifier|may use | v | tc_index v | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 8: The "dsmark" queuing discipline. Table 1 lists the parameters that can be configured in the "dsmark" queuing discipline. The upper part of the table shows parameters of the queuing discipline itself. The lower part shows parameters of each class. --------------------------------------------------------- | Variable name / tc keyword | Value | Default | --------------------------------------------------------- | indices | 2^n | none | | default_index | 0... indices-1 | absent | | set_tc_index | none (flag) | absent | --------------------------------------------------------- | mask | 0...0xff | 0xff | | value | 0...0xff | 0 | Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 10] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 --------------------------------------------------------- Table 1: Configuration parameters of "sch_dsmark". "indices" is the size of the table of (mask,value) pairs. 4.2 "cls_tcindex" As shown in figure 9, the "cls_tcindex" classifier uses skb->tc_index to select classes. It first calculates the lookup key using the algorithm key = (skb->tc_index >> shift) & mask Then it looks for an entry with this handle. If an entry is found, it may call a meter (if configured), and it will return the class IDs of the corresponding class. If no entry is found, the result depends on whether "fall_through" is set. If set, a class ID is constructed from the lookup key. Otherwise, it returns a "not found" indication and the search continues with the next classifier. We call construction of the class ID an "algorithmic mapping". This can be used to avoid setting up a large number of classifier elements if there is a sufficiently simple relation between values of skb->tc_index and class IDs. An example of this trick is used in the AF scripts on the web site. The size of the lookup table can be set using the "hash" option. "cls_tcindex" automatically uses perfect hashing if the range of possible choices does not exceed the size of the lookup table. If the "hash" option is omitted, an implementation-dependent default value is chosen. +-----------------+ | mask shift | | | | | skb->tc_index ---->(&) ---->(>>) | | | | +-------------|---+ | +-------------------+ | Key v +------------------------+ | key class(id) police | +------------------------+ | key class(id) police | +------------------------+ | | | : | +--------> Profile : | : +------------------> Class Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 11] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 | +-----------------> *:Key if fall_through Figure 9: The "tcindex" classifier. Table 2 shows the parameters that can be configured in the "tcindex" classifier. The upper part of the table shows parameters of the classifier itself. The lower part shows parameters of each element. ---------------------------------------------------------------- | Variable | tc keyword | Value | Default | ---------------------------------------------------------------- | hash | hash | 1...0x10000 | implementation- | | | | | dependent | | mask | mask | 0...0xffff | 0xffff | | shift | shift | 0...15 | 0 | | fall_through | fall_through/ | flag | fall_through | | | pass_on | | | ---------------------------------------------------------------- | res | classid | major:minor | none | | police | police | Profile | none | ---------------------------------------------------------------- Table 2: Configuration parameters of "cls_tcindex". Note that the keyword used by "tc" (the command-line tool used to manually configure traffic control elements) does not always correspond to the variable internally used by "cls_tcindex". 4.3 "sch_gred" Virtual Queue RED Parameters +-------+ Class virtual queue | VC0 | selector +---->| VC1 |----+ | | ... | | skb->tc_index | | VCn | | Class physical | +-------+ v Queue queue Packet | +---+ packet ---------+ ================================>| |========> | | | |===> +---+ ---------+ I I Drop packet X Figure 10: Generic RED and the use of "skb->tc_index" Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 12] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 Figure 10 shows how "sch_gred" uses skb->tc_index for the selection of the right virtual queue (VQ) within a physical queue. What makes "sch_gred" different from other Multi-RED implementations is the fact that it is decoupled from any one specific block along the packet's path such as a header classifier or meter. For example, CISCO's DWRED [8] is tied to mapping VQ selection based on the precedence bits classification. On the other hand, RIO [9] is tied to the IN/OUT metering levels for the selection of the VQ. In the case of GRED, any classifier, meter, etc. along the data path can affect the selection of the VQ by setting the appropriate value of skb->tc_index. GRED also differs from the two mentioned multiple RED mechanisms in that it is not limited to a specific number of VQ. The number of VQs is configurable for each physical class queue. GRED does not assume certain drop precedences (or priorities). It depends on the configuration parameters passed on by the user. In essence, DWRED and RIO are special cases of GRED. Currently, the number of virtual queues is limited to 16 (the least significant 4 bits of skb->tc_index). There is a one to one mapping between the values of skb->tc_index and the virtual queue number in a class. Buffer sharing is achieved using one of two ways (selectable via configuration): - Simple setting of physical queue limits. It is up to the individual configuring the virtual queues parameters to decide which one gets preferential treatment. Sharing and preferential treatment amongst virtual queues is based on parameter settings such as the per-virtual queue physical limit, threshold values and drop probabilities. This is the default setting. - A similar average queue trick as that is used in [9]. This is selected by the operator "grio" during the setup. Each VQ within a class is assigned a priority at configuration time. Priorities range from 1 to 16 at the moment, with 1 being the highest. The computation of the average queue value (for a VQ) involves first summing to the current stored average queue value all the the other average queue values of the VQs which are more important than it. This way a relatively higher priority (lower priority value) gets preferential treatment because its average queue is always the lowest; the relatively lower priority will still continue to send when the higher ones are not dominating the buffer space. A user can still configure the per-virtual-Queue physical queue limits, threshold values, and drop probabilities as in the (first) case when the "grio" option is not defined. The second scheme is slightly slower than the first one (a few more per-packet computations). GRED is configured in two steps. First (see also the upper part of Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 13] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 table 3) the generic parameters are configured to select the number of virtual queues "DPs" and whether to turn on the RIO-like buffer sharing scheme ("grio"). Also at this point, a default virtual queue is selected so that packets with out of range values of skb->tc_index or misconfigured priorities in the case of "grio" buffer-sharing setup are directed to it. Normally, the default virtual queue is the one with the highest likelihood of having a packet discarded. The operator "setup" identifies that this is a generic setup for GRED. -------------------------------------------------- | Variable | tc keyword | Value | Default | -------------------------------------------------- | DPs | DPs | 1...16 | none | | def | default | 1...DPs | none | | grio | grio | none (flag) | absent | -------------------------------------------------- | limit | limit | bytes | none | | qth_min | min | bytes | none | | qth_max | max | bytes | none | | n/a | avpkt | bytes | none | | n/a | bandwidth | rate | 10 Mbps | | n/a | burst | packets | none | | n/a | probability | [0...1) | 0.02 | | DP | DP | 1...DPs | 0 | | prio | prio | 1...DPs | none | -------------------------------------------------- Table 3: Configuration parameters of "sch_gred". The second step is to set parameters for individual virtual queues. (See also the lower part of table 3). These parameters are equivalent to the traditional RED parameters. In addition, each RED configuration identifies which virtual queue the parameters belong to as well as the priority if the "grio" technique is selected. The mandatory parameters are: - "limit" defines the virtual queue "physical" limit in bytes. - "min" defines the minimum threshold value in bytes. - "max" defines the maximum threshold value in bytes. - "avpkt" is the average packet size in bytes. - "bandwidth" is the wire-speed of the interface. - "burst" is the number of average-sized packets allowed to burst. The Linux RED implementation attempts to compute an optimal W value for the user based on the "avpkt", minimum threshold and allowed burst size. This is based on the equation: burst + 1 - qmin/avpkt < (1-(1-W)^burst)/W as described in [10]. - "probability" defines the drop probability in the range [0...). - "DP" identifies the virtual queue assigned to these parameters. Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 14] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 - "prio" identifies the virtual queue priority if "grio" was set in the general parameters. 5. Building sample configurations ^ ^ ^ ^ ^^ +------------+ ( netlink ) +------------+ | tc |<---> ( cloud ) <--> | kernel | +------------+ ( ) +------------+ ( ) ^^^^^^ Figure 11: User space to kernel communication using "tc" Communication and configuration of the kernel code or modules is achieved by a user level program tc written by Alexey. The interaction is shown in figure 11. Given the flexibility of the code, there are many ways to reach the same end goal. Depending on the requirement, one could script the same PHB using a different combinations of qdiscs; e.g. one could build a core EF capable router using either CBQ to rate limit it and prioritise its traffic or instead use the PRIO qdisc with a Token Bucket attached to rate limit it. It is hoped that users of Linux Diffserv will be able to script their own flavored configurations. The examples below (as well as those on the Linux Diffserv web site) are simplistic, in the sense that they only assume one interface per node. One should easily be able to extend them for more than one interface The normal recipe for creating a configuration script is: - attach "sch_dsmark" to the output interface - define the structure of the queuing discipline(s) inside "sch_dsmark" - number the classes and decide on a numbering scheme to use for skb->tc_index (the latter may be trivial if skb->tc_index is only used within "sch_dsmark".) - identify which packets go to which classes and configure the classifier(s) of "sch_dsmark" accordingly The script lines in the next subsections are numbered for clarity of the accompanying description below. For clarity, we did not include handling of historical DS field values in our scripts. 5.1 Edge device: Packet re-marking Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 15] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 Table lookup +-+ +-----------------+ +---------+------+ +-----------+ | |-+->| u32 | Meter |--->|Class 1:1| | | DSCP 0x2e | | | | +-----------------+ +---------+ | | | | | | | Exceeded -- +----->|Class 1:2| | | DSCP 0x18 | | | | | limit +---------+ FIFO |---->| |--> -->| | | +-------------------->|Class 1:3| | | DSCP 0x1a | | | | No match +---------+ | | | | | +---------------------------------->| | | | | | +------+ | | | | | | | +--------------------------------------------------+ | | sch_dsmark (1:0) | +----------------------------------------------------------------+ Figure 12: Packet re-marking at the edge. 1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 2. tc class change dev eth0 classid 1:1 dsmark mask 0x3 value 0xb8 3. tc class change dev eth0 classid 1:2 dsmark mask 0x3 value 0x68 4. tc class change dev eth0 classid 1:3 dsmark mask 0x3 value 0x48 5. tc filter add dev eth0 parent 1:0 protocol ip prio 4 handle 1: u32 divisor 1 6. tc filter add dev eth0 parent 1:0 protocol ip prio 5 handle 2: u32 divisor 1 7. tc filter add dev eth0 parent 1:0 prio 4 u32 match ip dst 10.0.0.0/24 police rate 1Mbit burst 2K continue flowid 1:1 8. tc filter add dev eth0 parent 1:0 prio 5 u32 match ip dst 10.0.0.0/24 flowid 1:2 9. tc filter add dev eth0 parent 1:0 prio 4 u32 match ip dst 10.1.0.0/16 match ip src 192.1.0.0/16 match ip protocol 6 0xff match ip dport 0x17 0xffff flowid 1:3 The first line attaches a dsmarker to the interface eth0 on the root node. The second line instructs the dsmarker to remark the DSCP of classid 1:2 by first masking out bits 6 and 7 then ORing that with a value of 0xb8. Note that: This is equivalent to ignoring the ECN bits, and setting the code point value to 0x2e (which happens to be the DSCP for EF). In a similar manner, the Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 16] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 third line instructs the dsmarker to remark the CP of classid 1:2 to 0x1a (DSCP for AF31). The fourth line adds a remarking the class 1:3 DSCPs to 0x12 (DSCP for AF21). These three lines in effect are also registering the classes 1:2, 1:3 and 1:4. Line 5 adds a u32 classifier with priority of 4. Line 6 adds another classifier of a lower priority. Line 7 maps all packets with a source IP address of 10.0.0.0/24 to class 1:1. Line 7 and 8 show how one can attach a meter to a classifier and the reaction to an exceeding of the rate. Basically, the trick is to define two filters matching the same headers with a higher priority one attached with a meter and policing action. The operator "continue" is used to allow a lookup of the next lower priority matching filter. In this case, should the metering be exceeded in class 1:1, the flow is reclassified to class 1:2. Line 9 selects all TCP packets from source subnet 10.1.1.0/16 destined towards subnet 192.1.1.0/16 and sends them to the queue for class 1:3. The overall effect is: all packets coming in from source subnet address 10.0.0.0/24 will get their packets marked with a DSCP of 0x2e (EF class/PHB) up to a point where they start exceeding their allocated rate (of 1Mbps and burst of 2K). In this case, the packets are demoted to class 1:2 where they will be remarked to DSCP 0x18 (AF21). Any TCP packets of origin subnet 10.1.1.0/16 destination subnet 192.1.1.0/16 will be remarked to 0x1A (AF22). It is easy to see that one can build a multi-color marking scheme of large depths using using such cascading filter/metering schemes. 5.2 Core device: EF using CBQ DSCP=0x2E | +-+ +--------+ +-+ +--------+ | +---------+-------+ +-+ +-+ | |->|tc_index|->| |--+-->|tc_index|-->|Class 2:1| pFIFO |--+ | | | | | | +--------+ | | | +--------+ +---------+-------+ | | | | | | | | | | | | | | | | | | | | No match +---------+-------+ | | | | | | | | | +--------------->|Class 2:2| RED |--+-->| |->| | | | | | +---------+-------+ | | | | -->| | | | | | | |--> | | | +--------------------------------------------+ | | | | | | CBQ (2:0) | | | | | +------------------------------------------------+ | | | | | | | +------------------------------------------------------------------+ | | sch_dsmark (1:0) | +----------------------------------------------------------------------+ Figure 13: Configuring EF using CBQ. Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 17] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 The script below is the output of the EF Perl script on the Linux Diffserv Web site. 1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 set_tc_index 2. tc filter add dev eth0 parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2 3. tc qdisc add dev eth0 parent 1:0 handle 2:0 cbq bandwidth 10Mbit allot 1514 cell 8 avpkt 1000 mpu 64 4. tc class add dev eth0 parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10 defmap 1 5. tc qdisc add dev eth0 parent 2:1 pfifo limit 5 6. tc filter add dev eth0 parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on 7. tc class add dev eth0 parent 2:0 classid 2:2 cbq bandwidth 10Mbit rate 5Mbit avpkt 1000 prio 7 allot 1514 weight 1 maxburst 21 borrow 8. tc qdisc add dev eth0 parent 2:2 red limit 60KB min 15KB max 45KB burst 20 avpkt 1000 bandwidth 10Mbit probability 0.4 9. tc filter add dev eth0 parent 2:0 protocol ip prio 2 handle 0 tcindex mask 0 classid 2:2 pass_on Line 1 attaches to the root node on interface eth0 a dsmarker which copies the TOS byte into skb->tc_index. Line 2 adds a filter to the root node which exists merely to mask out the ECN bits and extract the DSCP field by shifting to the right by two bits. A classful qdisc using CBQ is attached to node 2:0 (2:0 is the child of the root node 1:0) - this is in line 3. Two child classes are defined out of the 2:0 node. 2:1 is of type CBQ which is bound to a rate of 1.5 Mbps (line 4). A packet counting FIFO qdisc ("pfifo") with a maximum queue size of 5 packets is attached to the CBQ class as the buffer management scheme (line 5). Line 6 adds a "tcindex" classifier which will redirect all packets with a skb->tc_index 0x2e (the DSCP for EF) to classid 2:1 - non 0x2e are allowed to fall through so they can be matched by another filter. Line 7 defines another CBQ class, 2:1, emanating out of node 2:0 - this is intended to be the Best Effort class. The rate is limited to 5 Mbps; however, the class is allowed to borrow extra bandwidth if it is not being used (via the operator "borrow"). Since the EF class does not lend its bandwidth (operator "isolated" line 4), the BE can only borrow up to a maximum of an extra 3.5Mbps. Note that in scenarios where there is no congestion on the wire, this might not be a very smart provisioning scheme since the BE traffic will probably get equivalent traffic performance as EF. The major differentiator in that case will be the priorities. The EF class' Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 18] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 traffic will always be served first as long as there is something on the queue (prio 1 is higher than prio 8 in comparing line 4 and 7). Line 8 attaches RED as the buffer management scheme to be used by the BE class. Line 9 then maps the rest of the packets (without DSCP of 0x2e) to the classid 2:2. The description of the RED and CBQ parameters are beyond the scope of this document. 6. Conclusion We have given a brief introduction to the elements of Linux traffic control in general, and we have explained how the existing infrastructure can be extended in order to support Diffserv. We have then shown how we implemented support for the Diffserv architecture in Linux, using the traffic control framework of recent kernels. We have also described how nodes can be configured using our work. Our implementation provides a very flexible platform for experiments with PHBs already under standardization as well as experiments with new PHBs. It can also serve as a platform for work in other areas of Diffserv, such as edge configuration management and policy management. Future work will focus on the elimination of a few restrictions that still exist in our architecture, and also in the simplification of the configuration procedures. 7. References [1] RFC2475; Blake, Steven; Black, David; Carlson, Mark; Davies, Elwyn; Wang, Zheng; Weiss, Walter. An Architecture for Differentiated Services, IETF, December 1998. [2] RFC2474; Nichols, Kathleen; Blake, Steven; Baker, Fred; Black, David. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers, IETF, December 1998. [3] Almesberger, Werner. Linux Traffic Control - Implementation Overview, ftp://lrcftp.epfl.ch/pub/people/almesber/pub/tcio-current.ps.gz, Technical Report SSC/1998/037, EPFL, November 1998. [4] RFC2598; Jacobson, Van; Nichols, Kathleen; Poduri, Kedarnath. An Expedited Forwarding PHB, IETF, June 1999. [5] Floyd, Sally; Jacobson, Van. Link-sharing and Resource Management Models for Packet Networks, IEEE/ACM Transactions on Networking, Vol. 3 No. 4, pp. 365-386, August 1995. [6] RFC2597; Heinanen, Juha; Baker, Fred; Weiss, Walter; Wroclawski, John. Assured Forwarding PHB Group, IETF, June 1999. [7] RFC1349; Almquist, Philip. Type of Service in the Internet Protocol Suite, IETF, July 1992. [8] CISCO DWRED. Distributed Weighted Random Early Detection, http://www.cisco.com/univercd/cc/td/doc/product/software/ios111/cc111/wred.htm [9] Clark, David; Wroclawski, John. An Approach to Service Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 19] INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999 Allocation in the Internet, Internet Draft draft-clark-diff-svc-alloc-00.txt, July 1997. [10] Floyd, Sally; Jacobson, Van. Random Early Detection Gateways for Congestion Avoidance, IEEE/ACM Transactions on Networking, August 1993. 8. Author's address Werner Almesberger Institute for computer Communications and Applications Swiss Federal Institute of Technology (EPFL) CH-1015 Lausanne Switzerland email: Werner.Almesberger@epfl.ch Jamal Hadi Salim Computing Technology Labs Nortel Networks P.O.BOX 3511, Station C Ottawa, Ontario Canada K1Y 4H7 email: hadi@nortelnetworks.com Alexey Kuznetsov Institute for Nuclear Research (INR) 60th October Anniversary pr. 7a Moscow 117312 Russia email: kuznet@ms2.inr.ac.ru Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 20]