INTERNET-DRAFT Werner Almesberger EPFL ICA, Switzerland Jamal Hadi Salim CTL Nortel Networks Alexey Kuznetsov INR Moscow February 1999 Differentiated Services on Linux Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Recent Linux kernels offer a wide variety of traffic control functions, which can be combined in a modular way. We have designed support for Differentiated Services based on the existing traffic control elements, and we have implemented new components where necessary. In this document we give a brief overview of the structure of Linux traffic control, and we describe our prototype implementation in more detail. 1. Introduction The Differentiated Services architecture (Diffserv) lays the foundation for implementing service differentiation in the Internet in an efficient, scalable way. We assume that readers are familiar Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 1] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 with the concepts and terminology defined in [1]. Furthermore, we assume familiarity with the packet marking as described in [2]. We have developed a design to support basic classification and DS field manipulation required by Diffserv nodes, and to configure the first PHBs that are being defined in the Diffserv WG. We have implemented a prototype of this design using the traffic control framework available in recent Linux kernels. The source code and related information can be obtained from http://lrcwww.epfl.ch/linux-diffserv/ The main focus of our work is to allow a maximum of flexibility for node configuration and for experiments with PHBs, while still maintaining a design that does not unnecessarily sacrifice performance. This document is structured as follows. Section "Linux Traffic Control" gives a brief overview of traffic control functions in recent Linux kernels. Section "Diffserv extensions to Linux traffic control" discusses where the existing model needed to be extended. Section "New components" describes the new components in more detail. We conclude with examples of configuration scripts in section "Building sample configurations". 2. Linux Traffic Control Figure 1 shows roughly how the kernel processes data received from the network, and how it generates new data to be sent on the network. +---------------+ +---->| TCP, UDP, ... | | +---------------+ | | TRAFFIC CONTROL | v | +---------------------+ +------------+ +----------------+ -->|Input de-multiplexing|-->| Forwarding |-->| Output queuing |--> +---------------------+ +------------+ +----------------+ Figure 1: Processing of network data. "Forwarding" includes the selection of the output interface, the selection of the next hop, encapsulation, etc. Once all this is done, packets are queued on the respective output interface. This is the point where traffic control comes into play. Traffic control can, among other things, decide if packets are queued or if they are dropped (e.g. if the queue has reached some length limit, or if the traffic exceeds some rate limit), it can decide in which order packets are sent (e.g. to give priority to certain flows), it can delay the sending of packets (e.g. to limit the rate of outbound traffic), etc. Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 2] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 Once traffic control has released a packet for sending, the device driver picks it up and emits it on the network. 2.1 Components The traffic control code in the Linux kernel consists of the following major conceptual components: - queuing disciplines - classes (within a queuing discipline) - filters - policing Each network device has a queuing discipline associated with it, which controls how packets enqueued on that device are treated. A very simple queuing discipline may just consist of a single queue, where all packets are stored in the order in which they have been enqueued, and which is emptied as fast as the respective device can send. See figure 2 for such a queuing discipline without externally visible internal structure. +--------------------+ --->| Queuing discipline |---> +--------------------+ Figure 2: A simple queuing discipline without classes. More elaborate queuing disciplines may use filters to distinguish among different classes of packets and process each class in a specific way, e.g. by giving one class priority over other classes. +-+ +-----+ +------------------+ +-+ +-+ | | +------+ | |-->|Queuing discipline|-->| | | | | |---->|Filter|--->|Class| +------------------+ | |--+ | | | | | +------+ | +--------------------------+ | | | | | | | +----------------------------------+ | | | | | | +------+ | | | | | +->|Filter|-_ +-----+ +------------------+ +-+ | | | -->| | | +------+ ->| |-->|Queuing discipline|-->| | | | |--> | | | |Class| +------------------+ | |--+-->| | | | | +------+ _->| +--------------------------+ | | | | | +->|Filter|- +----------------------------------+ | | | | +------+ | | | +-----------------------------------------------------------+ | | Queuing discipline | +---------------------------------------------------------------+ Figure 3: A simple queuing discipline with multiple classes. Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 3] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 Figure 3 shows an example of such a queuing discipline. Note that multiple filters may map to the same class. Queuing disciplines and classes are intimately tied together: the presence of classes and their semantics are fundamental properties of the queuing discipline. In contrast to that, filters can be combined arbitrarily with queuing disciplines and classes as long as the queuing discipline has classes at all. But flexibility doesn't end yet - classes normally don't take care of storing their packets themselves, but they use another queuing discipline to take care of that. That queuing discipline can be arbitrarily chosen from the set of available queuing disciplines, and it may well have classes, which in turn use queuing disciplines, etc. +-+ +------+ +---------------+ +-+ +-+ | | +------+ | |-->|TBF, rate=1Mbps|-->| | | | | |--+->|Filter|--->|"high"| +---------------+ | |--+ | | | | | +------+ | +-----------------------+ | | | | | | | +--------------------------------+ | | | | | | | | | | | | +------+ +---------------+ +-+ | | | -->| | | Default | |-->| FIFO |-->| | | | |--> | | +------------->|"low" | +---------------+ | |--+-->| | | | | +-----------------------+ | | | | | +--------------------------------+ | | | | | | | +---------------------------------------------------------+ | | Queuing discipline with two delay priorities | +-------------------------------------------------------------+ Figure 4: Combination of priority, TBF, and FIFO queuing disciplines. Figure 4 shows an example of such a stack: first, there is a queuing discipline with two delay priorities. Packets which are selected by the filter go to the high-priority class, while all other packets go to the low-priority class. Whenever there are packets in the high-priority queue, they are sent before packets in the low-priority queue (e.g. the sch_prio queuing discipline works this way). In order to prevent high-priority traffic from starving low-priority traffic, we use a token bucket filter (TBF), which enforces a rate of at most 1 Mbps. Finally, the queuing of low-priority packets is done by a FIFO queuing discipline. Note that there are better ways to accomplish what we've done here, e.g. by using class-based queuing (CBQ). Packets are enqueued as follows: when the enqueue function of a queuing discipline is called, it runs one filter after the other until one of them indicates a match. It then queues the packet for the corresponding class, which usually means to invoke the enqueue function of the queuing discipline "owned" by that class. Packets Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 4] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 which do not match any of the filters are typically attributed to some default class. Typically, each class "owns" one queue, but it is in principle also possible that several classes share the same queue or even that a single queue is used by all classes of the respective queuing discipline. Note however that packets do not carry any explicit indication of which class they were attributed to. Queuing disciplines that change per-class information when dequeuing packets (e.g. CBQ) will therefore not work properly if the "inner" queues are shared, unless they are able either to repeat the classification or to pass the classification result from enqueue to dequeue by some other means. Usually when enqueuing packets, the corresponding flow(s) can be policed, e.g. by discarding packets which exceed a certain rate. 3. Diffserv extensions to Linux traffic control The traffic control framework available in recent Linux kernels [3] already offers most of the functionality required for implementing Diffserv support. We therefore closely followed the existing design and added new components where strictly necessary. 3.1 Overview Figure 5 shows the general structure of the forwarding path in a Diffserv node. +---------+ +-----+ +---------+ | Classi- |--->| +-----+ | | --->| fier & |----->| +-----+ | Marking |---> | Meter |------->| PHB |--->| | +---------+ +-----+ +---------+ Figure 5: General Diffserv forwarding path. Depending on the implementation, marking may also occur at different places, possibly even several times. The classification result may be used several times in the Diffserv processing path, and it may also depend on external factors (e.g. time), so reproducing the classification result may not even be expensive, but actually impossible. We therefore added a new field tc_index to the packet buffer descriptor (struct sk_buff), where we store the result of the initial classification. In order to avoid confusing tc_index with the classifier cls_tcindex, we will call the former skb->tc_index Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 5] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 throughout this document. skb->tc_index is set using the sch_dsmark queuing discipline, which is also responsible for initially retrieving the DSCP, and for setting the DS field in packets before they are sent on the network. sch_dsmark provides the framework for all other operations. The cls_tcindex classifier reads all or part of skb->tc_index field and uses this to select classes. Finally, we need a queuing discipline to support multiple drop priorities as required for Assured Forwarding. For this, we designed GRED, a generalized RED. sch_gred provides a configurable number of drop priorities which are selected by the lower bits of skb->tc_index. 3.2 Classification and marking The classifiers cls_rsvp and cls_u32 can handle all micro-flow classification tasks. In principle, behavior aggregate classification could also be done using cls_u32, but since we usually already have sch_dsmark at the top level, we use the simpler cls_tcindex and retrieve the DSCP using sch_dsmark, which then puts it into skb->tc_index. When using sch_dsmark, the class number returned by the classifier is stored in skb->tc_index. This way, the result can be re-used during later processing steps. Nodes in multiple DS domains must also be able to distinguish packets by the inbound interface in order to translate the DSCP to the correct PHB. This can be done using the route classifier, in combination with ip rule. Marking is done when a packet is dequeued from sch_dsmark. sch_dsmark uses skb->tc_index as an index to a table in which the outbound DSCP is stored and puts this value into the packet's DS field. skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> Initial DS field marking -- ^ Initial value of tc_index | +-+ +------+ | +---+-+ +-+ +-|-+ | | | cls_ |--->| | |--> . . . -->| | | | | | |--->| rsvp |--->| | | | |--->| | | | | | |--->| | | +---------------+ | | v | -->| | +------+ +-|-+-------^-----------+ | O |--> | | | | | ^ | | +------------------|---------|----------------+ | | Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 6] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 | sch_dsmark | | | | +--------------------|---------|------------------|-+ | | -- tc_index may | v v change | - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 6: Micro-flow classifier. Figure 6 shows the use of sch_dsmark and skb->tc_index in a micro-flow classifier based on cls_rsvp. Figure 7 shows a behavior aggregate classifier using cls_tcindex. skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> | -- DS field is used May change DS field -- ^ | for classification | +-|-+ +------+ +---+-+ +-+ +-|-+ | | | | cls_ |--->| | |--> . . . -->| | | | | | | |----->| tc |--->| | | | |--->| | | | | | |index |--->| | | +---------------+ | | v | -->| O | +------+ +-|-+--------------^----+ | O |--> | | | ^ | | | ^ | | | +----------|---------|----------------|---------+ | | | | sch_dsmark | | | | | +-|------------|---------|----------------|-----------|-+ | | | -- tc_index -- | | v | v may change v | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 7: Behaviour aggregate classifier. 3.3 Cascaded classifiers Linux traffic control supports a limited form of cascading of classifiers: if multiple classifiers are specified for the same class, they are invoked in sequence until one of them does not return a "no match" code. This can also be used to configure multiple meters, e.g. for "low", "high", and "excess" traffic. We have added the possibility for a metering decision to yield "no match" to this end. Note that cascading classifiers in this way is not sufficiently flexible for more demanding classification schemes. We are currently examining approaches for further generalizing classification. Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 7] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 3.4 Implementing PHBs PHBs based only on delay priorities, e.g. Expedited Forwarding [4], can be built using CBQ [5] or the more simple sch_prio. (See section "Building sample configurations".) Besides four delay priorities, which can again be implemented with already existing components, Assured Forwarding [6] also needs three drop priorities, which is more than the current implementation of RED supports. We therefore added a new queuing discipline which we call "generalized RED" (GRED). GRED uses the lower bits of skb->tc_index to select the drop class and hence the corresponding set of RED parameters. 3.5 Shaping The so-called Token Bucket Filter (sch_tbf) can be used for shaping at edge nodes. Unfortunately, this highest rate which sch_tbf can shape it limited by the system timer, which normally ticks at 100 Hz, but can be accelerated to 1 kHz or more. Higher rates can be shaped when using hardware-based solutions, such as ATM. 4. New components The prototype implementation of diffserv support requires the addition of three new traffic control elements to the kernel: (1) the queuing discipline sch_dsmark to extract and to set the DSCP, (2) the classifier cls_tcindex which uses this information, and (3) the queuing discipline sch_gred which supports multiple drop priorities. Only the queueing discipline to extract and set the DSCP is truly specific to the differentiated services architecture. The other two elements can also be used in other contexts. Figure 6 shows the use of sch_dsmark for the initial packet marking when entering a diffserv domain. The classification and rate control is performed by a micro-flow classifier, e.g. cls_rsvp, which is designed to identify RSVP flows. This classifier determines the initial TC index which is then stored in skb->tc_index. Afterwards, further processing is performed by an inner queuing discipline. Note that this queuing discipline may read and even change skb->tc_index. When a packet leaves sch_dsmark, skb->tc_index is examined and the diffserv field of the packet is set accordingly. Figure 7 shows the use of sch_dsmark and cls_tcindex in a node which Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 8] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 works on a behavior aggregate, i.e. on packets with the diffserv field already set. The procedure is quite similar to the previous scenario, with the exception that cls_tcindex takes over the role of cls_rsvp and that the DS field of the incoming packet is copied to tc_index before invoking the classifier. Note that the value of the outbound DS field can be changed in three ways: (1) by establishing a mapping from tcindex to the DS field that is different from the mapping that was used during classification, (2) by mapping the DS field of inbound packets to tc_index values that will be translated to different DS field values on output, and (3) by changing tc_index in the inner queuing discipline. Because the mapping in case (1) is more efficient than the mapping in case (3), any numbering scheme should try to use the DS field values of incoming packets also for tc_index. 4.1 sch_dsmark As illustrated in figure 8, the sch_dsmark queuing discipline performs three actions: - If set_tc_index is set, it retrieves the content of the DS field and stores it in skb->tc_index. - It invokes a classifier and stores the class ID returned in skb->tc_index. If the classifier finds no match, default_index is used instead. - After sending the packet through its inner queuing discipline, it uses the resulting value of skb->tc_index as an index into a table of (mask,value) pairs. The original value of the DS field is then replaced using the following formula: ds_field = (ds_field & mask) | value skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> | -- Optional: DS field is copied to tc_index | ^ | | | +-|-+ +---------+ tc_index is translated +----|---|---+ | | |-+->|Filter |_ to new DSCP \| | | | | | | | +-------^-+ -_ -- res.classid contains \ v | | | | | | | -_ new tc_index |\ (&)>(or) | | | | | +-------|-+ -->+-------+-------------+ | ^ ^ | | v | +->|Filter | |------->|classid|Queuing disc.|-->| | | | -->| O | | +-------^-+ _-->+-------+-------------+ | |___| |--> | | | | Default | _- | | [__|__] | | | | +----------|---- | | +>[__|__] | | | | | default_index provides tc_index | | [__|__] | | | +------------|--------------|---------------------+ |Mask Value| | | sch_dsmark | | | | +-|--------------|--------------|-----------------------|----------+ | | -- classifier|may use | Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 9] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 v | tc_index v | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index Figure 8: The dsmark queuing discipline. Table 1 lists the parameters that can be configured in the dsmark queuing discipline. The upper part of the table shows parameters of the queuing discipline itself. The lower part shows parameters of each class. --------------------------------------------------------- | Variable name / tc keyword | Value | Default | --------------------------------------------------------- | indices | 2^n | none | | default_index | 0... indices-1 | 0 | | set_tc_index | none (flag) | absent | --------------------------------------------------------- | mask | 0...0xff | 0xff | | value | 0...0xff | 0 | --------------------------------------------------------- Table 1: Configuration parameters of sch_dsmark. indices is the size of the table of (mask,value) pairs. 4.2 cls_tcindex As shown in figure 9, the cls_tcindex classifier uses skb->tc_index to select classes. It first calculates the lookup key using the algorithm key = (skb->tc_index >> shift) & mask Then it looks for an entry with this handle. If an entry is found, it may call a meter (if configured), and it will return the class IDs of the corresponding class. If no entry is found, the result depends on fall_through. If it is set, it constructs a class ID from the lookup key. Otherwise, it returns a "not found" indication. We call construction of the class ID a "algorithmic mapping". This can be used to avoid setting up a large number of classifier elements if there is a sufficiently simple relation between values of skb->tc_index and class IDs. +-----------------+ | shift mask | | | | | skb->tc_index ---->(>>) ---->(&) | | | | +-------------|---+ Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 10] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 | +-------------------+ | Key v +------------------------+ | key class(id) police | +------------------------+ | key class(id) police | +------------------------+ | | | : | +--------> Profile : | : +------------------> Class | +-----------------> *:Key if fall_through Figure 9: The tcindex classifier. Table 2 shows the parameters that can be configured in the tcindex classifier. The upper part of the table shows parameters of the classifier itself. The lower part shows parameters of each element. ------------------------------------------------------------- | Variable | tc keyword | Value | Default | ------------------------------------------------------------- | mask | mask | 0...0xffff | 0xffff | | shift | shift | 0...15 | 0 | | fall_through | fall_through/ | flag | fall_through | | | pass_on | | | ------------------------------------------------------------- | res | classid | major:minor | none | | police | police | Profile | none | ------------------------------------------------------------- Table 2: Configuration parameters of cls_tcindex. Note that the keyword used by tc (the command-line tool used to manually configure traffic control elements) does not always correspond to the variable internally used by cls_tcindex. 4.3 sch_gred +- -- -- -- -- -- -- -- -- -- -- -- -- -- -+ Class Virtual Virtual Q RED | Queue Selector Parameters | | +-----------+ +---------+ | | Virtual Q | | | Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 11] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 | | Table | | [ VQ0 ] | Class Physical | | _______ | | | Queue Queue | | [_______] | | [ VQ1 ] | packet --------+ | -------->[_______]----->| |----+---------> | | | |-------> | | [_______] | | ... | | --------+ | | [_______] | | | | Drop | | [_______] | | [ VQn ] | | packet | | | | | X | +-----------+ +---------+ | +- -- -- -- -- -- -- -- -- -- -- -- -- -- -+ Figure 10: Generic RED and the use of skb->tc_index Figure 10 shows how sch_gred uses skb->tc_index for the selection of the right virtual queue (within a physical queue). What makes sch_gred different from other Multi-RED implementations is the fact that it is decoupled from any one specific classification or even to a classifier per-se. For example, CISCO's WRED is tied to mapping virtual queue selection based on the precedence bits classification. On the other hand, RIO is tied to the IN/OUT levels for the selection of the virtual queue. Any classifier, meter or policer along the data path can affect the selection of the virtual queue by setting the appropriate value of skb->tc_index. GRED also differs from the two mentioned multiple RED mechanisms in that it is not limited to a specific number of virtual queues. The number of virtual queues is configurable for each class queue. GRED does not assume certain drop precedences (or priorities). It depends on the configuration parameters passed on by the user. In essence, WRED and RIO are special cases of GRED. Currently, the number of virtual queues is limited to 16 (the least significant 4 bits of skb->tc_index). There is a one to one mapping between the values of skb->tc_index and the virtual queue number in a class. 5. Building sample configurations Given the flexibility of the code, there are many ways to reach the same end goal. Depending on the requirement, one could script the same PHB using a different combinations of qdiscs; e.g. one could build a core EF capable router using either CBQ to rate limit it and prioritise its traffic or instead use the PRIO qdisc with a Token Bucket attached to rate limit it. It is hoped that users of Linux DiffServ will be able to script their own flavored configurations. The examples below are simplistic, in the sense that they only assume one interface per node. The lines are numbered for clarity of the description below. Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 12] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 The normal recipe for creating a configuration script is: - attach a classful qdisc to a device. - define your classes - identify which packets go to which classes. 5.1 Edge device: Packet re-marking 1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 2. tc class change dev eth0 classid 1:2 dsmark mask 0xc0 \ value 0x2e 3. tc class change dev eth0 classid 1:3 dsmark mask 0xc0 \ value 0x18 4. tc class change dev eth0 classid 1:4 dsmark mask 0xc0 \ value 0x1a 5. tc filter add dev eth0 parent 1:0 protocol ip prio 5 \ handle 1: u32 divisor 256 6. tc filter add dev eth0 parent 1:0 prio 4 u32 ht 1:6: \ match ip src 10.0.0.1\ flowid 1:2 7. tc filter add dev eth0 parent 1:0 prio 4 u32 ht 1:7: \ police rate 1000kbit burst 1000 action -1 \ match ip src 11.0.0.1\ flowid 1:3 8. tc filter add dev eth0 parent 1:0 prio 5 u32 ht 1:8: \ match ip src 11.0.0.1 \ flowid 1:4 9. tc filter add dev eth0 parent 1:0 prio 5 handle ::1 \ u32 ht 800:: \ match ip nofrag \ offset mask 0x0F00 shift 6 \ hashkey mask 0x00ff0000 at 8 \ link 1: The first line attaches a dsmarker to the interface eth0 on the root node. The dsmarker is capable of setting skb->tc_index by copying the DS field into it. The second line instructs the dsmarker remark the DSCP of classid 1:2 to 0x2e (which happens to be the DSCP fpr EF). Similarly, the third line instructs the dsmarker to remark the DSCP of classid 1:3 to 0x18 (DSCP for AF21). The fourth line adds a remarking the class 1:4 DSCPs to 0x1a (DSCP for AF22). These three lines in effect are also registering the classes 1:2, 1:3 and 1:4. Line 5 adds a u32 classifier with 256 hash buckets. Line 6 maps all packets with a source IP address of 10.0.0.1 to class 1:2. Line 7 and 8 show how one can attach a meter to a classifier and the reaction to an exceeding of the rate. Basically, the trick is to define two filters matching the same headers with a higher priority one attached with a mater and policing action. -1 stands for fall through. Line 7 matches all packets whose source IP Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 13] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 address is 11.0.0.1 up to a certain rate. If the rate exceeds 1000kbps (burst rate of 1000 bits), the action is to continue searching for the next filter. In this case the next lower filter with the same match is the one on line 7 which redirects the packet to class 1:4. The overall effect is: all packets coming in from source IP address 10.0.0.1 will get their packets marked with a DSCP of 0x2e(EF class). All packets from source IP address 11.0.0.1 will get their packets marked as 0x18 for AF class AF21. When the meter starts reporting that the flows from 11.0.0.1 exceed their metering rate, they get remarked to AF22 (DSCP 0x1a). 5.2 Core device: EF using CBQ The script below is the output of the EF perl script on the Linux DiffServ website. 1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 \ set_tc_index 2. tc filter add dev eth0 parent 1:0 protocol ip prio 1 \ tcindex mask 0xfc shift 2 3. tc qdisc add dev eth0 parent 1:0 handle 2:0 cbq \ bandwidth 10Mbit allot 1514 cell 8 avpkt 1000 mpu 64 4. tc class add dev eth0 parent 2:0 classid 2:1 cbq \ bandwidth 10Mbit \ rate 1500Kbit avpkt 1000 prio 1 bounded isolated \ allot 1514 weight 1 maxburst 10 defmap 1 5. tc qdisc add dev eth0 parent 2:1 pfifo limit 5 6. tc filter add dev eth0 parent 2:0 protocol ip prio 1 \ handle 0x2e tcindex classid 2:1 pass_on 7. tc class add dev eth0 parent 2:0 classid 2:2 cbq \ bandwidth 10Mbit rate 5Mbit avpkt 1000 prio 7 \ allot 1514 weight 1 maxburst 21 borrow 8. tc qdisc add dev eth0 parent 2:2 red limit 60KB min 15KB \ max 45KB burst 20 avpkt 1000 bandwidth 10Mbit \ probability 0.4 9. tc filter add dev eth0 parent 2:0 protocol ip prio 2 \ handle 0 tcindex mask 0 classid 2:2 pass_on Line 1 attaches to the root node on interface eth0 a dsmarker which copies the TOS byte into skb->tc_index. Line 2 adds a filter to the root node which exists merely to mask out the ECN bits and extract the DSCP field by shifting to the right by two bits. A classful qdisc using CBQ is attached to node 2:0 (2:0 is the child of the root node 1:0) - this is in line 3. Two child classes are defined out of the 2:0 node. 2:1 is of type CBQ which is bound to a rate of 1.5 Mbps (line 4). A packet counting FIFO qdisc (pfifo) with a maximum queue size of 5 packets is attached to the CBQ class as the buffer management scheme (line 5). Line 6 adds a tcindex classifier which will redirect all packets with a Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 14] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 skb->tc_index 0x2e (the DSCP for EF) to classid 2:1 - non 0x2e are allowed to fall through so they can be matched by another filter. Line 7 defines another CBQ class, 2:1, emanating out of node 2:0 - this is intended to be the Best Effort class. The rate is limited to 5 Mbps; however, the class is allowed to borrow extra bandwidth if it is not being used (via the operator borrow). Since the EF class does not lend its bandwidth (operator isolated line 4), the BE can only borrow up to a maximum of an extra 3.5Mbps. Note that in scenarios where there is no congestion on the wire, this might not be a very smart provisioning scheme since the BE traffic will probably get equivalent traffic performance as EF. The major differentiator in that case will be the priorities. The EF class' traffic will always be served first as long as there is something on the queue (prio 1 is higher than prio 8 in comparing line 4 and 7). Line 8 attaches RED as the buffer management scheme to be used by the BE class. Line 9 then maps the rest of the packets (without DSCP of 0x2e) to the classid 2:2. The description of the RED and CBQ parameters are beyond the scope of this document. 6. Conclusion We have given a brief introduction to the elements of Linux traffic control in general, and then we have explained how the existing infrastructure can be extended in order to support Diffserv. We have then shown how we implemented support for the Diffserv architecture in Linux, using the traffic control framework of recent kernels. We have also described how nodes can be configured using our work. Our implementation provides a very flexible platform for experiments with PHBs already under standardization as well as experiments with new PHBs. It can also serve as a platform for work in other areas of Diffserv, such as edge configuration management. Future work will focus on the elimination of a few restrictions that still exist in our architecture, and also in the simplification of the configuration procedures. 7. References [1] RFC2475; Blake, Steven; Black, David; Carlson, Mark; Davies, Elwyn; Wang, Zheng; Weiss, Walter. An Architecture for Differentiated Services, IETF, December 1998. [2] RFC2474; Nichols, Kathleen; Blake, Steven; Baker, Fred; Black, David. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers, IETF, December 1998. [3] Almesberger, Werner. Linux Traffic Control - Implementation Overview, ftp://lrcftp.epfl.ch/pub/people/almesber/pub/tcio-current.ps.gz, Technical Report SSC/1998/037, EPFL, November 1998. [4] Jacobson, Van; Nichols, Kathleen; Poduri, Kedarnath. An Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 15] INTERNET-DRAFT draft-wajhak-diffserv-linux-00.txt February 1999 Expedited Forwarding PHB (work in progress), Internet Draft draft-ietf-diffserv-phb-ef-02.txt, February 1999. [5] Floyd, Sally; Jacobson, Van. Link-sharing and Resource Management Models for Packet Networks, IEEE/ACM Transactions on Networking, Vol. 3 No. 4, pp. 365-386, August 1995. [6] Heinanen, Juha; Baker, Fred; Weiss, Walter; Wroclawski, John. Assured Forwarding PHB Group (work in progress), Internet Draft draft-ietf-diffserv-af-06.txt, February 1999. 8. Author's address Werner Almesberger Institute for computer Communications and Applications Swiss Federal Institute of Technology (EPFL) CH-1015 Lausanne Switzerland email: Werner.Almesberger@epfl.ch Jamal Hadi Salim CTL Nortel Networks email: hadi@nortelnetworks.com Alexey Kuznetsov INR Moscow email: kuznet@ms2.inr.ac.ru Almesberger, Hadi & Kuznetsov Expires 8/99 [Page 16] cheers, jamal Computing Technology Labs (CTL), Nortel