Internet Engineering Task Force Authors INTERNET DRAFT Mandis Biegi Raymond Jennings Srinivasa Rao Dinesh Verma IBM T J Watson Research Center 18 November 1998 Supporting Service Level Agreements using Differentiated Services Status of Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Beigi Jennings Rao Verma Expires 18 May 1998 [Page i] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 Abstract This document describes an NT-based implementation of an Differentiated Services access router as described in [DSARCH]. This document describes our access router implementation, lessons learned, and issues in supporting service level agreements using the differentiated services architecture. 1.Introduction Differentiated Services is a mechanism by which network providers can offer their customers a range of network services which are differentiated by the performance level to be observed within the provider's network. The description of differentiated services is provided in its DS Field Specification [DSHEAD], and its architecture [DSARCH] and framework [DSFRAME] documents. We have attempted to implement an access router to exploit the capabilities offered by core routers that support differentiated services. In this document, we describe the implementation of the access router, including the mechanisms implemented in it, and the techniques used for consistent configuration of access routers and monitoring service level agreements (SLAs) in a differentiated services environment. Section 2 describes the overall architecture of the system we implemented, and motivates the design choices that we made. It also describes the mechanisms implemented in the control path and the data-path of the access router. Section 3 describes some of the lessons that we learned from this implementation. Section 4 describes an attempt to implement a virtual leased line service and our experiences on doing it within the constraints of an NT environment. The appendices in the draft provide a description of the schema and message formats needed for communication among the different components. We hope that this draft will provide useful information to implementors of differentiated services and towards the goal of using differentiated services to support service level agreements as defined in the framework document. Beigi Jennings Rao Verma Expires 18 May 1998 [Page ii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 2. System Architecture The overall architecture consisted of three types of boxes, an access router, a policy server, and a control server. The architecture is illustrated in Figure 1. [Customer Network]<----->[Access Router]<-------->[Core Router] |--------[Policy Server] |--------[Control Server] | [Customer Network]<----->[Access Router]<-------->[Core Router] Figure 1. The function of the access routers is as described in the Differentiated Services architecture document. It is responsible for collecting traffic and performance statistics, classifying, marking and policing the packets received from the customer network. The core routers provide differentiated services to packets marked differently. The access routers are implemented on a Windows NT platform, and consist of two components, a data-path component and a control-path component. The data-path component is responsible for marking, policing and classifying packets using information available at the network and transport protocol levels. The control path component enables classification and marking using higher application-level information, and incorporates functionality to communicate with the policy and the control servers. The two servers, namely the policy server and the control server are not described explicitly in the Differentiated Services Architecture. However, they are required if we need to support service level agreements using differentiated service. The policy server is used to provide a consistent definition of the different types of services that are to be supported in the network, a consistent classification of traffic across different classes, and the target performance metrics that have to be met for specific service level agreements. Defining these parameters at a centralized policy server eliminates the need to configure each access router independently, and increases the probability of a consistent configuration of all access routers. The control server is responsible for validating compliance with the SLAs specified at the policy server. It periodically polls all the access-routers in the network and collects traffic and performance statistics from them. The statistics are compared to target performance objectives specified in the service level agreements Beigi Jennings Rao Verma Expires 18 May 1998 [Page iii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 stored at the policy server. The control server is also responsible for detecting changes in the policies stored at the policy server, and notifying access routers when the policies have changed. 2.1. Access Router The access router consists of two components, a data-path component and a control-path component. The data-path component is implemented as an NDIS Intermediate Device Driver which sits between the IP layer and the token-ring/ethernet layer. The control path component is implemented as a Windows application. { [Proxy Interface] [ Policy Interface] [ Control Interface ] } Control Path | | | +-----------------+---------------------+ | | [Data Path] In the design of the access-router, we have classified each interface into one of two types, external interface or an internal interface. The external interface connects the access router to another DS domain whereas the internal interface connects the access router to our (ISP) own DS domain. Any DS-specific processing in the data-path is only done on packets being received or sent on an internal interface. Packets being sent out on an internal interface are passed through a classification module. The classification module looks at the 6-tuples in IP/TCP header and assigns a class of service to the packet. The 6-tuple consists of source and destination addresses, protocol, source and destination port numbers, and the incoming DS Field contained in the packet. After classification the packet is passed through a routing module which determined the egress access router out of the DS domain. This information is needed in order to ensure that different packets are rate-controlled appropriately. After the routing module, the packets are passed through a statistics collection module. Statistics are collected on the granularity of a channel which is a logical pipe consisting of a ingress access router, an egress access router, and a class of service. After statistics collection, the packets are passed through a pacing module. The pacer implements a simple policing function based on the virtual clock paradigm, and drops packets that are too early. We would have liked to implement a shaping function, wherein early packets are delayed rather than dropped, but could not design an efficient method to do so within the context of an NDIS intermediate Beigi Jennings Rao Verma Expires 18 May 1998 [Page iv] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 driver. The only way to do shaping would be to make a copy of every packet to be shaped, which would impose a serious performance degradation. Multiple channels are paced together into a single pacer. Due to the congestion control protocol employed by TCP, all TCP connections that share a bottleneck link in the core network of the DS domain should be paced together. Otherwise, the TCP connection that is not being paced will take up the lion's share of the bandwidth on the bottleneck link. It is due to this phenomenon that we need to determine the egress access router at the ingress access router. While this does not ensure pacing of all TCP sessions sharing a congested link, it ensures that TCP sessions originating at the same session that are likely to share a congested link are paced together. As a last step of the packet processing, the TOS byte of the packet is changed and the IP header checksum is updated. Packets being received at an internal interface are subject to statistics collection. We want to collect traffic and performance data on all the channels in the DS domain. A straight-forward implementation of statistics collection could be done by determining the ingress access router who sent the packets being received. This would have required that the egress data-path on the egress access router also perform a routing lookup, which adds unnecessary complexity. We therefore developed an active probing mechanism to monitor the performance of the channels. Probes are generated at the ingress access routers after every 50 packets (with some randomization to prevent synchronization effects) and sent to the destination of the sampled packet. The egress access router detects the probes and removes them from the traffic stream. It then reflects the probes back to the originating access-router, which is responsible for collecting round-trip delay information. As a result, the packets being received are examined to see if they are probe packets. The probe packets are either reflected back to the originating access-router or processed to collect delay and traffic statistics. The control path of the access-router consists of three components. The first component is a client of the policy server, and is responsible for obtaining policies from the directory, and passing them to the data-path. The second component of the component is responsible for receiving requests from the control server, and sending out the appropriate responses. The final component provides an interface for an application-specific proxy to specify classification behavior. Beigi Jennings Rao Verma Expires 18 May 1998 [Page v] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 The data-path component is only capable of marking packets on the basis of the 6-tuple. However, we may want to classify traffic on the basis of other criteria and information that is available only at higher layers. Examples of such classifications include different marking on the basis of a URL being visited or on the type of application being run. In order to accommodate such applications, we have provided an interface for trusted proxies to specify a 6-tuple for being processed appropriately by the data-path. The proxy application is responsible for converting from the application-level information to the 6-tuple that can be understood by the data-path. We have verified the proxy interface using a http proxy that can classify web-accesses using the URL and a PICS classification scheme [PICSREF]. While an edge-device without an http proxy accesses the policy server relatively infrequently (during initialization and then only if the policies change), an edge-device with a http proxy needs to access the policy-server every time it receives a http request. Caching of application-level policies helps in reducing the access-rate to some extent. 2.2. Policy Server The policy server is implemented as an LDAP [LDAPREF] directory. Different types of entries that are required for configuration of the access-routers are stored at the directory. The schema used at the directory is shown in Appendix 1. We realize that the schema description in the appendix is very terse and cryptic, but we hope it will provide a flavor for the type of entries to be stored there. Our initial thought was that the policy directory would have a relatively flat structure. It would consist of only two types of entries, one specifying the different levels of services implemented in the network, and the other specifying the mapping of different 6-tuple combination (or application-level information) to one of the service-classes. The application-level information that was used to define policy in our implementation was a mapping from the PICS-ratings to one specific service level. The service-level field provided a description of how packets belonging to a specific service class were to be treated. Surprisingly enough, the specification of the Per-Hop Behavior is not needed in the description of the service-level. It is adequate to simply specify the DS Field value to be used to mark packets belonging to that service-level. Each service level is also associated with a set of target performance levels. Beigi Jennings Rao Verma Expires 18 May 1998 [Page vi] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 For traffic conditioning operations other than marking, e.g. policing or shaping, we need more information than just the service-level that a packet belongs to. As mentioned in the section 2.1, the ingress access-router needs to determine the egress access-router of packets flowing through it. Such a routing information can be generated using routing protocols like BGP, or it can be pre-configured. We assumed that this local topology would be stored at the policy server, from where the access-routers can obtain it. Besides the local topology, the policy server specifies how multiple channels are to be paced at each edge-device. A channel is identified by the ingress access-router, the egress access-router and the service-level. Multiple channels are mapped into one pacer. This mapping is stored in the directory. The channels represent virtual pipes across the DS domain, and the pacers represent the constraints on the capacity of those pipes. The characteristics of each pacer at each edge-device is also stored at the directory. The pacer characteristics is the mean bandwidth which is to be allowed through the pacer. Another attribute which needs to be specified is the grace-period for packets dropping. A very simple policing scheme is used in our initial implementation: an expected time of packet arrival is computed depending on the rate assigned to the pacer, and a packet is dropped if the expected time of arrival exceeds the physical time by more than the grace-period. An additional entry in the directory specifies the location of the control server, if any. The modification time attribute of the directory server is turned on in our implementation. This requests the directory server to keep track of the time an entry was created or modified. The modification time information is used by the directory client to determine if the entries stored at the directory have changed. 2.3. Control Server The control server is responsible for collecting traffic and performance statistics from the different access routers and verifying that the service level agreements are being satisfied. It does so by polling the access-routers and comparing the reported performance of each channel to the expected level. The performance statistics could be retrieved using SNMP [SNMPREF] from the control server. In our implementation, we decided to use an simple http-like request-response protocol to obtain the information. Performance-related XML messages were exchanged in Beigi Jennings Rao Verma Expires 18 May 1998 [Page vii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 the polling exchanges. The XML messages are viewable, and provide for much easier debugging of the implementation than an equivalent SNMP implementation. The DTD used for the XML documents is shown in Appendix-II. We realize that the description in the appendix is very terse and cryptic, but we hope it will provide a flavor for the type of messages exchanged between the control server and the access routers. In addition to correlating the performance information, the control server is also responsible for tracking the changes in entries stored at the policy server. The control server tracks changes by obtaining the modification time, the distinguished names of entries, and the count of entries of each type at the directory. If the modification times or the count has changed, entries have been added or removed from the directory. The control server attempts to determine the set of edge-devices which may be impacted by the change in the directory, and notifies them in the next polling period that they need to re-access the directory. 3. Implementation Lessons During the implementation of the NT-based differentiated services access router, we came across several surprising lessons. In this section, we would like to share some of these experiences. In retrospective, many of these lessons should have been obvious. Unfortunately, we learned most of them the hard way. The biggest limitation of our implementation is that we could not perform packet shaping easily. An NDIS intermediate driver is implemented as a callback function which is invoked after IP processing is completed, but before the MAC processing is invoked. The management of buffers needed to hold a packet is kept in the NDIS layer, and the driver is required to provide a return code to enable forward processing, or else to provide an error code. This is adequate for functions such as marking or policing. However, shaping requires that the control of the buffer be handed over to the intermediate driver, which resumes the forward processing of the packet after the passage of some time. Given the lack of return code to have such a control, the only way of doing shaping is to make a physical copy of the packet. The performance impact of the copying is significant. Another interesting lesson was learned from the way the NT data path handled the packets which are used to communicate with the policy server. Our initial assumption that all rules be fetched from the policy server turned out to have one strong limitation. The initial (as well as subsequent) access to the policy server has to be done at Beigi Jennings Rao Verma Expires 18 May 1998 [Page viii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 a higher priority, otherwise the control packets are dropped during the pacing stage. As a result, the first action taken by the control path is to install a policy rule for communicating with the policy server. Additionally, with the use of the control server, each access-router needs to have an additional rule which adds the control server communication as a network control (high priority) traffic. The policy server implementation as an LDAP directory illustrated the significant impact schema definition could have on the performance of the directory server. Our initial attempt to look up rules for an access router involved searching the root subtree for all policy rules that matched the IP address of an access router. Converting this search scheme to a directed entry lookup could speed up operations by an order of magnitude. One limitation of the LDAP implementation of the policy server is that there are no efficient ways of monitoring when policies have changed. Our initial implementation used polling from each of the access router to the policy server. In subsequent revisions, We used the control server to make this determination. The control server based implementation reduced the load on the policy server significantly. However, there was a slightly bigger lag in access routers determining if the policies have changed at the server. Another interesting lesson was learned during the implementation of the HTTP proxy to provide PICS-based rating. Our initial attempt was to build the proxy as an add-on to an existing server using the add-on APIs provided by web servers such as Lotus Domino Go Server, or the Netscape server. To our dismay, the API exported did not permit determining the full 5-tuple of a web-session, which prevented us from passing the proper parameters to the data-path. We had to implement an independent web-proxy. Interestingly enough, the initial accesses to web-proxy (before the proxy has made a classification decision) needs to occur at the priority of network control. Without this additional implicit rule, we noticed too many TCP connect requests were being dropped as being of low priority before the proxy could classify them as being of high priority. 4. Virtual Leased Line Service Realizing that the concept of channels is close to that of a virtual leased line, we tried to examine how closely we could mimic the concept of a virtual leased line using the edge-device developed on NT. We mapped each channel into its individual pacer, and tried to set the pacer parameters so that they would correspond to an Beigi Jennings Rao Verma Expires 18 May 1998 [Page ix] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 equivalent leased lines. The pacer as implemented in our system uses two parameters, a rate and a grace-period. The interesting part, of course, is to determine whether a virtual leased line service can be provided by using a policing mechanism only, rather than using a shaping mechanism which we were unable to implement. In order to provide a virtual leased link with a fixed rate and a specific buffer capacity at the ingress access-router, we use a pacer with the same rate and a grace-period that would drop the packets only if a corresponding physical link would have run out of buffer spaces. We conducted a variety of experiments to validate whether this model would provide the abstraction of a virtual leased line. Our findings are summarized below. We know that by setting a very large grace-period, we would be able to obtain a line with the rate equivalent to that of the access router output link. We also know that by setting a very low grace-period, we would be able to reduce the effective rate to zero (i.e. deny access to a class of packets). In our implementation on using a token ring, we figured we should be able to move from 0 Kbps to 16 Kbps in this manner. It seemed logical that by computing the grace-period in an appropriate manner, we would be able to obtain a target rate for UDP and TCP traffic. For UDP traffic, determination of the grace-period was relatively straight-forward. As long as the buffer-size for the input link exceeded the size of UDP packets being sent on the network, UDP traffic throughput could be controlled to fit in quite closely to the assigned pacing rate. With TCP, the dynamics of congestion control provide quite a different story. For UDP traffic, a single application flow as well as five application flows aggregated into a single pacer were considered. In the multiple user case, we looked at equal sharing of the targeted throughput as well as unequal weights. When there is a single flow with UDP packets of size 4 KBytes, the measured throughput is close to the targeted throughput: 55 Kbps, 495 Kbps, and 986 Kbps for targeted rates of 56 Kbps, 500 Kbps, and 1 Mbps respectively. As the packet size gets smaller, we were surprised to see the measured throughput decreases gradually. Our guess is that the decrease in the throughput is due to the larger number of packets that need to be processed by the access-router, but do not have a conclusive answer yet. When multiple application flows were considered, the results were similar. The performance of our pacing mechanism with TCP connections has been studied only for the single user case so far. With a packet size of 4 kbytes, the measured throughput remains in the range of a few Beigi Jennings Rao Verma Expires 18 May 1998 [Page x] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 hundred bps for expected rates upto 1 mbps for a single user. For a packet size of 1 byte, the measured througputs are significantly better but still far from the expected rates; for e.g. 16.8 Kbps for 56 Kbps and 389 Kbps for 1 mbps. The bursty nature of TCP packet arrivals is not well handled by the simple pacing algorithm. The interaction of TCP congestion control mechanism with the packet drops means that the achievement of targeted rates is more difficult in this case. What we found out was that TCP was timing out and resulting in large periods of no activity. This implies that it would be difficult to obtain the full throughput of a leased line for a single TCP flow using a policing only mechanism. When multiple TCP flows are paced together, the impact of dropping is likely to be less severe. We are currently investigating ways to achieve rates closer to the targeted values. Our goal would be to find a way to determine the grace-period for TCP so that the access-router would work with standard implementations of TCP. Acknowledgments The authors would like to acknowledge the helpful comments and suggestions of the following individuals: Edward Ellesson, John Tavs, Arvind Krishna, Kurt Dietrich, Sanjay Kamat and Rajendran Rajan. References [DSARCH] S. Blake, et. al. "An Architecture for Differentiated Services", Internet Draft , October 1998. [DSFRAME] Y. Bernet, J. Binder, S. Blake, et. al. " A Framework for Differentiated Services", Internet Draft , November 1998. [DSHEAD] K. Nichols and S. Blake, "Definition of the Differentiated Services Field (DS Byte) in the IPv4 and IPv6 Headers", Internet Draft , May 1998. [LDAPREF] W. Yeong, T. Howes, and S. Kille, "Lightweight Directory Access Protocol", March 1995. [SNMPREF] J. Case, M. Fedor, M. Schoffstall, and J. Davin, "Simple Network Management Protocol", RFC 1157, May 1990. Beigi Jennings Rao Verma Expires 18 May 1998 [Page xi] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 [PICSREF] W3C Consortium, "Platform for Internet Content Selectio (PICS)", http://www.w3.org/PICS. Authors' Address Mandis Beigi Phone: (914) 784-3277 Raymond Jennings Phone: (914) 784-5475 Srinivasa Rao Phone: (914) 784-7477 Dinesh Verma Phone: (914) 784-7466 IBM T. J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 Email: mandis,raymondj,psrao,dverma@watson.ibm.com Beigi Jennings Rao Verma Expires 18 May 1998 [Page xii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 Appendix-I: Schema used in LDAP Repository The schema used in our policy server is shown below. We have not described the different attributes in detail for the sake of brevity. The classes defined in the schema perform the following functions: - customer: Defines a customer for whom SLAs are being defined. - interface: defines an interface IP address of an access router. - controlserver: defines the location of control server in network. - slaprincipal: defines the policy applicable for an interface. - servicelevel: defines a service offered by the network. - slachannel: defines the mapping from a channel to a pacer. - slapacer: defines the pacing parameters of a Pacer. - EdgeSubnet: defines the local topology of an access-router. Please note that this is a simple schema done purely to expedite our access router implementation, and has many limitations. We are providing it only for informational purposes. All relevant entries for an interface on an access router are identified by matching on the attribute "if" of the entry, which should match the IP address of the interface. objectclass: customer Required Attributes: objectClass, o Permissible Attributes: address, description objectclass: interface Required Attributes: objectClass, cust, ipaddr, defcos, interfaceUpdateTime Permissible Attributes: edName, description objectclass: controlserver Required Attributes: objectClass, if, port, cos Permissible Attributes: description objectclass: slaprincipal Required Attributes: objectClass, if, ptype, cos Permissible Attributes: sourceSubnet, sourceSubnetMask, destSubnet, destSubnetMask, sourcePort, destPort, subnetExchangeFlag, portExchangeFlag, proto, principalURL, principalHigherLayer, Beigi Jennings Rao Verma Expires 18 May 1998 [Page xiii] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 userGroup, ratingSystem, ratingService, businessRelevance, profitability, targetGroup, description objectclass: servicelevel Required Attributes: objectClass, o, cos, sourcePort, destPort, markPacketFlag, violation Permissible Attributes: delay, loss, delayPriority, lossPriority, nextCos, tosEncoding, portEncoding, description objectclass:slachannel Required Attributes: objectClass, sourceIP, destinationIP, cos, channelpacer Permissible Attributes: description objectclass: slapacer Required Attributes: objectClass, if, pacerNum, pacerRate, maxPacketDrop, maxByteDrop Permissible Attributes: description objectclass:EdgeSubnet Required Attributes: objectClass, if, SubnetAddress, SubnetMask Permissible Attributes: description Beigi Jennings Rao Verma Expires 18 May 1998 [Page xiv] Internet Draft draft-verma-diffserv-ntimplem-00.txt 18 November 1998 Appendix-II: DTD used for Control Server Communication Beigi Jennings Rao Verma Expires 18 May 1998 [Page xv]