Network Working Group Internet Draft C. Proust Document: draft-proust-flow-routing-01.txt France Telecom Expires: January 2005 August 2004 An approach for routing at flow level draft-proust-flow-routing-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/ietf/1shadow-sites.txt Abstract This specification designs an architecture that aims at load balancing flows along multiple routes according to a measure of their average traffic loads. This architecture supports such a service due to the operation in a closed loop of an enhanced routing control plane with an enhanced forwarding plane. Basically, this routing paradigm allows a second dynamic metric called load to be taken into account when calculating new routes and when forwarding the flows along the routes. Proust Expires û January 2005 [Page 1] Internet-Draft flow routing 8/2/2004 Table of Contents 1. Introduction...................................................3 1.1 A Brief History............................................3 1.2 Common Understanding of the Requirements...................4 1.3 Some Strong Architectural Beliefs..........................4 1.4 Design Philosophy..........................................5 1.5 Overview of the Architecture...............................6 1.6 Review of an Example......................................10 2. Extensions of the Forwarding plane............................16 2.1 Overview..................................................16 2.2 A new option in the IP protocol...........................17 2.3 FIB structure.............................................34 2.4 FIB update................................................34 3. Extensions of the Routing Control plane.......................37 3.1 Overview..................................................37 3.2 Statistic module..........................................40 3.3 RIB structure.............................................41 3.4 LLP control messages......................................42 3.5 Synchronization of the LLDB...............................46 3.6 RIB update................................................49 3.7 Route calculation algorithm...............................52 4. Ongoing Experiment............................................52 5. Conclusion....................................................53 IANA Considerations..............................................54 New IP Option.................................................54 New PDU Types.................................................54 New TLV Code..................................................54 New sub-TLV Codes.............................................55 Security Considerations..........................................55 Normative References.............................................55 Informative References...........................................56 Acknowledgments..................................................57 Changes..........................................................57 Author's Address.................................................57 IPR Notice.......................................................57 Full Copyright Statement.........................................57 Proust Expires û January 2005 [Page 2] Internet-Draft flow routing 8/2/2004 1. Introduction 1.1 A Brief History While the Internet has dramatically evolved during the last decade in terms of services, numbers of users, network sizes, or bandwidth rates, there have not been any fundamental changes in the field of routing apart from MPLS developments. To quote from [IAB-FUTURE], "The deployed Internet routing system assumes that the shortest path is always the best path. This is probably false, however it is a reasonable compromise given the routing protocols currently available. The Internet lacks deployable approaches for policy-based routing or routing with alternative metrics." Hence today, it is envisioned that the routing area needs some improvement. As an indication, one can underscore the past activity that has been done in the working group QOSR (QOS Routing). It delivered a framework document [QOSRFRWK] and a protocol specification [QOSPF] that are still used as references in this field. After this working group has concluded another approach named [OMP] was proposed at the IETF. It also enhances the routing control plane with adaptive capabilities to network traffic load changes. Despite the interest of the IETF in both approaches, they were not brought into alignment with the standard tracks mainly because they were not appropriate for a large-scale deployment. Moreover, the IAB recently released a new framework document [QOSARCFRWK] about QOS architectural considerations for IP networks that reminds the need for enhanced routing capabilities in the Internet. On page 11 of this document, G. Huston states: "There is a more fundamental issue here concerning resource management and traffic engineering. The approach of single path selection with static load characteristics does not match a networked environment which contains a richer mesh of connectivity and dynamic load characteristics. In order to make efficient use of a rich connectivity mesh, it is necessary to be able to direct traffic with a common ingress and egress point across a set of available network paths, spreading the load across a broader collection of network links." This I-D addresses this problem statement. It designs an architecture that allows flows within a routing domain to be load balanced along multiple routes according to a measure of their relative traffic load. Proust Expires û January 2005 [Page 3] Internet-Draft flow routing 8/2/2004 1.2 Common Understanding of the Requirements Several documents served as the foundations to derive the requirements that apply to this document. First of all, there is the already mentioned framework document about QOS routing [QOSRFRWK]. Second, there is the more recent framework document about QOS architecture [QOSARCFRWK]. Third, there are the specifications documents of two experimental protocols that broaden the understanding of the above problem statement, namely, protocols known as the QOS Routing Mechanisms and OSPF Extensions (QOSPF), and the Optimized Multi Path (OMP) approach. Finally, the BCP 41 [BCP41] gives authoritative directions about congestion control principles that can serve to check that the solution is coherent with TCP and other elaborated end-to-end congestion control mechanisms. This section serves as a reminder of the fundamental requirements that the solution must meet. - the solution must provide for multiple paths between any node pair - the solution must optimize the use of the network (mainly by means of load balancing mechanisms), based on knowledge of available resources (mainly link bandwidth) - the solution must exhibit good properties in terms of scalability, stability and performance - the solution must not place a new security breach upon the Internet architecture - the solution must allow for a fine granularity of routing - the solution may provide support for management mechanisms to ease its deployment and supervision 1.3 Some Strong Architectural Beliefs The solution described hereafter is based on several strong assumptions. First, "stateless" switching technology is considered necessary to build an Internet architecture that exhibits good properties, such as, scalability, robustness and security. Second, the finer the level of granularity of network services, the more value added the resulting network. Third, resource access control mechanisms are keys to avoid contention of resources. Besides, over-provisioning technique is a Proust Expires û January 2005 [Page 4] Internet-Draft flow routing 8/2/2004 common practice among network operators. This technique is considered adapted to IP networks where IP traffic is unpredictable. Furthermore, the deployment of a router-based end-to-end congestion control mechanism [BCP41] in the Internet is seen as a priority for its evolution. Finally, it is generally admitted that stability and simplicity [INT- ARCH2] are key concerns when promoting an architecture whose scope challenges that of the Internet. 1.4 Design Philosophy As a consequence of the stateless nature of the Internet, every IP datagram contains enough information in its header to make routers forward them independently towards their final destination. In order to cope with this fundamental characteristic, the solution that is depicted, hereafter, conforms to the stateless paradigm. It defines a new option for the IP protocol that allows flows to be load balanced among multiple routes according to network traffic load. To do so, route pinning information is recorded and conveyed in the optional IP header of every datagram. With regard to the level of granularity to which this network service applies, the solution examines routing at flow level in order to increase the added value of those services. Hence the definition of a flow that applies to this architecture is that of a pair of Internet sockets combined with a number of transport protocol (source address, source port number, destination address, destination port number). With regard to the third assumption, the architecture relies on a resource access control mechanism that operates at flow level and takes into account the availability level of network resources, mainly link bandwidth. However, over a long period of time, the network resources would remain over-provisioned. There are numerous RFCs [BCP41], [QOSRFRWK] that stress the need for a router-based end-to-end congestion control mechanism within the Internet. As a result, it would be opportunistic to design a routing solution that not only provides for IP connectivity but also that controls congestion at the scale of the network. With regard to the stability criteria, the routing solution must first include mechanisms that ensure convergence within a reasonable timeframe given the dynamic of the flows. Second, the routing system must generate an amount of information about resource availability that is manageable. Third, the routing system must guarantee that the flows are routed in a stable way. That is to say, that the system must not allow for Proust Expires û January 2005 [Page 5] Internet-Draft flow routing 8/2/2004 oscillations of traffic between different routes except when topology changes. To comply with the simplicity criteria the solution must keep the overall architecture on a rather uncomplicated level. Hence their underlying mechanisms must be kept as simple as possible in order to be implemented and operated on large scale. 1.5 Overview of the Architecture This architecture allows flows of IP datagrams to be dynamically load balanced along multiple routes according to their relative level of availability. In particular, the routing control plane of the network system is enhanced to take into account an additional metric called the traffic load. Due to its dynamic nature, this metric is periodically refreshed. Basically, a flow is defined as the combination of a pair of Internet sockets and the number of a transport protocol: i.e. a quintuple of a source network address, a source port number, a destination network address, a destination port number, and a transport protocol number. The load balancing policy of the flows is defined at the routing control plane while it is applied at the forwarding plane. In this architecture the forwarding plane is placed under a tighter control of the routing control plane since the latter dynamically modifies its routing policy to adapt its sets of routes to the fluctuation of the network load distribution. Therefore, this I-D presents the functional extensions needed both in the forwarding and in the routing control plane. The extensions of the routing control plane are based on any standard Interior Gateway Protocol (IGP) of type Link State such as OSPF [OSPF] or ISIS [ISIS]. In particular, their route calculation algorithms are enhanced to take into account both the static metric known as the cost, and a dynamic metric that assesses the average traffic load of the transmission links. For this purpose, their routing control process includes a statistical function that periodically computes aggregated information about the level of incoming and outgoing traffic circulating over their active transmission links. Each router broadcasts this aggregated information to neighbor routers and further throughout the routing domain using specific Link State Packets (LSP) and databases synchronization procedures. For convenience, these specific LSPs are called Link Load Packets (LLP). As a consequence, this augmented Link State DataBase (LSDB) enables every router to figure out the topology of the network, as well as, its traffic load distribution. Given that additional information, every router can determine a richer set of routes satisfying multiple criteria, such as, basic connectivity and length of the route, level of traffic load along a Proust Expires û January 2005 [Page 6] Internet-Draft flow routing 8/2/2004 route. Thus it is possible to define a routing policy that activates only short routes free of long term congestion. Given that capability, when the network system detects the overloading of a primary route, it can react appropriately by inserting an alternative route in the forwarding plane that would transport the additional traffic. To support such enhancement of the routing control plane, the structure of the Routing Information dataBase (RIB) associated to an IGP is extended. It includes three new attributes. The first one called load is the dynamic metric which represents the average traffic load along the corresponding route for a given period. Its value is set to the maximum of the average traffic load measures of the elementary transmission links that compose the route. The second one called index, identifies a route among the set of routes associated to a given destination prefix. The third one called path, allows the routing control process to efficiently refresh the load attribute in the RIB when a new LLP is received. Hence, given those new RIB capabilities, the functional extension of the routing control process develops along two axes: the calculation of a set of short routes under the control of a congestion avoidance mechanism, and the refreshing of the load attribute in the RIB table each time a new LLP is received. By means of symmetry in relation to the routing control plane, this architecture enhances the data forwarding plane in various directions. First, the structure of the Forwarding Information dataBase (FIB) is extended in order to reflect the additional information provided by the routing control plane. Hence, every route of the FIB possesses two more attributes: a traffic load indication and a route index parameter. The latter allows a route to be uniquely identified among the set of routes that exactly matches a given destination prefix. Second, the forwarding algorithm of the IP protocol is modified to efficiently load balance the flows according to the routing policy defined at the routing control plane. To this end, its route lookup algorithm is modified to operate on an ordered list of criteria which first considers the destination address then the route index. Besides, it includes two main execution paths. The first one operates like a route recording mechanism. It is activated when a flow is crossing the network for the first time or when a topology change is affecting the current route associated to any ongoing flow. Its purpose is to link the forwarding of a flow onto a route that efficiently meets some load balancing criteria. That linking operation is known in the literature as route pinning. For scalability, robustness and security reasons, this architecture Proust Expires û January 2005 [Page 7] Internet-Draft flow routing 8/2/2004 conforms to the stateless network paradigm. Therefore, every IP datagram header includes an additional field that allows some route pinning information like a route index value to be selected and written in the datagram by every router at each hop. That additional information is also used to distinguish a new flow from an old one. By convention, a zero value in a route index field would indicate a new flow. For stability and performance reasons, once a choice of load balancing is made, it should be maintained for all subsequent datagrams belonging to the flow. To that end a piggybacking procedure is designed between the two endpoints of a bidirectional IP communication that allows the route pinning information (i.e. a vector of route index values) collected in the IP datagram header to be sent backward when reaching one terminal. Hence, every IP datagram conveys within its header, both updated route pinning information (i.e. two vectors of route index values) associated to the forward and backward flows of any bidirectional IP communication. Besides, each terminal manages a communication context enriched with that information. Moreover, at each endpoint two finite state machines are defined. They implement a synchronization process that guarantees that only one vector of route index values is valid for a flow at a time in the network. The second execution path of the forwarding process operates like a hop by hop routing mechanism constrained with route pinning information. It is effective as far as the network topology remains unchanged and bidirectional IP communications are established. The route pinning information is contained in the IP datagram header, it is a route index value that uniquely identifies a route among multiple routes in the FIB associated to a given destination prefix. A route index value different from zero indicates that the datagram relates to an already route pinned flow. In this case, the route lookup procedure proceeds in two fast stages. According to the destination address value contained in the IP datagram header, it first performs a longest prefix matching route lookup operation in the FIB table. Then, it selects the pinned route according to the route index value conveyed in the datagram header by means of an exact route index matching search operation. If the latter stage fails, it indicates an invalid route index value that may be related to a topology change affecting the current pinned route. In that case, the router switches from that fast forwarding execution path to the slower one described above. From an implementation standpoint, such functional extensions needed in the forwarding plane are specified by means of a new option in the IP protocol. Proust Expires û January 2005 [Page 8] Internet-Draft flow routing 8/2/2004 With regard to security considerations, the knowledge of both vectors of route index values, relating to the forward and backward flows of a bidirectional IP communication, is of no use from the perspective of an end user. A vector of route index values has only a local meaning in relation to the FIB of the routers. Therefore, without the knowledge of all the FIBs of the routers that take place along the route used by a flow, an end user cannot take advantage of this information to supersede the routing decision enforced by the network. At worst, an end user that would inadvertently change any value of one vector of route index values, enforced by the network, would lose the benefit of being transmitted efficiently along the same route. In terms of performance, that change could affect the delivery timeframe, due to the reprocessing of a valid vector of route index values, as well as, the reordering of some datagrams at the receiving endpoint. Proust Expires û January 2005 [Page 9] Internet-Draft flow routing 8/2/2004 1.6 Review of an Example In this example the topology of the network is represented in the figure below. It is composed of two IP terminals whose IP addresses are S1 and D1. Moreover, four interconnected routers named R1, R2, R3 and R4 compose the IP network. subnet A subnet B +----+ | | +----+ | S1 | | ,---. ,---. ,---. | | D1 | +----+____| / \ / \ / \ |___+----+ +----+ |____( R1 )......( R2 ).....( R3 )___| +----+ | \ / \ / \ / | `---'. `---' .'`---' `. .' `. .-' `. .' `,---..' / \ ( R4 ) \ / `---' Figure 1: network topology example In this scenario, the terminal S1 is first initializing a bidirectional IP communication with the terminal D1. As this time, the distribution of the network traffic load is within well acceptable boundaries. In other words, there is no transmission link in the network that experiences a traffic overload. Therefore, in this context, the routing control plane of the network has converged to the simplest form in which, every router has determined a unique route per destination prefix. In a first approximation, a simplified representation of the four RIBs and therefore FIBs of the related network is represented on the four figures below. This approximation assimilates the router names as the destination prefixes of the network. It does not change the logic presented hereafter. Furthermore, the RIB tables include two additional columns compared to a more conventional RIB table. The first one called index, identifies a route among a set of routes linked to a given destination prefix, while the second one called load, designates the short term traffic load observed over the related path. The maximum traffic load threshold is arbitrarily fixed to 30 percent of the absolute available bandwidth calculated over a route. One can calculate such bandwidth by considering the Proust Expires û January 2005 [Page 10] Internet-Draft flow routing 8/2/2004 maximum of the elementary absolute bandwidths associated with the transmission links that compose a given route. The maximum threshold that identifies an acceptable level of network congestion might be determined by following some over-provisioning engineering rules. +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R3 | 1 | R2 | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R4 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | R2 | <30% | | B | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | A | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 2: RIB of router R1 Figure 3: RIB of router R2 +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R1 | 1 | R2 | <30% | | R2 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | B | <30% | | B | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | R2 | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 4: RIB of router R3 Figure 5: RIB of router R Hence at first, as the network is lightly loaded, the bidirectional IP communication that takes place between terminal S1 and D1, is routed along the primary route composed of router R1, R2 and R3. The figure below represents the two flows f1 and f1' of that bidirectional IP communication. For convenience, flow f1 is characterized by the quintuple S1, D1, p, p' and protocol, where protocol designates the number of the transport protocol, and p and p' designate the source and destination port numbers identifying both communication points on terminals S1 Proust Expires û January 2005 [Page 11] Internet-Draft flow routing 8/2/2004 and D1, respectively. By means of symmetry, the flow f1' is characterized by the quintuple D1, S1, p', p and protocol. flow f1' subnet A _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' `------. |`- +----+ | S1 | ---+--' ,---. ,---. ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 ).....( R3 )___| ++-+-+ | \ / \ / \ / | `---'. `---' .'`---' `. .' `. .-' `. .' `,---..' / \ ( R4 ) \ / `---' Figure 6: pinning of flows f1 and f1' on the primary route A moment later, the network starts to experiment some traffic overload at the transmission link located between router R2 and R3. The figure below depicts this event. flow f1' _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' `------. |`- +----+ | S1 | ---+--' ,---. ,---. ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 .XXXXX. R3 )___| ++-+-+ | \ / \ / |\` \ / | `---'. `---' \ .'`---' subnet A `. .\ `. .-' \ `. .' \ `,---..' \______________ / \ overloaded link ( R4 ) \ / `---' Figure 7: network experiencing a light traffic overload Proust Expires û January 2005 [Page 12] Internet-Draft flow routing 8/2/2004 A moment later, the routing control plane detects such an event by means of newly received LLP messages. As defined before, LLP messages characterize the distribution of a network traffic load in close relationship with its topology. Hence, in relation with this scenario, this event triggers the calculation of secondary routes. Only the routes that are affected by the temporary unavailability of this transmission link can trigger the activation of alternative routes. This decision belongs to each router and depends on the positions of routers and hot spots within the topology. In the present case, the overload of the transmission link located between routers R2 and R3 has a different impact depending on the considered router. For instance from the viewpoint of router R3, this occurrence affects all routes that rely on this specific transmission link. Therefore, the router R3 calculates and activates three secondary routes in its updated RIB table, which is represented below on figure 10. Meanwhile, from the viewpoint of router R4, none of its routes are affected by this minor outage. The figure 11 shows that the RIB table of router R4 remains unchanged at this moment. +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | | 1 | R3 | >30% | | R3 |-----+-------+------+ + R3 +-----+-------+------| | | 2 | R4 | <30% | | | 2 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R4 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | | 1 | R3 | >30% | | B |-----+-------+------+ + B +-----+-------+------| | | 2 | R4 | <30% | | | 2 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | A | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 8: RIB of router R1 Figure 9: RIB of router R2 Proust Expires û January 2005 [Page 13] Internet-Draft flow routing 8/2/2004 +--------+-----+-------+------+ |dst pref|index|nexthop| load | +--------+-----+-------+------+ | | 1 | R2 | >30% | | R2 |-----+-------+------+ | | 2 | R4 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | |dst pref|index|nexthop| load | | R1 |-----+-------+------+ +--------+-----+-------+------+ | | 2 | R4 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R2 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | B | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | B | 1 | R3 | <30% | | A |-----+-------+------+ +--------+-----+-------+------+ | | 2 | R4 | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 10: RIB of router R3 Figure 11: RIB of router R4 Finally, later on, terminal S1 and D1 are engaged in a secondary bidirectional IP communication characterized by two flows f2 and f2'. This time, as the primary route is already slightly loaded, it is expected that both routers R3 and R1 will route the datagrams of flows f2 and f2' along the secondary route. This secondary route is composed of the sequence of routers R1, R4 and R3. The main advantage of the design of this architecture is that it allows flows f1 and f1' to still be forwarded along the primary route which is estimated to be lightly overloaded, while allowing more recent flows f2 and f2' to be forwarded along a secondary route which is believed to be under loaded. This capability is enabled through the enhancement of the forwarding plane of all routers. First, their RIB tables are extended to include two new attributes per route: an index parameter and the traffic load attribute. Second, their forwarding algorithm is extended to take into account a route index value, in addition to, the destination address of the IP header when processing an IP datagram, received on one of its interfaces. Moreover, the route lookup processing is extended to operate differently according to the nature of the flows being forwarded. Proust Expires û January 2005 [Page 14] Internet-Draft flow routing 8/2/2004 When a datagram belonging to a new flow (e.g. one datagram for a router with a related route index value contained in its IP header equals to zero) needs to be routed, or when some topology change has occurred that affects the route followed by the datagrams of an ongoing bidirectional IP communication, a load balancing and multi criteria route lookup procedure is carried out. Doing so, the router first selects the route whose prefix destination matches the address destination contained in the IP header, and second, for which the traffic load value is under the maximum threshold, and third, exhibits the lowest cost (in relation to the metric being considered by the intra-domain routing policy). When the datagram belonging to an old flow (e.g. one datagram for a router with a related route index value contained in its IP header different from zero) needs to be forwarded, a simpler and faster route lookup procedure is executed. It operates in two steps. First, based on the destination address, a Longest prefix Matching Search (LMS) operation is executed in the FIB. Second, the route index value that relates to the current router serves as a refinement of the route lookup processing. It indicates which load-balancing choice was previously assigned to the forwarding of this flow. Such route pinning information are stored and sent by terminals during any bidirectional IP communication through the use of a piggybacking procedure. The figure 12 below represents the routing of all flows depicted in this scenario among the available resources. flow f1' subnet A _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' overloaded `------. |`- +----+ | S1 | ---+--' ,---. ,---. link ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 .XXXXX. R3 )__,+-- ++-+-+ <--+---. \ / \ / \ / / |--> ---|--. `.`---'. `---' .'`-,----',' `-.`----.`. . ,-' ,---' flow f2 `--. \ `. flow f2Æ .-' / ,-' `. `---. -' _.-' / `--. `,---.'.-''_.--' ` /-----\--'' ( R4 ) \ / `---' Figure 12: pinning of all flows f1, f1', f2 and f2' on the routes Proust Expires û January 2005 [Page 15] Internet-Draft flow routing 8/2/2004 2. Extensions of the Forwarding plane 2.1 Overview A very schematic functional representation of a router might be depicted by the figure 13 below. Its role is to forward datagrams towards their final destinations via the transmission interfaces that interconnect it with other networking systems. For that purpose, any router operates at the same time on two functional planes: the control plane and the forwarding plane. Usually, the control plane operates in a shared time processing environment while the forwarding plane operates in near real time. In short, the control plane allows a router to adapt to network changes, in due time, with minimum impact on the transport of network traffic. The forwarding plane of a router is in charge of applying all needed processing to datagrams, alongside the forwarding function. For instance, it may include some network filtering mechanisms or some quality of service (QOS) management functions. +---------------------------------------+ | | | +-------------------+ | | | | | | | routing protocol | | control | | |----+ | plane | | (RIB) | | | | | | | | | +-------------------+ | | | ^ | | |---------|------------------|----------| | | | | forwarding | | yes v | plane | ,-----. +--------------+ | FORWARDED | /is the \ | | | DATAGRAM +------+--+ | /datagram \ no | forwarding | | +------+--+ | | |---+-->(intended to)--->| module |---+-->| | | | | | | \ a local / | | | | | | +------+--+ | \process/ | (FIB) | | +------+--+ RECEIVED | \ ? / | | | DATAGRAM | `---' +--------------+ | | | +---------------------------------------+ Figure 13: schematic functional representation of a router During the last decade, the core technology of routers has been improved by a factor of several orders of magnitude. Today, routers are flexible enough to apply all sorts of elementary processing while Proust Expires û January 2005 [Page 16] Internet-Draft flow routing 8/2/2004 still forwarding datagrams at transmission line rates. However, with regard to the principle of forwarding, there has been little change in the past. Most of the time the routing decision relies on a longest prefix matching search operation, which is based on the destination address consigned in the header of a datagram. While this forwarding mechanism has proven to be robust and scalable, it lacks some flexibility with regard to policy routing and load balancing. As described in the introduction section, the support of an adaptive load balancing service at flow level, implies a modification of the IP protocol, the forwarding information database (FIB), and the updating mechanism of the FIB. 2.2 A new option in the IP protocol The specification of a new option in the IP protocol allows some of the functional extensions required in the forwarding plane to be supported, namely the load balancing procedure through a modified route lookup algorithm and the collection and transport of route pinning information. This new option for the IP protocol is named "dynamic load balancing". It is assumed that the combination of a destination prefix and of a vector of route index values completely characterizes a route in an IP network based on a hop by hop forwarding mode. A route index value identifies a route among a set of routes associated to a given destination prefix in the FIB of a router. This assumption is reasonably acceptable on condition that the routing control plane of the network has converged. Therefore such characterization of a route followed by a datagram can be used to allocate some network resources to a flow. By analogy with the route recording optional mode, this new option of the IP protocol allows a vector of route index values to be recorded in the IP header when a datagram crosses the network for the first time or each time a topology change affects the route associated to its related flow. It is during this initial phase that network resources are allocated to flows, in other words that a flow is pinned onto a route selected for its high level of availability. Once the load balancing choice has been executed for a flow at each router along the route selected according to some quality indicators, each router simply forwards them according to a combination of the prefix destination and the route pinning information (a vector of route index values). As mentioned before, the forwarding procedure and mechanisms described hereafter first apply to bidirectional IP communications. Proust Expires û January 2005 [Page 17] Internet-Draft flow routing 8/2/2004 As routers are stateless, the route pinning information must be carried in the IP header of the datagrams. Furthermore, as the route pinning information is recorded in the IP header of a new flow when being forwarded initially, it must be sent backward to the sending terminal to be used for future sending. This is accomplished through a piggybacking procedure that operates between terminals. Each optional IP header would include two vectors of route index values, one relating to the flow being forwarded and one relating to the backward flow. Besides, each terminal manages a communication context enriched with that information. Anytime a terminal receives a new datagram, it might update the route pinning information stored in the related communication context. At each endpoint two finite state machines are defined. They implement a synchronization process that guarantees that only one vector of route index values is valid for a flow in the network at any time. As mentioned earlier, the treatment of this IP option differs according to the type of system being considered. In short, when a router processes this IP option, it may: - select a route according to some load balancing criteria - mark a route index value in the related field of the first vector contained in the optional IP header of a datagram - forward a datagram according to the destination prefix value and the related route index value contained in its IP header In short, when a terminal processes this IP option, it may: - record both vectors of route index values from received datagrams in the related communication context - synchronize the value of both vectors of route index values exchanged with the other endpoint of a bidirectional IP communication resulting in an unambiguous route pinning process - initialize both sequences of route index values in the optional IP header of sent datagrams The subsequent paragraphs detail the format of the optional IP header and the different treatments carried out by each system. 2.2.1 Format of the optional IP header This section describes the format of this new optional IP header called "dynamic load balancing", compliant with version 4 of the Internet Protocol. The IPv4 protocol specification allows for the definition of any optional IP treatment. To this end, it provides the definition of a generic optional header in its header format. The former is defined using TLV style coding. Hence, the first two bytes identify the Proust Expires û January 2005 [Page 18] Internet-Draft flow routing 8/2/2004 option type and the option value. Moreover, the maximum size of an optional header is limited to 40 bytes. The data structure of this new optional IP header is designed to contain two vectors of route index values, one vector index field and one flag field. This optional header is of maximum size. It is represented on the figure 14 below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Type | Opt Length=40 | CVI | FF | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | | | FRIV | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | | BRIV | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 14: format of the optional IP Header Thus, the IHL field of the basic IP header has a value of 15 when the Option Length field has a value of 40. The data field called Current Vector Index (CVI) is one byte long. It represents the current position within the vector called Forward Route Index Vector (FRIV). A router uses this index value to point to the route index value within the first vector (FRIV) that is to be associated to this router when forwarding a datagram along the pinned route associated to this flow. When a terminal sends a datagram using this option it is set to 1, then each router increments it by Proust Expires û January 2005 [Page 19] Internet-Draft flow routing 8/2/2004 one at each hop. When a datagram reaches its final destination, this field represents the length of the route followed by the datagram. The data field called Forward Route Index Vector (FRIV) is 18 bytes long. It represents the route pinning information or the sequence of route index values allocated to a new flow by the load balancing procedure while being forwarded by each router. Each route index value is encoded using 4 bits. Hence the FRIV vector can characterize a route of 35 hops maximum length. Due to the 4 bits provision, a route index value can vary between 0 and 15 integer values. The data field called Backward Route Index Vector (BRIV) is 18 bytes long. It represents the route pinning information of a flow piggybacked in the related backward flow of the bidirectional IP communication. It uses the same encoding convention as used for the FRIV vector. Each time a terminal sends a datagram, it should initialize the BRIV vector field in the optional header with the most updated information recorded in the related communication context. The data field called FF (Forward Flag) is 8 bits long. It represents different flags used by the pair of FSMs that synchronize the exchange of route pinning information at each endpoint of a bidirectional IP communication. 2.2.2 Role of the Terminal 2.2.2.1 Creation of a communication context When a terminal engages in a bidirectional IP communication it should create an associated communication context and two FSM for the purpose of managing the route pinning information conveyed in the optional IP header of the related datagrams. Namely, FRIV and BRIV vector values. Such a communication context characterizes both flows of the bidirectional IP communication by means of a combination of a pair of Internet sockets along with the content of the three fields defined for this IP option. To give an example, the communication context of a bidirectional IP communication between two terminals whose IP addresses are S and D, which, furthermore, relies on the transport protocol TCP and that uses port numbers p and pÆ, might be defined from the viewpoint of terminal S by the following combinations of information fields. - forward flow o source address: S . should be 4 bytes long Proust Expires û January 2005 [Page 20] Internet-Draft flow routing 8/2/2004 . set to the IP address of the local host o destination address: D . should be 4 bytes long . set to the IP address of the remote host o source port number: p . should be 2 bytes long . set by the application o destination port number: pÆ . should be 2 bytes long . set by the application o transport protocol number: TCP . should be 1 byte long . set by the application o forward route index vector: FRIV . should be 18 bytes long . initially set to the null vector, that is a vector full of zero integer values o current vector index: CVI . should be 1 byte long . it is a constant whose initial value is set to one integer value - backward flow o source address: D . should be 4 bytes long . set to the IP address of the local host o destination address: S . should be 4 bytes long . set to the IP address of the remote host o source port number: pÆ . should be 2 bytes long . set by the application o destination port number: p . should be 2 bytes long . set by the application o transport protocol number: TCP . should be 1 byte long . set by application o backward route index vector: BRIV . should be 18 bytes long . initially set to the null vector, that is a vector full of zero integer values From the viewpoint of terminal D, the communication context is obtained by inversion of the roles of receiver and sender. 2.2.2.2 Receiving a datagram Proust Expires û January 2005 [Page 21] Internet-Draft flow routing 8/2/2004 Whenever a terminal receives a datagram with this IP option enabled it may copy the content of the optional header in the related communication context. If such a context does not exist at the time the datagram is received, the terminal should create one prior to the copying operation. Whatever the case, the copy of the content of the optional header should be done as depicted hereafter. The BRIV vector contained in the optional header of the received datagram should be copied in the FRIV field associated to the backward flow of the related communication context. The FRIV vector contained in the optional header of the received datagram might be copied in the BRIV field associated to the forward flow of the related communication context. The change of the BRIV vector recorded in the communication context is placed under the control of a FSM described later in this section and represented on figure 16. 2.2.2.3 Sending a datagram Whenever a terminal sends a datagram with this IP option enabled, it should initialize both vectors FRIV and BRIV in the optional header with the last data recorded in the related communication context. If such a context does not exist at the time a datagram is sent, the terminal should create one prior to this initializing operation. Whatever the case, the initialization of both vectors in the optional header should be done as described hereafter. The FRIV vector contained in the optional header of the datagram about to be sent, should be initialized with the last recorded FRIV vector of the related communication context. The BRIV vector contained in the optional header of the datagram about to be sent should be initialized with the last recorded BRIV vector of the related communication context. Besides, the CVI index value contained in the optional header of the datagram about to be sent should be initialized with the 1 integer value. 2.2.2.4 Control of the uniqueness of the pinned routes Proust Expires û January 2005 [Page 22] Internet-Draft flow routing 8/2/2004 When a terminal initiates a bidirectional IP communication or when a topology change is affecting a route pinned to the flow of an established bidirectional IP communication, it becomes necessary that both terminals agree on using the same unique pair of route index vectors FRIV. If it is not the case, it may happen that both flows oscillate between several routes that present a sufficient level of availability. For that purpose, two FSMs are defined to guarantee that only one route is associated to each flow during the transitory period where the network dynamically determines how to load balance the flows among the different available routes. Two FSMs are started on each terminal for each active communication context. The first one controls the setting of the FF flag contained in the optional header of the sent datagrams. The second one controls the change of the BRIV vector in the communication context that can occur upon receiving a datagram. Initially the FSM that controls the FF flag starts in the state called "syn sent". As far as the FSM remains in this state, the terminal set the FF flag contained in the optional header to "syn" value in all sent datagrams. From the state called "syn sent", the FSM can switch to the state called "ack sent" as soon as the terminal receives a datagram. As far as the FSM remains in the state called "ack sent", all datagrams sent by the terminal contained a FF flag set to "ack" in the optional header. As far as the terminal receives a datagram whose value of the BRIV vector contained in the optional header is identical to the value of the FRIV vector recorded in the communication context, the FSM stays in the state called "ack sent" From the state called "ack sent", the FSM can switch to the state called "syn sent" as soon as the terminal receives a datagram whose value of the BRIV vector contained in the optional header differs from the value of the FRIV vector recorded in the related communication context. Proust Expires û January 2005 [Page 23] Internet-Draft flow routing 8/2/2004 ,-----. / syn \ ,->( sent )-. ,' \ / `. packet ; `-----' \ received : : && | | packet FRIV vector | | received of context : | has changed \ ,-----. ; \ / ack \ , `--( sent )<--' \ / packet `-----'. received ^ \ && / : FRIV vector ( | of context \ ; is unchanged \ / `------' Figure 15: FSM that controls the FF flag Initially the FSM that controls the change of the BRIV vector starts in the state called "BRIV vector of context is changing". In this state, the BRIV vector recorded in the related communication context can be changed. From the state called "BRIV vector of context is changing", the FSM can switch to the state called "BRIV vector of context is fixed" as soon as the terminal receives a datagram whose flag FF contained in the optional header is set to "syn". As far as the FSM remains in the state called "BRIV vector of context is fixed", the BRIV vector recorded in the related communication context should not be allowed to be changed. As far as the terminal receives a datagram whose flag FF contained in the optional header is set to "ack" and whose value of the FRIV vector is identical to the BRIV vector recorded in the related communication context, the FSM stays in the state called "BRIV vector of context is fixed". From the state called "BRIV vector of context is fixed", the FSM can switch to the state called "BRIV vector of context is changing" as soon as the terminal receives a datagram whose flag FF contained in the optional header is set to "ack" and whose value of the FRIV vector differs from the BRIV vector recorded in the related communication context. Proust Expires û January 2005 [Page 24] Internet-Draft flow routing 8/2/2004 ,-------. ,' `. / BRIV vector \ ( of context )-. ,--->\ is changing / `. / `. ,' \ ack ; '-------' : received | | syn && | ; received BRIV vector | ; of context | ; has changed : ,-------. ,' \ ,' `.<--' \ / BRIV vector \ `--- of context ) \ is fixed / `. ,' '------+' ack ^ \ received ; \ && ; : BRIV vector ( | of context \ ; is unchanged `-------' Figure 16: FSM that controls the change of the BRIV vector of context 2.2.3 Role of the Router From the viewpoint of a router, this IP option specifies a new forwarding mechanism that dynamically load balances the flows according to the availability of network resources. It relies on a modified FIB table where each route is described by a combination of the following elements: - destination prefix - route index - next-hop address - outgoing interface - cost value - load value 2.2.3.1 Overview of the forwarding algorithm Proust Expires û January 2005 [Page 25] Internet-Draft flow routing 8/2/2004 Basically, prior to any route lookup operation, this new forwarding mechanism operates a classification of flows between old and new ones. Its execution path differs according to the result of this prior operation. A received datagram is considered to belong to an old flow when the route index value contained in its optional header is different from zero. By convention, the value zero in a route index field of the optional header is used to characterize new flows entering the network. When a terminal initiates a bidirectional IP communication, it initializes the FRIV vector with zero values. Therefore, at this initial point, the routers that forward this datagram consider that it belongs to a new flow. Consequently, each router must execute the load balancing procedure followed by the route index recording operation of the forwarding algorithm. As soon as an IP communication is established between both endpoints, the FRIV and BRIV vectors conveyed in the optional header of the related datagrams should have been valued with route index values different from zero. Therefore, at this point, the routers that forward these datagrams consider that they belong to an old flow. Consequently, each router must execute the simple destination and route index-based route lookup procedure of the forwarding algorithm. 2.2.3.2 Detail of the forwarding algorithm Upon receiving a datagram on one of its interfaces, a router first has to check the validity of the IP header. Then it proceeds to read the destination address contained in the IP header of the datagram. In case the datagram is intended for the router itself, its forwarding algorithm should deliver it to its related process. On the contrary, based on the destination address, the router proceeds to a "longest prefix matching search" (LMS) operation in the FIB table. The result of the route lookup operation may determine that several routes are associated to the selected destination prefix. At this point, the forwarding algorithm differs according to the type of flow being forwarded. It determines to which type of flow the datagram belongs to, by examining a specific route index value contained in the optional IP header. It is the route index value of the FRIV vector that should be associated to this router. Its position is recorded in the dynamic field called CVI in the optional IP header. Thus the route index value linked to the current router is the FRIV[CVI] field value contained in the optional IP header. Proust Expires û January 2005 [Page 26] Internet-Draft flow routing 8/2/2004 In case the FRIV[CVI] route index value is zero, this means the datagram belongs to a new flow. Therefore the router must execute the load balancing procedure of the forwarding algorithm. Its purpose is to associate the new flow with a route that presents a high level of availability according to the current value of its load attribute. Thus, the procedure called "widest-enough shortest search" consists in selecting the shortest route with enough bandwidth among those that were pre selected during the preliminary LMS search operation. The bandwidth condition is assessed using the load attribute; it is satisfied when the load attribute of a route remains under an arbitrary threshold. Finally, the router marks the route index value that corresponds to the selected route in the FRIV[CVI] vector field of the optional header. Then, the CVI index value is incremented by one unity in the optional header and the datagram is forwarded along the selected route. In case the FRIV[CVI] route index value is different from zero, this means the datagram belongs to an already load-balanced flow. Therefore, the router checks the validity of the current route index value. To do so, it executes an "exact matching search" (EMS) operation based on the route index value among the routes that were pre selected during the preliminary LMS search operation. If it is the case, the CVI index value is incremented by one unity and the datagram is forwarded along the pinned route. If it is not the case, due to some network topology change or due to some misbehaving or malfunctioning terminal, the router reinitializes the route index values in the FRIV vector beginning at rank CVI up, to the end of the vector. Then it must proceed to a new load balancing operation before forwarding the datagram farther. This part of the forwarding algorithm is the same as described above. Proust Expires û January 2005 [Page 27] Internet-Draft flow routing 8/2/2004 The complete forwarding algorithm is represented in the flow chart of figure 17 below. +----------------+ |LMS route lookup| +-------+--------+ | | ,-V---. /is new \ no / flow ? \ yes +-------------------( FRIV[CVI] )--------------------+ | \ equal / | | \ to 0 ?/ | V `-----' | +----------------+ | |EMS route lookup| | +-------+--------+ | | | | | ,-V---. | /is the \ +-------------------+ V / pointed \ no | re-initialization | +----------------------+ ( route )----->| of FRIV vector |----->|widest-enough shortest| \ valid ? / | from CVI position | |route lookup | \ / | up to the end | +-----------+----------+ `-+---' +-------------------+ | |yes V | +------------------+ | |write the index | | |of the selected | | |route in FRIV[CVI]| | +--------+---------+ | | | | | | | | | | | +----------------+ | +--------------->| increment CVI |<-----------------+ +-------+--------+ | V +-----------------------+ | transmit the datagram | +-----------------------+ Figure 17: flow chart of the forwarding algorithm Proust Expires û January 2005 [Page 28] Internet-Draft flow routing 8/2/2004 2.2.4 Review of an Example The network topology, to which this example applies, is the one described in the introductory section. It is represented in figure 1. As mentioned earlier, the design of this architecture only applies to bidirectional IP communications. In that respect, we consider in this example a bidirectional IP communication that takes place between terminals S1 and D1. As mentioned before, the network is initially under loaded. Hence, the routing control plane has converged around a simple network routing state, where there exists only one available route. The primary route calculated to transport both flows of the newly created IP communication is built on the sequence of routers R1, R2 and R3. The figure 18 below details the processing stages of the IP option, as and when the datagram crosses the network. The high part of the diagram represents the network topology, whereas, the lower part depicts the evolution of the content of the optional header while a datagram is, little by little, forwarded through the network. +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| CVI | 1 | | 2 | | 3 | | 4 | | 4 | FRIV| 0,...,0 |-->| 1,0,..,0|->| 1,1,0,..|->|1,1,1,0,.|-->| 1,1,1,0,| BRIV| 0,...,0 | | 0,...,0 | | 0,...,0 | | 0,...,0 | | 0,...,0 | +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (A) (B) (C) (D) (E) Figure 18: forwarding of the first datagram of flow f1 along the primary route built on the sequence of R1, R2 and R3 routers At stage labeled (A), the terminal S1 is initiating the bidirectional IP communication with terminal D1. As such, the terminal creates a communication context, which is characterized by the set of values given below. Proust Expires û January 2005 [Page 29] Internet-Draft flow routing 8/2/2004 - forward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward route index vector: FRIV = (0, 0,à, 0) o current vector index: CVI = 1 - backward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o backward route index vector: BRIV = (0, 0,à, 0) Given that initialization step, the terminal packages the datagram to be sent with the initial values defined for that context and transmits it towards its access router, namely R1. At stage (B), the router R1 identifies that the datagram relates to a new flow since the FRIV[1] value is equal to zero. Thus, R1 carries out a full route lookup operation, independently. As its FIB contains only one route that matches the destination address of the datagram, it is unambiguously selected. Its route index value is one. According to the IP option processing, this index value is written in the FRIV vector at a place indicated by the CVI index value. Hence, at that time, FRIV[CVI] is set to one in the optional IP header. Moreover, the CVI index value is incremented by one unity and the datagram is transmitted towards the next-hop router R2. At stages (C) and (D), router R2 and R3 successively perform the same processing chain as depicted in stage (B) on that datagram. At stage (E), the terminal D1 that receives this first datagram creates its own context of communication session. This time it is characterized by the set of values given below. - forward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o forward route index vector: FRIV = (0, 0,à, 0) o current vector index: CVI = 1 - backward flow Proust Expires û January 2005 [Page 30] Internet-Draft flow routing 8/2/2004 o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o backward route index vector: BRIV = (1, 1, 1, 0,à, 0) Doing so, it has first copied the FRIV vector contained in the optional header of the received datagram, to initialize the BRIV vector of the newly created communication context. Then, it has copied the BRIV vector contained in the optional header of the received datagram, to initialize the FRIV vector of the related communication context. Moreover, the constant CVI is set to the one integer value. At stage (F), the terminal D1 sends a datagram in response to terminal S1. In turn, the terminal D1 packages the datagram to be sent with the initial values defined for that context and transmits it towards its access router, namely R3. At stages (G), (H) and (I), router R3, R2 and R1 successively perform the same processing chain as depicted in stage (B) on that datagram. At stage (J), the terminal S1 that receives this first datagram updates its related context of communication session. Thus, at that time it is characterized by the set of values given below. - forward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward route index vector: FRIV = (1, 1, 1, 0,à, 0) o current vector index: CVI = 1 - backward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o backward route index vector: BRIV = (1, 1, 1, 0,à, 0) Doing so, it has first copied the FRIV vector contained in the optional header of the received datagram to update the BRIV vector of the related communication context. Then, it has copied the BRIV vector contained in the optional header of the received datagram, to update the FRIV vector of the related communication context. Proust Expires û January 2005 [Page 31] Internet-Draft flow routing 8/2/2004 +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| CVI | 4 | | 4 | | 3 | | 2 | | 1 | FRIV|1,1,1,0,.|<--|1,1,1,0,.|<-|1,1,0,.. |<-|1,0,..,0 |<--| 0,...,0 | BRIV|1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (J) (I) (H) (G) (F) Figure 19: forwarding of the first datagram of flow f1Æ along the primary route built on the sequence of R3, R2 and R1 routers. At stage (K), the terminal S1 is sending a second datagram to the terminal D1. Thus, it packages the datagram to be sent with the initial values defined for that context at that time and transmits it towards its access router, namely R1. At stage (L), the router R1 identifies that the datagram may relate to an already routed flow, since the FRIV[1] value is different from zero. Thus, R1 carries out a validation of this index value by performing the two stage route lookup operation. As its FIB still contains a valid route that, first, matches the destination address of the datagram and that second, possesses a route index value, that equals to the FRIV[1] value of that datagram, it is unambiguously selected as the route on which the flow is pinned. Once this is done, the router R1 only has to increment, by one unity, the CVI index value of the IP option header and transmit the datagram towards the next-hop router R2. At stages (M) and (N), router R2 and R3 successively perform the same processing chain as depicted in stage (I) on that datagram. At stage (O), the terminal D1 that receives this second datagram updates its related context of communication session. Thus, at that time it is characterized by the set of values given below. - forward flow o source address: D1 Proust Expires û January 2005 [Page 32] Internet-Draft flow routing 8/2/2004 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o forward route index vector: FRIV = (1, 1, 1, 0,à, 0) o current vector index: CVI = 1 - backward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o backward route index vector: BRIV = (1, 1, 1, 0,à, 0) Doing so, it has first copied the FRIV vector contained in the optional header of the received datagram, to update the BRIV vector of the related communication context. Then, it has copied the BRIV vector contained in the optional header of the received datagram to update the FRIV vector of the related communication context. +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| CVI | 1 | | 2 | | 3 | | 4 | | 4 | FRIV|1,1,1,0,.|-->|1,1,1,0,.|->|1,1,1,0,.|->|1,1,1,0,.|-->|1,1,1,0,.| BRIV|1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (K) (L) (M) (N) (O) Figure 20: forwarding of the subsequent datagrams of flow f1 along the primary route built on the sequence of R1, R2 and R3 routers Proust Expires û January 2005 [Page 33] Internet-Draft flow routing 8/2/2004 2.3 FIB structure The support of this IP option relies on a new structure of the FIB. It is represented below on figure 21. +--------------------+-------+----------+-----------+------+------+ | Destination Prefix | Index | Next-Hop | Interface | Load | Cost | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0/24 | 1 | 10.1.1.1 | GE1/0 | 42% | 100 | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0.24 | 2 | 10.2.1.1 | GE1/1 | 32% | 150 | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0/24 | 3 | 10.3.1.1 | GE1/2 | 8% | 200 | +--------------------+-------+----------+-----------+------+------+ | 192.168.2.0/24 | 1 | 10.3.1.1 | GE1/2 | 20% | 100 | +--------------------+-------+----------+-----------+------+------+ Figure 21: extended structure of the FIB table The "Destination Prefix" field represents an IP prefix associated with a reachable underlying sub-network known within a routing domain. The "Index" field represents an identifier that unambiguously identifies a route, among the subset of FIB routes that share the same destination prefix. The "Next-Hop" field represents the IP address of the next-hop router to be used along the route. The "Interface" field identifies the output transmission link of the router to be used along the route. The "Load" field represents the maximum of the elementary traffic load that has been measured for each transmission link along the calculated route. This quantity is expressed as a percentage of an absolute bandwidth. The "Cost" field represents the total usage cost of resources along the route. It is calculated for a given topology by means of the Dijkstra's shortest path first (SPF) algorithm. 2.4 FIB update 2.4.1 RIB/FIB relationship At the control plane level, in addition to static routing a router can activate several dynamic routing processes like ISIS or BGP. According to the definition of a specific metric each routing process calculates independently the best routes that are stored in separate RIB tables. Due to the diversity of routing information sources, two Proust Expires û January 2005 [Page 34] Internet-Draft flow routing 8/2/2004 routes in different RIB tables may exist at the same time for a given destination prefix. In consequence of such an eventuality, the control plane of a router also possesses a central routing process that manages a central RIB. The latter contains the final set of valid routes that are used in the transmission plane. The central routing process selects the best routes by comparing their relative degrees of trustworthiness. This static criterion is defined according to some conventional engineering rules. It is assigned to the different routing processes by means of a manual configuration. The routes contained in the FIB table are a reflection of the set of the routes contained in the central RIB. The figure 22 below depicts the relationships between processes that control the updating of the different routing tables. +-----------------------------------------------------------+ | control plane | | | | +---------+ +--------+ +---------+ | | ,---. | dynamic | | static | | dynamic | ,---. | |( RIB )<-->| routing | |routing | | routing |<-->( RIB ) | | `---' | process | |process | | process | `---' | | +----.----+ +---+----+ +----.----+ | | `-._ | _.-' | | `-._ | _.-' | | +---v--v---v---+ | | | central | ,---. | | | routing |<-->( RIB ) | | | process | `---' | | +------^-------+ | | | | +----------------------------+------------------------------+ | forwarding plane | | | +------v-------+ | | | forwarding | ,---. | | | module |<-->( FIB ) | | | | `---' | | +--------------+ | | | +-----------------------------------------------------------+ Figure 22: relations between processes that control routing tables As a result of this functional organization, each time a route is added, modified or deleted from the central RIB, a similar updating action should be applied to the FIB. Proust Expires û January 2005 [Page 35] Internet-Draft flow routing 8/2/2004 Furthermore, according to the proposed architecture, it is envisioned that the load attribute of the routes must be updated on a periodic basis. The regular updating of this attribute is fundamental to enable the whole system to keep the global network traffic load under control. In this way, some feedback is introduced into this closed control loop. In a first step, the dynamic routing protocol measures the distribution of network traffic load and controls the state of the topology. In a second step, it updates its routing table according to the current state of both the distribution of traffic and the topology. In a third step, the central routing process updates the central RIB then the FIB table. In a forth step, the forwarding module transmits the traffic according to the updated routes of the FIB. This cycle should occur on a regular basis. How often depends on the dynamic level of the traffic and of the ratio of the bandwidth pipes in relation to the average bandwidth of flows. It is assumed that the bandwidth of individual flows should be very small compared to the available bandwidth provisioned in the core of the network. Thus, the routes of the FIB might be updated by means of two types of mechanism. The first one is invoked on a periodic basis. The second one is triggered when a topology change is detected or when the measured distribution of network traffic load requires a modification of the routing table. 2.4.2 Periodic updates One can distinguish two types of periodic updates of the FIB routing table. The first one occurs over long periods of time and is mandated by the standard operating of any link state dynamic routing protocol. The second one occurs over short periods of time (lasting only a few minutes), that is to say, when a new measure of some traffic load is received from a part of the network. It may modify the load attributes that relate to an existing set of calculated routes. To do so, the combination of the destination prefix and of the index attribute is used as a key identifier to unambiguously address a specific route in the FIB table. 2.4.3 Triggered updates As it is already the case with the standard specifications, a triggered update of the FIB may occur when some topology change is detected. Proust Expires û January 2005 [Page 36] Internet-Draft flow routing 8/2/2004 Besides, as a result of a periodic update of the load attribute in the FIB, some alternative routes might be added or suppressed. Therefore, it is another case of triggered FIB update. 3. Extensions of the Routing Control plane 3.1 Overview As depicted above, the support of an adaptive load balancing service relies on a periodic update of the dynamic load attribute of the routes. Moreover, when the topology changes or when the distribution of the network traffic sustains a significant modification, it is assumed that the dynamic routing protocol would update the content of its RIB, which in turn may lead to the updating of the FIB. For this purpose, the dynamic routing protocol must be extended to support such functional requirements. In this approach, the ISIS dynamic routing protocol is chosen for extension. However, the approach should be easily transposable to the other well-known link state dynamic routing protocol, namely OSPF. The standard specification of the ISIS routing protocol requires that all information about the topology of a given network is conveyed in only one control message called a Link State PDU (LSP). In this respect, each router generates only one valid LSP at a time that should contain complete up-to-date information about its local topology in relation to the global one. Moreover, the specification of the protocol requires that each router must regenerate its LSP on a periodic basis. Besides, whenever a topology change is detected, the routers of the network should propagate such an event in order to inform all routers. It is the router that detects the local topology change that should initiate the updating process by regenerating its LSP. Whatever the reason, when a new valid LSP is received, it should always take overall precedence in the Link State DataBase (LSDB). The convergence property of the routing protocol is ensured as soon as the LSDBs of the routers are synchronized with one another within a routing domain. Thus, most of the inner mechanisms of the ISIS routing protocol aim at synchronizing the LSDBs as quickly as possible. With regard to the first functional requirement, the ISIS protocol is extended to integrate a statistic module which would synthesize the traffic load measures of its transmission interfaces on a periodic basis. To this end, this module might rely on the SNMP protocol to collect the counter values coming from its transmission interfaces which are stored in the MIB-II database of the system. Moreover, the structure of the RIB managed by the ISIS routing protocol must be Proust Expires û January 2005 [Page 37] Internet-Draft flow routing 8/2/2004 extended to allow a traffic load parameter to be attributed to a route. Besides, as such synthesis information is by nature much more dynamic than the usual link state information, it is envisioned that such information may not be conveyed within usual LSP control messages. Therefore, the ISIS protocol is extended to support the propagation and management of a new control message called Link Load PDU (LLP). Each router would generate a single LLP on a periodic basis, which would be propagated by neighboring systems within a routing domain. For convenience, all valid LLPs would be stored in a separate database called Link Load DataBase (LLDB). As it is the case for the topology-driven route calculation process, it is fundamental that the LLDBs be synchronized as quickly as possible to ensure the coherence of the load balancing mechanism on the scale of a routing domain. To this end, the mechanisms defined for the synchronization of the LSDBs serve as a canvas to design those used to ensure an efficient synchronization of the contents of the LLDBs. Thus, two additional ISIS control messages are also defined that would enforce the synchronization of the LLDBs within a routing domain, namely the Complete Sequence Number Link Load PDUs (CSNLLP) and the Partial Sequence Number Link Load PDUs (PSNLLP). Besides, a new TLV is defined that allows any LLP to be referenced in both CSNLLP and PSNLLP control messages. Hence, whatever the type of the data link interface, these messages would allow ISIS neighbor routers to synchronize both their LLDBs and their LSDBs in due time. The synchronization of the LLDBs should occur whenever an LLP control message is generated. Therefore, the reception of any valid LLP control message would immediately trigger the update of the load attribute of the affected routes contained in the RIB table. For this purpose, the structure of the RIB table under control of the ISIS routing protocol must be extended. A new attribute called "path" is defined. It represents the sequence of next-hop routers along a given route from the viewpoint of a router. Therefore, the update of the traffic load attribute of any route calculated by the ISIS routing protocol can be achieved. It is the maximum of the traffic load of the transmission links that relate to the considered route, expressed as a percentage of an absolute bandwidth. With regard to the second functional requirement, the ISIS routing protocol should be extended to allow routes from the RIB to be added or removed due to some significant changes in the distribution of the traffic. Typically, following the processing of a newly received LLP control message as described above, such a procedure would look through the RIB table, and would determine on the basis of the load attribute, if some routes requires some update action in response. Proust Expires û January 2005 [Page 38] Internet-Draft flow routing 8/2/2004 For instance, such adaptive action might consist in adding an alternative route to the one being temporarily overwhelmed by traffic. Thus, this triggered update mechanism requires a route calculation procedure that takes into account several criteria. This path selection procedure should consider, at least, both the cost and the load attributes, and may also consider some routing policy constraints. However, this optimization issue is known as being non tractable and synonymous with a NP-complete issue. For this reason, the adopted approach relies on a simple heuristic procedure, which optimizes all criteria but according to a specific order, called "widest-enough shortest", since it first considers the bandwidth requirement and then the total cost of the path. Once this operation is completed, the ISIS routing protocol might proceed immediately to the update of the route contained in the FIB. For this purpose, the structure of the RIB table, under control of the ISIS routing protocol must be extended. A new attribute called "index" is defined. Combined with the destination prefix, it identifies, unambiguously, a route within the FIB. Proust Expires û January 2005 [Page 39] Internet-Draft flow routing 8/2/2004 3.2 Statistic module As shown in the figure 23 below, the statistic module might be integrated inside the ISIS routing process. +-------------------------------------------------------------+ | | | +-------------------+ | | ,----. | ISIS dynamic | ,----. | |( LSDB )<-->| routing protocol |<-->( LLDB ) | | `----' | | `----' | | | +----------+ | | | ,---. | |statistic |<--+--------+ | | ( RIB )<-->| | module | | | | | `---' | +----------+ | | | | +-------------------+ | | | ^ | | | | | | | | | | | ,---. +---------v---------+ +----v------+ ,----. | | ( RIB )<-->| central routing | | SNMP agent|<->( MIB2 ) | | `---' | protocol | | | `----' | | +---------^---------+ +-----------+ | +----------------------+--------------------------------------+ | | | | +---------v----------+ | | | | ,---. | | | forwarding module |<-->( FIB ) | | | | `---' | | | | | | +--------------------+ | | | +-------------------------------------------------------------+ Figure 23: ISIS routing process in relation to an SNMP agent Moreover, for convenience, it may use the information stored in the MIB-II database as it is a standard that should be included in a router. The MIB-II database specification defines a set of objects that provides information about the operating state of a network device such as a router. These objects are gathered into several groups. With regard to the functional purpose of the statistic module, the group called "interface" is of particular interest. It defines several objects that supply raw statistical information about the operating state of each transmission interface of a router. For instance, the "ifTable" sub-group contains valuable counters that record statistical information about the input and output traffic observed on each Proust Expires û January 2005 [Page 40] Internet-Draft flow routing 8/2/2004 transmission interface as time goes by. Moreover, these absolute values are expressed either in term of number of datagrams (ifInUcastPkts, ifOutUcastPkts, etc) or in bytes (ifInOctets, ifOutOctets). Thus, the statistic module could request the values of these counters on a regular basis. Given the maximum bandwidth resource of each transmission interface, it could calculate the resulting synthesis of the traffic load on its interfaces, experimented over a fixed period of time. Such a synthesis of results could be expressed as a percentage of the absolute bandwidth. Besides, these syntheses results could be calculated over a well-chosen period of time. As an example, three periods of time are proposed for the experiment. A short time period might be considered as about 2 minutes. A medium time period might be considered as about 5 minutes. A long time period might be considered as about 30 minutes. At best, the duration of the period of time should be a tunable parameter. In some way, this parameter controls the efficiency of the router-based closed control loop, therefore, it might enable a network operator to keep some control over the dynamic of its network system. Among the factors that influence the dynamic state of a network system are the distribution of the traffic or the ratio between the average flow bandwidth and the average available bandwidth provisioned in the core network. Besides, when the statistic module starts collecting measures about the traffic load, it should initialize all its synthesis results with zero. 3.3 RIB structure As mentioned before, the structure of the RIB table under control of the ISIS routing process should be extended along three axes. First, the load attribute is added to the RIB table to reflect the level of traffic load that is experimented along a given route. Second, the path attribute allows the load attribute of any given route, to be updated efficiently and unambiguously. Third, the index attribute serves as a key identifier, when combined with a destination prefix. Therefore, it allows the FIB table to be updated reliably and can be used as a routing information indication. Proust Expires û January 2005 [Page 41] Internet-Draft flow routing 8/2/2004 +---------------+-----+--------+---------+------------+-----+------+ | Destination |Index|Next-Hop|Interface| Path |Load | Cost | | Prefix | | | | | | | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0/24 | 1 |10.1.1.1| GE1/0 | nh1,nh2,.. | 42% | 100 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0.24 | 2 |10.2.1.1| GE1/1 | nh3,nh4,.. | 32% | 150 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0/24 | 3 |10.3.1.1| GE1/2 | nh5,nh6,.. | 8% | 200 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.2.0/24 | 1 |10.3.1.1| GE1/2 | nh1,nh2,.. | 20% | 100 | +---------------+-----+--------+---------+------------+-----+------+ Figure 24: extended structure of the RIB table The "Destination Prefix" field represents an IP prefix linked to a reachable underlying sub-network known within a routing domain. The "Index" field represents an identifier that unambiguously identifies a route among the subset of routes that share the same destination prefix. The "Next-Hop" field represents the IP address of the next-hop router to be used along the route. The "Interface" field identifies the output transmission link of the router to be used along the route. The "Path" field represents the sequence of next-hop routers that identify, unambiguously, the set of transmission and switching resources associated to a given route. The "Load" field represents the maximum of the elementary traffic load that has been measured for each transmission link along the calculated route. This quantity is expressed as a percentage of an absolute bandwidth. The "Cost" field represents the total usage cost of resources along the route. It is calculated for a given topology by means of the Dijkstra's shortest path first (SPF) algorithm. 3.4 LLP control messages The need of an additional ISIS control message dedicated to the transport of statistic traffic load information is first motivated by stability reasons. That way, the routing process clearly separates the processing of the topology-related information from the processing of the traffic load information. The former type of information is assumed to be quite stable over time. It may be stored without any change in the LSDB database for quite a long period of time (from 20 minutes up to 18 hours). Proust Expires û January 2005 [Page 42] Internet-Draft flow routing 8/2/2004 To the contrary, the latter type of information is much more dynamic by nature and requires to be regenerated by all routers at a more important pace. The order of magnitude of its period of refreshing might be as little as a few minutes. Moreover, from the network bandwidth optimization viewpoint, it might be more efficient to use a separate control message dedicated to the transport of the traffic load related information. In compensation, the routing process must be extended to deal with a new control message, namely, the Link Load PDU (LLP). With regard to the LAN environments, it is assumed that the transformation of the topology, which is used to synchronize the LSDBs efficiently, should also be applied to synchronize the LLDB. That is to say, any fully meshed topology should be transformed into a star topology by means of a pseudonode router located at the center of the star. Since this router is only an abstraction, the cost and traffic load values associated with its transmission links, are equal to zero. In addition, as the standard specification of the ISIS routing protocol defines two routing levels designated by the acronyms L1 and L2, it should be necessary to define one LLP control message for each routing level. However, as this specification is still in a very early development stage, it is designed for only one routing level. Besides, it is a common practice for network operators to deploy ISIS by means of only one routing level, namely, the routing level 2 that is synonymous of a backbone routing level. Therefore, this Internet Draft focuses on extending ISIS routing capabilities at routing level 2 only. Proust Expires û January 2005 [Page 43] Internet-Draft flow routing 8/2/2004 No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Remaining Lifetime | 2 +--------------------------------+ | LLP ID | ID Length + 2 +--------------------------------+ | Sequence Number | 4 +--------------------------------+ | Checksum | 2 +--------------------------------+ |RES|RES|RES| IS Type | 1 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 25: format of the LLP control message The field called "PDU Length" represents the total length of the Link Load PDU including the header. The field called "Remaining Lifetime" represents the validity delay of the Link Load PDU expressed in seconds. The field called "LLP ID" represents the identifier of the LLP; it is made of the concatenation of three data fields, the "Source ID", the "Pseudonode ID" and the "LLP number". The "Source ID" field might be six bytes long. It is usually characterized by an IP loopback address that may identify, unambiguously, a router within a routing domain. The "Pseudonode IP" field represents an integer value. It is one byte long. A non zero value identifies a pseudonode router, while a zero value identifies a real router. The "LLP number" field represents the fragment number of the LLP, in case fragmentation processing is needed. It is also one byte long. When no fragmentation occurs, this field is filled with zero. The field called "Sequence Number" represents the sequence number of the Link Load PDU. Proust Expires û January 2005 [Page 44] Internet-Draft flow routing 8/2/2004 The field called "Checksum" represents a control sum that covers all the bytes of the Link Load PDU included between the "Source ID" field and the end the message. The field called "IS Type" represents the level of routing (L1 only, L2 only, or L1L2) which this LLP belongs to. The variable length field represents all the elements of information conveyed in this LLP. They are represented using TLV coding format. An LLP control message must include exactly one TLV code of type 1. The TLV code of type 1 characterizes the "area address", to which the router that originates this LLP belongs. Moreover, an LLP control message may include exactly one TLV code of type 134 and one TLV code of type 133. The TLV code of type 134 characterizes the "router ID", to which the router that originates this LLP refers. The TLV code of type 133 allows some authentication information to be associated with an LLP control message. Besides, an LLP control message may include as many TLV codes of type 22 as necessary. The TLV code of type 22 allows a transmission link to be defined extensively. This code allows some sub-TLV codes to be specified. In particular, an LLP control message may encapsulate some sub-TLV codes of types 6, 8 and 9 in its TLV codes of type 22. The sub-TLV code of type 6 allows an IPv4 address associated with a transmission interface to be defined. The sub-TLV code of type 8 allows an IPv4 address associated with an ISIS neighbor interface to be defined. The sub-TLV code of type 9 allows a maximum link bandwidth associated with a transmission interface to be defined. Moreover, in order to convey the statistic information about the traffic load of any transmission link it is necessary to define additional sub-TLV codes. Thus, it is proposed to define 3 additional sub-TLVs that would enable a network operator to finely configure the granularity of information collected in the LLDBs. A sub-TLV called "Short Term Link Load" could be used to represent the level of traffic load measured over the last two minutes. The sub-TLV called "Medium Term Link Load" could be used to represent the level of traffic load measured over the last five minutes. The sub-TLV called "Long Term Link Load" could be used to represent the level of traffic load measured over the last 30 minutes. Proust Expires û January 2005 [Page 45] Internet-Draft flow routing 8/2/2004 3.5 Synchronization of the LLDB As mentioned before, the mechanisms used to synchronize the LLDBs are identical to the ones used for the synchronization of the LSDBs. For independence and flexibility reasons, those mechanisms are duplicated. Thus, two additional ISIS control messages should be defined for a given routing level, a Complete Sequence Number Link Load PDU (CSNLLP) and a Partial Sequence Number Link Load PDU (PSNLLP). As mentioned before, this Internet Draft focuses on extending ISIS routing capabilities at routing level 2 only. The processing of the CSNLLP and PSNLLP control messages should be the same as the ones defined for the CSNP and PSNP control messages respectively. In brief, the procedure uses the same selective flooding mechanism to quickly broadcast any new LLP message generated by a router within a routing domain. With the provision that the received LLP is valid and includes more recent information than the one contained in its LLDB, a router should retransmits this LLP through all its interfaces except the one whose LLP message was received from. Such condition is based both on the sequence number and the remaining lifetime defined in the header of the LLP messages. Besides, in order to guarantee the delivery of LLP messages to all routers within a routing domain, the procedure checks periodically that all neighboring LLDBs are identical. Depending on the type of transmission link, this procedure is more or less complex. In LAN subnetwork environment, it is based on the concept of pseudonode router which logically transforms a broadcast subnetwork into a simple star of point to point interconnection links. Thus, all exchanges of PSNLLP and CSNLLP messages are conducted between the pseudonode router and the other neighboring routers. The pseudonode router is designated among all neighboring routers by means of a simple election mechanism. This artifice reduces the control traffic load on the LAN and quickens the synchronization period of time. Furthermore, to cope with the contention issue inherent in LAN subnetworks accesses, the pseudonode router regularly broadcasts a CSNLLP message on the LAN. This message includes a summary of its LLDB by listing all the LLP headers recorded in its database. Hence, each neighboring router can check the consistency of its own LLDB in relation to the one of the pseudonode router. In point to point subnetwork environment, it is based on a simple exchange of CSNLLP and PSNLLP messages that serve as query and response messages. Once the LLDB databases are synchronized, no more exchange of message is necessary. Proust Expires û January 2005 [Page 46] Internet-Draft flow routing 8/2/2004 However, while the body of CSNP and PSNP control messages refer to LSP messages, the body of CSNLLP and PSNLLP control messages should refer to LLP messages. To this end, a new TLV code is needed to identify a list of LLP identifier entries. It is intentionally called "LLP Entries" by analogy with the TLV code of type 9. Each LLP identifier entry would be composed of an LLP ID, a remaining lifetime value, a sequence number value, and a checksum value. The format of such TLV code is represented in figure 26 below. No. of Octets +--------------------------------+ | CODE | 1 +--------------------------------+ | LENGTH | 1 +--------------------------------+ | VALUE | LENGTH +--------------------------------+ | REMAINING LIFETIME | 2 +--------------------------------+ | LLP ID | 8 +--------------------------------+ | LLP SEQ NUMBER | 4 +--------------------------------+ | CHECKSUM | 2 +--------------------------------+ : : : : +--------------------------------+ | REMAINING LIFETIME | 2 +--------------------------------+ | LLP ID | 8 +--------------------------------+ | LLP SEQ NUMBER | 4 +--------------------------------+ | CHECKSUM | 2 +--------------------------------+ Figure 26: new TLV code called "LLP entries" Proust Expires û January 2005 [Page 47] Internet-Draft flow routing 8/2/2004 The format of both CSNLLP and PSNLLP control messages are represented in the figures hereafter. No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Source ID | ID length + 1 +--------------------------------+ | Start LLP ID | ID Length + 2 +--------------------------------+ | End LLP ID | ID Length + 2 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 27: format of the CSNLLP control message The field called "PDU Length" represents the total length of the CSNLLP control message including its header. The field called "Source ID" usually represents a system address that unambiguously identifies the router that originates this control message. In most cases, the "ID length" might be six bytes long. Usually, the "Source ID" value is obtained by concatenation of an IP loopback address and an integer value named the "Pseudonode IP". The latter is one byte long. A non zero value identifies a pseudonode router, while a zero value identifies a real router. The field called "Start LLP ID" represents the minimum value of the LLP ID identifiers, included in the list of LLP identifier entries that follows in the body of the CSNLLP control message. The field called "End LLP ID" represents the maximum value of the LLP ID identifiers, included in the list of LLP identifier entries that follows in the body of the CSNLLP control message. The variable length fields might contain one TLV code of type 133 and one or several TLV codes of the new type called "LLP entries". Proust Expires û January 2005 [Page 48] Internet-Draft flow routing 8/2/2004 No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Source ID | ID length + 1 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 28: format of the PSNLLP control message The field called "PDU Length" represents the total length of the PSNLLP control message including its header. The field called "Source ID" usually represents a system address that unambiguously identifies the router that originates this control message. In most cases, the "ID length" might be six bytes long. Usually, the "Source ID" value is obtained by concatenation of an IP loopback address and an integer value named the "Pseudonode IP". The latter is one byte long. A non zero value identifies a pseudonode router while a zero value identifies a real router. The variable length fields might contain one TLV code of type 133 and one or several TLV codes of the new type called "LLP entries". 3.6 RIB update 3.6.1 Periodic updates The standard specification of the ISIS routing protocol requires that each router within a given routing domain should regenerate and propagate its LSP on a periodic basis. This period of time is called the maximumLSPGenerationInterval and may vary between 20 minutes and 18 hours. Therefore, on receipt of such LSP, a router should update its LSDB which might, in turn, lead to the updating of its RIB. The present specification of the extension capabilities of the ISIS routing protocol requires that each router should regenerate and propagate its LLP on a periodic basis. This period of time is called the maximumLLPGenerationInterval and may vary between 2 and 30 minutes. Therefore, on receipt of such LLP, a router should update its LLDB which might in turn lead to the updating of its RIB. Proust Expires û January 2005 [Page 49] Internet-Draft flow routing 8/2/2004 This updating operation consists only in updating the load attribute of the affected routes. Nevertheless, though being an independent operation, such periodic update might also be followed by a trigger update that may add or suppress an alternative route due to a significant change in the distribution of the network traffic load. 3.6.2 Triggered updates As said above, such an update of the load attribute may trigger the withdrawal or the addition of an alternative route. Anytime, the load attribute update operation that corresponds to the processing of a newly received LLP message, is carried out, the router should launch a control process to check that no route in the RIB is overloaded or under-loaded. To this end, it uses two thresholds: a minimum and a maximum. Anytime, the level of the load attribute of a route is below the minimum or above the maximum threshold, the router should retrieves all the routes that share the same destination prefix as the one linked to this route from its FIB. The decision of adding an alternative route into the RIB would be based on the following set of conditions: - if the load attribute of a route is greater than a maximum threshold value, then, - if the load attribute of all the other routes that share the same destination prefix, shows that these routes are also in a state of traffic overload, then, - if the number of active routes (for instance, such a limit may have a value of 15) that share this destination prefix is still under a limit fixed by a routing policy constraint, then, o a new route might be added into the RIB The decision of withdrawing an alternative route from the RIB would be based on the following set of conditions: - if the load attribute of a given route in the RIB is smaller than a minimum threshold value, then, - if that given route is not the primary route also called SPF route (i.e. a route which has an index value of 1), then, - if the load attribute of another non primary route among those that share the same destination prefix, shows that it is also in a state of traffic under load, then, - if the total cost of this given route is higher than the total cost of this other non primary route being in a state of traffic under load, then, o this given route might be withdrawn from the RIB Proust Expires û January 2005 [Page 50] Internet-Draft flow routing 8/2/2004 In no case, should a valid primary route be suppressed due to a low level of traffic load. Moreover, in order to avoid both decisional procedures to trigger each other, which may result in a sequence of addition and withdrawal operations, once a shortest alternative route is added it is no longer suppressed according to this mechanism. On the contrary, a non shortest alternative route could be suppressed due to a low level of its load attribute. Anytime an LLP control message is correctly processed in the LLDB and RIB databases, each router would walk through its RIB and look for the applicability of the above set of conditions. The figure 29 below shows the dynamics of this decision process in relation to the load attribute. at this time a new route is created | route | traffic ^ | at this time a route might be suppressed load | v | | ,-. | | ; \ _ | MAX 30%|________;____:______/_\___|_____________ | / : ,' \ | | / `--' : | | ,' | | | / | | | ; : v / MIN 3%|__|______________________\____/_________ | | `--' |_______________________________________> time Figure 29: dynamics of the process of creation/suppression of a route Besides, according to the standard specification of the ISIS routing protocol whenever a topology change is detected, it is advertised within a given routing domain by means of the propagation of a new LSP. As a result, the LSDB table is updated which in turn might lead to the updating of the RIB table. Anytime a topology change affects the RIB table, the ISIS routing protocol should also consider its impact on the alternative routes. To this end, all routes that need to be recalculated should be identified. The ones that are not valid anymore should be suppressed from the RIB as soon as possible. Proust Expires û January 2005 [Page 51] Internet-Draft flow routing 8/2/2004 3.7 Route calculation algorithm The route calculation procedure relies both on the LSDB and the LLDB databases. Whenever a route must be calculated, the route calculation procedure is carried out either in one step or in two steps according to the type of route. If the RIB database no longer contains any more a valid primary route (also named SPF route) associated with the given destination prefix, then the calculation procedure must calculate it by means of a SPF algorithm based on the overall LSDB database. The index value attributed to a primary route must always be set to one. To the contrary, if the RIB database still contains a valid primary route associated with the given destination prefix, then the route calculation procedure must be carried out in two steps. First, it selects, from the LSDB database, the link transmissions that are able to sustain a certain amount of traffic load. On the basis of the information stored in the LLDB, this procedure logically suppresses from the LSDB all transmission links, whose traffic loads are greater than a maximum threshold value. Second, based on the resulting reduced topology, this route calculation procedure would work out a shortest path according to the second metric being considered, namely, the cost. This metric is associated to any transmission link and represents the cost of using a transmission link. Usually, the value associated with the cost attribute of a transmission link is inversely proportionate to its maximum bandwidth. Therefore, the more bandwidth associated with a transmission link, the less it costs to use it. Finally, if the total cost of the calculated route is not exceeding a certain limit value fixed to comply with a routing policy consideration, then the route is activated in the RIB of the ISIS routing protocol. 4. Ongoing Experiment At the present time, an implementation of the overall architecture has been started. The testbed is composed of several PC-based routers and terminals that rely on the Linux operating system. Besides, all developments that relate to the ISIS routing protocol are based on a commercial distribution of the Zebra routing protocol stack. Proust Expires û January 2005 [Page 52] Internet-Draft flow routing 8/2/2004 5. Conclusion The architecture presented in the present Internet Draft aims at load balancing flows among multiple routes that are believed to be slightly loaded. Basically, the approach relies on several functional extensions that affect both the IP protocol and an intra- domain link state dynamic routing protocol, such as, ISIS. The former extensions enable flow to be load balanced according to a criterion synonymous with a level of resource availability. Besides, thanks to an end-to-end piggybacking mechanism, those extensions of the IP protocol allow flows, in the course of time, to be pinned along the routes. The latter extensions enable multiple routes associated with a given destination prefix to be calculated according to multiple criteria, such as, the level of bandwidth availability and the total cost. In addition to the shortest path first routes, the routing protocol is able to manage alternative routes that would use available resources. To this end, the ISIS routing protocol is extended to maintain a representation of the current distribution of the network traffic load on a short term periodic basis. As the solution applies to flow level, in term of performance and efficiency, it possesses a useful level of granularity. Besides, the solution is stateless, hence, it is in some ways scalable and secure friendly. Moreover, thanks to the route pinning mechanism, the load-balancing service should exhibit some good stability properties. In terms of simplicity, the implementation of the end-to-end piggybacking mechanism should be of low cost. However, the support of the designed forwarding mechanism in the routers might require special care. With regard to this concern, it is expected to get some feedback from the experimental field in the future. Proust Expires û January 2005 [Page 53] Internet-Draft flow routing 8/2/2004 IANA Considerations This draft requires the use of a new IP option and, therefore, a new option number from the number space within the IP protocol. Moreover, it also requires the use of new ISIS control messages and, therefore, new PDU type values from the number spaces within the specification of the ISIS routing protocol. Besides, this draft also requires the use of new sub-TLVs and, hence, new code values from the number spaces within the specification of the ISIS TLV type number 22 named "Extended IS Reachability TLV". This section explains the logic used by the author to choose the most appropriate number space for each new entity, and is intended to assist in the determination of any final values assigned by IANA. New IP Option The Internet Protocol has provisions for optional header fields identified by an option type field. As the IP option defined in this draft must be copied in the fragments, the most significant bit of the one byte length option field must be set to one. Besides, as this option belongs to the class of control messages, the next two significant bits must be set to zero. Finally, the option type within the class 0 is coded using the 5 least significant bits of the option field. At the time of writing this draft, the number 25 is the next available value. Therefore, the proposed value for the IP option defined in this draft is the decimal number 153. New PDU Types The type of any ISIS control message is defined in the fixed ISIS common header of eight bytes long. Besides, it is coded in the "PDU type" field that is 5 bits long. As the time of writing this Internet Draft, it appears that values 30, 35 and 37 of the "PDU type" field are available. Thus, with regard to the code value needed to be associated with the definition of the PDU type named LLP at routing level 2, the proposed value is 30. Besides, it is proposed to use the number 35 to be associated with the definition of the PDU type named CSNLLP at routing level 2. Finally, it is proposed to use the number 37 to be associated with the definition of the PDU type named PSNLLP at routing level 2. New TLV Code The TLV codes are included in the variable length fields of the different ISIS control messages. As its acronym implies, the type of any TLV code is defined in the first field of this standard data structure. It is coded by means of one byte. Proust Expires û January 2005 [Page 54] Internet-Draft flow routing 8/2/2004 This draft requires the use of a new TLV code called "LLP Entries". Thus, by analogy with the TLV code of type 9, it is proposed to associate the value of 19 with this TLV code. New sub-TLV Codes The TLV code of type 22, called "Extended IS Reachability TLV", enables the characterization information of a transmission link being advertised in an LSP to be extended. It is specified in [ISIS-TE] and relies on the concept of sub-TLV for the purpose of traffic engineering parameters definition. This draft also requires the use of three new sub-TLVs and, hence, three new code values from the number space within the specification of the TLV type number 22. At the time of writing this document, values 19, 20 and 21 are available from this sub-TLV number space. Hence, it is proposed to associate the values 19, 20 and 21 with the sub-TLVs named "Short Term Link Load", "Medium Term Link Load" and "Long Term Link Load" respectively. Security Considerations As mentioned earlier, a terminal might misbehave when processing the IP option described in this document. For instance, a terminal could send a datagram with an inappropriate sequence of index values. In that case, the routers along the route would detect such anomaly and, therefore, would proceed to the remarking of the FFIV vector. Therefore, from the viewpoint of a terminal, it is useless to spoof the FFIV vector, since such practice would finally end up rerouting the flow. Besides, as the routes are calculated independently of the flows by means of a shortest path algorithm, it should be impossible to introduce a routing loop, by means of a manipulation of the FFIV vector values. Normative References [RFC2026] Bradner, S., "The Internet Standards Process û Revision 3", BCP 9, RFC 2026, October 1996. [BCP14] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [FYI18] Malkin, G., "Internet Users' Glossary", BCP 18, RFC 1983, August 1996. [ISIS] ISO 10589, "Intermediate System to Intermediate System Intra- Domain Routeing Exchange Protocol for use in Conjunction with the Proust Expires û January 2005 [Page 55] Internet-Draft flow routing 8/2/2004 Protocol for Providing the Connectionless-mode Network Service (ISO 8473)" [Also republished as RFC 1142] Informative References [ISIS-IETF] Callon, R.W., "Use of OSI IS-IS for routing in TCP/IP and dual environments", RFC 1195, December 1990. [INT-ARCH1] Carpenter, B., "Architectural Principles of the Internet", RFC 1958, June 1996. [INT-ARCH2] Bush, R., Meyer, D., "Some Internet Architectural Guidelines and Philosophy", RFC 3439, December 2002. [BCP41] Floyd, S., "Congestion Control Principles", September 2000. [QOSARCFRWK] Huston, G., "Next steps for the IP QoS Architecture", RFC 2990, November 2000. [QOSRFRWK] Crawley, E., Nair, R., Rajagopalan, B., Sandick, H., "A framework for QoS-based Routing in the Internet", RFC 2386, August 1998. [TE-REQ] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., and McManus, J., "Requirements for Traffic Engineering Over MPLS,", RFC 2702, September 1999. [OMP] Villamizar, C., Li, T., "IS-IS Optimized Multipath (ISIS- OMP)", Work in Progress, January 2002. [QOSPF] Apostolopoulos, G., Williams, D., Kamat, S., Guerin, R., Orda, A., Przygienda, T., "QOS routing mechanims and OSPF extensions ", RFC 2676, August 1999. [OSPF] Moy, J., "Open Shortest Path First version 2", STD 54, RFC 2328, April 1998. [E2E] Kempf, J., Austein, R., "The Rise of the Middle and the Future of End to End: Reflections on the evolution of the Internet Architecture", Work in Progress, October 2003. [IAB-FUTURE] Atkinson, R., Floyd, S., "IAB Concerns and Recommendations Regarding Internet Research and Evolution", Work in Progress, October 2003. [IRTF-ROUTREQ] Doria, A., Davies, E., Kastenholtz, F., "Requirements for Internet Domain Routing", Work in Progress, December 2003. Proust Expires û January 2005 [Page 56] Internet-Draft flow routing 8/2/2004 [IRTF-RR] http://www.irtf.org/rrg/ [ISIS-TE] Smit, H., Li, T., "IS-IS extensions for Traffic Engineering", Work in Progress, August 2003. Acknowledgments The author would like to thank No‹l Cantenot of the R&D division of France Telecom, and Samuel Clouet who is currently doing an internship at the R&D division of France Telecom, for their fruitful comments on this work. Changes Changes from -00 version 1. add descriptions of two FSM (Finite State Machines) that controls the uniqueness of the selected routes 2. modification of the route recording mode of the forwarding algorithm when a topology change is detected Author's Address Christophe Proust France Telecom - R&D Division 42 rue des Coutures 14066 Caen cedex 4 - FRANCE Phone: +33 2 31759308 Email: christophe.proust@francetelecom.com IPR Notice The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any Proust Expires û January 2005 [Page 57] Internet-Draft flow routing 8/2/2004 kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. Proust Expires û January 2005 [Page 58]