Network Working Group Internet Draft C. Proust Document: draft-proust-flow-routing-00.txt France Telecom Expires: August 2004 February 2004 An approach for Routing at Flow level draft-proust-flow-routing-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/ietf/1shadow-sites.txt Abstract This document proposes an approach that aims at load balancing flows among multiple routes that are estimated to be slightly loaded. Mainly, it designs a router-based, end-to-end, congestion control mechanism by means of a piggybacking procedure which is distributed both in routers and terminals. This document proposes a set of functional extensions that might be applied to the IPv4 protocol and to the ISIS routing protocol. Proust Expires - August 2004 [Page 1] Internet-Draft Routing at Flow level 2/5/2004 Table of Contents 1. Introduction...................................................3 1.1 A Brief History............................................3 1.2 Common Understanding of the Requirements...................4 1.3 Some Strong Architectural Beliefs..........................4 1.4 Design Philosophy..........................................5 1.5 Overview of the Architecture...............................6 1.6 Review of an Example......................................10 2. Identification of IP flow forwarding states...................16 2.1 Overview..................................................16 2.2 Format of the IP option header............................18 2.3 Role of the Terminal......................................20 2.4 Role of the Router........................................23 2.5 Review of an Example......................................26 3. Extension of the Forwarding plane.............................31 3.1 Overview..................................................31 3.2 FIB structure.............................................33 3.3 FIB update................................................33 3.4 Forwarding algorithm......................................36 4. Extension of the Routing Control plane........................38 4.1 Overview..................................................38 4.2 Statistic module..........................................41 4.3 RIB structure.............................................42 4.4 LLP control messages......................................43 4.5 Synchronization of the LLDB...............................47 4.6 RIB update................................................50 4.7 Route calculation algorithm...............................53 5. Ongoing Experiment............................................53 6. Conclusion....................................................54 IANA Considerations..............................................55 New IP Option.................................................55 New PDU Types.................................................55 New TLV Code..................................................55 New sub-TLV Codes.............................................56 Security Considerations..........................................56 Normative References.............................................56 Informative References...........................................57 Acknowledgments..................................................58 Author's Address.................................................58 IPR Notice.......................................................58 Full Copyright Statement.........................................58 Proust Expires - August 2004 [Page 2] Internet-Draft Routing at Flow level 2/5/2004 1. Introduction 1.1 A Brief History While the Internet has dramatically evolved during the last decade in terms of services, number of users, network sizes, or bandwidth rates, there have not been any fundamental changes in the field of routing apart from MPLS developments. To quote from [IAB-FUTURE], "The deployed Internet routing system assumes that the shortest path is always the best path. This is probably false, however it is a reasonable compromise given the routing protocols currently available. The Internet lacks deployable approaches for policy-based routing or routing with alternative metrics." Hence today, it is envisioned that there is still an important need for improvement in this area. As an indication, one can underscore the past activity that has been done in the working group QOSR (QOS Routing). It delivered a framework document [QOSRFRWK] that still serves as a reference in this field. Moreover, during this period (1996-1999) of focused research two protocols, [QOSPF] and [OMP] that address part of the requirements defined in the QOSR working group were designed. Despite the interest of the IETF in these approaches, they were not brought into alignment with the standard tracks mainly because they were not appropriate for a large-scale deployment. From the point of view of the author of this draft, the experimental protocol [QOSPF] seems not to be scalable enough, since it relies on a resource signaling protocol that operates at flow level, such as, the Resource reSerVation Protocol (RSVP), while the approach known as OMP [OMP] seems to lack the required level of granularity and to be prone to some instability. As another argument, one could also cite the long term ongoing work being conducted within the Routing Research Group of the IRTF [IRTF- RR], which is on the verge of publishing a set of routing requirements for next generation Internet [IRTF-ROUTREQ]. Moreover, a recent framework document [QOSARCFRWK] which is supported by the IAB and that deals with QOS architectural considerations applied to IP networks emphasizes the need for enhanced routing capabilities. On page 11 of this document, G. Huston states: "There is a more fundamental issue here concerning resource management and traffic engineering. The approach of single path selection with static load characteristics does not match a networked environment which contains a richer mesh of connectivity and dynamic load characteristics. In order to make efficient use of a rich connectivity mesh, it is necessary to be able to direct traffic with a common ingress and egress point across a set of available network Proust Expires - August 2004 [Page 3] Internet-Draft Routing at Flow level 2/5/2004 paths, spreading the load across a broader collection of network links." All the remainder of this document addresses this problem statement. 1.2 Common Understanding of the Requirements Several documents served as the foundations to derive the requirements that apply to this document. First of all, there is the already mentioned framework document about QOS routing [QOSRFRWK]. Second, there is the more recent framework document about QOS architecture [QOSARCFRWK]. Third, there are the specifications documents of two experimental protocols that broaden the understanding of the above problem statement, namely, protocols known as the QOS Routing Mechanisms and OSPF Extensions (QOSPF), and the Optimized Multi Path (OMP) approach. Finally, the BCP 41 [BCP41] about congestion control principles serves to check that the solution is coherent with TCP and other elaborated end-to-end congestion control mechanisms. The remainder of this paragraph serves as a reminder of the fundamental requirements that this solution needs to meet. - the solution must provide for multiple paths between any node pair - the solution must optimize the use of the network, based on knowledge of available resources (mainly link bandwidth) - the solution must exhibit good properties in terms of scalability, stability and performance - the solution must not place a new security breach upon the Internet architecture - the solution must allow for a fine granularity of routing - the solution may provide support for policy routing and may provide management mechanisms to ease its deployment and supervision 1.3 Some Strong Architectural Beliefs The solution described hereafter is based on several strong assumptions. Proust Expires - August 2004 [Page 4] Internet-Draft Routing at Flow level 2/5/2004 First, "stateless" switching technology is considered necessary to build an Internet architecture that exhibits good properties, such as, scalability and robustness. Second, the finer the level of granularity of network services, the more value added the resulting network. Third, resource access control mechanisms are keys to avoid contention of resources. Besides, over-provisioning technique is a common practice among network operators. This technique is considered adapted to IP networks where IP traffic is unpredictable. Furthermore, there is a fundamental lack of router-based end-to-end congestion control mechanism [BCP41] in the Internet. Finally, it is generally admitted that stability and simplicity [INT- ARCH2] are key concerns when promoting an architecture whose scope challenges that of the Internet. 1.4 Design Philosophy As a consequence of the stateless argument, every datagram conveys some form of signaling information about its ongoing IP communication. This is a characteristic of the IP protocol where every datagram contains an information header enabling routers to forward independently every datagram towards its final destination. In order to cope with this characteristic, the solution that is depicted, hereafter, relies on this stateless paradigm. It defines a new option for the IP protocol that allows dynamic flow forwarding states information to be conveyed within every datagram. With regard to the level of granularity to which this network service applies, the solution examines routing at flow level in order to increase the added value of those services. Hence the definition of a flow that applies to this architecture is that of an Internet socket. With regard to the third assumption, the architecture relies on a resource access control mechanism that operates at flow level and takes into account the availability level of network resources, mainly link bandwidth. However, over a long period of time, the network resources would remain over-provisioned. There are numerous RFCs [BCP41], [QOSRFRWK] that stress the need for a router-based end-to-end congestion control mechanism within the Internet. As a result, it would be opportunistic to design a routing solution that not only provides for IP connectivity but also that controls congestion at the scale of the network. With regard to the stability criteria, the routing solution must first include mechanisms that ensure convergence within a reasonable timeframe given the dynamic of the flows. Proust Expires - August 2004 [Page 5] Internet-Draft Routing at Flow level 2/5/2004 Second, the routing system must not generate an amount of information about resource availability that is unmanageable. Third, the routing system must guarantee that the flows are routed in a stable way. That is to say, that the system must not allow for oscillations of traffic between different routes except when topology changes. To comply with the simplicity criteria the solution must keep the overall architecture on a rather uncomplicated level. Hence their underlying mechanisms must be kept as simple as possible in order to be implemented and operated on large scale. 1.5 Overview of the Architecture This architecture allows any IP datagram to be routed at flow level. With that regard a flow is defined as an Internet socket. Moreover, this routing model is able to load balance traffic efficiently, taking into account real availability of network resources. For this purpose, the architecture is composed of three complementary functional building blocks. The first one is in charge of the routing control. The second one takes care of the forwarding data plane. Finally the third one is responsible for identifying flow forwarding states and makes use of a marking function that affects the IP option header. The routing control part is based on any standard Interior Gateway Protocol (IGP) of type Link State such as OSPF [OSPF] or ISIS [ISIS]. Those protocols are enhanced in order to take into account information about network resource availability in their route calculation algorithm. For this purpose, their routing process embeds a statistical function that periodically computes the level of incoming and outgoing traffic circulating over their active transmission links. This aggregated information is generated by every router and distributed among all routers, using specific Link State Packets (LSP) and databases synchronization procedures. For convenience, these specific LSPs are called Link Load Packets (LLP). As a consequence, this augmented Link State DataBase (LSDB) enables every router to figure out the topology of the network, as well as, the distribution of its traffic load. Given that information, every router is able to determine a set of routes that comply with multiple criteria, such as, connectivity and availability, i.e. short routes that are estimated to be free of congestion. Moreover, when a primary route seems to be overloaded, the routing system is able to determine and activate an alternative route in order to capture some of the overloading traffic. To this end, the route calculation algorithm is based on an ordered multi-criteria search algorithm of the "widest-enough shortest" type. Furthermore, this path calculation algorithm may also take into Proust Expires - August 2004 [Page 6] Internet-Draft Routing at Flow level 2/5/2004 account some routing policy constraints. For instance, one could set up constraints such as: the cost of a route should be inferior to some maximum threshold, a route should not be longer than 32 hops, or no more than 15 different routes associated to a given destination prefix should be allowed, etc). To reflect this ability, the structure of the Routing Information dataBase (RIB) associated to an IGP would be extended to include three new attributes. The first one called load, defines the level of traffic load along the corresponding route for a given period. The second one called index, allows multiple routes associated to a destination prefix to be identified. The third one called path, allows the routing process to efficiently refresh the load attribute, associated to its routes when a set of new LLP is received. Hence, given that sort of enhancement, the role of the routing process becomes threefold: first, to calculate a set of short routes free of congestion, second, to refresh the load attribute, associated with the set of active routes, when a new set of LLP is received, and third, to trigger an alternate route calculation algorithm when a maximum load threshold is reached. By means of symmetry in relation to the control plane, this architecture enhances the data forwarding plane model in various directions. First, the structure of the Forwarding Information dataBase (FIB) is extended in order to reflect the additional information provided by the routing control plane. Hence, every route of the FIB possesses two more attributes: an indication of traffic load and an index parameter. Second, the forwarding algorithm is modified in order to take into account several criteria, when transmitting any datagram towards its final destination. This route lookup algorithm is based on an ordered list of criteria that, first of all, examines the destination prefix, then the index parameter. Therefore, this two-stage procedure performs, first, a longest prefix matching operation followed by an exact matching operation. When the index parameter is invalid or equal to zero, the forwarding process completes its route lookup operation using a selection algorithm of the "widest-enough shortest" type. This optional procedure is applied to the subset of the FIB routes that were identified at the preliminary stage of the longest match search. This algorithm takes into account both the load attribute and the cost attribute. Besides, it is designed to be applied when transmitting the first datagram of a bi-directional IP communication or when some major topology changes affect the route followed by the flow. The role of this optional stage is to load balance new flows among multiple routes that exhibit some good properties: being free of congestion (e.g. traffic load attribute is under a maximum Proust Expires - August 2004 [Page 7] Internet-Draft Routing at Flow level 2/5/2004 threshold that could correspond to some over-provisioning limitations), being the shortest (e.g. in relation to the cost metric) among active under-loaded routes and satisfying some routing policy constraints. Hence, it is a route lookup processing that adapts to the type of the datagram being forwarded. When the first datagram of a new flow is transmitted through a network, every router carries out the load-balancing procedure and selects a specific index value that, from its viewpoint, optimizes the use of the network at the time of the datagram is being processed. In order to cope with stability and performance criteria, once a choice of load balancing is made, it should be maintained for all subsequent datagrams belonging to the flow. This function is known in the literature as route pinning. In this architecture it is intimately intertwined with the forwarding processing. To cope with the strong assumption that scaling IP networks are stateless, the route pinning function is carried out through a marking process that is distributed both in routers and in terminals. In brief, in this functional decomposition, the role of the router is to determine load-balancing index values (i.e., to calculate adequate flow forwarding states), while the role of the terminal is to store and remind the network of the previous sequence of index values (i.e., to store the flow forwarding state information generated by every flow). The desired result is obtained by means of a piggybacking procedure that operates in a closed loop. Every datagram that belongs to a bi- directional IP communication conveys within its header, both the previous valid sequence of flow forwarding states, associated with the forward flow being transmitted and the previous valid sequence of flow forwarding states associated with the backward flow that would be sent in response. When the network is running smoothly, i.e. no major outage is affecting its topology, both sequences of flow forwarding states relating to the forward and backward flows of a bi- directional IP communication, should not experience any change over the duration of the communication, once they are calculated. With regard to security considerations, the knowledge of both sequences of flow forwarding states, relating to the forward and backward flows of a bidirectional IP communication, is of no use from the perspective of an end user. In effect, since those values have only some local meaning in relation to the FIB of the routers, without the knowledge of all the FIBs of the routers that take place along the route used by a flow, an end user cannot take advantage of this information to supersede the routing decision enforced by the network. At worst, an end user that would inadvertently change any value of one sequence of flow forwarding states, enforced by the Proust Expires - August 2004 [Page 8] Internet-Draft Routing at Flow level 2/5/2004 network, would lose the benefit of being transmitted efficiently along the same route. In terms of performance, that change could affect the delivery timeframe, due to the reprocessing of a valid sequence of flow forwarding state values, as well as, the reordering of some datagrams at the receiving endpoint. Proust Expires - August 2004 [Page 9] Internet-Draft Routing at Flow level 2/5/2004 1.6 Review of an Example In this example the topology of the network is represented in the figure below. It is composed of two IP terminals whose IP addresses are S1 and D1. Moreover, four interconnected routers named R1, R2, R3 and R4 compose the IP network. subnet A subnet B +----+ | | +----+ | S1 | | ,---. ,---. ,---. | | D1 | +----+____| / \ / \ / \ |___+----+ +----+ |____( R1 )......( R2 ).....( R3 )___| +----+ | \ / \ / \ / | `---'. `---' .'`---' `. .' `. .-' `. .' `,---..' / \ ( R4 ) \ / `---' Figure 1: Topology of the Example Network In this scenario, the terminal S1 is first initializing a bi- directional IP communication with the terminal D1. As this time, the distribution of the network traffic load is within well acceptable boundaries. In other words, there is no transmission link in the network that experiences a traffic overload. Therefore, in this context, the routing control plane of the network has converged to form the simplest routing state where every router has determined a unique route per destination prefix. In a first approximation, a simplified version of the four RIBs and therefore FIBs of the related network are represented by the four figures below. This approximation assimilates the router designation names as the destination prefixes of the network. It does not change the logic presented hereafter. Furthermore, the RIB tables include two additional columns compared to a more conventional RIB table. The first one called index, designates the rank of the route in relation to its uniqueness to a given destination prefix, while the second one called load, designates the short term traffic load observed over the related path. The maximum traffic load threshold is arbitrarily fixed to 30 percent of the absolute available bandwidth calculated over a route. One can calculate such bandwidth Proust Expires - August 2004 [Page 10] Internet-Draft Routing at Flow level 2/5/2004 by considering the maximum of the elementary absolute bandwidths associated with the transmission links that compose a given route. The maximum threshold that identifies an acceptable level of network congestion might be determined by following some over-provisioning engineering rules. +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R3 | 1 | R2 | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R4 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | R2 | <30% | | B | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | A | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 2: RIB of router R1 Figure 3: RIB of router R2 +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R1 | 1 | R2 | <30% | | R2 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | B | <30% | | B | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | R2 | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 4: RIB of router R3 Figure 5: RIB of router R Hence at first, as the network is lightly loaded, the bidirectional IP communication that takes place between terminal S1 and D1, is routed along the primary route composed of router R1, R2 and R3. The figure below represents the two flows f1 and f1' of that bi- directional IP communication. For convenience, flow f1 is characterized by the quintuple S1, D1, p, p' and protonum, where proto designates the number of higher end-to- end transport protocols, and p and p' designate the port numbers identifying the communication points of terminals S1 and D1, Proust Expires - August 2004 [Page 11] Internet-Draft Routing at Flow level 2/5/2004 respectively. By means of symmetry, the flow f1' is characterized by the quintuple D1, S1, p', p and protonum. flow f1' subnet A _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' `------. |`- +----+ | S1 | ---+--' ,---. ,---. ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 ).....( R3 )___| ++-+-+ | \ / \ / \ / | `---'. `---' .'`---' `. .' `. .-' `. .' `,---..' / \ ( R4 ) \ / `---' Figure 6: mapping of flows f1 and f1' on the primary route A moment later, the network starts to experiment some traffic overload at the transmission link located between router R2 and R3. The figure below depicts this event. flow f1' _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' `------. |`- +----+ | S1 | ---+--' ,---. ,---. ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 .XXXXX. R3 )___| ++-+-+ | \ / \ / |\` \ / | `---'. `---' \ .'`---' subnet A `. .\ `. .-' \ `. .' \ `,---..' \______________ / \ overloaded link ( R4 ) \ / `---' Figure 7: network experiencing a light traffic overload Proust Expires - August 2004 [Page 12] Internet-Draft Routing at Flow level 2/5/2004 At that point, the routing control plane detects such an event by means of a newly received set of LLP messages. As defined before, LLP messages characterize the distribution of a network traffic load in close relationship with its topology. Hence this event triggers the calculation of secondary routes. From the viewpoint of each router, only the routes that are affected by the temporary unavailability of some transmission links are susceptible to be reinforced by means of an alternative route. In the present case, the overload of the transmission link located between routers R2 and R3 has a different impact depending on the considered router. For instance from the viewpoint of router R3, this occurrence affects all routes that rely on this specific transmission link. Therefore, the router R3 calculates and activates three secondary routes in its updated RIB table of router R3, which is represented below on figure 12. Meanwhile, from the viewpoint of router R4, none of its routes are affected by this minor outage. The figure 13 represents the RIB table of router R4 at this moment. +--------+-----+-------+------+ +--------+-----+-------+------+ |dst pref|index|nexthop| load | |dst pref|index|nexthop| load | +--------+-----+-------+------+ +--------+-----+-------+------+ | R2 | 1 | R2 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | | 1 | R3 | >30% | | R3 |-----+-------+------+ + R3 +-----+-------+------| | | 2 | R4 | <30% | | | 2 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R4 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | | 1 | R3 | >30% | | B |-----+-------+------+ + B +-----+-------+------| | | 2 | R4 | <30% | | | 2 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | A | 1 | A | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 10: RIB of router R1 Figure 11: RIB of router R2 Proust Expires - August 2004 [Page 13] Internet-Draft Routing at Flow level 2/5/2004 +--------+-----+-------+------+ |dst pref|index|nexthop| load | +--------+-----+-------+------+ | | 1 | R2 | >30% | | R2 |-----+-------+------+ | | 2 | R4 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | |dst pref|index|nexthop| load | | R1 |-----+-------+------+ +--------+-----+-------+------+ | | 2 | R4 | <30% | | R1 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | R4 | 1 | R4 | <30% | | R2 | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | B | 1 | B | <30% | | R3 | 1 | R3 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ | | 1 | R2 | >30% | | B | 1 | R3 | <30% | | A |-----+-------+------+ +--------+-----+-------+------+ | | 2 | R4 | <30% | | A | 1 | R1 | <30% | +--------+-----+-------+------+ +--------+-----+-------+------+ Figure 12: RIB of router R3 Figure 13: RIB of router R Finally, later on, terminal S1 and D1 are engaged in a secondary bidirectional IP communication characterized by two flows f2 and f2'. This time, as the primary route is already slightly loaded, it is expected that both routers R3 and R1 will route the datagrams of flows f2 and f2' along the secondary route. This secondary route is composed of the sequence of routers R1, R4 and R3. The main advantage of the design of this architecture is that it allows flows f1 and f1' to still be forwarded along the primary route which is estimated to be lightly overloaded, while allowing more recent flows f2 and f2' to be forwarded along a secondary route which is believed to be under loaded. This capability is enabled through the enhancement of the forwarding plane of all routers. First, their RIB tables are extended to include two new attributes per route: an index parameter and the traffic load attribute. Second, their forwarding algorithm is extended to take into account an index value, in addition to, the destination address of the IP header when processing an IP datagram, received on one of its interfaces. Moreover, the route lookup processing is extended to operate differently according to the type of request. When the request concerns the forwarding of a new flow (e.g. one that concerns a datagram that has not been marked as such by routers), the procedure is carried out entirely. That means that the router Proust Expires - August 2004 [Page 14] Internet-Draft Routing at Flow level 2/5/2004 selects the route, whose prefix destination matches the address destination contained in the IP header, and for which the traffic load value is under the maximum threshold and exhibits the lowest cost (in relation to the metric being considered by the routing protocol). When the request concerns the forwarding of a flow that belongs to an already ongoing bi-directional IP communication (e.g. one that concerns a datagram that has already been marked as such by routers) the procedure is shortcut. It is only based on the destination address and the index values read in the IP header of the datagram...The index value serves as a refinement of the lookup processing. It indicates which load-balancing choice was assigned to the forwarding of this flow. The way the terminal can store such information is through the use of a piggybacking procedure that operates in a closed loop. Every IP header conveys both the sequence of choices made by the routers located on the route used by a forwarded datagram, therefore, relating to a forward flow, as well as, the sequence of choices made by the routers located on the route used by a prior datagram that relates to the backward flow. Since by design, this architecture operates in a closed loop, it applies first to bi-directional IP communication. The figure 14 below represents the routing of all flows depicted in this scenario among the available resources. flow f1' subnet A _.-----------------------. subnet B ,--'' flow f1 `--. | ,-' ,----------------------------. `-----| +----+ <--+' ,-' overloaded `------. |`- +----+ | S1 | ---+--' ,---. ,---. link ,---. `+-->| D1 | +----+____| / \ / \ / \ |___+----+ +-+-++ |____( R1 )......( R2 .XXXXX. R3 )__,+-- ++-+-+ <--+---. \ / \ / \ / / |--> ---|--. `.`---'. `---' .'`-,----',' `-.`----.`. . ,-' ,---' flow f2 `--. \ `. flow f2Æ .-' / ,-' `. `---. -' _.-' / `--. `,---.'.-''_.--' ` /-----\--'' ( R4 ) \ / `---' Figure 14: mapping of all flows f1, f1', f2 and f2' on the routes Proust Expires - August 2004 [Page 15] Internet-Draft Routing at Flow level 2/5/2004 2. Identification of IP flow forwarding states The overall architecture aims at load balancing efficiently IP flows among multiple routes that are believed to be free of congestion. This architecture supports such a service due to a combination of three functional building blocks that operate in a complementary way. This chapter describes the low level one entitled "identification of IP flow forwarding states" in comparison with the control plane which is said to be on a higher level in the load-balancing processing chain. This low level functional building block first serves to pin the flows on the selected routes. 2.1 Overview This functional building block is intimately intertwined with the forwarding data plane, since it identifies the index attribute of a route of the FIB as a flow forwarding state. The assumption, here, is that the combination of a destination prefix and of a sequence of index attributes might be used to uniquely characterize a route. This assumption is reasonably acceptable on condition that the routing control plane of the network has converged. Every router that contributes to the forwarding of a datagram has to determine its next-hop router according to the load-balancing service. Hence, as the route lookup processing is applied to the FIB, an index value is identified at each hop during the journey of a datagram that crosses the network. By analogy with record routing, the basic idea consists in writing such a sequence of flow forwarding states in the IP option header of a datagram, while it is transmitted through the network. As mentioned before, the support of such load-balancing service requires IP flows to relate to some bi-directional IP communications. Hence, this supplementary assumption is used to design a piggybacking procedure that would make both terminal endpoints of an IP communication aware of the recorded sequence of flow forwarding states which would subsequently contribute to the efficient forwarding of their own flow. To this end, a new option for the IP protocol is provided. It is intentionally called "Identification of IP Flow Forwarding States". Proust Expires - August 2004 [Page 16] Internet-Draft Routing at Flow level 2/5/2004 The data structure of this option header is designed to contain two vectors of flow forwarding state values (e.g. index values that identify routes in the FIB) and two index fields that allow both vectors to be addressed. Furthermore, this IP option is designed to operate differently according to the type of system being addressed. In short, the role of the routers is to identify, mark and validate the sequence of flow forwarding states, included in the IP option header. To this effect, each router makes use of the first vector and of the related index field, included in the IP option header. To the contrary, the role of the terminals is, on one hand, to record the complete sequence of flow forwarding states extracted from received datagrams, and, on the other hand, to initialize the various fields of the IP option header when sending new data. To this end, every terminal that engages in a bi-directional IP communication creates a context of communication session that enables it to store all needed information of interest. Anytime a terminal receives a new datagram, it might update this IP option-based information (the two vectors and the two index values), stored in the related context of communication that, in particular, characterizes the route followed by the flows. Due to this functional decomposition, the number of states created in the routers remains independent of the number of flows. Therefore, this solution addresses some scalability concerns. Therefore, the core processing of this IP option articulates around two simple sets of functions. When processing this IP option, a router needs to: - identify a flow forwarding state - mark a flow forwarding state in the first vector of the IP option header - validate a flow forwarding state When processing this IP option a terminal needs to: - record both sequences of flow forwarding states from received datagrams - manage a context of an IP communication session - initialize both sequences of flow forwarding states in sent datagrams The subsequent paragraphs gives details of the processing carried out by each system. Proust Expires - August 2004 [Page 17] Internet-Draft Routing at Flow level 2/5/2004 2.2 Format of the IP option header This paragraph describes the format of this IP option header that complies with version 4 of the Internet Protocol. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Type | Option Length | FVI | BVL | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + + | FFIV | + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | | | + + | | + BFIV + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 15: Format of the IP option Header Proust Expires - August 2004 [Page 18] Internet-Draft Routing at Flow level 2/5/2004 The Internet Protocol specification allows for the definition of any optional IP processing. To this end, it provides the definition of a generic optional header in its header format. The former is defined using TLV style coding. Hence, the first two bytes identify the option type and the option value. Moreover, the maximum size of an optional header is limited to 40 bytes. Given that information, the IP option header of type ôIdentification of Flow Forwarding Statesö is designed to use the maximum size allowed. It contains two vectors and two index fields. Thus, the IHL field has a value of 15 when the Option Length field has a value of 40. The data field called Forwarding Vector Index (FVI) is one byte long. It represents the index of the first following vector. The routers use this index value to point to the flow forwarding state of the first vector that is to be associated with them when processing this IP option. It is incremented at each hop. When the datagram reaches its final destination, this field represents the length of the route followed by the present datagram. The data field called Backward Vector Length (BVL) is one byte long. It represents the length of the route followed by the prior datagram of the backward flow in relation to the present datagram. The data field called Forward Flow Information Vector (FFIV) is 18 bytes long. The routers use it to characterize/refine the route followed by the present datagram that relates to the forward flow. The characterization of the route is based on a combination of a destination prefix and a sequence of route indexes. In the context of this functional building block, the route index is called ôflow forwarding stateö. Each value is encoded using 4 bits. Hence the FFIV vector might record the characterization of a route of 36 hops maximum length. Due to the 4 bits provision, a route index value (i.e. a flow forwarding state) might vary between 0 and 15 integer values. The first element of the vector would reference the index value of the route selected by the first router when forwarding a datagram. The second element of the vector would reference the index value of the route selected by the second router when forwarding a datagram and so forth. The data field called Backward Flow Information Vector (BFIV)is 18 bytes long. It serves the purpose of the control closed loop that operates on the highest level of the routing control plane. It is piggybacked on every forwarded datagram to convey the characterization of the route followed by a recent datagram which relates to the backward flow. It uses the same encoding convention as that described for the FFIV vector. In accordance with the inner Proust Expires - August 2004 [Page 19] Internet-Draft Routing at Flow level 2/5/2004 logical reasoning of this control closed loop, each time a terminal sends a datagram, it should initialize the BFIV vector and BVL index field of its optional header with the related information contained in its created context of the communication session. This context of the communication session allows previous FFIV vector and FVI index value, consigned to the optional header of a received datagram in relation to the present communication session, to be stored as BFIV vector and BVL index value for future datagrams sending. 2.3 Role of the Terminal 2.3.1 Creation of a context of a communication session When a terminal engages in a bi-directional IP communication it should create an associated context of a communication session for the purpose of managing the data of the routing control loop that are consigned in the IP option header of all datagrams. Namely, FFIV and BFIV vectors as well as the FVI index and BVL length values. Such a context of a communication session might be as simple as a combination of information that defines an Internet socket along with the content of the four fields defined for this IP option. To give an example, the context of a communication session, in relation to a bi-directional IP communication between two endpoint terminals whose IP addresses are S and D, which, furthermore, relies on the transport protocol TCP and that uses port numbers p and pÆ, might be defined from the viewpoint of terminal S by the following combination of information fields. - forward flow o source address: S . should be 4 bytes long . its initial value is set to the IP address of the local host o destination address: D . should be 4 bytes long . its initial value is set to the IP address of the remote host o source port number: p . should be 2 bytes long . its initial value is set by the application protocol o destination port number: pÆ . should be 2 bytes long . its initial value is set by the application protocol o transport protocol number: TCP . should be 1 byte long . its initial value is set by the transport protocol over which runs the application protocol o forward flow information vector: FFIV . should be 18 bytes long Proust Expires - August 2004 [Page 20] Internet-Draft Routing at Flow level 2/5/2004 . its initial value is set to the null vector, that is a vector full of zero integer values o forward vector index: FVI . should be 1 byte long . it is a constant whose initial value is set to one integer value o forward vector length: FVL . should be 1 byte long . its initial value is set to zero integer value - backward flow o source address: D . should be 4 bytes long . its initial value is set to the IP address of the local host o destination address: S . should be 4 bytes long . its initial value is set to the IP address of the remote host o source port number: pÆ . should be 2 bytes long . its initial value is set by the application protocol o destination port number: p . should be 2 bytes long . its initial value is set by the application protocol o transport protocol number: TCP . should be 1 byte long . its initial value is set by the transport protocol over which runs the application protocol o backward flow information vector: BFIV . should be 18 bytes long . its initial value is set to the null vector, that is a vector full of zero integer values o backward vector length: BVL . should be 1 byte long . its initial value is set to zero integer value From the viewpoint of terminal D, the context of communication session is obtained by inversion of the roles of receiver and sender. 2.3.2 Receiving a datagram Whenever a terminal receives a datagram with this IP option enabled it should copy the content of the optional header in the related context of the communication session. If such a context does not exist at the time the datagram is received, the terminal should create one prior to the copying operation. Proust Expires - August 2004 [Page 21] Internet-Draft Routing at Flow level 2/5/2004 Whatever the case, the copy of the content of the optional header should be done as depicted hereafter. The FFIV vector consigned to the optional header of the received datagram should be copied in the BFIV vector data element of the related context of communication session. The FVI index value consigned to the optional header of the received datagram should be copied in the BVL integer data element of the related context of communication session. The BFIV vector consigned to the optional header of the received datagram should be copied in the FFIV vector data element of the related context of communication session. The BVL length value consigned to the optional header of the received datagram should be copied in the FVL integer data element of the related context of communication session. The FVI integer data element of the related context of a communication session is a constant equal to the one integer value. 2.3.3 Sending a datagram Whenever a terminal sends a datagram with this IP option enabled, it should initialize the four fields of the optional header with the more related up-to-date data elements recorded in the related context of communication session. If such a context does not exist at the time a datagram is sent, the terminal should create one prior to this initializing operation. Whatever the case, the initialization of the four fields of the optional header should be done as described hereafter. The FFIV vector consigned to the optional header of the datagram about to be sent, should be initialized with the most up-to-date FFIV vector data element of the related context of the communication session. The FVI index value consigned to the optional header of the datagram about to be sent should be initialized with the 1 integer value. The BFIV vector consigned to the optional header of the datagram about to be sent should be initialized with the most up-to-date BFIV vector data element of the related context of the communication session. Proust Expires - August 2004 [Page 22] Internet-Draft Routing at Flow level 2/5/2004 The BVL length value consigned to the optional header of the datagram about to be sent should be initialized with the most up-to-date BVL integer data element of the related context of the communication session. 2.3.4 Managing a context of a communication session For statistical purposes, it might be interesting to integrate, in the processing of this IP option, a session manager that would record all past values of the FFIV, FVL, BFIV and BVL data elements of the related context of the session. For instance, the evolution of the FFIV vector might be an indication with regard to the stability of the network topology being used. Also, the evolution of the FVL fields might indicate the variation of the length of the route being used. 2.4 Role of the Router From the viewpoint of a router, this IP option specifies a forwarding mechanism for network resource optimization purposes. Its algorithm proceeds as described hereafter. Upon receiving a datagram on one of its interfaces, a router first checks, as usual, the validity of the IP header. Then it proceeds to the read the destination address consigned to the IP header.. In case the datagram is intended for the router itself, its forwarding algorithm should deliver it to the related endpoint process. Conversely, the router should proceed to a modified route lookup operation. In a first stage, given the destination address of the datagram, the router proceeds to the longest match search of the destination prefix in its FIB. Having said this, the FIB may contain several routes for a given destination prefix. For this reason, the router proceeds to the second stage of the route lookup procedure to refine the routing selection process. Hence, the router should identify the flow forwarding state of this datagram that relates to the state of the forwarding plane of this router. To do so, it reads the FVI index value consigned to the IP option header. This value indicates the row of the FFIV vector, where the flow forwarding state that relates to this router, should be. Therefore, the router identifies this flow forwarding state (FFS) by reading the FFIV[FVI] indexed value which is coded using 4 bits. Proust Expires - August 2004 [Page 23] Internet-Draft Routing at Flow level 2/5/2004 In case the FFS read value is zero, this means the datagram belongs to a new flow for which no previous FFS value has been set. Therefore the router selects the best route among those that were pre-selected at the previous longest match stage. The multi-criteria selection algorithm is of type widest-enough shortest.. In other words, as the routes possess two metrics, cost and traffic load, the selection process should consider the shortest route in relation to the cost metric, which has enough bandwidth. Finally, the index value of the selected route defines the new value of the flow forwarding state for that datagram. Thus, the forwarding process marks this index value in the FFIV vector of the optional header at the row indicated by the FVI index value. Then the FVI index value is incremented by one unity in the optional header and the datagram is forwarded along the selected route. In case the FFS read value is different from zero, this means the datagram belongs to a flow for which a previous FFS value has been set. Therefore, the router refines the routing selection process. First, it checks the validity of the FFS read value by verifying that an associated route does still exist. In relation to a network topology change, it may happen that an index value is not valid anymore. If this is the case, the datagram should be directed along this route. In doing so, the flow remains stuck to the route that was selected at the beginning of the communication. Then the FVI index value is incremented by one unity in the optional header and the datagram is forwarded along the selected route. If this is not the case, because of some network topology change or because of some misbehaving or malfunctioning terminal, the router should apply the selection procedure described above which relates to the forwarding of the initial datagram of a flow. Proust Expires - August 2004 [Page 24] Internet-Draft Routing at Flow level 2/5/2004 The second stage of this forwarding algorithm is depicted in figure 16. +---------+ |read FVI | +----+----+ | +------v-------+ |read FFIV[FIV]| +------+-------+ | ,-v---. / is \ /FFIV[FVI]\ yes ( equal )-----------------------+ \ to 0 ? / | \ / | `--+--' | |no | | | ,--v--. | / does \ +-----------v-----------+ /FFIV[FVI]\ no | widest-enough shortest| ( belongs to)---------->| selection of a route | \ the sub / | in the sub FIB | \ FIB ? / +-----------+-----------+ `--+--' | |yes | | +-----------v------------+ | | write the index of the | | | selected route in the | | | FFIV[FIV] vector field | | | of the IP option header| | +-----------+------------+ | | | | | +----------v---------+ +----------------->|increment FVI in the| | IP option header | +----------+---------+ | +--------v--------+ |transmit datagram| +-----------------+ Figure 16: refinement step of the routing selection process Proust Expires - August 2004 [Page 25] Internet-Draft Routing at Flow level 2/5/2004 2.5 Review of an Example The network topology, to which this example applies, is the one described in the introductory chapter. It is represented in figure 1. As mentioned earlier, the design of this architecture only applies to bi-directional IP communications. In that respect, we consider in this example a bi-directional IP communication that takes place between terminals S1 and D1. As mentioned before, the network is globally under loaded. Hence, the routing control plane has converged around a simple network routing state, where there exists only one available route at that time. The primary route calculated to transport both flows of the newly created IP communication is built on the sequence of routers R1, R2 and R3. The figure 17 below details the processing stages of the IP option, as and when the datagram crosses the network. The higher part of the diagram represents the network topology, whereas, the lower part depicts the evolution of the content of the optional header while a datagram is, little by little, forwarded through the network. +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| FVI | 1 | | 2 | | 3 | | 4 | | 4 | FFIV| 0,...,0 |-->| 1,0,..,0|->| 1,1,0,..|->|1,1,1,0,.|-->| 1,1,1,0,| BVL | 0 | | 0 | | 0 | | 0 | | 0 | BFIV| 0,...,0 | | 0,...,0 | | 0,...,0 | | 0,...,0 | | 0,...,0 | +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (A) (B) (C) (D) (E) Figure 17: forwarding of the first datagram of flow f1 along the primary route built on the sequence of R1, R2 and R3 routers At stage labeled (A), the terminal S1 is initiating the bi- directional IP communication with terminal D1. As such, the terminal creates a context of the communication session, which is characterized by the set of values given below. Proust Expires - August 2004 [Page 26] Internet-Draft Routing at Flow level 2/5/2004 - forward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward flow information vector: FFIV = (0, 0,à, 0) o forward vector index: FVI = 1 o forward vector length: FVL = 0 - backward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o forward flow information vector: BFIV = (0, 0,à, 0) o forward vector index: BVL = 0 Given that initialization step, the terminal packages the datagram to be sent with the initial values defined for that context and transmits it towards its access router, namely R1. The figure above shows at that time the content of the four fields that relates to this IP option. At stage (B), the router R1 identifies that the datagram relates to a new flow since the FFVI[1] value is equal to zero. Thus, R1 carries out a full route lookup operation, independently. As its FIB contains only one route that matches the destination address of the datagram, it is unambiguously selected. Its index value is one. According to the IP option processing, this index value is written in the FFIV vector at a place indicated by the FVI index value. Hence, at that time FFIV[FVI] is set to one in the IP option header. Moreover, the FVI index value is incremented by one unity and the datagram is transmitted towards the next-hop router R2. At stages (C) and (D), router R2 and R3 successively perform the same processing chain as depicted in stage (B) on that datagram. At stage (E), the terminal D1 that receives this first datagram creates its own context of communication session. This time it is characterized by the set of values given below. - forward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p Proust Expires - August 2004 [Page 27] Internet-Draft Routing at Flow level 2/5/2004 o transport protocol number: TCP o forward flow information vector: FFIV = (0, 0,à, 0) o forward vector index: FVI = 1 o forward vector length: FVL = 0 - backward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward flow information vector: BFIV = (1, 1, 1, 0,à, 0) o forward vector index: BVL = 4 Doing so, it has first copied the FFIV vector and FVI index value, contained in the option header of the received datagram, to initialize the BFIV and BVL data elements that relate to the backward flow of the newly created context of communication session. Then, it has copied the BFIV vector and BVL length value, contained in the option header of the received datagram, to initialize the FFIV and FVL data elements that relate to the forward flow of the context of communication session. Moreover, the constant FVI is set to the one integer value. At stage (F), the terminal D1 sends a datagram in response to terminal S1. In turn, the terminal D1 packages the datagram to be sent with the initial values defined for that context and transmits it towards its access router, namely R3. At stages (G), (H) and (I), router R3, R2 and R1 successively perform the same processing chain as depicted in stage (B) on that datagram. At stage (J), the terminal S1 that receives this first datagram updates its related context of communication session. Thus, at that time it is characterized by the set of values given below. - forward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward flow information vector: FFIV = (1, 1, 1, 0,à, 0) o forward vector index: FVI = 1 o forward vector length: FVL = 4 - backward flow o source address: D1 o destination address: S1 Proust Expires - August 2004 [Page 28] Internet-Draft Routing at Flow level 2/5/2004 o source port number: pÆ o destination port number: p o transport protocol number: TCP o forward flow information vector: BFIV = (1, 1, 1, 0,à, 0) o forward vector index: BVL = 4 Doing so, it has first copied the FFIV vector and FVI index value contained in the option header of the received datagram to update the BFIV and BVL data elements that relate to the backward flow of the context of communication session. Then, it has copied the BFIV vector and BVL length value, contained in the option header of the received datagram, to update the FFIV and FVL data elements that relate to the forward flow of the context of communication session. +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| FVI | 4 | | 4 | | 3 | | 2 | | 1 | FFIV|1,1,1,0,.|<--|1,1,1,0,.|<-|1,1,0,.. |<-|1,0,..,0 |<--| 0,...,0 | BVL | 4 | | 4 | | 4 | | 4 | | 4 | BFIV|1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (J) (I) (H) (G) (F) Figure 18: forwarding of the first datagram of flow f1Æ along the primary route built on the sequence of R3, R2 and R1 routers. At stage (K), the terminal S1 is sending a second datagram to the terminal D1. Thus, it packages the datagram to be sent with the initial values defined for that context at that time and transmits it towards its access router, namely R1. At stage (L), the router R1 identifies that the datagram may relate to an already routed flow, since the FFVI[1] value is different from zero. Thus, R1 carries out a validation of this index value by performing the two stage route lookup operation. As its FIB still contains a valid route that, first, matches the destination address of the datagram and that second, possesses an index value, that Proust Expires - August 2004 [Page 29] Internet-Draft Routing at Flow level 2/5/2004 equals to the FFIV[1] value of that datagram, it is unambiguously selected as the route on which the flow is pinned. Once this is done, the router R1 only has to increment, by one unity, the FVI index value of the IP option header and transmit the datagram towards the next-hop router R2. At stages (M) and (N), router R2 and R3 successively perform the same processing chain as depicted in stage (I) on that datagram. At stage (O), the terminal D1 that receives this second datagram updates its related context of communication session. Thus, at that time it is characterized by the set of values given below. - forward flow o source address: D1 o destination address: S1 o source port number: pÆ o destination port number: p o transport protocol number: TCP o forward flow information vector: FFIV = (1, 1, 1, 0,à, 0) o forward vector index: FVI = 1 o forward vector length: FVL = 4 - backward flow o source address: S1 o destination address: D1 o source port number: p o destination port number: pÆ o transport protocol number: TCP o forward flow information vector: BFIV = (1, 1, 1, 0,à, 0) o forward vector index: BVL = 4 Doing so, it has first copied the FFIV vector and FVI index value, contained in the option header of the received datagram, to update the BFIV and BVL data elements that relate to the backward flow of the context of communication session. Then, it has copied the BFIV vector and BVL length value contained in the option header of the received datagram to update the FFIV and FVL data elements that relate to the forward flow of the context of communication session. Proust Expires - August 2004 [Page 30] Internet-Draft Routing at Flow level 2/5/2004 +----+ ,---. ,---. ,---. +----+ | S1 |-------( R1 )-------( R2 )------( R3 )--------| D1 | +----+ `---' `---' `---' +----+ ^ +----+ ^ ^ ^ ^ +----+ ^ | | | | | | | | | | | | v v v v v v +---------+ +---------+ +---------+ +---------+ +---------+ |IP header| |IP header| |IP header| |IP header| |IP header| FVI | 1 | | 2 | | 3 | | 4 | | 4 | FFIV|1,1,1,0,.|-->|1,1,1,0,.|->|1,1,1,0,.|->|1,1,1,0,.|-->|1,1,1,0,.| BVL | 4 | | 4 | | 4 | | 4 | | 4 | BFIV|1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| |1,1,1,0,.| +---------+ +---------+ +---------+ +---------+ +---------+ | | | | | | | | | | | payload | | payload | | payload | | payload | | payload | | | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +---------+ (K) (L) (M) (N) (O) Figure 19: forwarding of the subsequent datagrams of flow f1 along the primary route built on the sequence of R1, R2 and R3 routers 3. Extension of the Forwarding plane 3.1 Overview The very schematic representation of a router might be depicted by the figure 20 below. The functional role of a router aims at forwarding datagrams towards their final destinations via the transmission interfaces that interconnect it with other systems. For that purpose, any router operates at the same time on two functional levels: the control plane and the transmission plane. Usually, the control plane operates in a shared time processing environment while the transmission plane operates in near real time. In short, the control plane allows a router to adapt to network changes, in due time, with minimum impact on the transport of network traffic. The transmission plane of a router is in charge of applying all needed processing, alongside the forwarding function. For instance, it may include some network filtering mechanisms or some quality of service (QOS) management functions. During the last decade, the core technology of routers has been improved by a factor of several orders of magnitude. Today, routers are flexible enough to apply all sorts of elementary processing while still forwarding datagrams at transmission line rates. However, with Proust Expires - August 2004 [Page 31] Internet-Draft Routing at Flow level 2/5/2004 regard to the principle of forwarding, there has been little change in the past. Most of the time the routing decision relies on a longest match search operation, which is based on the destination address consigned in the header of a datagram. While this forwarding mechanism has proven to be robust and scalable, it lacks some flexibility with regard to policy routing and load balancing. +---------------------------------------+ | | | +-------------------+ | | | | | | | routing protocol | | control | | |----+ | plane | | (RIB) | | | | | | | | | +-------------------+ | | | ^ | | |---------|------------------|----------| | | | | transmission | | yes v | plane | ,-----. +--------------+ | FORWARDED | /is the \ | | | DATAGRAM +------+--+ | /datagram \ no | forwarding | | +------+--+ | | |---+-->(intended to)--->| module |---+-->| | | | | | | \ a local / | | | | | | +------+--+ | \process/ | (FIB) | | +------+--+ RECEIVED | \ ? / | | | DATAGRAM | `---' +--------------+ | | | +---------------------------------------+ Figure 20: schematic functional representation of a router As described in the introductory chapter, the support of a fair load balancing network service, implies a modification of the forwarding algorithm, the forwarding information database (FIB), and the updating mechanism of the FIB. Proust Expires - August 2004 [Page 32] Internet-Draft Routing at Flow level 2/5/2004 3.2 FIB structure The support of the IP option described in chapter 2.4 requires a new structure of the FIB. It is represented below on figure 21. +--------------------+-------+----------+-----------+------+------+ | Destination Prefix | Index | Next-Hop | Interface | Load | Cost | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0/24 | 1 | 10.1.1.1 | GE1/0 | 42% | 100 | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0.24 | 2 | 10.2.1.1 | GE1/1 | 32% | 150 | +--------------------+-------+----------+-----------+------+------+ | 192.168.1.0/24 | 3 | 10.3.1.1 | GE1/2 | 8% | 200 | +--------------------+-------+----------+-----------+------+------+ | 192.168.2.0/24 | 1 | 10.3.1.1 | GE1/2 | 20% | 100 | +--------------------+-------+----------+-----------+------+------+ Figure 21: extended structure of the FIB table The "Destination Prefix" field represents an IP prefix associated with a reachable underlying sub-network known within a routing domain. The "Index" field represents an identifier that unambiguously identifies a route, among the subset of FIB routes that share the same destination prefix. The "Next-Hop" field represents the IP address of the next-hop router to be used along the route. The "Interface" field identifies the output transmission link of the router to be used along the route. The "Load" field represents the maximum of the elementary traffic load that has been measured for each transmission link along the calculated route. This quantity is expressed as a percentage of an absolute bandwidth. The "Cost" field represents the total usage cost of resources along the route. It is calculated for a given topology by means of the Dijkstra's shortest path first (SPF) algorithm. 3.3 FIB update 3.3.1 RIB/FIB relationship At the control plane level, in addition to static routing a router can activate several dynamic routing processes like ISIS or BGP. According to the definition of a specific metric each routing process calculates independently the best routes that are stored in separate RIB tables. Due to the diversity of routing information sources, two Proust Expires - August 2004 [Page 33] Internet-Draft Routing at Flow level 2/5/2004 routes in different RIB tables may exist at the same time for a given destination prefix. In consequence of such an eventuality, the control plane of a router also possesses a central routing process that manages a central RIB. The latter contains the final set of valid routes that are used in the transmission plane. The central routing process selects the best routes by comparing their relative degrees of trustworthiness. This static criterion is defined according to some conventional engineering rules. It is assigned to the different routing processes by means of a manual configuration. The routes contained in the FIB table are a reflection of the set of the routes contained in the central RIB. The figure 22 below depicts the relationships between processes that control the updating of the different routing tables. +-----------------------------------------------------------+ | control plane | | | | +---------+ +--------+ +---------+ | | ,---. | dynamic | | static | | dynamic | ,---. | |( RIB )<-->| routing | |routing | | routing |<-->( RIB ) | | `---' | process | |process | | process | `---' | | +----.----+ +---+----+ +----.----+ | | `-._ | _.-' | | `-._ | _.-' | | +---v--v---v---+ | | | central | ,---. | | | routing |<-->( RIB ) | | | process | `---' | | +------^-------+ | | | | +----------------------------+------------------------------+ | transmission plane | | | +------v-------+ | | | forwarding | ,---. | | | module |<-->( FIB ) | | | | `---' | | +--------------+ | | | +-----------------------------------------------------------+ Figure 22: relations between processes that control routing tables As a result of this functional organization, each time a route is added, modified or deleted from the central RIB, a similar updating action should be applied to the FIB. Proust Expires - August 2004 [Page 34] Internet-Draft Routing at Flow level 2/5/2004 Furthermore, according to the proposed architecture, it is envisioned that the load attribute of the routes must be updated on a periodic basis. The regular updating of this attribute is fundamental to enable the whole system to keep the global network traffic load under control. In this way, some feedback is introduced into this closed control loop. In a first step, the dynamic routing protocol measures the distribution of network traffic load and controls the state of the topology. In a second step, it updates its routing table according to the current state of both the distribution of traffic and the topology. In a third step, the central routing process updates the central RIB then the FIB table. In a forth step, the forwarding module transmits the traffic according to the updated routes of the FIB. This cycle should occur on a regular basis. How often depends on the dynamic level of the traffic and of the ratio of the bandwidth pipes in relation to the average bandwidth of flows. It is assumed that the bandwidth of individual flows should be very small compared to the available bandwidth provisioned in the core of the network. Thus, the routes of the FIB might be updated by means of two types of mechanism. The first one is invoked on a periodic basis. The second one is triggered when a topology change is detected or when the measured distribution of network traffic load requires a modification of the routing table. 3.3.2 Periodic updates One can distinguish two types of periodic updates of the FIB routing table. The first one occurs over long periods of time and is mandated by the standard operating of any link state dynamic routing protocol. The second one occurs over short periods of time (lasting only a few minutes), that is to say, when a new measure of some traffic load is received from a part of the network. It may modify the load attributes that relate to an existing set of calculated routes. To do so, the combination of the destination prefix and of the index attribute is used as a key identifier to unambiguously address a specific route. 3.3.3 Triggered updates As it is already the case with the standard specifications, a triggered update of the FIB may occur when some topology change is detected. Proust Expires - August 2004 [Page 35] Internet-Draft Routing at Flow level 2/5/2004 Besides, as a result of a periodic update of the load attribute in the FIB, some alternative routes might be added or suppressed. Therefore, it is another case of triggered FIB update. 3.4 Forwarding algorithm With regard to the forwarding mechanism needed, it is fully explained in paragraph 2.4, which defines a new option for the IP protocol that extends the traditional forwarding algorithm by adding a refinement stage to the route lookup procedure. In addition to the longest match search (LMS), this optional IP processing examines the value of a flow forwarding state that should have been marked in the optional header by every processing router during the forwarding of a prior datagram of that flow. When the network topology remains stable, this refinement stage is an exact match search (EMS) applied to a short list of pre-selected routes. When the network topology changes or at the beginning of a bi- directional IP communication, this refinement stage is an ordered multi-criteria search of the "widest-enough longest" type, which is applied to a short list of pre-selected routes. In a first step, this procedure selects the routes from the pre-selected subset, whose traffic load attribute still remains below a maximum threshold. In a second step, it selects the shortest one in relation to the cost attribute. The complete forwarding algorithm is depicted below in the flow chart of figure 23. Proust Expires - August 2004 [Page 36] Internet-Draft Routing at Flow level 2/5/2004 +----------------------+ |read the IP header of | |the received datagram | +----------+-----------+ | ,-v---. / is \ +---------------------+ / the IP \ no |transmit the datagram| ( option )----->|along the SPF route | \present ?/ |according to the | \ / |standard procedure | `--+--' +---------------------+ | yes +-------v--------+ |LMS route lookup| +-------+--------+ | ,-v---. / \ / does \ no +----------+ ( any route )------>| drop the | \ exist ? / | datagram | \ / +----------+ `-+---' | yes +-------v--------+ |EMS route lookup| +-------+--------+ | ,-v---. /is the \ / pointed \ no +---------------+ ( route )----------->|widest-enough | \ valid ? / |shortest search| \ / +-------+-------+ `-+---' | | yes +-------------v--------------+ | |write the index of the | | |selected route in FFIV[FIV] | | +-------------+--------------+ +------v--------+ | | increment FVI |<-----------------+ +---------------+ | +----------v------------+ | transmit the datagram | +-----------------------+ Figure 23: flow chart of the forwarding algorithm Proust Expires - August 2004 [Page 37] Internet-Draft Routing at Flow level 2/5/2004 4. Extension of the Routing Control plane 4.1 Overview As depicted above, the support of an adaptive load balancing service relies on a periodic update of the dynamic load attribute of the routes. Moreover, when the topology changes or when the distribution of the network traffic sustains a significant modification, it is assumed that the dynamic routing protocol would update the content of its RIB, which in turn may lead to the updating of the FIB. For this purpose, the dynamic routing protocol must be extended to support such functional requirements. In this approach, the ISIS dynamic routing protocol is chosen for extension. However, the approach should be easily transposable to the other well-known link state dynamic routing protocol, namely OSPF. The standard specification of the ISIS routing protocol requires that all information about the topology of a given network is conveyed in only one control message called a Link State PDU (LSP). In this respect, each router generates only one valid LSP at a time that should contain complete up-to-date information about its local topology in relation to the global one. Moreover, the specification of the protocol requires that each router must regenerate its LSP on a periodic basis. Besides, whenever a topology change is detected, the routers of the network should propagate such an event in order to inform all routers. It is the router that detects the local topology change that should initiate the updating process by regenerating its LSP. Whatever the reason, when a new valid LSP is received, it should always take overall precedence in the Link State DataBase (LSDB). The convergence property of the routing protocol is ensured as soon as the LSDBs of the routers are synchronized with one another within a routing domain. Thus, most of the inner mechanisms of the ISIS routing protocol aim at synchronizing the LSDBs as quickly as possible. With regard to the first functional requirement, the ISIS protocol is extended to integrate a statistic module which would synthesize the traffic load measures of its transmission interfaces on a periodic basis. To this end, this module might rely on the SNMP protocol to collect the counter values coming from its transmission interfaces which are stored in the MIB-II database of the system. Moreover, the structure of the RIB managed by the ISIS routing protocol must be extended to allow a traffic load parameter to be attributed to a route. Besides, as such synthesis information is by nature much more dynamic than the usual link state information, it is envisioned that such Proust Expires - August 2004 [Page 38] Internet-Draft Routing at Flow level 2/5/2004 information may not be conveyed within usual LSP control messages. Therefore, the ISIS protocol is extended to support the propagation and management of a new control message called Link Load PDU (LLP). Each router would generate a single LLP on a periodic basis, which would be propagated by neighboring systems within a routing domain. For convenience, all valid LLPs would be stored in a separate database called Link Load DataBase (LLDB). As it is the case for the topology-driven route calculation process, it is fundamental that the LLDBs be synchronized as quickly as possible to ensure the coherence of the load balancing mechanism on the scale of a routing domain. To this end, the mechanisms defined for the synchronization of the LSDBs serve as a canvas to design those used to ensure an efficient synchronization of the contents of the LLDBs. Thus, two additional ISIS control messages are also defined that would enforce the synchronization of the LLDBs within a routing domain, namely the Complete Sequence Number Link Load PDUs (CSNLLP) and the Partial Sequence Number Link Load PDUs (PSNLLP). Besides, a new TLV is defined that allows any LLP to be referenced in both CSNLLP and PSNLLP control messages. Hence, whatever the type of the data link interface, these messages would allow ISIS neighbor routers to synchronize both their LLDBs and their LSDBs in due time. The synchronization of the LLDBs should occur whenever an LLP control message is generated. Therefore, the reception of any valid LLP control message would immediately trigger the update of the load attribute of the affected routes contained in the RIB table. For this purpose, the structure of the RIB table under control of the ISIS routing protocol must be extended. A new attribute called "path" is defined. It represents the sequence of next-hop routers along a given route from the viewpoint of a router. Therefore, the update of the traffic load attribute of any route calculated by the ISIS routing protocol can be achieved. It is the maximum of the traffic load of the transmission links that relate to the considered route, expressed as a percentage of an absolute bandwidth. With regard to the second functional requirement, the ISIS routing protocol should be extended to allow routes from the RIB to be added or removed due to some significant changes in the distribution of the traffic. Typically, following the processing of a newly received LLP control message as described above, such a procedure would look through the RIB table, and would determine on the basis of the load attribute, if some routes requires some update action in response. For instance, such adaptive action might consist in adding an alternative route to the one being temporarily overwhelmed by traffic. Thus, this triggered update mechanism requires a route calculation procedure that takes into account several criteria. This path selection procedure should consider, at least, both the cost and Proust Expires - August 2004 [Page 39] Internet-Draft Routing at Flow level 2/5/2004 the load attributes, and may also consider some routing policy constraints. However, this optimization issue is known as being non tractable and synonymous with a NP-complete issue. For this reason, the adopted approach relies on a simple heuristic procedure, which optimizes all criteria but according to a specific order, called "widest-enough shortest", since it first considers the bandwidth requirement and then the total cost of the path. Once this operation is completed, the ISIS routing protocol might proceed immediately to the update of the route contained in the FIB. For this purpose, the structure of the RIB table, under control of the ISIS routing protocol must be extended. A new attribute called "index" is defined. Combined with the destination prefix, it identifies, unambiguously, a route within the FIB. Proust Expires - August 2004 [Page 40] Internet-Draft Routing at Flow level 2/5/2004 4.2 Statistic module As shown in the figure 24 below, the statistic module might be integrated inside the ISIS routing process. +-------------------------------------------------------------+ | | | +-------------------+ | | ,----. | ISIS dynamic | ,----. | |( LSDB )<-->| routing protocol |<-->( LLDB ) | | `----' | | `----' | | | +----------+ | | | ,---. | |statistic |<--+--------+ | | ( RIB )<-->| | module | | | | | `---' | +----------+ | | | | +-------------------+ | | | ^ | | | | | | | | | | | ,---. +---------v---------+ +----v------+ ,----. | | ( RIB )<-->| central routing | | SNMP agent|<->( MIB2 ) | | `---' | protocol | | | `----' | | +---------^---------+ +-----------+ | +----------------------+--------------------------------------+ | | | | +---------v----------+ | | | | ,---. | | | forwarding module |<-->( FIB ) | | | | `---' | | | | | | +--------------------+ | | | +-------------------------------------------------------------+ Figure 24: ISIS routing process in relation to an SNMP agent Moreover, for convenience, it may use the information stored in the MIB-II database as it is a standard that should be included in a router. The MIB-II database specification defines a set of objects that provides information about the operating state of a network device such as a router. These objects are gathered into several groups. With regard to the functional purpose of the statistic module, the group called "interface" is of particular interest. It defines several objects that supply raw statistical information about the operating state of each transmission interface of a router. For instance, the "ifTable" sub-group contains valuable counters that record statistical information about the input and output traffic observed on each Proust Expires - August 2004 [Page 41] Internet-Draft Routing at Flow level 2/5/2004 transmission interface as time goes by. Moreover, these absolute values are expressed either in term of number of datagrams (ifInUcastPkts, ifOutUcastPkts, etc) or in bytes (ifInOctets, ifOutOctets). Thus, the statistic module could request the values of these counters on a regular basis. Given the maximum bandwidth resource of each transmission interface, it could calculate the resulting synthesis of the traffic load on its interfaces, experimented over a fixed period of time. Such a synthesis of results could be expressed as a percentage of the absolute bandwidth. Besides, these syntheses results could be calculated over a well-chosen period of time. As an example, three periods of time are proposed for the experiment. A short time period might be considered as about 2 minutes. A medium time period might be considered as about 5 minutes. A long time period might be considered as about 30 minutes. At best, the duration of the period of time should be a tunable parameter. In some way, this parameter controls the efficiency of the router-based closed control loop, therefore, it might enable a network operator to keep some control over the dynamic of its network system. Among the factors that influence the dynamic state of a network system are the distribution of the traffic or the ratio between the average flow bandwidth and the average available bandwidth provisioned in the core network. Besides, when the statistic module starts collecting measures about the traffic load, it should initialize all its synthesis results with zero. 4.3 RIB structure As mentioned before, the structure of the RIB table under control of the ISIS routing process should be extended along three axes. First, the load attribute is added to the RIB table to reflect the level of traffic load that is experimented along a given route. Second, the path attribute allows the load attribute of any given route, to be updated efficiently and unambiguously. Third, the index attribute serves as a key identifier, when combined with a destination prefix. Therefore, it allows the FIB table to be updated reliably and can be used as a routing information indication. Proust Expires - August 2004 [Page 42] Internet-Draft Routing at Flow level 2/5/2004 +---------------+-----+--------+---------+------------+-----+------+ | Destination |Index|Next-Hop|Interface| Path |Load | Cost | | Prefix | | | | | | | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0/24 | 1 |10.1.1.1| GE1/0 | nh1,nh2,.. | 42% | 100 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0.24 | 2 |10.2.1.1| GE1/1 | nh3,nh4,.. | 32% | 150 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.1.0/24 | 3 |10.3.1.1| GE1/2 | nh5,nh6,.. | 8% | 200 | +---------------+-----+--------+---------+------------+-----+------+ |192.168.2.0/24 | 1 |10.3.1.1| GE1/2 | nh1,nh2,.. | 20% | 100 | +---------------+-----+--------+---------+------------+-----+------+ Figure 25: extended structure of the RIB table The "Destination Prefix" field represents an IP prefix linked to a reachable underlying sub-network known within a routing domain. The "Index" field represents an identifier that unambiguously identifies a route among the subset of routes that share the same destination prefix. The "Next-Hop" field represents the IP address of the next-hop router to be used along the route. The "Interface" field identifies the output transmission link of the router to be used along the route. The "Path" field represents the sequence of next-hop routers that identify, unambiguously, the set of transmission and switching resources associated to a given route. The "Load" field represents the maximum of the elementary traffic load that has been measured for each transmission link along the calculated route. This quantity is expressed as a percentage of an absolute bandwidth. The "Cost" field represents the total usage cost of resources along the route. It is calculated for a given topology by means of the Dijkstra's shortest path first (SPF) algorithm. 4.4 LLP control messages The need of an additional ISIS control message dedicated to the transport of statistic traffic load information is first motivated by stability reasons. That way, the routing process clearly separates the processing of the topology-related information from the processing of the traffic load information. The former type of information is assumed to be quite stable over time. It may be stored without any change in the LSDB database for quite a long period of time (from 20 minutes up to 18 hours). Proust Expires - August 2004 [Page 43] Internet-Draft Routing at Flow level 2/5/2004 To the contrary, the latter type of information is much more dynamic by nature and requires to be regenerated by all routers at a more important pace. The order of magnitude of its period of refreshing might be as little as a few minutes. Moreover, from the network bandwidth optimization viewpoint, it might be more efficient to use a separate control message dedicated to the transport of the traffic load related information. In compensation, the routing process must be extended to deal with a new control message, namely, the Link Load PDU (LLP). With regard to the LAN environments, it is assumed that the transformation of the topology, which is used to synchronize the LSDBs efficiently, should also be applied to synchronize the LLDB. That is to say, any fully meshed topology should be transformed into a star topology by means of a pseudonode router located at the center of the star. Since this router is only an abstraction, the cost and traffic load values associated with its transmission links, are equal to zero. In addition, as the standard specification of the ISIS routing protocol defines two routing levels designated by the acronyms L1 and L2, it should be necessary to define one LLP control message for each routing level. However, as this specification is still in a very early development stage, it is designed for only one routing level. Besides, it is a common practice for network operators to deploy ISIS by means of only one routing level, namely, the routing level 2 that is synonymous of a backbone routing level. Therefore, this Internet Draft focuses on extending ISIS routing capabilities at routing level 2 only. Proust Expires - August 2004 [Page 44] Internet-Draft Routing at Flow level 2/5/2004 No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Remaining Lifetime | 2 +--------------------------------+ | LLP ID | ID Length + 2 +--------------------------------+ | Sequence Number | 4 +--------------------------------+ | Checksum | 2 +--------------------------------+ |RES|RES|RES| IS Type | 1 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 26: format of the LLP control message The field called "PDU Length" represents the total length of the Link Load PDU including the header. The field called "Remaining Lifetime" represents the validity delay of the Link Load PDU expressed in seconds. The field called "LLP ID" represents the identifier of the LLP; it is made of the concatenation of three data fields, the "Source ID", the "Pseudonode ID" and the "LLP number". The "Source ID" field might be six bytes long. It is usually characterized by an IP loopback address that may identify, unambiguously, a router within a routing domain. The "Pseudonode IP" field represents an integer value. It is one byte long. A non zero value identifies a pseudonode router, while a zero value identifies a real router. The "LLP number" field represents the fragment number of the LLP, in case fragmentation processing is needed. It is also one byte long. When no fragmentation occurs, this field is filled with zero. The field called "Sequence Number" represents the sequence number of the Link Load PDU. Proust Expires - August 2004 [Page 45] Internet-Draft Routing at Flow level 2/5/2004 The field called "Checksum" represents a control sum that covers all the bytes of the Link Load PDU included between the "Source ID" field and the end the message. The field called "IS Type" represents the level of routing (L1 only, L2 only, or L1L2) which this LLP belongs to. The variable length field represents all the elements of information conveyed in this LLP. They are represented using TLV coding format. An LLP control message must include exactly one TLV code of type 1. The TLV code of type 1 characterizes the "area address", to which the router that originates this LLP belongs. Moreover, an LLP control message may include exactly one TLV code of type 134 and one TLV code of type 133. The TLV code of type 134 characterizes the "router ID", to which the router that originates this LLP refers. The TLV code of type 133 allows some authentication information to be associated with an LLP control message. Besides, an LLP control message may include as many TLV codes of type 22 as necessary. The TLV code of type 22 allows a transmission link to be defined extensively. This code allows some sub-TLV codes to be specified. In particular, an LLP control message may encapsulate some sub-TLV codes of types 6, 8 and 9 in its TLV codes of type 22. The sub-TLV code of type 6 allows an IPv4 address associated with a transmission interface to be defined. The sub-TLV code of type 8 allows an IPv4 address associated with an ISIS neighbor interface to be defined. The sub-TLV code of type 9 allows a maximum link bandwidth associated with a transmission interface to be defined. Moreover, in order to convey the statistic information about the traffic load of any transmission link it is necessary to define additional sub-TLV codes. Thus, it is proposed to define 3 additional sub-TLVs that would enable a network operator to finely configure the granularity of information collected in the LLDBs. A sub-TLV called "Short Term Link Load" could be used to represent the level of traffic load measured over the last two minutes. The sub-TLV called "Medium Term Link Load" could be used to represent the level of traffic load measured over the last five minutes. The sub-TLV called "Long Term Link Load" could be used to represent the level of traffic load measured over the last 30 minutes. Proust Expires - August 2004 [Page 46] Internet-Draft Routing at Flow level 2/5/2004 4.5 Synchronization of the LLDB As mentioned before, the mechanisms used to synchronize the LLDBs are identical to the ones used for the synchronization of the LSDBs. For independence and flexibility reasons, those mechanisms are duplicated. Thus, two additional ISIS control messages should be defined for a given routing level, a Complete Sequence Number Link Load PDU (CSNLLP) and a Partial Sequence Number Link Load PDU (PSNLLP). As mentioned before, this Internet Draft focuses on extending ISIS routing capabilities at routing level 2 only. The processing of the CSNLLP and PSNLLP control messages should be the same as the ones defined for the CSNP and PSNP control messages respectively. In brief, the procedure uses the same selective flooding mechanism to quickly broadcast any new LLP message generated by a router within a routing domain. With the provision that the received LLP is valid and includes more recent information than the one contained in its LLDB, a router should retransmits this LLP through all its interfaces except the one whose LLP message was received from. Such condition is based both on the sequence number and the remaining lifetime defined in the header of the LLP messages. Besides, in order to guarantee the delivery of LLP messages to all routers within a routing domain, the procedure checks periodically that all neighboring LLDBs are identical. Depending on the type of transmission link, this procedure is more or less complex. In LAN subnetwork environment, it is based on the concept of pseudonode router which logically transforms a broadcast subnetwork into a simple star of point to point interconnection links. Thus, all exchanges of PSNLLP and CSNLLP messages are conducted between the pseudonode router and the other neighboring routers. The pseudonode router is designated among all neighboring routers by means of a simple election mechanism. This artifice reduces the control traffic load on the LAN and quickens the synchronization period of time. Furthermore, to cope with the contention issue inherent in LAN subnetworks accesses, the pseudonode router regularly broadcasts a CSNLLP message on the LAN. This message includes a summary of its LLDB by listing all the LLP headers recorded in its database. Hence, each neighboring router can check the consistency of its own LLDB in relation to the one of the pseudonode router. In point to point subnetwork environment, it is based on a simple exchange of CSNLLP and PSNLLP messages that serve as query and response messages. Once the LLDB databases are synchronized, no more exchange of message is necessary. Proust Expires - August 2004 [Page 47] Internet-Draft Routing at Flow level 2/5/2004 However, while the body of CSNP and PSNP control messages refer to LSP messages, the body of CSNLLP and PSNLLP control messages should refer to LLP messages. To this end, a new TLV code is needed to identify a list of LLP identifier entries. It is intentionally called "LLP Entries" by analogy with the TLV code of type 9. Each LLP identifier entry would be composed of an LLP ID, a remaining lifetime value, a sequence number value, and a checksum value. The format of such TLV code is represented in figure 27 below. No. of Octets +--------------------------------+ | CODE | 1 +--------------------------------+ | LENGTH | 1 +--------------------------------+ | VALUE | LENGTH +--------------------------------+ | REMAINING LIFETIME | 2 +--------------------------------+ | LLP ID | 8 +--------------------------------+ | LLP SEQ NUMBER | 4 +--------------------------------+ | CHECKSUM | 2 +--------------------------------+ : : : : +--------------------------------+ | REMAINING LIFETIME | 2 +--------------------------------+ | LLP ID | 8 +--------------------------------+ | LLP SEQ NUMBER | 4 +--------------------------------+ | CHECKSUM | 2 +--------------------------------+ Figure 27: new TLV code called "LLP entries" Proust Expires - August 2004 [Page 48] Internet-Draft Routing at Flow level 2/5/2004 The format of both CSNLLP and PSNLLP control messages are represented in the figures hereafter. No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Source ID | ID length + 1 +--------------------------------+ | Start LLP ID | ID Length + 2 +--------------------------------+ | End LLP ID | ID Length + 2 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 28: format of the CSNLLP control message The field called "PDU Length" represents the total length of the CSNLLP control message including its header. The field called "Source ID" usually represents a system address that unambiguously identifies the router that originates this control message. In most cases, the "ID length" might be six bytes long. Usually, the "Source ID" value is obtained by concatenation of an IP loopback address and an integer value named the "Pseudonode IP". The latter is one byte long. A non zero value identifies a pseudonode router, while a zero value identifies a real router. The field called "Start LLP ID" represents the minimum value of the LLP ID identifiers, included in the list of LLP identifier entries that follows in the body of the CSNLLP control message. The field called "End LLP ID" represents the maximum value of the LLP ID identifiers, included in the list of LLP identifier entries that follows in the body of the CSNLLP control message. The variable length fields might contain one TLV code of type 133 and one or several TLV codes of the new type called "LLP entries". Proust Expires - August 2004 [Page 49] Internet-Draft Routing at Flow level 2/5/2004 No. of Octets +--------------------------------+ | | | Common Fixed ISIS Header | 8 | | +--------------------------------+ | PDU Length | 2 +--------------------------------+ | Source ID | ID length + 1 +================================+==================== | | | Variable Length Fields | Variable Length | | +--------------------------------+ Figure 29: format of the PSNLLP control message The field called "PDU Length" represents the total length of the PSNLLP control message including its header. The field called "Source ID" usually represents a system address that unambiguously identifies the router that originates this control message. In most cases, the "ID length" might be six bytes long. Usually, the "Source ID" value is obtained by concatenation of an IP loopback address and an integer value named the "Pseudonode IP". The latter is one byte long. A non zero value identifies a pseudonode router while a zero value identifies a real router. The variable length fields might contain one TLV code of type 133 and one or several TLV codes of the new type called "LLP entries". 4.6 RIB update 4.6.1 Periodic updates The standard specification of the ISIS routing protocol requires that each router within a given routing domain should regenerate and propagate its LSP on a periodic basis. This period of time is called the maximumLSPGenerationInterval and may vary between 20 minutes and 18 hours. Therefore, on receipt of such LSP, a router should update its LSDB which might, in turn, lead to the updating of its RIB. The present specification of the extension capabilities of the ISIS routing protocol requires that each router should regenerate and propagate its LLP on a periodic basis. This period of time is called the maximumLLPGenerationInterval and may vary between 2 and 30 minutes. Therefore, on receipt of such LLP, a router should update its LLDB which might in turn lead to the updating of its RIB. Proust Expires - August 2004 [Page 50] Internet-Draft Routing at Flow level 2/5/2004 This updating operation consists only in updating the load attribute of the affected routes. Nevertheless, though being an independent operation, such periodic update might also be followed by a trigger update that may add or suppress an alternative route due to a significant change in the distribution of the network traffic load. 4.6.2 Triggered updates As said above, such an update of the load attribute may trigger the withdrawal or the addition of an alternative route. Anytime, the load attribute update operation that corresponds to the processing of a newly received LLP message, is carried out, the router should launch a control process to check that no route in the RIB is overloaded or under-loaded. To this end, it uses two thresholds: a minimum and a maximum. Anytime, the level of the load attribute of a route is below the minimum or above the maximum threshold, the router should retrieves all the routes that share the same destination prefix as the one linked to this route from its FIB. The decision of adding an alternative route into the RIB would be based on the following set of conditions: - if the load attribute of a route is greater than a maximum threshold value, then, - if the load attribute of all the other routes that share the same destination prefix, shows that these routes are also in a state of traffic overload, then, - if the number of active routes (for instance, such a limit may have a value of 15) that share this destination prefix is still under a limit fixed by a routing policy constraint, then, o a new route might be added into the RIB The decision of withdrawing an alternative route from the RIB would be based on the following set of conditions: - if the load attribute of a given route in the RIB is smaller than a minimum threshold value, then, - if that given route is not the primary route also called SPF route (i.e. a route which has an index value of 1), then, - if the load attribute of another non primary route among those that share the same destination prefix, shows that it is also in a state of traffic under load, then, - if the total cost of this given route is higher than the total cost of this other non primary route being in a state of traffic under load, then, o this given route might be withdrawn from the RIB Proust Expires - August 2004 [Page 51] Internet-Draft Routing at Flow level 2/5/2004 In no case, should a valid primary route be suppressed due to a low level of traffic load. Moreover, in order to avoid both decisional procedures to trigger each other, which may result in a sequence of addition and withdrawal operations, once a shortest alternative route is added it is no longer suppressed according to this mechanism. On the contrary, a non shortest alternative route could be suppressed due to a low level of its load attribute. Anytime an LLP control message is correctly processed in the LLDB and RIB databases, each router would walk through its RIB and look for the applicability of the above set of conditions. The figure 30 below shows the dynamics of this decision process in relation to the load attribute. at this time a new route is created | route | traffic ^ | at this time a route might be suppressed load | v | | ,-. | | ; \ _ | MAX 30%|________;____:______/_\___|_____________ | / : ,' \ | | / `--' : | | ,' | | | / | | | ; : v / MIN 3%|__|______________________\____/_________ | | `--' |_______________________________________> time Figure 30: dynamics of the process of creation/suppression of a route Besides, according to the standard specification of the ISIS routing protocol whenever a topology change is detected, it is advertised within a given routing domain by means of the propagation of a new LSP. As a result, the LSDB table is updated which in turn might lead to the updating of the RIB table. Anytime a topology change affects the RIB table, the ISIS routing protocol should also consider its impact on the alternative routes. To this end, all routes that need to be recalculated should be identified. The ones that are not valid anymore should be suppressed from the RIB as soon as possible. Proust Expires - August 2004 [Page 52] Internet-Draft Routing at Flow level 2/5/2004 4.7 Route calculation algorithm The route calculation procedure relies both on the LSDB and the LLDB databases. Whenever a route must be calculated, the route calculation procedure is carried out either in one step or in two steps according to the type of route. If the RIB database no longer contains any more a valid primary route (also named SPF route) associated with the given destination prefix, then the calculation procedure must calculate it by means of a SPF algorithm based on the overall LSDB database. The index value attributed to a primary route must always be set to one. To the contrary, if the RIB database still contains a valid primary route associated with the given destination prefix, then the route calculation procedure must be carried out in two steps. First, it selects, from the LSDB database, the link transmissions that are able to sustain a certain amount of traffic load. On the basis of the information stored in the LLDB, this procedure logically suppresses from the LSDB all transmission links, whose traffic loads are greater than a maximum threshold value. Second, based on the resulting reduced topology, this route calculation procedure would work out a shortest path according to the second metric being considered, namely, the cost. This metric is associated to any transmission link and represents the cost of using a transmission link. Usually, the value associated with the cost attribute of a transmission link is inversely proportionate to its maximum bandwidth. Therefore, the more bandwidth associated with a transmission link, the less it costs to use it. Finally, if the total cost of the calculated route is not exceeding a certain limit value fixed to comply with a routing policy consideration, then the route is activated in the RIB of the ISIS routing protocol. 5. Ongoing Experiment At the present time, an implementation of the overall architecture has been started. The testbed is composed of several PC-based routers and terminals that rely on the Linux operating system. Besides, all developments that relate to the ISIS routing protocol are based on a commercial distribution of the Zebra protocol routing stack. Proust Expires - August 2004 [Page 53] Internet-Draft Routing at Flow level 2/5/2004 6. Conclusion The architecture presented in the present Internet Draft aims at load balancing flows among multiple routes that are believed to be slightly loaded. Basically, the approach relies on several functional extensions that affect both the IP protocol and an intra- domain link state dynamic routing protocol, such as, ISIS. The former extensions enable flow to be load balanced according to a criterion synonymous with a level of resource availability. Besides, thanks to an end-to-end piggybacking mechanism, those extensions of the IP protocol allow flows, in the course of time, to be pinned along the routes. The latter extensions enable multiple routes associated with a given destination prefix to be calculated according to multiple criteria, such as, the level of bandwidth availability and the total cost. In addition to the shortest path first routes, the routing protocol is able to manage alternative routes that would use available resources. To this end, the ISIS routing protocol is extended to maintain a representation of the current distribution of the network traffic load on a short term periodic basis. As the solution applies to flow level, in term of performance and efficiency, it possesses a useful level of granularity. Besides, the solution is stateless, hence, it is in some ways scalable and secure friendly. Moreover, thanks to the route pinning mechanism, the load-balancing service should exhibit some good stability properties. In terms of simplicity, the implementation of the end-to-end piggybacking mechanism should be of low cost. However, the support of the designed forwarding mechanism in the routers might require special care. With regard to this concern, it is expected to get some feedback from the experimental field in the future. Proust Expires - August 2004 [Page 54] Internet-Draft Routing at Flow level 2/5/2004 IANA Considerations This draft requires the use of a new IP option and, therefore, a new option number from the number space within the IP protocol. Moreover, it also requires the use of new ISIS control messages and, therefore, new PDU type values from the number spaces within the specification of the ISIS routing protocol. Besides, this draft also requires the use of new sub-TLVs and, hence, new code values from the number spaces within the specification of the ISIS TLV type number 22 named "Extended IS Reachability TLV". This section explains the logic used by the author to choose the most appropriate number space for each new entity, and is intended to assist in the determination of any final values assigned by IANA. New IP Option The Internet Protocol has provisions for optional header fields identified by an option type field. As the IP option defined in this draft must be copied in the fragments, the most significant bit of the one byte length option field must be set to one. Besides, as this option belongs to the class of control messages, the next two significant bits must be set to zero. Finally, the option type within the class 0 is coded using the 5 least significant bits of the option field. At the time of writing this draft, the number 25 is the next available value. Therefore, the proposed value for the IP option defined in this draft is the decimal number 153. New PDU Types The type of any ISIS control message is defined in the fixed ISIS common header of eight bytes long. Besides, it is coded in the "PDU type" field that is 5 bits long. As the time of writing this Internet Draft, it appears that values 30, 35 and 37 of the "PDU type" field are available. Thus, with regard to the code value needed to be associated with the definition of the PDU type named LLP at routing level 2, the proposed value is 30. Besides, it is proposed to use the number 35 to be associated with the definition of the PDU type named CSNLLP at routing level 2. Finally, it is proposed to use the number 37 to be associated with the definition of the PDU type named PSNLLP at routing level 2. New TLV Code The TLV codes are included in the variable length fields of the different ISIS control messages. As its acronym implies, the type of any TLV code is defined in the first field of this standard data structure. It is coded by means of one byte. Proust Expires - August 2004 [Page 55] Internet-Draft Routing at Flow level 2/5/2004 This draft requires the use of a new TLV code called "LLP Entries". Thus, by analogy with the TLV code of type 9, it is proposed to associate the value of 19 with this TLV code. New sub-TLV Codes The TLV code of type 22, called "Extended IS Reachability TLV", enables the characterization information of a transmission link being advertised in an LSP to be extended. It is specified in [ISIS-TE] and relies on the concept of sub-TLV for the purpose of traffic engineering parameters definition. This draft also requires the use of three new sub-TLVs and, hence, three new code values from the number space within the specification of the TLV type number 22. At the time of writing this document, values 19, 20 and 21 are available from this sub-TLV number space. Hence, it is proposed to associate the values 19, 20 and 21 with the sub-TLVs named "Short Term Link Load", "Medium Term Link Load" and "Long Term Link Load" respectively. Security Considerations As mentioned earlier, a terminal might misbehave when processing the IP option described in this document. For instance, a terminal could send a datagram with an inappropriate sequence of index values. In that case, the routers along the route would detect such anomaly and, therefore, would proceed to the remarking of the FFIV vector. Therefore, from the viewpoint of a terminal, it is useless to spoof the FFIV vector, since such practice would finally end up rerouting the flow. Besides, as the routes are calculated independently of the flows by means of a shortest path algorithm, it should be impossible to introduce a routing loop, by means of a manipulation of the FFIV vector values. Normative References [RFC2026] Bradner, S., "The Internet Standards Process û Revision 3", BCP 9, RFC 2026, October 1996. [BCP14] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [FYI18] Malkin, G., "Internet Users' Glossary", BCP 18, RFC 1983, August 1996. [ISIS] ISO 10589, "Intermediate System to Intermediate System Intra- Domain Routeing Exchange Protocol for use in Conjunction with the Proust Expires - August 2004 [Page 56] Internet-Draft Routing at Flow level 2/5/2004 Protocol for Providing the Connectionless-mode Network Service (ISO 8473)" [Also republished as RFC 1142] Informative References [ISIS-IETF] Callon, R.W., "Use of OSI IS-IS for routing in TCP/IP and dual environments", RFC 1195, December 1990. [INT-ARCH1] Carpenter, B., "Architectural Principles of the Internet", RFC 1958, June 1996. [INT-ARCH2] Bush, R., Meyer, D., "Some Internet Architectural Guidelines and Philosophy", RFC 3439, December 2002. [BCP41] Floyd, S., "Congestion Control Principles", September 2000. [QOSARCFRWK] Huston, G., "Next steps for the IP QoS Architecture", RFC 2990, November 2000. [QOSRFRWK] Crawley, E., Nair, R., Rajagopalan, B., Sandick, H., "A framework for QoS-based Routing in the Internet", RFC 2386, August 1998. [TE-REQ] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., and McManus, J., "Requirements for Traffic Engineering Over MPLS,", RFC 2702, September 1999. [OMP] Villamizar, C., Li, T., "IS-IS Optimized Multipath (ISIS- OMP)", Work in Progress, January 2002. [QOSPF] Apostolopoulos, G., Williams, D., Kamat, S., Guerin, R., Orda, A., Przygienda, T., "QOS routing mechanims and OSPF extensions ", RFC 2676, August 1999. [OSPF] Moy, J., "Open Shortest Path First version 2", STD 54, RFC 2328, April 1998. [E2E] Kempf, J., Austein, R., "The Rise of the Middle and the Future of End to End: Reflections on the evolution of the Internet Architecture", Work in Progress, October 2003. [IAB-FUTURE] Atkinson, R., Floyd, S., "IAB Concerns and Recommendations Regarding Internet Research and Evolution", Work in Progress, October 2003. [IRTF-ROUTREQ] Doria, A., Davies, E., Kastenholtz, F., "Requirements for Internet Domain Routing", Work in Progress, December 2003. Proust Expires - August 2004 [Page 57] Internet-Draft Routing at Flow level 2/5/2004 [IRTF-RR] http://www.irtf.org/rrg/ [ISIS-TE] Smit, H., Li, T., "IS-IS extensions for Traffic Engineering", Work in Progress, August 2003. Acknowledgments The author would like to thank No‹l Cantenot of France Telecom R&D for his fruitful comments on this work. Author's Address Christophe Proust France Telecom R&D 42 rue des Coutures 14066 Caen cedex 4 - FRANCE Phone: +33 2 31759308 Email: christophe.proust@francetelecom.com IPR Notice The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. Proust Expires - August 2004 [Page 58]