INTERNET DRAFT OHTA Masataka draft-ohta-ric-hqlip-00.txt Tokyo Institute of Technology FUJIKAWA Kenji Kyoto University Real Internet Consortium 25 March 2001 Hierarchical QoS Link Information Protocol (HQLIP) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract HQLIP is a protocol to distribute link information with multiple layers of hierarchy. It can be used for Internet-wide link state routing. 1. Introduction HQLIP is a protocol to distribute link information with multiple layers of hierarchy. It can be used for Internet-wide link state routing. Link state routing protocols, such as HQLIP, is essential for QoS routing and multicast. Conventional routing protocol such as BGP can OHTA Masataka Expires on 25 November [Page 1] INTERNET DRAFT HQLIP March 2001 also be used for best effort unicast routing. In a sense, HQLIP is an extension of OSPF considering QoS routing. OSPF supports only two layers of hierarchy and can not propagate link failure or QoS degradation. With HQLIP, multiple layers of hierarchy is supported scalable for the Internet-wide routing. QoS link state payload is not transparent to the HQLIP protocol and HQLIP knows whether some QoS information is better or worse than another. As such, HQLIP can rapidly propagate QoS degradation while suppressing the rate of QoS information update. With HQLIP, a network is divided into areas connected by routers. Areas have multiple layers of hierarchy. Each physical interface of routers is a level 0 area. Areas of level i or lower (subarea) may be connected by routers to form an area of level i+1 (parent area). Area borders are located on routers. If an interface of a router belongs to an area, the router belongs to the area. A router belonging to areas is called border router of the areas. The lowest layer, that is, the layer with most detailed least aggregated information, is a level 0 area. A datalink layer is an area of level 1 or more. Each area has a center to aggregate QoS information. The following values are recommended for level numbers of areas: 0-7 site (organization or home) local 8-15 inter providers 16 maximum level (area of the whole Internet) Areas are connected by links, which also have multiple layers of hierarchy. A link connecting from a level i area to a level j area has a level of (i,j). Within an area, two kinds of information is flooded. One is area information and is relatively static. IP addresses of the area or accounting information are examples of area information. Another is link information and is relatively dynamic. Connectivity between areas and QoS between centers of areas are the link information. Link information is further classified into internal link information and external link information. Information, flooded in a level i area, on how its subareas (level j and k (j, k < i)) are connected is internal link information of level (j, k). Information, flooded in a level i area, on how its subarea OHTA Masataka Expires on 25 November [Page 2] INTERNET DRAFT HQLIP March 2001 (level j (j < i)) is connected to an area (level k) adjacent to the level i area is external link information of level (j, k). HQLIP uses UDP port XXX to find each other and use TCP port YYY for communication between routers. Configuration information is explained in section 2, mutual finding of routers in section 3, communication between routers in section 4, generation of area information in section 5 and generation of link information in section 6. 2. Configuration Information At least one router of an area should have information on ID of the area, IP address ranges (a list of IP addresses and netmask length) belonging to the area and other area information as configuration information. If there are multiple such routers, the configuration information must be identical. Area ID must be unique in its parent area and areas contained (directly or recursively) by the parent area. An area ID is constructed as (level, IP_address). The IP_address is a globally unique unicast IPv4 address of the area (IPv4 case), lower 8 byte of an IPv6 address of the area (IPv6 level 0 case) or upper 8 byte of a globally unique unicast IPv6 address of the area (IPv6 level 1 or more case). A level is represented by 1 byte, an IPv4 area ID is 5 bytes long and an IPv6 area ID is 9 bytes long. An area of level 1 or more has a center for QoS information aggregation. A level 0 area is a point with no internal structure and is itself is the center. If a router of an area is a candidate of a center of the area, the router have configuration information of center ID and center priority, both of which must be unique within the area. Center candidate must have area information. 3. Mutual Finding of Routers and Their Confirmation Each router broadcasts (IPv4 case) or multicasts to all host multicast address (IPv6 case) UDP packets of length 0 with intervals of 20 to 30 seconds (intervals vary randomly each time). A source address of the packets is that of the originating interface and TTL is 1. Each router maintain a list of adjacent routers. The list is empty at first. The list contains interfaces to reach adjacent routers and IP addresses used by the adjacent routers. Each router, upon receiving a packet (broadcast or multicast) on UDP OHTA Masataka Expires on 25 November [Page 3] INTERNET DRAFT HQLIP March 2001 port XXX, looks up the list of adjacent routers. Newly found routers are added to the list and TCP connection is established between routers at port YYY. TCP connection is initiated from a router with smaller IP address of UDP packets. A router periodically transmit keep alive messages, if there is no data to send. The intervals of keep alive messages are 10 seconds, which may be shortened by the administrators of both end of the TCP connection. If the keep alive messages do not consume so much resource (link bandwidth and message processing power), in which case, TCP timer values such as for retransmission must also be made smaller or keep alive will unnecessarily time out. If no data, including keepalive, is sent over a TCP connection for 3.5 times more than keep alive interval (35 seconds default), the connection is considered to be down and disconnected. The router at the other end must be deleted from the list of adjacent routers. 4. Communication between Routers A router has a database, each for each levels of areas to which the router belongs, on link and area information of the area. Routers flood their database each other through TCP connections between them. The initial state of the database is empty. The format of the entries of the database is as shown in Figures 1 and 2. +--------------------------+ | Type(1) | |--------------------------| | Parent area ID | |--------------------------| | Link start area ID | |--------------------------| | Link end area ID | |--------------------------| | Source time stamp | |--------------------------| | Metric | |--------------------------| | QoS Information | +--------------------------+ Figure 1. Link Information OHTA Masataka Expires on 25 November [Page 4] INTERNET DRAFT HQLIP March 2001 +---------------------+ | Type (2,3,4) | |---------------------| | Parent area ID | |---------------------| | Source ID | |---------------------| | Source time stamp | |---------------------| | Payload | +---------------------+ Figure 2. Area Information A type field of link information is 1 byte long and 1. Source time stamp and metric fields are 4 bytes long. QoS information is of variable length. Metric is used for routing best effort communication and QoS information is used for routing resource reserved communication. Link information Source ID of area information is the center ID of the router, which generated the area information. Values for a type field of area information are 2 for center information, 3 for IP address information and 4 for accounting information. The type field is 1 byte long, the source time stamp field is 4 bytes long and the payload is of variable length. If no accounting performed in a area, accounting information for the area is generated with empty payload. The center information with empty payload means that the source is no longer a center or a center candidate. The source time stamp is lower 4 bytes of milliseconds from 00:00:00 January 1st, 1970 (GMT) and compared with serial number arithmetic of RFC 1982. For data format in TCP streams, see Appendix A. Entries within a database are attached with local time stamp at which time the entry is updated. Though various constraints on time in this specification is defined for source time stamp, it is recommended that local time stamp is also used for sanity checking. Except for link local messages, to compare source and local time stamp, 15 seconds of jitter for message propagation should be tolerated. Maximum allowable jitter for link local messages are defined link by link. To prevent inconsistency on accounting, each router should maintain error of source time stamp at most 5 seconds. To prevent jitter, the value of a source time stamp should be adjusted at most 1 second a OHTA Masataka Expires on 25 November [Page 5] INTERNET DRAFT HQLIP March 2001 minutes and should never moves backward. The source time stamp on the same link or area must be incremented each time. If link or area information is continuously updated within 1ms, the source time stamp value will be different from the value of the clock. A database for an area has two states: synchronizing and synchronized. The initial state is synchronizing. 4.1 Initial synchronization of database After a TCP connection between routers is established, the routers, each other, synchronize their database by sending the content of the databases of areas common to them, from the lowest level area and updates their database. During the synchronization, synchronizing flag of the messages are turned on. At the end of the synchronization of each area, an area synchronization message is sent. When the area synchronization message is sent and received for an area, the the state of the database becomes synchronized and the database may be used for routing table computation. 4.2 Maintaining synchronization of database A router receiving a link information message adds the received information to its database if there is no database entry with the same link start and link end or update the existing entry if the time stamp of the message is newer than the existing entry. However, if the update is too frequent, the router ignores the update, generates warning message to the operator and floods information that the link is down. If the updated link information is degraded in every sense, there is no update rate limit. If the updated link information stays same or is improved in a sense, at least 30 seconds must have passed after the previous update, if the previous updated information stayed same or is improved in a sense, or at least 30*K seconds must have passed after the K previous update, if the previous N updates are degraded in every sense. A router receiving an area information message replaces the existing entry of the same time. However, if the update is too frequent (more than 3 updates are performed in 70 minutes), the router ignores the update, generates warning messages to operators and stop using the area of the message. OHTA Masataka Expires on 25 November [Page 6] INTERNET DRAFT HQLIP March 2001 If a router receiving a message adds or updates its local database entry, the router relays the message to other interfaces belonging to the same area as the interface through which the message is received. If the area over the connection is at synchronizing state, the messages to be relayed are buffered appropriately. Link information with empty payload means the disconnect or failure of the link. A router receiving link failure information removes an entry for the link from its database and floods the information to all the interfaces of the area through TCP connections except for the incoming interface of the information. If there are entries of link or area which becomes unreachable from the router, the entries are also removed. 5. Generation of Area Information 5.1 Determining the Center Subarea ID of an area of level i, an area ID of the are and the center priority is flooded in the area as area information on center from center candidates. If the priority of a center candidate changes or a former center candidate is no longer be a center candidate, the information is flooded immediately. If a router of an area is a center candidate of the area and there is no other routers with higher priority than the router, the router becomes the center of the area. If a router finds another router of the same priority as itself, it is an error to be notified to the operator. 5.2 Other area information A center router of an area floods area information other than center information within the area. The router may flood the information if there is some change to the information but no more frequent than 3 times an hour. Routers other than a center router with area information warns the operators if they receives area information contradicting with their own copy of the area information. OHTA Masataka Expires on 25 November [Page 7] INTERNET DRAFT HQLIP March 2001 6. Generation of Link Information 6.1 Generation of Link Information at Level (0,0) A router generates internal link information of level 0 for the outgoing direction of packets on links connected to its interfaces and floods the information in parent area of the interface. Link start and end is determined by the IP addresses for the TCP connections. Link metric is statically configured for the outbound direction and is empty on link failure. QoS information is empty on link failure or the link is not capable of QoS assurance. Otherwise, the QoS information is out bound one. QoS information may be worse than actually available QoS of the link. 6.2 Generation of External Link Information at Level (0,i) A router on a border of an area of level i computes QoS from the center of the area to the router and flood it as an external link information (at level (0,i)) from interfaces belonging to not the area but the parent area of the area in the parent areas of the interfaces. Link start and end are determined by the IP address of the interface and the area ID of the level i area. Link metric is statically defined for outgoing direction and is empty on link failure. QoS information is empty on link failure or the link is not capable of QoS assurance. Otherwise, the QoS information is in bound one. QoS information may be worse than actually available QoS of the link. 6.3 Generation of Link Information at Level (j,i) A router at the center of an area of level j (j>=1) computes, based on internal and external link information flooded in the area, inbound QoS from an area (of level i) at the end of external link information and floods it as link information at level (j,i) to the parent area. The link information is internal if the parent area is also a parent area of the area of level i, or external otherwise. Link start and end are the area ID of the level j and i area. Link metric is statically defined for the direction from the level j area to level i area and is empty on link failure. QoS information is empty on link failure or the link is not capable of QoS assurance. Otherwise, the QoS information is for the direction from level i area to level j area. QoS information may be worse than OHTA Masataka Expires on 25 November [Page 8] INTERNET DRAFT HQLIP March 2001 actually available QoS of the link. 6.4 Link Information Update There is a limitation on the rate of link information update not to make the amount of traffic or route recomputation too large. However, to minimize the possibility of reservation failure caused by link state degradation, there is no rate limitation for update to link state worse than the current in every sense. There still are limitation on average update rate that if updates worse than the current in every sense is repeated so often, updates to state better in some sense is delayed. That is, if link state is getting worse and gradually worse announcements are repeated a lot of times, announcement of better link state will be delayed. That A is links state "better in some sense" than B is that there is some resource reservation request which is possible with A but not with B. Sometimes, A is "better in some sense" than B and B is "better than in some sense" than A. "worse in some sense" can be defined likewise. That A is link state "worse in every sense" tan B is that A is "worse in some sense" than B and that A is not "better in some sense" than B. A router, two minutes after its TCP connection become synchronized, investigate link information to the another end of the connection and generate a link state entry equal to or worse in every sense than the actual link state. When TCP connection is disconnected, link state entry of link failure is immediately generated and added to the database. When link state changes and the actual QoS becomes worse in some sense than the entry of the current database, the database is immediately updated. The new link state entry must be equal to or worse in every sense than the current link state. If 30 seconds has not passed since the last update or if the previous n updates are toward worse in every sense and k*30 seconds has not passed since k-th previous (1<=k<=n), new link state entry must be worse in every sense than the current database entry. Link state is better in some sense than the current database entry, a router may update the database with entry equal to or worse in every sense than the current link state and better in some sense than the current database entry. 30 seconds must have passed since the last such update. If the previous n updates are toward worse in every sense, k*30 seconds must have passed since k-th previous (1<=k<=n). OHTA Masataka Expires on 25 November [Page 9] INTERNET DRAFT HQLIP March 2001 7. Routing with Hierarchical Structure Routing with HQLIP is performed from the upper layer first without considering the routing in lower layers. First, in an area containing two points, between which a route is to be computed, at the lowest level, a sequence of subarea containing the best path between the points using internal link information of the area. Within each subarea, route is computed recursively using link information of the subarea. Metric of subareas will not be added to the metric of parent area to affect routing within the parent area. 8. Terminal Area A part of the Internet may be treated as a terminal area. All the information within the terminal area must be flooded within the area and propagated outside of the area. However, information from outside areas may be omitted. This property is especially useful to reduce traffic in the terminal area when address aggregation is not performed well, as is the case with IPv4. If an area is, at certain level, is connected with the rest of the Internet only with a single link, outside information at the level or above may be omitted from the announcement to the inside. The area is called terminal area. Further, if an area has only a single area to the rest of the Internet other than a terminal area, the area needs information between them only that such an area is recursively called a terminal area. In this case, however, resource reservation to outside of the terminal area can not be known to be successful untile RESV messages reaches outside of the area. It is not possible to have policy, either. OHTA Masataka Expires on 25 November [Page 10] INTERNET DRAFT HQLIP March 2001 References [SS] Fujikawa, K., and Sasaki, M., ``Service Specification (SS),'' Internet Draft (work in progress), draft-fujikawa-ric-ss-01.txt, March 2001. Authors' Address Masataka Ohta Computer Center Tokyo Institute of Technology 2-12-1, O-okayama, Meguro-ku, Tokyo 152, JAPAN Phone: +81-3-5734-3299 Fax: +81-3-5734-3415 EMail: mohta@necom830.hpcl.titech.ac.jp FUJIKAWA Kenji Graduate School of Informatics Kyoto University Yoshidahonmachi, Sakyo-Ku, Kyoto City, 606-8501, JAPAN Phone: +81-75-753-5387 Fax: +81-75-751-0482 EMail: fujikawa@real-internet.org A. Packet Formats A.1 BNF Notation KALIVE = Header SYNC = Header Level LINKQOS = Header Level Stamp SrcArea DstArea [ Metric [ LinkQoS ]*{0-8} ] AREACNTR = Header Level Stamp SrcArea [ ParentArea Priority ] AREAADDR = Header Level Stamp SrcArea [ Address ]* AREAQOS = Header Level Stamp SrcArea InArea OutArea [ AreaQoS ]*{0-4} A.2 Header +---------------+---------------+---------------+---------------+ | Type | Flags | Length | +---------------+---------------+---------------+---------------+ Flags: 8bits 0x1 = Syncing OHTA Masataka Expires on 25 November [Page 11] INTERNET DRAFT HQLIP March 2001 0x2 = External (for LINKQOS) Type: 8bits 0x0 = KALIVE 0x1 = LINKQOS 0x2 = AREACNTR 0x3 = AREAADDR 0x4 = AREAQOS 0x5 = SYNC Length: 16bits The total length of the message including the Header. A.3 Level +---------------+ | Level | +---------------+ Level: 8bits The parent level of link and area information. For example, This value of link information of level(2,1) becomes grater than three. A.4 Stamp (Source Time Stamp) for IPv4 +---------------+---------------+---------------+---------------+ | IPv4 Address | +---------------+---------------+---------------+---------------+ | Time | +---------------+---------------+---------------+---------------+ for IPv6 +---------------+---------------+---------------+---------------+ | | + IPv6 Address + | | +---------------+---------------+---------------+---------------+ | Time | +---------------+---------------+---------------+---------------+ Time: 32bits IPv4 Address: 32bits IPv6 Address: 64bits OHTA Masataka Expires on 25 November [Page 12] INTERNET DRAFT HQLIP March 2001 Lower 64bits of IPv6 address. This must be globally unique. A.5 Area (SrcArea DstArea ParentArea InArea OutArea) for IPv4 +---------------+ | Level | +---------------+---------------+---------------+---------------+ | IPv4 Address | +---------------+---------------+---------------+---------------+ for IPv6 +---------------+ | Level | +---------------+---------------+---------------+---------------+ | | + IPv6 Address + | | +---------------+---------------+---------------+---------------+ Level: 8bits IPv4 Address: 32bits IPv6 Address: 32bits A.6 Metric +---------------+ | Metric | +---------------+ Metric: 8bits A.7 LinkQoS +---------------+---------------+---------------+---------------+ // LINK_QOS (See [SS]) // +---------------+---------------+---------------+---------------+ OHTA Masataka Expires on 25 November [Page 13] INTERNET DRAFT HQLIP March 2001 A.8 Priority +---------------+ | Priority | +---------------+ Priority: 8bits Priority 0 is the highest one. A.9 Address +---------------+ | PrefixLen | +---------------+- | Address + PrefixLen: 8bits The length of a network address prefix. From 1 to 32 when IPv4, and from 1 to 64 when IPv6. Address: (Integer of (PrefixLen+7)/8)*8 bits A.10 AreaQoS +---------------+---------------+---------------+---------------+ // AREA_QOS (See [SS]) // +---------------+---------------+---------------+---------------+ OHTA Masataka Expires on 25 November [Page 14]