INTERNET-DRAFT Lan Wang Expires: October 1999 Andreas Terzis Lixia Zhang UCLA April 1999 A Proposal for reducing RSVP Refresh Overhead using State Compression Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract As a soft-state protocol, RSVP specifies that each RSVP node sends periodic RSVP messages for each existing RSVP session. The overhead due to such periodic refreshes goes up linearly with the number of active RSVP sessions. One can reduce the overhead by using a longer refresh period, which unfortunately leads to longer delays in re- synchronizing RSVP state. To overcome the dilemma, this draft describes the use of a 'state-compression' approach to soft-state protocol refreshes. In the absence of state changes this compression mechanism has a constant overhead while retaining the soft-state nature of RSVP. draft-wang-rsvp-state-compression-00.txt [Page 1] INTERNET-DRAFT April 1999 1. Introduction As a soft-state setup protocol, RSVP specifies that each RSVP node sends control messages for each active RSVP session to each of its neighbors serving the same session. These control messages are sent immediately whenever a change of state occurs, for example when a new session starts, when the TSPEC or RSPEC parameters of an existing session change, or when a route change occurs that affects the paths of existing RSVP sessions. In addition, an RSVP node also sends periodic refresh messages to its neighbors about all existing sessions. Such periodic refreshes serve the purpose of keeping consistent RSVP reservation state along data flow paths. There are potentially a number of causes that can make the reservation state between neighbor RSVP nodes out-of-synchronization. For example: - An RSVP message that carries state-change information gets lost. - A rare local event inadvertently changed RSVP state values. - Other locally induced state changes occur, for example when the current Kerberos key expires, a new key is obtained which needs to be carried to the next hop by the POLICY_DATA object in the next refresh. Instead of attempting to identify all the potential causes of changes and handle them individually, RSVP simply uses periodic refreshes to persistently re-synchronize the reservation state along the path. This approach assures consistent reservation state along the data flow path, even when the routers do not have an exhaustive list of all the possible causes of state inconsistency or how to correct them individually. The drawback of this simple approach is that its overhead grows linearly with the number of active RSVP sessions. A common mechanism to reduce the periodic refresh overhead is by using a longer refresh period (example: RTCP, SDR announcements). A longer refresh period, however, leads to longer delays in re-synchronizing state. In this draft we propose a "compression" mechanism for refresh overhead reduction. The goal of this scheme is to improve the efficiency of soft-state protocols in their communication behavior, not to further simplify RSVP. Rather, we applied *additional* processing (compression) to the protocol state information so that the number of refresh messages is reduced from being proportional to the number of sessions to being proportional to the number of neighbor nodes, while keeping the soft-state nature of RSVP. When a state change is notified by explicit RSVP messages, or detected by the periodic refreshes, the cost of updating the "compressed" state is proportional to LOG(T) where T is the total number of RSVP sessions. Section 7 presents a detailed overhead analysis of our draft-wang-rsvp-state-compression-00.txt [Page 2] INTERNET-DRAFT April 1999 compression mechanism. 2. Refresh Overhead Reduction by State Compression Even in the absence of new control information generated by sources or destinations, an RSVP node sends one message per refresh period to its neighboring nodes for each of all the RSVP sessions to keep the state consistency. Instead of sending one message per refresh period per session, one way to reduce the overhead is to compress the state of all RSVP sessions to a single signature, which can then be sent to neighbor nodes in place of all the "raw" refresh messages. All we need is an algorithm that can produce a one-to-one mapping from the given set of RSVP state to a signature. In this draft we propose to use the MD5 algorithm [MD5] for RSVP state compression. As stated in [MD5], "it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given pre-specified target message digest". We can therefore conclude, with a high level of assurance, that no two sets of different RSVP states should result in the same signature, and whenever the computed signatures at two neighboring nodes differ, the two nodes will be able to detect the differences from the states of shared RSVP sessions. Whenever a state change is detected (the digests of two neighbors disagree), raw RSVP messages are transmitted to re-synchronize the state. As a further optimization, we also add an acknowledgment component (ACK) to the RSVP protocol. The ACK is used for the following two purposes. 1. To minimize the re-synchronization delay after an explicit state change request. A node can request an ACK for each RSVP message that carries state-change information, and promptly retransmit the message until an acknowledgment is received. 2. To minimize the re-synchronization cost when inconsistent state between RSVP neighboring nodes is detected but the exact difference is unknown. A node should request an ACK for each digest sent, and it should log all the state changes since the last agreed-upon digest (which indicates that the two nodes have consistent state at that point in time). When an inconsistency is detected, the first recovery effort of a node is to send those latest changes. We propose to compute the MD5 signature in a structured way (see Section 3.2), so that in cases where the retransmission of latest changes does not result in state re-synchronization, two neighbors draft-wang-rsvp-state-compression-00.txt [Page 3] INTERNET-DRAFT April 1999 can efficiently locate the portion of inconsistent state and send raw RSVP messages to re-synchronize. 2.1. Definitions - Input Message The input to the MD5 Algorithm. In this document an input message is some portion of the RSVP state stored at the local node. - MD5 Signature The result of the computation of the MD5 algorithm on an input message. - Digest A set of MD5 Signatures that represents a compressed version of the RSVP state shared between two neighboring RSVP nodes. The following definitions are taken from RFC 2209. - PSB Path State Block. Each PSB holds path state for a particular (session, sender) pair, defined by SESSION and SENDER_TEMPLATE objects, respectively. -RSB Reservation State Block. Each RSB holds a reservation request that arrived in a particular RESV message, corresponding to the triple: (session, next hop, Filter_spec_list). 2.2. Limitations of the New Approach The ability for two RSVP neighbors to exchange a digest instead of raw RSVP messages relies on the assumption that the two nodes know precisely all the RSVP sessions that go through these two nodes in sequence. Whenever a unicast route change occurs, the upstream node draft-wang-rsvp-state-compression-00.txt [Page 4] INTERNET-DRAFT April 1999 must be able to detect the change and synchronize the state with the new downstream node (as well as tearing down the session with the old downstream neighbor). Similar actions need to be taken whenever a change in multicast routing occurs, which includes not only route changes but also multicast group membership changes (either addition or deletion). When two neighbor nodes of an RSVP session are directly connected, RSVP can expect to receive notification of route changes through the interface to the local routing module. When the two nodes are interconnected through a non-RSVP cloud, however, there seems no known way to reliably detect all routing changes without the per session periodic refreshes. Furthermore, due to the fact that each multicast RSVP session may involve a different set of up- and down-stream neighbors than any other multicast session, it is difficult to compute a digest for aggregate multicast sessions. Thus this first attempt focuses on refresh reduction for unicast RSVP sessions only. Finally, because an RSVP session is identified by the destination address, a unicast RSVP session can be a many-to-one session with multiple senders, thus the reservation parameters may change when crossing a merging point that has multiple upstream branches towards different senders. Consequently, computing the digest for each RSVP neighbor requires that the local node store data value that is not in the current protocol specification. We further discuss the difficulty in handling multicast and many-to-one unicast session, as well as possible solutions, in appendix A. Therefore this first draft of the refresh compression mechanism limits the problem space to handling one-to-one unicast sessions whose paths do not cross non-RSVP clouds. An RSVP node resorts to the current refresh scheme for all the sessions that have the "non_RSVP_cloud" bit set, or that have a multicast destination address. 2.3. Benefits and Constraints of the New Scheme The design of the new scheme aims at the following features - efficient state re-synchronization when two RSVP neighbors discover state inconsistency. - choice of individual nodes to use either original RSVP refreshes or the refresh reduction, or switch in between over time to achieve the best tradeoff. draft-wang-rsvp-state-compression-00.txt [Page 5] INTERNET-DRAFT April 1999 - backward compatibility with the current RSVP specification. - incremental digest computation when part of the session(s) changes state. We also recognize the following constraints of the new scheme. - Generally speaking, individual RSVP sessions follow different paths. However to reduce the refresh overhead to one refresh message per refresh period per neighbor, a digest represents the compressed state of aggregate sessions. The only way to do so is to first group all RSVP sessions by each neighbor, then compute the digest. That is, refreshing of compressed state must be done on a per neighbor basis. - In its current design, RSVP PATH messages carry the destination address of the session, thus all refresh messages automatically follow the latest unicast or multicast routing changes (including multicast group membership changes). Although RSVP has an interface to the local routing module for prompt notification of all the changes to minimize the adaptation delay, all RSVP reservations eventually converge on new paths even in the absence of such notifications. When we move from per session refresh to per hop refresh by digest, however, the protocol loses its automatic routing adaptation. Thus the new scheme relies entirely on specific notification to adjust to any routing changes. - To detect all the RSVP state changes, the digest must be computed over the RSVP sessions state rather than over the messages that are exchanged. Because those messages represent partial, but not all causes of state changes, digests of the messages alone are inadequate in detecting all the state changes; two neighbor nodes may end up in inconsistent state even when all RSVP messages are delivered correctly. One challenge we encountered many times during proposing the compression refresh mechanism is that when an RSVP node processes the objects carried in a refresh message, this processing step may take other factors as part of the input, and produce new object values to be passed on to next neighbor; what is saved at the local node is the value of the received objects. Change of RSPEC is a typical example in a multicast RSVP session. The same issue exists even when we limit our consideration to unicast sessions only; ADSPEC and POLICY_DATA are two such objects whose new values sent to the next hop may be different from that stored at the local node. To compare digest between neighbor node pairs a node must keep the latest values of all the objects it has sent to the neighbor. This requirement leads to a necessary change to the current protocol specification. draft-wang-rsvp-state-compression-00.txt [Page 6] INTERNET-DRAFT April 1999 3. Digest Mechanism Description 3.1. State Organization for Digest Computation RSVP maintains path state and reservation state for each active session [RFC2205]. A session is uniquely identified by a session object which contains the IP destination address, protocol ID and optionally a destination port number of the associated data packets. A Path State Block (PSB) is comprised of a sender template (i.e. IP address and port number of the sender), a Tspec that describes the sender's traffic characteristics, and possibly objects for policy control and advertisements. A Reservation State Block (RSB) contains filterspecs (i.e. sender templates) of the senders for which the reservation is intended, the reservation style and a flowspec that quantifies the reservation. It may also contain objects for policy control and confirmation. To "compress" a set of RSVP states into an MD5 signature, neighboring RSVP routers must order their states using the same rules since different orders of the states may result in different signatures. We propose that sessions be ordered by the increasing order of IP address, protocol ID and port number in session objects. Ordering rules for multicast and many-to-one unicast sessions are described in appendix A. 3.2. Structured Digest Computation As the number of sessions increases, a node may experience state changes more frequently and, with a flat structure, the overhead of digest computation increases linearly with the number of sessions. Therefore we need a scheme to facilitate partial state changes by incremental MD5 computation. As mentioned in section 3.1, a node orders all the RSVP sessions it shares with each neighbor node. The node first computes a MD5 signature for each session. Table 1 shows the objects that should be included in the digest computation. These objects should be fed into the computation procedure in the order as presented in the table. Although PSB and RSB also contain a few other fields such as incoming interface and outgoing interfaces, those fields have only local meaning to a specific router and therefore are excluded from the digest computation. draft-wang-rsvp-state-compression-00.txt [Page 7] INTERNET-DRAFT April 1999 RSVP Structure || Objects to Include _________________||_______________________________________ || Session || session object _________________||_______________________________________ || PSB || sender template, sender tspec, || adspec, policy _________________||_______________________________________ || RSB || filterspec, flowspec, style, policy _________________||_______________________________________ Table 1. Objects to Include in Digest Computation Assuming that N is the number of MD5 signatures that can be packed in one Digest message, a node constructs a digest tree for each of its neighbors in the following way. First, it computes the signature of each session, making these signatures as the leaf node of the tree. If there are more than N leaf-level signatures, it groups each N signatures in one block and compute a level-1 signature for the group. With a total of T sessions, this grouping results in Ceiling[T/N] level-1 signatures. If the number of level-1 signatures is still larger than N, the node continues on to group each of N level-1 signatures to compute a level-2 signature to get Ceiling[T/N^2] level-2 signatures. Let Ci be the number of level-i signatures, we repeat the grouping until Ci is smaller than or equal to N. o o o digest /|\ /|\ /|\ ooo ooo ooo level-1 signature /|\ ....../|\ o o o ....o o o leaf-level signatures Figure 1. A Digest Tree with N=3 There are two ways to define the value for N. The first way is to define it statically in the specification. One would like the largest N value that can fit a digest in the link MTU, but the value of MTU may be different at different places. A second approach is to let two RSVP neighbor nodes negotiate N's value at startup time and then keep it constant. This second approach may require new handshake message exchanges between two neighbor RSVP nodes. In this draft we assume a configured value for N. draft-wang-rsvp-state-compression-00.txt [Page 8] INTERNET-DRAFT April 1999 A digest needs to be recomputed when the following events occur: o a new session is added; o a session is changed, i.e. a state is added to a session, or a state is modified or deleted from a session; o an existing session is deleted. The cost of digest computation is summarized in section 7. 3.3. Neighbor Data Structure A router needs to compute two digests for each neighbor, one digest for the states that are refreshed by messages from that neighbor and one for the states that the local sends refresh messages to the neighbor. We refer to the two digests InDigest and OutDigest, respectively. The OutDigest is used in the Digest message and the InDigest is sent in a DigestErr message to the corresponding neighbor. Section 4 specifies the format of the new RSVP message types proposed for our scheme. To avoid having to traverse all the RSVP sessions to compute the digest for a specific neighbor, a router maintains a linked list that points to all the sessions shared with the neighbor. An example of the per neighbor structure is shown in Figure 2. The Neighbor structure holds all the information needed to send and receive digests to/from a RSVP neighbor node. A description of each field in the Neighbor structure follows: IP Address Holds the IP address of the interface this neighbor uses to communicate with the local node. OutSession OutSession contains the sessions to be included in the computation of the digest the local node is sending to this neighbor. The sessions included are those that have PSBs whose next downstream hop is the neighbor in question and RSBs that have that neighbor as the immediate upstream hop. OutDigest The digest sent to this neighbor. RefreshOutTimer Contains the next time a digest should be sent to this neighbor. draft-wang-rsvp-state-compression-00.txt [Page 9] INTERNET-DRAFT April 1999 Session Struct Session Struct ^ ^ | | Neighbor Struct | | SessDigest Struct +----------------- + +-----------------+ +-----------------+ | IP Address | | Session* | | Session* | +----------------- + +-----------------+ +-----------------+ | OutSession |--->| Next SessDigest |--->| Next SessDigest |---->... +----------------- + +-----------------+ +-----------------+ | OutDigest | | P/R | | P/R | +----------------- + +-----------------+ +-----------------+ | RefreshOutTimer | | MD5 Signature | | MD5 Signature | +----------------- + +-----------------+ +-----------------+ | TimestampLastSent| +------------------+ |TimestampLastAcked| +------------------+ | OutDigestTimeout | +------------------+ | InSession |---> Linked List of SessDigest structs +------------------+ | InDigest | +------------------+ | CleanupInTimer | +------------------+ | ... | +------------------+ Figure 2. Neighbor Structure TimestampLastSent Contains the timestamp of the latest digest sent to this neighbor. TimestampLastAcked This is the timestamp of the latest digest acknowledged by the neighbor, i.e. the time instant when state, between the local node and this neighbor, was last synchronized. OutDigestTimeout Contains the timeout period in milliseconds for digests sent to this neighbor. If an acknowledgment is not received by the end of this period, the digest should be retransmitted. draft-wang-rsvp-state-compression-00.txt [Page 10] INTERNET-DRAFT April 1999 InSession A pointer to a linked list of SessDigest objects. This list contains all the sessions that are refreshed by digests received from this neighbor. Specifically, this includes sessions whose PSBs have this neighbor as previous hop and sessions whose RSBs have this neighbor as next hop. InDigest The digest expected to be received from this neighbor. CleanupInTimer If no digest is received from this neighbor by the time contained in the CleanupInTimer field, all associated state should be cleaned up. The SessDigest structure contains information about a session that has the neighbor in question as a Next Hop. The SessDigest structure has the following fields: Session* A pointer to the portion of the RSVP state that holds information for the session represented by this structure. P/R This field indicates whether the PSB or the RSB of a Session will be used in the signature operation. MD5 signature Contains the MD5 signature for this session. 3.4. Neighbor Discovery To use the compressed refresh scheme, a router needs to discover all the neighbors that are capable of exchanging digests (compression- capable). For this purpose, when a RSVP router starts sending (raw) RSVP messages of a session, it should request that the neighbor(s) acknowledge the messages by including a TIMESTAMP object with the ACK_Requested flag on (see section 4). If the local router receives an ACK message from a neighbor whose address is not currently on the draft-wang-rsvp-state-compression-00.txt [Page 11] INTERNET-DRAFT April 1999 Neighbor Structure list, it has discovered a new digest-capable neighbor. If it receives an error message from a neighbor who does not understand the new TIMESTAMP object, it discovers a neighbor that is compression-incapable. When a non-RSVP cloud exists between two RSVP neighbor nodes, although they can discover each other using the acknowledgments during initial message exchanges, the upstream node may not be able to detect a change of the downstream neighbor caused by a unicast route change. As shown in Figure 3, node A originally has a downstream neighbor B for a particular unicast RSVP session. After a route change, node C becomes A's downstream neighbor. However, since A's outgoing interface is unchanged, A will not notice the route change, hence it will continue to include the path state of sender S when calculating the digest to B. Node B will not be aware of the change either as long as A sends B the same digest. C may never get a path message from A. Therefore, the digest scheme cannot be used crossing non-RSVP clouds until an effective way of detecting route changes is found. RSVP routers can detect the existence of non-RSVP clouds between neighbors by comparing the TTL value in the IP header with the Send_TTL in the RSVP common header, as described in [RFC2209]. original route ___________ ---> -> ->| |-- B ----| -> S - - --- A --| non-RSVP | |------- D | cloud | | |___________|-- C ----| new route ---> Figure 3. An RSVP Session with a non-RSVP cloud 3.5. Digest Refreshes RSVP states are refreshed by regular RSVP messages as well as by digest Messages. However, different from a regular RSVP message which refreshes the state of one RSVP session, a digest message may carry multiple MD5 signature values, each representing the compressed state of multiple sessions. If an individual signature in the digest matches the corresponding local value, all the RSVP states covered by that signature should be refreshed. That is the local node should carry out the same operation as if one refresh message for each of all the covered sessions is received. draft-wang-rsvp-state-compression-00.txt [Page 12] INTERNET-DRAFT April 1999 The refresh rate of digest messages can be different from the refresh rates of individual sessions covered by that digest. For digest messages, the default refresh rate is 30 seconds as RFC 2205 suggests. If a different refresh rate is used, it SHOULD be carried in the TIME_VALUES object of the digest message. After a digest is received, all sessions covered by this digest MUST be refreshed at this rate. 4. New RSVP Message Types and Objects 4.1. TIMESTAMP Object TIMESTAMP Class = Value to be assigned by IANA of form 0bbbbbbb TIMESTAMP object Class = TIMESTAMP Class, C_Type = 1 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Flags: 8 bits 0x80 = ACK_Requested flag Indicates that the sender is requesting a message acknowledgment. Acknowledgments MUST be silently ignored when they are sent in response to messages whose ACK_Requested flag is not set. This flag MUST be set when the message is a Digest message. It MUST not be set in an ACK or DigestErr message. Reserved: 24 bits This field is reserved. It MUST be set to zero on transmission and MUST be ignored on receipt. draft-wang-rsvp-state-compression-00.txt [Page 13] INTERNET-DRAFT April 1999 Timestamp: 32 bits It is the middle 32 bits (lower 16 bits of the integer part and higher 16 bits of the fractional part) of the NTP time value when the message is sent. If NTP time is not available, system timer can also be used. When combined with the message generator's IP address, it uniquely identifies a message. We chose to use a timestamp to identify a message because it is always increasing and so a sender does not have to check if the value is in use or has been used before. It also helps the receiver to identify which of the RSVP messages for the same state is the most recent one. However, depending on the processing speed of the router and the existing timer granularity, we may have two consecutive messages with the same timestamp value. Therefore, we define the timestamp to be max(current_timestamp, last_used_timestamp+1), where last_used_timestamp is the timestamp of the latest RSVP message that contains a TIMESTAMP object. 4.2. DIGEST Object DIGEST Class = Value to be assigned by IANA of form 10bbbbbb DIGEST object Class = DIGEST Class, C_Type = 1 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Level | Group | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Number of Signatures | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | // signature (16 bytes) // | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Level: 8 bits This is the level of this set of signatures in the corresponding digest tree. The level of a signature is defined as follows: 1. Leaf signatures are assigned level 0; draft-wang-rsvp-state-compression-00.txt [Page 14] INTERNET-DRAFT April 1999 2. The level of a non-leaf signature is (1 + the level of its immediate children). Group: 24 bits This is the group this set of signatures belong to (recall we divide signatures into groups of size N and compute a signature for each group to build a tree). We number the groups at the same level of a digest tree increasingly from left to right and the left-most group is assigned group 0. Level and group uniquely identify the location of a set of signatures in a digest tree. Reserved: 16 bits This field is reserved. It MUST be set to zero on transmission and MUST be ignored on receipt. Number of Signatures: 16 bits Number of MD5 signatures contained in this object. Signature: a 16-byte MD5 signature. 4.3. Digest Message The format of a Digest message is as follows: ::= [ ] [ ] A Digest message has a common RSVP header with the message type set to 14. It MUST contain a TIMESTAMP object with the ACK_Requested flag set and a digest object. The TIME_VALUES object may be included so that a router can inform its neighbor what timer value to use to time out states when using a digest scheme. draft-wang-rsvp-state-compression-00.txt [Page 15] INTERNET-DRAFT April 1999 4.4. ACK Message The format of an ACK message is as follows: ::= [ ] [ ... ] An ACK message has a message type of 15. It MUST contain at least one TIMESTAMP object. With multiple TIMESTAMP objects, the ACK message can be used to acknowledge multiple messages. 4.5. DigestErr Message A DigestErr message has the following format: ::= [ ] The message type of DigestErr message is 16. It MUST contain a TIMESTAMP object and a DIGEST object. A DigestErr message acts as a negative acknowledgment to a Digest message. The TIMESTAMP object identifies the Digest message being acknowledged. The DIGEST object carries the local digest value of the sender of the DigestErr message. 5. RSVP Message Operations 5.1. Non-digest RSVP Messages A node sends non-digest RSVP messages in the same way as before, with the new option of asking for an ACK. To request an acknowledgment, the node sends the message with a TIMESTAMP object and sets the ACK_Requested bit in the object. It also sets a retransmission timer. When the neighbor receives a message that requires acknowledgment, it processes the message as usual and sends back an ACK message which contains the timestamp of the message being acknowledged. If there is any change in the state, it then updates the signature. If no ACK is received after timeout, the node retransmits the message draft-wang-rsvp-state-compression-00.txt [Page 16] INTERNET-DRAFT April 1999 up to a configured number of times. The retransmission period SHOULD be shorter than regular refresh period and be proportional to the round-trip time between the two neighboring nodes. Each retransmission MUST carry the same timestamp as that in the original message. If the corresponding state is changed before an ACK is received, the original message becomes obsolete and no longer needs to be retransmitted. If changing of the state triggers a new message, the timestamp of the new message should be max(current_timestamp, last_used_timestamp + 1). If the node does not receive an ACK after all the retransmissions, this message needs to be retransmitted later in the recovery process (see section 6.2). 5.2. Digest Message Once a compression-capable node discovers a compression-capable neighbor, instead of sending one refresh message for each PATH/RESV state, it sends a digest, in one RSVP message per refresh period to each of the neighbors. Special treatments for RSVP states containing ADSPEC and POLICY_DATA objects are discussed in section 5.3. Each digest MUST carry a timestamp object with the ACK_Requested flag set. When the neighbor receives the Digest message, it first locates the corresponding set of signatures in the local digest tree and then performs a binary comparison between each signature in the message and that in the local digest tree. o If any of the signatures does not match the local one, it should respond with a DigestErr message containing the local digest value after processing the Digest message. Note that if there is any matching signature, RSVP states represented by that signature should be refreshed. o Otherwise, it is in a consistent state with the sender, so it sends an ACK message and refreshes all the states shared with the sender. The retransmission of a Digest message is handled in the same way as non-digest messages. However, as in the original RSVP design where an RSVP node never stops sending refresh messages for each active session, a node should not stop sending digest refreshes even when a previous one failed to receive an acknowledgment. If a neighbor node crashed and becomes alive again, it will find the digest value different from its own and the two routers will start resynchronization step. When the digest value is changed, any pending retransmission of the obsolete Digest message should be canceled. draft-wang-rsvp-state-compression-00.txt [Page 17] INTERNET-DRAFT April 1999 5.3. Refresh of ADSPEC and POLICY_DATA An RSVP message may carry an ADSPEC object that contains resource information used by receivers to predict the end-to-end service. This object is updated hop-by-hop to gather information along a route. Since it is opaque to RSVP, RSVP simply records the received ADSPEC in the corresponding state and, whenever it needs to send a refresh message for a state, it calls the traffic control module with the recorded ADSPEC to get an updated ADSPEC object. Therefore, the ADSPEC object sent by a node may be different from that received by the node. POLICY_DATA object is handled in the same way except that it is submitted to the policy control module. To handle ADSPEC or POLICY_DATA objects correctly with the digest scheme, we need to solve the following two problems. First, for each session with ADSPEC or POLICY_DATA object, we must trigger periodic updating of ADSPEC or POLICY_DATA objects. Second, since ADSPEC and POLICY_DATA are opaque to RSVP, RSVP cannot tell whether the objects returned by the traffic control (or policy control) module changes after each call. For the first problem, we treat digest refreshes the same way as regular RSVP refresh messages. In other words, before a node sends a digest, the node calls corresponding modules to update these objects. There are three possible solutions to the second problem. First, a node can simply assume that the object returned by the traffic control (or policy control) module changes after each call at refresh period, which represents a change in RSVP state and hence a regular RSVP PATH state should be sent to carry the new object value to the next hop. This approach requires no changes to RSVP specification but it essentially falls back to the original RSVP refresh mechanism whenever ADSPEC or POLICY_DATA object exists in an RSVP session. The second option is to modify the RSVP/traffic control and RSVP/policy control interfaces so that they return a change bit that indicates whether the updated object is different from the previously returned one. The previously returned object can be an argument passed to the interface. For example, the original RSVP/traffic control interface for ADSPEC is Call: TC_Advertise( Interface, Adspec, Non_RSVP_Hop_flag ) -> New_Adspec, where Adspec is the ADSPEC from received RSVP messages. We can change it to draft-wang-rsvp-state-compression-00.txt [Page 18] INTERNET-DRAFT April 1999 Call: TC_Advertise( Interface, Adspec, Old_Adspec Non_RSVP_Hop_flag ) -> New_Adspec, change bit, and Old_Adspec is the ADSPEC object returned by the previous call. When the changed bit is set, a regular RSVP refresh must be sent. The third option is to let RSVP perform a binary comparison between the previously and newly returned objects after each call to traffic control (policy control) module to find out whether the object has changed. This third option can be a short-term fix while we work on the proposed changes in RSVP interfaces to other modules. 5.4. Error Message for Unknown Object The class number of TIMESTAMP object is of the form 0bbbbbbb. According to RFC 2205, the receiver of a message containing this object should send back an error message (PathErr or ResvErr depending on the original message) if it does not understand the object. Upon receiving the error message, the sender should do the following things: 1. stop sending digests to the receiver; 2. withhold sending TIMESTAMP object in future messages until it receives from that neighbor a message containing a TIMESTAMP object; 3. continue regular RSVP refreshes. 6. RSVP State Re-synchronization 6.1. Discovery of Inconsistency Two RSVP neighbors may become out-of-sync due to a number of reasons. - A state-changing RSVP message is lost, and the sender did not ask for ACK. - A neighbor crashed and lost part or all of its state. - Other unknown reasons. Receipt of a DigestErr message indicates inconsistency between two nodes. The timestamp and digest value in the DigestErr message help the two neighbors to localize the problem. draft-wang-rsvp-state-compression-00.txt [Page 19] INTERNET-DRAFT April 1999 o If the timestamp acknowledged is smaller than the timestamp of the last Digest message sent (TimestampLastSent), this error message is for an obsolete message. This message should be ignored since it may not represent the current state of the neighbor. o If they are equal, the node should start the recovery process described in the next section. 6.2. Recovery To help state synchronization, each node needs to keep a log of the RSVP messages sent after the last synchronization point (TimestampLastAcked) for each neighbor. We record the following information for a message: o message type o a flag indicating whether the message contains a TIMESTAMP object o timestamp of the message (if no TIMESTAMP object is present in the message, use the timestamp that could be assigned when the message is sent) o a flag indicating whether ACK is required for the message o a pointer to the corresponding RSVP state o objects required to reconstruct the message that cannot be obtained from the RSVP state (e.g. ERROR_SPEC). Note that if the message is a teardown message, all the objects in the message should be stored here since the corresponding state has already been deleted.) When a node receives an ACK message, it first checks if the message acknowledges the latest Digest message. If so, it should update TimestampLastAcked to the value carried by the TIMESTAMP object in the ACK message and remove all the RSVP messages sent before TimestampLastAcked. Otherwise, only the acked message needs to be removed. Any message rendered obsolete by a new RSVP message should be removed from the log. If the node detects any discrepancy with a neighbor, it can retransmit all the messages on the log followed by a Digest message. It may set the ACK_Requested flag in these messages and a message should be removed from the log if the corresponding ack message is received. There is a tradeoff between the number of messages contained in the log and the bandwidth used to acknowledge messages. The more messages with ACK_Requested set, the less messages we may have on the log. However, this requires more ACK messages sent from the neighbor. In the extreme case where the log is full, older messages should be purged to accommodate new RSVP messages. draft-wang-rsvp-state-compression-00.txt [Page 20] INTERNET-DRAFT April 1999 If retransmission of the logged messages cannot lead to consistency, i.e. the node still receives a DigestErr, regular refresh messages should be sent for the problematic part of the digest tree. One possible cause of the problem is that some of the "synchronized" states were accidentally modified or erased by the neighbor, while RSVP messages for these states have already been deleted from the log. Another possible scenario is that some messages were not received by the neighbor and they are removed from the log because of limited logging space. To find the states that are inconsistent, the node first compares the digest value in the DigestErr message with its own. When it finds the first mismatching signature (call it S1), it sends a Digest message containing the signatures used to compute S1. A DigestErr is expected for this Digest message since at least one of the children signatures does not match. The node again looks for the first mismatching signature (S2) in the DigestErr message and sends the children of S2 in a Digest message. This procedure is repeated until the leaf session (Sn) causing the problem is found. Following refreshing of Sn, the node sends the top-level digest to the neighbor. If the inconsistency is caused by a single session, walking down the tree once can solve the problem. Otherwise, the node will receive another DigestErr and it has to continue to identify the other problematic sessions by traversing the tree. There is a tradeoff between the latency of the recovery procedure and the transmission efficiency. For example, if the tree has many levels, we need many RTTs to exchange the signatures to find the leaf-level sessions that contribute to the inconsistency. However, if we have sufficient resources, we can stop at an intermediate level of the tree and refresh all the states represented by the mismatching signature at that level. 7. MD5 Computation Cost RFC 1231 [MD5] provides a detailed description of the MD5 algorithm computation. The algorithm divides the message in 64-byte chunks and applies a four step process to each one of the 64-byte chunks. In each of the four steps, a number of bit-wise logical operations are applied to the 64-byte chunk. The results from the computation on the nth chunk are used as input for the computation of the (n+1)th chunk. The computation cost of MD5 is a linear function of the size of the message as measured in number of bytes. Given T total unicast RSVP sessions shared between two RSVP neighbors draft-wang-rsvp-state-compression-00.txt [Page 21] INTERNET-DRAFT April 1999 the time needed to create the digest is proportional to the size of the RSVP state in bytes. The node first computes the MD5 signature for each of all the sessions, then computes the MD5 signature for each of N session signatures. So on and so forth, this computation continues to build a forest of signature trees from bottom up, until the number of trees is less than or equal to N, the number of signatures that can be carried in one packet. The list of all the highest level signatures, the digest, should be carried in one IP packet without fragmentation. When the parameters in one of the sessions change, all the signatures on the path from that leaf to the root of the tree must be recomputed. The number of signatures in this path, is bounded by the height of the tree. Assuming T total sessions and an N-ary tree, this number is: Ceiling[log_N(T)] (log_N() is the logarithm of base N) For example with T=1000 and N=80 (80 MD5 signature is 1280 bytes), the number of signatures to recompute is 2. Insertion and deletion times will depend on the particular data structure used to implement the digest tree. In the N-ary tree described above the worst case time for insertion or deletion maybe O(T) since a insertion or deletion may cause a complete tree restructure. On the other hand, other data structures, such as the BB tree described in [BB1],[BB2], have insertion and deletion times of O(log_2(T)). The tradeoff in this case is between the increase in the recomputation time (from log_N(T) to log_2(T)) and the decrease in the insertion/deletion time. State ordering after an arbitrary number of session insertions and deletions, provided that these changes are reliably delivered to both neighbors, is guaranteed if both neighbors start from a consistent state and use the same insertion/deletions algorithms. Neighboring RSVP nodes SHOULD use the same data structure to store the digest tree. otherwise even though RSVP state between two neighboring nodes is consistent the two neighbors will possibly compute different digests. draft-wang-rsvp-state-compression-00.txt [Page 22] INTERNET-DRAFT April 1999 8. Discussion Motivated by the requirements from using RSVP for MPLS traffic engineering, Berger et al have proposed a few extensions to the basic RSVP protocol [REDUCT] for two purposes. 1. to assure prompt delivery of new RSVP control message by adding an acknowledgment option. 2. to reduce the state refresh overhead by eliminating refresh messages. Instead, a "Hello" exchange is added between RSVP neighbors to keep track of the liveness of each other. The overhead reduction by compression scheme proposed in this draft shares a similar ACK component with the scheme proposed in [REDUCT]. Both schemes also face the same challenges posed by non-RSVP clouds and multicast RSVP sessions; when facing either of these conditions, both schemes resort back to the current RSVP refresh mechanism. However we believe that stopping refreshes entirely represents a fundamental departure from the soft-state approach adopted in RSVP design. As we have argued at a few places in this draft, reliable RSVP message delivery alone is neither necessary nor sufficient for assuring consistent reservation state inside the network. To keep the soft-state approach while making the protocol scale with the number of sessions, the proposed state compression mechanism pays the cost of computing and maintaining a tree of session signatures. When sessions or session parameters change, maintaining the tree of session signatures has a computational overhead that is proportional to the log of the total number of sessions (see a quick analysis in Section 7). When large numbers of RSVP sessions exist in a network, we believe that the proposed scheme brings substantial performance gain from reduced refresh overhead, especially when most of the sessions are relatively stable. However the feasibility of using this approach in a highly dynamic environment where sessions are constantly added and deleted remains an open issue. 9. Compatibility The extensions introduced here are fully compatible with the existing version of RSVP. The TIMESTAMP has an assigned value whose form is 0bbbbbbb. The RSVP specification dictates that objects of this class should generate an error message when received by nodes that do not support them. An RSVP node that sends a message with a TIMESTAMP object and subsequently receives an "Unknown Object Class" error, MUST NOT send any more messages with attached TIMESTAMP object and draft-wang-rsvp-state-compression-00.txt [Page 23] INTERNET-DRAFT April 1999 MUST NOT start sending Digest objects but SHOULD use regular refreshes instead. Regarding Digest messages, a node SHOULD only start sending Digest messages only when it discovers that its particular peer is compression-capable using the procedure outlined in Section 3.4. Implementations that support the TIMESTAMP object, MUST also support the digest mechanism and vice-versa. 10. IANA Considerations IANA must provide three new Msg Types for the three new messages defined in this draft: Digest, ACK and DigestErr. Also IANA must provide Class Numbers for the objects defined here. Specifically: The TIMESTAMP object requires a Class-Num value of the form 0bbbbbbb. The DIGEST object requires a a Class-Num value of the form 10bbbbbb. 11. Security Considerations No new security issues are raised in this document. See [RFC2205] for a general discussion on RSVP security issues. 12. Acknowledgments The phrase "overhead reduction by state compression" was suggested by Van Jacobson of cisco. Our design also benefited from discussion with Steve McCanne of UC Berkeley regarding the roles of acknowledgment in soft-state protocol design. 13. References [BB1] R. Bayer. "Symmetric binary B-trees: Data structure and maintenance algorithms". Acta Informatica, 1:290-306, 1972 [BB2] L. Guibas, R. Sedgewick, "A diocrhomatic framework for balance trees", In Proceedings of the 19th Annual Symposium on Foundations of Computer Science, pages 8-21. IEEE Computer Society, 1978. draft-wang-rsvp-state-compression-00.txt [Page 24] INTERNET-DRAFT April 1999 [REDUCT] Lou Berger et al, "RSVP Refresh Reduction Extensions", draft-berger-rsvp-refresh-reduct-00.txt, March 1999. [MD5] R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321, April 1992. [RFC2205] Braden, R. Ed., Zhang, L., et al, "Resource ReserVation Protocol -- Version 1 Functional Specification", RFC 2205, September 1997. [RFC2209] Braden, R., Zhang, L, "Resource ReSerVation Protocol (RSVP) -- Version 1 Message Processing Rules", RFC 2209, September 1997. Appendix A. Considerations for Multicast and Many-to-one Unicast Sessions In a multicast session or many-to-one unicast session, there may be multiple path states and reservation states in each session, so routers also need to order path and reservation states in the same way. The order of path states can be determined by sender templates and the order of reservation states can be based on filterspecs and Rspecs. In addition to ordering path and reservation states, a router may need to keep information about the reservation request sent to each upstream neighbor if it is a merging or splitting point because 1. the reservation request it received from a downstream neighbor may not be the same as the one sent to an upstream neighbor; 2. currently reservation states only record reservation made on a link to a downstream neighbor; 3. the sender of a digest has to compute the digest based on what flowspec and filterspec were sent to its upstream neighbor. For a multicast session, complication arises for neighbor discovery if a router is attached to a broadcast LAN. First, there may be both compression-capable and compression-incapable downstream neighbors on the LAN. In this case, both regular RSVP refreshes and digests should be sent to the same interface. Furthermore, a router may not detect a change in the downstream neighbors when a downstream router on the broadcast LAN joins or leaves a group without affecting the list of outgoing interfaces of the associated RSVP states. One solution for this problem is to change the RSVP/routing interface so that it reports a route change when the list of downstream neighbor routers changes. If a message with ACK request is sent to a multicast session over a draft-wang-rsvp-state-compression-00.txt [Page 25] INTERNET-DRAFT April 1999 multi-access interface (PATH and PATH-TEAR), the receiving node must add a random delay before sending the ACK to avoid ACK implosion. The random interval SHOULD be between zero (0) and a configured maximum time. The configured maximum SHOULD be set in proportion to the refresh and retransmission interval. 14. Authors' Addresses Lan Wang UCLA 4805 Boelter Hall Los Angeles, CA 90095 Phone: 310-267-2190 Email: lanw@cs.ucla.edu Andreas Terzis UCLA 4677 Boelter Hall Los Angeles, CA 90095 Phone: 310-267-2190 Email: terzis@cs.ucla.edu Lixia Zhang UCLA 4531G Boelter Hall Los Angeles, CA 90095 Phone: 310-825-2695 EMail: lixia@cs.ucla.edu draft-wang-rsvp-state-compression-00.txt [Page 26]