Routing Over Large Clouds Working Group James V. Luciani INTERNET-DRAFT (Bay Networks) Grenville Armitage (Bellcore) Joel Halpern (Newbridge) Expires October 1996 Server Cache Synchronization Protocol (SCSP) - NBMA Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Abstract This document describes the Server Cache Synchronization Protocol (SCSP) for Non Broadcast Multiple Access (NBMA) networks. SCSP attempts to solve the generalized server synchronization/cache- replication problem wherein a set of server entities which are bound to a Server Group (SG) through some means (e.g., all servers belonging to the same Logical IP Subnet (LIS)[1]) wish to synchronize the contents (or a portion thereof) of their caches. These caches contain information on the state of the clients within the scope of interest of the SG. An example of types of information that must be synchronized can be seen in NHRP using IP where the information includes the REGISTERED clients' IP to NBMA mappings in the SG LIS. Luciani, et al. [Page 1] INTERNET-DRAFT SCSP-NBMA Expires October 1996 1. Introduction It is perhaps an obvious goal for any protocol to not limit itself to a single point of failure such as having a single server in a client/server paradigm. Even when there are redundant servers, there still remains the problem of cache synchronization; i.e., when one server becomes aware of a change in state of cache information then that server must propagate the knowledge of the change in state to all servers which are actively mirroring that state information. Further, this must be done in a timely fashion without putting undo resource strains on the servers. Assuming that the state information kept in the server cache is the state of clients of the server, then in order to minimize the burden placed upon the client it is also highly desirable that clients need not have complete knowledge of all servers which they may use. However, any mechanism for synchronization should not preclude a client from having access to several (or all) servers. Of course, any solution must be reasonably scalable, capable of using some autoconfiguration service, and lend itself to a wide range of authentication methodologies This document describes the Server Cache Synchronization Protocol (SCSP). SCSP solves the generalized server synchronization/cache- replication problem while addressing the issues described above. SCSP synchronizes caches (or a portion of the caches) of a set of server entities which are bound to a Server Group (SG) through some means (e.g., all NHRP servers belonging to a Logical IP Subnet (LIS)[1]) and may exist in an any topology as long as the resultant graph spans the set of servers that need to be synchronized. These caches contain information on the state of the clients within the scope of interest of the SG. An example of types of information that must be synchronized can be seen in NHRP[2] using IP where the information includes the REGISTERED clients' IP to NBMA mappings in the SG LIS. Only the first few pages of this document constitute the SCSP description proper. However, this document also includes (rightly or wrongly) a description of the use of SCSP by a number of protocols (e.g., NHRP, ATMARP, MARS, etc.) and some optional functionality which may be implemented as deemed appropriate. It is hoped that these appendices will spark interest in applying SCSP to the server synchronization needs of other protocols by supplying examples of SCSP's use. Luciani, et al. [Page 2] INTERNET-DRAFT SCSP-NBMA Expires October 1996 2. Overview SCSP places no topological requirements upon upon the SG. Obviously, however, the resultant graph must span the set of servers to be synchronized. SCSP borrows heavily from the link state protocols [3,4]. However, unlike those technologies, there is no Shortest Path First (SPF) calculation and there is little or no additional memory requirements imposed above and beyond that which is required to save the cached information which would exist regardless of the synchronization technology. In order to give a frame of reference for the following discussion, the terms Local Server (LS), Directly Connected Server (DCS), and Remote Server (RS) are introduced. The LS is the server under scrutiny; i.e., all statements are made from the perspective of the LS when discussing the SCSP protocol. The DCS is a server which is directly connected to the LS; e.g., there exists a VC between the LS and DCS. Thus, every server is a DCS with respect to every other server which connects to it directly, and every server is an LS which has zero or more DCSs directly connected to it. An RS is a server that is neither an LS nor a DCS; i.e, an RS is always two or more hops away from an LS (whereas a DCS is always one hop away from an LS). SCSP uses three message types: "Hello", "Cache Alignment", and "Client State Update". "Hello" messages are used to ascertain whether a DCS is operational and whether the connection between the LS and DCS is bidirectional, unidirectional, or non-functional. "Cache Alignment" (CA) messages allow an LS to synchronize its entire cache with that of the cache of its DCSs. "Client State Update" (CSU) messages are used to update the state of cache entries in servers for a given SG. Sections 2.1, 2.2, and 2.3 contain a more in depth explanation of the Hello, CA, and CSU messages respectively. Luciani, et al. [Page 3] INTERNET-DRAFT SCSP-NBMA Expires October 1996 +---------------+ | | +-------@| DOWN |@-------+ | | | | | +---------------+ | | | @ | | | | | | | | | | | | | | @ | | | +---------------+ | | | | | | | WAITING | | | +--| |--+ | | | +---------------+ | | | | @ @ | | | | | | | | | @ | | @ | +---------------+ +---------------+ | BIDIRECTION |----@| UNIDIRECTION | | | | | | CONNECTION |@----| CONNECTION | +---------------+ +---------------+ Figure 1: Hello Finite State Machine (HFSM) 2.1 Hello Messages "Hello" messages ascertain whether a DCS is operational and whether the connections between the LS and DCS are bidirectional, unidirectional, or non-functional. Every LS MUST periodically send Hello messages to each of its DCSs. An LS must be configured with a list of DCS NBMA addresses. The mechanism for this configuration is beyond the scope of this document although one possible mechanisms would be an autoconfiguration server. An LS has a Hello Finite State Machine (HFSM) associated with each of its DCSs (see Figure 1). The HFSM monitors the state of the connectivity between the LS and a particular DCS. The HFSM starts in the "Down" State and transitions to the "Waiting" State after NBMA level connectivity has been established. Once in the Waiting State, the LS starts sending Hello messages to the DCS. The Hello message includes: a Sender ID which is set to the LS's ID (LSID), a Receiver ID which identifies the expected receiver of the Hello message and is initially set to zero, an S bit which will be described below, and a HelloInterval and DeadFactor which will be described below. At this Luciani, et al. [Page 4] INTERNET-DRAFT SCSP-NBMA Expires October 1996 point, The DCS may or may not be sending its own Hello messages to the LS. In either case, upon receiving the LS's Hello, the DCS copies the LSID from the Sender ID (SID) field of the LS's Hello message. When the DCS sends its next Hello to that LS, the DCS includes the LSID in the Receiver ID field and its own ID (the DCSID) in the Sender ID field. When the LS receives the DCS's Hello message, it will know that the DCS has received the LS's Hello message and thus bidirectional communication is possible at which point the HFSM transitions from the Waiting State to the "Bidirectional Connection" State. If an LS which is not in the down state receives a Hello message from a DCS and that message has a zero in the Receiver ID field then the HFSM for that DCS transitions to the "Unidirectional Connection" State. If while in the Unidirectional State, the LS receives a subsequent Hello message from that DCS and that message contains a Receiver ID equal to the LSID then the HFSM transitions to the Bidirectional Connection State. Any abnormal event, such as receiving a malformed Hello message or receiving a Hello message with a Receiver ID which is neither the LSID nor zero, causes the HFSM to transition to the Waiting State; however, a loss of NBMA connectivity causes the HFSM to transition to the Down State. Hello messages also contain a HelloInterval and a DeadFactor. The Hello interval advertises the time between sending of consecutive Hello messages by a server. That is, if the time between reception of Hello messages from a DCS exceeds the HelloInterval advertised by that DCS then the next Hello message is to be considered late by the LS. If the LS does not receive a Hello message within the interval HelloInterval*DeadFactor seconds then the LS MUST consider the DCS to be stalled at which point the LS should transition the HFSM to the Waiting State. A DeadFactor is also advertised on a per server basis. Luciani, et al. [Page 5] INTERNET-DRAFT SCSP-NBMA Expires October 1996 +------------+ | | +---@| DOWN | | | | | +------------+ | | | | | @ | +------------+ | |Master/Slave| |----| |@---+ | |Negotiation | | | +------------+ | | | | | | | | @ | | +------------+ | | | Cache | | |----| |----| | | Summarize | | | +------------+ | | | | | | | | @ | | +------------+ | | | Update | | |----| |----| | | Cache | | | +------------+ | | | | | | | | @ | | +------------+ | | | | | +----| Aligned |----+ | | +------------+ Figure 2: Cache Alignment Finite State Machine 2.2 Cache Alignment Messages "Cache Alignment" (CA) messages allow an LS to synchronize its entire cache with that of the cache of its DCSs. That is, CA messages allow a booting LS to synchronize with its DCSs. A CA message contains a CA header followed by zero or more Client State Advertisement Summary records (CSAS records). Luciani, et al. [Page 6] INTERNET-DRAFT SCSP-NBMA Expires October 1996 An LS has a Cache Alignment Finite State Machine (CAFSM) associated (see Figure 2) with each of its DCSs. The CAFSM starts in the Down State. When the HFSM reaches the Bidirectional State, the CAFSM transitions to the Master/Slave Negotiation State. The Master/Slave Negotiation State causes either the LS or DCS to take on the role of master over the cache alignment process. When the LS's CAFSM reaches the Master/Slave Negotiation State, the LS will send a CA message to the DCS associated with the CAFSM. The first CA message which the LS sends includes no CSAS records and a CA header which contains the LSID in the Sender ID field, the DCSID in the Receiver ID field, a sequence number, and three bits. These three bits are the M (Master/Slave) bit, the I (Initialization of master) bit, and the O (More) bit. In the first CA message sent by the LS to a particular DCS, the M, O, and I bits are set to one. If the LS does not receive a CA message from the DCS in CAReXmtInterval seconds then it resends the CA message it just sent. The LS continues to do this until the CAFSM transitions to the Cache Summarize State or until the HFSM transitions out of the Bidirectional State. Any time the HFSM transitions out of the Bidirectional State, the CAFSM transitions to the Down State. When the LS receives a CA message from the DCS while in the Master/Slave Negotiation State, the role the LS plays in the exchange depends on packet processing as follows: 1) If the CA from the DCS has the M, I, and O bits set to one and there are no CSAS records in the CA message and the SenderID as specified in the DCS's CA is larger than the LSID then a) The timer counting down the CAReXmtInterval is stopped. b) The CAFSM corresponding to that DCS transitions to the Cache Summarize State and the LS takes on the role of slave. c) The LS adopts the sequence number it received in the CA message as its own sequence number. d) The LS sends a CA message to the DCS which is formated as follows: the M and I bits are set to zero, the Sender ID field is set to the LSID, the Receiver ID field is set to the DCSID, and the sequence number is set to the sequence number that appeared in the DCS's CA message. If there are CSAS records to be sent (i.e., if the LS's cache is not empty) then the O bit is set to one and the initial set of CSAS records are included in the CA message. 2) If the CA message from the DCS has the M and I bits off and the Sender ID as specified in the DCS's CA message is smaller than the LSID then a) The timer counting down the CAReXmtInterval is stopped. b) The CAFSM corresponding to that DCS transitions to the Cache Summarize State and the LS takes on the role of master. c) The LS must process any CSAS records in the received CA. Luciani, et al. [Page 7] INTERNET-DRAFT SCSP-NBMA Expires October 1996 An explanation of record processing is given below. d) The LS sends a CA message to the DCS which is formated as follows: the M bit is set to one, I bit is set to zero, the Sender ID field is set to the LSID, the Receiver ID field is set to the DCSID, and the LS's current sequence number is incremented by one and placed in the CA message. If there are any CSAS records to be sent from the LS to the DCS (i.e., if the LS's cache is not empty) then the O bit is set to one and the initial set of CSAS records are included in the CA message that the LS is sending to the DCS. 3) Otherwise, the packet must be ignored. At any given time, the master or slave have at most one outstanding CA message. Once the LS's CAFSM has transitioned to the Cache Summarize State the sequence of exchanges of CA messages occurs as follows. 1) If the LS receives a CA message with the M bit set incorrectly (e.g., the M bit is set in the CA of the DCS and the LS is master) or if the I bit is set then the CAFSM transitions back to the Master/Slave Negotiation State. 2) If the LS is master and the LS receives a CA message with a sequence number which is one less than the LS's current sequence number then the message is a duplicate and the message MUST be discarded. 3) If the LS is master and the LS receives a CA message with a sequence number which is equal to the LS's current sequence number then the CA message MUST be processed. An explanation of message processing is given below. As a result of having received the CA message from the DCS the following will occur: a) The timer counting down the CAReXmtInterval is stopped. b) The LS must process any CSAS records in the received CA message. c) Increment the LS's sequence number by one. d) The cache exchange continues as follows: 1) If the LS has no more CSAS records to send and the received CA message has the O bit off then the CAFSM transitions to the Update Cache State. 2) If the LS has no more CSAS records to send and the received CA message has the O bit on then the LS sends back a CA message (with new sequence number) which contains no CSAS records and with the O bit off. Reset the timer counting down the CAReXmtInterval. 3) If the LS has more CSAS records to send then the LS sends the next CA message with the LS's next set of CSAS records. If LS is sending its last set of CSAS records then the O bit is set off otherwise the O bit is set on. Reset the timer counting down the CAReXmtInterval. Luciani, et al. [Page 8] INTERNET-DRAFT SCSP-NBMA Expires October 1996 4) If the LS is slave and the LS receives a CA message with a sequence number which is equal to the LS's current sequence number then the CA message is a duplicate and the LS MUST resend the CA message which it had just sent to the DCS. 5) If the LS is slave and the LS receives a CA message with a sequence number which is one more than the LS's current sequence number then the message is valid and MUST be processed. An explanation of message processing is given below. As a result of having received the CA message from the DCS the following will occur: a) The LS must process any CSAS records in the received CA message. b) Set the LS's sequence number to the sequence number in the CA message. c) The cache exchange continues as follows: 1) If the LS had just sent a CA message with the O bit off and the received CA message has the O bit off then the CAFSM transitions to the Update Cache State and the LS sends a CA message with no CSAS records and with the O bit off. 2) If the LS still has CSAS records to send then the LS MUST send a CA message with CSAS records in it. If the message being sent from the LS to the DCS contains the last CSAS records that the LS needs to send then the CA is sent with the O bit off. 6) If the LS is slave and the LS receives a CA message with a sequence that is neither equal to or one more than the current LS's sequence number then an error has occurred and the CAFSM transitions to the Master/Slave Negotiation State. CA message processing occurs as follows: The LS makes a list of those cache entries which are more "up to date" in the DCS than the LS's own cache. A CSA record is more "up to date" than the corresponding cache entry in the LS if 1) the sequence number in the CSA record is "larger than that found in the LS's corresponding cache entry 2) the combination of the Client ID and CSA Originator ID do not exist in the LS's cache During this process, the DCS makes a similar list with respect to the LS. The previously mentioned list is called the CSA Request List (CRL). If the CRL of the LS is empty upon transition into the Update Cache State then the CAFSM immediately transitions into the Aligned State. If the CRL is not empty then the LS solicits the relevant CSA records from the DCS associated with the CAFSM and when the LS has all the updated CSA record information it transitions into the Aligned State. The LS solicits the relevant CSA records by forming CSU Solicit (CSUS) messages from the CRL. CSUS messages contain a CSUS header and CSAS records from the CRL. The LS then sends the Luciani, et al. [Page 9] INTERNET-DRAFT SCSP-NBMA Expires October 1996 CSUS messages to the DCS. The DCS responds to the CSUS messages by sending CSU messages containing the appropriate CSA records to the LS. The DCS acts in a similar manner as does the LS with respect to acquiring updated CSA records for the CSAS records in the CRL. In this way, both LS and DCS databases are synchronized. At most one CSUS message will be outstanding at any given time. Just before the first CSUS message is sent from an LS to the DCS associated with the CAFSM, a timer is set to CSUSReXmtInterval seconds. If all the CSA records corresponding to the CSAS records in the CSUS message have not been received by the time that the timer expires then a new CSUS message will be created which includes all the outstanding CSA records plus additional CSAS records not covered in the previous CSUS message. The new CSUS message is then sent to the DCS. If, at some point before the timer expires, all CSA record updates have been received for all the CSAS records included in the previously sent CSUS message then the timer is stopped and if there are additional CSAS records that were not covered in the previous CSUS message but were in the CRL then the timer is reset and a new CSUS message is created which contains CSAS records from the CRL which have not yet been sent to the DCS. This process continues until all the CSAS records that were in the CRL have been updated in the LS. When the LS has a completely updated cache then the LS's CAFSM transitions to the Aligned State as previously mentioned. 2.3. Client State Update Messages "Client State Update" (CSU) messages are used to update the state of cache entries in servers for a given SG. An LS may send/receive a CSU to/from a DCS only when the corresponding CAFSM is in either the Aligned State or the Update Cache State. A CSU message is sent from an LS to each of its DCSs when the LS observes changes in the state of one or more clients in the SG. The change in state of a particular client is noted in a CSU message via a "Client State Advertisement" (CSA) record within the CSU. In this way, state changes are propagated throughout the SG. Examples of such changes in state are as follows: 1) an LS receives a request to add an entry to its cache (e.g., NHRP Registration Request or an administrative intervention), 2) an LS receives a request to remove an entry from its cache (e.g., NHRP Purge Request or administrative intervention), Luciani, et al. [Page 10] INTERNET-DRAFT SCSP-NBMA Expires October 1996 3) a cache entry has timed out in the LS's cache, has been refreshed in the LS's cache, or has been administratively modified (e.g., in NHRP, an Internetworking address to NBMA address binding has timed out or has been refreshed). After receiving a CSU, an LS acknowledges it by sending a CSU Reply. Each CSA which the LS has not already seen is propagated to each the "appropriate" DCSs. The choice of which DCSs to which the CSU needs to be propagated is specific to the instance of SCSP being executed; i.e., one instance of SCSP might require that the CSU is propagated to every DCS except the DCS from which LS originally received the CSU while another instance of SCSP might choose to propagate the CSU to only specific DCSs (e.g., this would be the case if the database being synchronized is one which is encompassed within LNNI[6]). A LS responds to CSUS messages by sending CSU messages containing the appropriate CSA records to the DCS. The LS acknowledges the CSU messages as described above. If an LS receives a CSUS message containing a CSAS record for an entry which is no longer in its database (e.g., the entry timed out and was discarded after the Cache Alignment exchange completed but before the entry was requested through a CSUS message), then the LS will respond respond with a CSA record containing a client state code of "4 - No such client data in server cache". Conclusions While the above text is couched in terms of synchronizing the knowledge of the state of a client within the cache of servers contained in a server grouping, this solution generalizes easily to any number of database synchronization problems (e.g., LECS synchronization). In such a case, the Client ID (CID) and Client state would be replaced by a unique token and an octet string describing the database entry being synchronized. The appendices below show how SCSP can be implemented in terms of packet syntax specific to a set of other protocols. Appendix A: Terminology This appendix introduces the terminology associated with SCSP. A.1 Abbreviations CA - Cache Alignment Message CAFSM - Cache Alignment Finite State Machine CID - Client ID Luciani, et al. [Page 11] INTERNET-DRAFT SCSP-NBMA Expires October 1996 CRL - CSA Request List CSA - Client State Advertisement CSAS - Client State Advertisement Summary CSU - Client State Update CSUS - Client State Update Solicit DCS - Directly Connected Server HFSM - Hello Finite State Machine I - Initialize bit LS - Local Server LSID - Local Server ID M - Master/Slave bit O - More bit RS - Remote Server SG - Server Group SID - Server ID A.2 Definitions Cache Alignment message (CA message) These messages allow an LS to synchronize its entire cache with that of the cache of one of its DCSs. Cache Alignment Finite State Machine (CAFSM) The CAFSM monitors the state of the cache alignment between an LS and a particular DCS. There exists one CAFSM per DCS as seen from an LS. Client ID (CID) The CID is an unique token which identifies a client whose state is being kept in a server's cache. This value might be taken from the protocol address of the client. CSA Request List (CRL) When CA messages are exchanged between an LS and one of its DCSs, Luciani, et al. [Page 12] INTERNET-DRAFT SCSP-NBMA Expires October 1996 the LS makes a list of those cache entries which are more recent in the DCS (based on a CSAS sequence number) than the LS's own entry and adds to that list any entry in the DCS which is not already in its cache. This list is the CRL. Client State Advertisement record (CSA record) A CSA is a record within a CSU message which identifies an update to the status of a "particular" client. Client State Advertisement Summary record (CSAS record) A CSAS contains a summary of the information in a CSA. A server will send CSAS records describing its cache entries to another server during the cache alignment process. CSAS records are also included in a CSUS messages when an LS wants to request the entire CSA from the DCS. The LS is requesting the CSA from the DCS because the LS believes that the DCS has a more recent view of the state of the cache entry in question. Client State Update message (CSU message) This is a message sent from an LS to its DCSs when the LS becomes aware of a change in state of a client. Client State Update Solicit message (CSUS message) This message is sent by an LS to its DCS after the LS and DCS have exchanged CA messages. The CSUS message contains one or more CSAS records which represent solicitations for entire CSA records (as opposed to just the summary information held in the CSAS). Directly Connected Server (DCS) The DCS is a server which is directly connected to the LS; e.g., there exists a VC between the LS and DCS. This term, along with the terms LS and RS, is used to give a frame of reference when talking about servers and their synchronization. Unless explicitly stated to the contrary, there is no implied difference in functionality between a DCS, LS, and RS. Hello Finite State Machine (HFSM) An LS has a HFSM associated with each of its DCSs. The HFSM monitors the state of the connectivity between the LS and a particular DCS. Initialize bit (I bit) This bit is included in a CA message. When set, this bit indicates that the sender of the CA wishes to negotiate for Master/Slave server status in the cache alignment process. Local Server (LS) The LS is the server under scrutiny; i.e., all statements are made from the perspective of the LS. Luciani, et al. [Page 13] INTERNET-DRAFT SCSP-NBMA Expires October 1996 This term, along with the terms DCS and RS, is used to give a frame of reference when talking about servers and their synchronization. Unless explicitly stated to the contrary, there is no implied difference in functionality between a DCS, LS, and RS. Local Server ID (LSID) The LSID is a unique token that identifies an LS. This value might be taken from the protocol address of the LS. Master/Slave bit (M bit) This bit is included in a CA message. When set, this bit indicates that the sender of the CA wishes to be Master of the cache alignment process. More bit (O bit) This bit is included in a CA message. When set, this bit indicates that the sender of the CA has more CA messages to send above and beyond the message it is currently sending. Remote Server (RS) An RS is a server that is neither an LS nor a DCS and unless otherwise stated an RS refers to a server in the SG. This term, along with the terms LS and DCS, is used to give a frame of reference when talking about servers and their synchronization. Unless explicitly stated to the contrary, there is no implied difference in functionality between a DCS, LS, and RS. Server Group (SG) The SCSP synchronizes caches (or a portion of the caches) of a set of server entities which are bound to a SG through some means (e.g., all servers belonging to a Logical IP Subnet (LIS)[1]). Thus an SG is just a grouping of servers around some commonality. Server Group ID (SGID) This ID is a 32 bit identification field that uniquely identifies the SG instance. Thus multiple SG instances may be running concurrently and this field may be used to demux them. Server ID (SID) The SID is a unique token that identifies a given server. This value might be taken from the protocol address of the server. Appendix B: Packet Formats B.1 SCSP Message Formats This section of the appendix includes the message formats for SCSP. SCSP messages may be contained as TLVs within a given protocol's Luciani, et al. [Page 14] INTERNET-DRAFT SCSP-NBMA Expires October 1996 packet or it may be used as the mandatory part of separate packet types as in NHRP (see Section B.2). B.1.1 Cache Alignment (CA) The Cache Alignment (CA) message allows an LS to synchronize its entire cache with that of the cache of its DCSs. The CA message type code is 1. The CA message format is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID Len | Recvr ID Len |M|I|O|u| Type | No. of CSASs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CA Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Server Group ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receiver ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSAS Record | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ....... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSAS Record | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sender ID Len This field holds the length in octets of the Sender ID. Recvr ID Len This field holds the length in octets of the Receiver ID. M This bit is part of the negotiation process for the cache alignment. When this bit is set then the sender of the CA message is indicating that it wishes to lead the alignment process. This bit is the "Master/Slave bit". I When set, this bit indicates that the sender of the CA message believes that it is in a state where it is negotiating for the status of master or slave. This bit is the "Initialization bit". O This bit indicates that the sender of the CA message has more CSAS Luciani, et al. [Page 15] INTERNET-DRAFT SCSP-NBMA Expires October 1996 records to send. This implies that the cache alignment process must continue. This bit is the "More bit" despite its dubious name. u unused Type This is the code for the message type. No. of CSASs This field contains the number of Client State Advertisements Summaries (CSASs) contained in the CA message. CA Sequence Number A value which provides a unique identifier to aid in the sequencing of the cache alignment process. The slave server always copies the sequence number from the master server's previous CA message into its current CA message thus acknowledging the master's CA message. When the slave receives a "higher" sequence number then the number that the slave previously sent then the slave's previous CA message is acknowledged. A "larger" sequence number means a more recent CA message. Server Group ID This ID is a 32 bit identification field that uniquely identifies the SG instance. Thus multiple SG instances may be running concurrently and this field may be used to demux them. Sender ID This is the protocol address of the server which is sending the CA message. Receiver ID This is the protocol address of the server which is to receive the CA message. CSAS record See protocol specific section. B.1.2 Client State Update Request (CSU Request) The Client State Update Request (CSU Request) message is used to update the state of cache entries in servers which are attached to the server sending the message. A CSU Request message is sent from one server (the LS) to another directly connected server (the DCS) when the LS observes changes in the state of one or more clients. Luciani, et al. [Page 16] INTERNET-DRAFT SCSP-NBMA Expires October 1996 This observation may be a result of receiving a CSU from another DCS or as a result of some event occurring for a client that has registered with it. The change in state of a "particular" client is noted in a CSU message via a "Client State Advertisement" (CSA) record within the CSU. The CSU Request message type code is 2. The CSU Request message format is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID Len | Recvr ID Len |unused | Type | No. of CSAs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSU Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Server Group ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receiver ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Record | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ....... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Record | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sender ID Len This field holds the length in octets of the Sender ID. Recvr ID Len This field holds the length in octets of the Receiver ID. Type This is the code for the message type. No. of CSAs This field contains the number of Client State Advertisements (CSAs) contained in the CSU message. CSU Sequence Number A value which, when coupled with the address of the source, provides a unique identifier for the CSU Request This value is equivalent to the CSU Sequence Number in SCSP. A "larger" sequence number means a more recent advertisement. Server Group ID This ID is a 32 bit identification field that uniquely identifies Luciani, et al. [Page 17] INTERNET-DRAFT SCSP-NBMA Expires October 1996 the SG instance. Thus multiple SG instances may be running concurrently and this field may be used to demux them. Sender ID This is the protocol address of the server which is sending the CSU message. Receiver ID This is the protocol address of the server which is to receive the CSU message. CSA Record See protocol specific section for CSA (e.g., NHRP CSA is in the NHRP section below. B.1.3 Client State Update Reply (CSU Reply) The Client State Update Reply (CSU Reply) message is used to acknowledge the reception of Client State Update Request. A CSU Reply message is sent from one server (the DCS) to the server (the LS) which sent the original CSU Request. The CSU Reply message type code is 3. The CSU Request message format is the same as that of the CSU Request so that when an server receives an CSU Request all that needs to be done to reply to it is to change the type code to 3 and send the message back. However, the CSU Reply message may be truncated from the CSU Request at (but not including) the Client State field. B.1.4 Client State Update Solicit Message (CSUS message) The CSUS message contains a CSU header and zero or more CSAS records. This message allows one server (LS) to solicit the entirety of CSA data stored in the cache of a directly connected server (DCS). The DCS responds with CSU messages containing the appropriate CSAs. The CSUS message type code is 4. The CSUS message format is the same as that of the CA message; however the M, I, and O bits are not meaningful in this context and are set to zero. Also, CSUS Sequence Number is from a different numbering space than the CA Sequence number. B.1.5 Hello: The Hello message is used to check connectivity between the sending server (the LS) and one of its directly connected neighbor servers (the DCSs). The Hello message type code is 5. The Hello message format is as follows: Luciani, et al. [Page 18] INTERNET-DRAFT SCSP-NBMA Expires October 1996 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID Len | Recvr ID Len | unused| Type | unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | HelloInterval | DeadFactor | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Server Group ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receiver ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sender ID Len This field holds the length in octets of the Sender ID. Recvr ID Len This field holds the length in octets of the Receiver ID. Type This is the code for the message type. HelloInterval The hello interval advertises the time between sending of consecutive Hello Messages by an LS. If the time between Hello messages exceeds the HelloInterval then the Hello is to be considered late by the DCS. On the other hand, if the LS does not receive a Hello Reply within its HelloInterval then the LS resends the same Hello message it sent previously DeadFactor This is a multiplier to the HelloInterval. If a DCS does not receive a Hello message within the interval HelloInterval*DeadFactor from an LS that advertised the HelloInterval then the DCS MUST consider the LS to be stalled at which point the DCS should transition to the Waiting State. On the other hand, if the LS does not receive a Hello Reply within DeadFactor*HelloInterval then one of two things happens: 1) if the LS has received Hello messages from the DCS during this time then the LS transitions to the Unidirectional State; otherwise, 2) the LS transitions to the Waiting State. Server Group ID This ID is a 32 bit identification field that uniquely identifies the SG instance. Thus multiple SG instances may be running concurrently and this field may be used to demux them. Luciani, et al. [Page 19] INTERNET-DRAFT SCSP-NBMA Expires October 1996 Sender ID This is the protocol address of the server which is sending the Hello. Receiver ID This is the protocol address of the server which is to Reply to the Hello. If the sender does not know this address then the sender sets it to zero and it will be filled in the subsequent reply. B.2: Packet Formats For NHRP In NHRP, each SCSP message type is a separate packet type and the SCSP message is included as mandatory part of the packet. Note that the NHRP packet type code is different from the SCSP message type code. B.2.1 NHRP CA The NHRP CA packet has an SCSP CA message as its mandatory part and this NHRP packet has a type code of 11. B.2.2 NHRP CSU Request The NHRP CSU Request packet has an SCSP CSU Request message as its mandatory part and this NHRP packet has a type code of 12. B.2.3 NHRP CSU Reply The NHRP CSU Reply packet has an SCSP CSU Reply message as its mandatory part and this NHRP packet has a type code of 13. B.2.4 NHRP CSU Solicit The NHRP CSU Solicit packet has an SCSP CSU Solicit message as its mandatory part and this NHRP packet has a type code of 14. B.2.5 NHRP Hello The NHRP Hello packet has an Hello message as its mandatory part and this NHRP packet has a type code of 15. B.2.6 CSA record and CSAS record for NHRP The CSA record and CSAS record are protocol specific (e.g., NHRP, IPMC, ATMARP, etc.) because they carry protocol specific data. This section describes the information carried in CSA records and CSAS records for NHRP. B.2.6.1 CSA Record The Client State Advertisement (CSA) record contains the information necessary to relate the current state of a client to the servers being synchronized. There are zero or more CSA records in an CSU Request message. The contents of a record is as follows: Luciani, et al. [Page 20] INTERNET-DRAFT SCSP-NBMA Expires October 1996 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Client ID Len |CSA Orig ID Len| Cli NBMA T/L |Cli NBMA SubT/L| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Client NBMA Address (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Client NBMA Subaddress (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Client ID (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Originator Protocol Address (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Client State (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Client ID Len This field holds the length in octets of the Client ID. CSA Orig ID Len This field holds the length in octets of the CSA Originator ID. Cli NBMA T/L Type & length of source NBMA address interpreted in the context of the 'address family number'[7]. "Cli NBMA T/L" is coded in a similar manner as ar$shtl as seen in Section 5.1 of [2]. Cli NBMA SubT/L Type & length of source NBMA subaddress interpreted in the context of the 'address family number'[7]. "Cli NBMA SubT/L" is coded in a similar manner as ar$sstl as seen in Section 5.1 of [2]. CSA Sequence Number This field contains a sequence number that identifies the CSA record instance for the given client. A "larger" sequence number means a more recent advertisement. Client NBMA Address The Source NBMA address field is the address of the client whose state is being kept. If the field's length as specified in "Cli NBMA T/L" is 0 then no storage is allocated for this address at all. Client NBMA SubAddress The Source NBMA subaddress field is the address of the client whose state is being kept. If the field's length as specified in "Cli Luciani, et al. [Page 21] INTERNET-DRAFT SCSP-NBMA Expires October 1996 NBMA SubT/L" is 0 then no storage is allocated for this address at all. Client ID This field identifies the protocol address of the client whose state is being kept in the servers' cache. CSA Originator ID This field contains the protocol address of the server which originated the CSA record. Client State This field/record contains an octet string which identifies the current state of the client. The field/record is broken down as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | State | Maximum Transmission Unit | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Holding Time | Other Data (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ State This field contains a value which represents the change in state of the client. For example: 0 - Client is registered and available. 1 - Holding timer expired for client. 2 - Client reregistered. 3 - Client has been purged. 4 - No such client data in server cache Maximum Transmission Unit This field represents the acceptable Maximum MTU for connections to the client. Holding Time The Holding Time field specifies the number of seconds for which the client information specified is valid. Cached information SHALL be discarded when the holding time expires. Luciani, et al. [Page 22] INTERNET-DRAFT SCSP-NBMA Expires October 1996 Other Data This is a variable length octet string which is potentially vendor specific. This may be encoded in a way similar to the Vendor Private extension. B.2.6.2 Client State Advertisement Summary Record (CSAS record): The client state advertisement summary is a summarization of the CSA. A CSAS contains the following: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Client ID Len |CSA Orig ID Len| unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Client Protocol Address (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSA Originator Protocol Address (variable length) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Client ID Len This field holds the length in octets of the Client ID. CSA Orig ID Len This field holds the length in octets of the CSA Originator ID. CSA Sequence Number This field contains a sequence number that identifies the CSA record instance for the given client. A "larger" sequence number means a more recent advertisement. Client ID This field identifies the protocol address of the client whose state is being kept in the servers' cache. CSA Originator ID This field contains the protocol address of the server which originated the CSA record. B.3 Packet Formats For ATMARP The following packet format uses the ATMARP packet format exactly as it exists in [1] with some additions at the end. Consult Section 6.6 and 6.7 of [1] for more details. Luciani, et al. [Page 23] INTERNET-DRAFT SCSP-NBMA Expires October 1996 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ar$hrd | ar$pro | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused | ar$op | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SCSP "mandatory parts" | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ar$hrd The "Hardware type" is assigned to ATM Forum address family and is 19 decimal (0x0013). ar$pro The "Protocol type" is (see Assigned Numbers) for protocol type number for the protocol using ATMARP. (IP is 0x0800). ar$op The operation type value is 3 for SCSP. SCSP "mandatory parts" This part depends on the value of ar$op. This part/field is analogous to the use of the mandatory part in NHRP while the preceding fields are directly analogous to the "Fixed" part in NHRP. See the following sections for details. B.3.1 ATMARP CA The ATMARP CA packet has an SCSP CA message as its mandatory part and this ATMARP packet has a ar$op value of 3. B.3.2 ATMARP CSU Request The ATMARP CSU Request packet has an SCSP CSU Request message as its mandatory part and this ATMARP packet has a ar$op value of 4. For ATMARP, since ATMARP clients have no concept of a sequence number, SCSP must generate a sequence number for each client request which causes a database update to occur since SCSP cannot acquire a unique sequence number from the client for the given update. B.3.3 ATMARP CSU Reply The ATMARP CSU Reply packet has an SCSP CSU Reply message as its mandatory part and this ATMARP packet has a ar$op value of 5. B.3.4 ATMARP CSU Solicit The ATMARP CSU Solicit packet has an SCSP CSU Solicit message as its Luciani, et al. [Page 24] INTERNET-DRAFT SCSP-NBMA Expires October 1996 mandatory part and this ATMARP packet has a ar$op value of 6. B.3.5 ATMARP Hello The ATMARP Hello packet has an Hello message as its mandatory part and this ATMARP packet has a ar$op value of 7. B.3.6 CSA record and CSAS record for ATMARP These records are the same as those found in Sections B.2.6.1 and B.2.6.2 of this document with two exceptions. The first exception is that the Holding Time is always set to 20 minutes. The second exception is that the "Cli NBMA T/L" and "Cli NBMA SubT/L" fields are coded in a manner similar to ar$sstl and ar$shtl respectively as seen in Section 6.6 of [1]. B.4 Packet Formats For MARS The MARS model does not currently specify the nature of any distributed-MARS, but contains a vaguely implicit assumption that a dynamic information that the MARS is required to keep on behalf of its Cluster - the CMIs, the Cluster/ServerControlVCs (the actual sets of registered Cluster members and MCSs), ClusterSequenceNumbers, and ServerSequenceNumbers. The minimal goal of a distributed-MARS is to ensure that members of the ServerGroup could take over running the Cluster if the designated MARS failed. Allowing cluster members to terminate on any one of the component MARSs in the ServerGroup is a more difficult goal, which nevertheless needs to be pursued in the longer term. B.4.1 The MARS Sub-caches. This description will assume a 'Designated MARS' model (see Appendix C). The overall MARS state is made up of the following components: Cluster membership list. Cluster Member IDs. Cluster Sequence Number. (Multicast) Server membership list. (Multicast) Server Sequence Number. Absolute maximum and minimum group addresses for protocol being supported. Luciani, et al. [Page 25] INTERNET-DRAFT SCSP-NBMA Expires October 1996 Member map (hostmap) for each Layer 3 group. (MCS) Servermap for each Layer 3 group. Block-join map. For the rest of this description, combinations of these components may be referred to as MARS sub-caches. The Cluster membership list is the most fundamental object for a MARS. It contains the ATM addresses of every cluster member, and explicitly maps ATM addresses to Cluster Member IDs. Both of these pieces of information will be combined into a single sub-cache and carried in the same CSAS and/or CSA Records. (This list allows a backup MARS to construct backup ClusterControlVCs as necessary.) The ClusterSequenceNumber (CSN) is a single integer value, which increments every time a MARS control message is transmitted on ClusterControlVC. It is essential that backup MARSs are reasonably uptodate with their concept of CSN. Unfortunately, the CSN increments more often that database changes occur. It is possible that the designated MARSs should treat every increment of the CSN as a reason to issue an update to the ServerGroup. It may not be sufficient for the CSN to only be updated whenever another cache state change needed to be propagated from the designated MARS. (During a failover from designated MARS to a backup MARS, this could result in the advertised CSN being older that what the cluster members expect to see - opening a window of opportunity for cluster members to lose a subsequent message on the backup ClusterControlVC and not realize it.) The (Multicast)Server membership list is essential to enable construction of a backup ServerControlVC by any one of the backup MARSs. The ServerSequenceNumber poses a similar problem to the ClusterSequenceNumber - it is likely that backup MARSs really need to keep it up to date with the latest value. Each layer 3 group that has members and/or MCSs registered for it will have database entries in the designated MARS. The (MCS) servermaps are likely to change far less frequently that the (host) membership maps, and so (for the same layer 3 multicast group) the hostmaps and servermaps are treated as separate sub-caches. To simplify and shorten the CSAS and CSA Records, members of these maps be identified by indexes into the respective Cluster Membership list or Server Membership list (rather than enumerating their actual ATM addresses in CSA updates, etc). The Cluster/Server Membership lists are the most important parts of the cache for the distributed-MARS to get right. Luciani, et al. [Page 26] INTERNET-DRAFT SCSP-NBMA Expires October 1996 The block-join map represents all currently valid block MARS_JOINs registered with the MARS. This allows the preceding, group-specific hostmaps to be simplified. (The CSA Records representing the hostmap for a given group only lists nodes that have issued a specific single-group MARS_JOIN for that group.) Internally, the MARS builds whatever database structure is required to ensure that replies to MARS_REQUESTs, and general hole-punching activities, take the block- join map's contents into account. B.4.2 Client State Advertisement Summary (CSAS) records. These are combined with the Cache Alignment message define in section B.1.1. Since a number of different sub-caches exist in a MARS (as described above) a number of different CSAS record types are defined. The general form is: csas$type 16 bits Type of sub-cache in this CSAS record. csas$contents n octets CSAS contents, determined by csas$type. Available CSAS record types are: CSAS_CLUSTER_LIST 1 CSAS_MCS_LIST 2 CSAS_HOST_MAP 3 CSAS_MCS_MAP 4 CSAS_BLOCK_JOINS 5 The specific formats of the associated csas$contents field is described in the following sub-sections. B.4.2.1 CSAS_CLUSTER_LIST. The complete CSAS Record looks like: csas$type 16 bits Set to 1 (CSAS_CLUSTER_LIST) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$sequence 32 bits CSAS Sequence number. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented every time a new cluster member registers, or an old one is considered to have died or deregistered. B.4.2.2 CSAS_MCS_LIST. The complete CSAS Record looks like: Luciani, et al. [Page 27] INTERNET-DRAFT SCSP-NBMA Expires October 1996 csas$type 16 bits Set to 2 (CSAS_MCS_LIST) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$sequence 32 bits CSAS Sequence number. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented every time a new MCS registers, or an old one is considered to have died or deregistered. B.4.2.3 CSAS_HOST_MAP. The complete CSAS Record looks like: csas$type 16 bits Set to 3 (CSAS_HOST_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$group_len 8 bits Length of group address. csas$sequence 32 bits CSAS Sequence number. csas$origin x octets Originator's protocol address. csas$group y octets Hostmap's group address. For this CSAS, the sequence number is incremented whenever a cluster member joins or leaves the group. B.4.2.4 CSAS_MCS_MAP. The complete CSAS Record looks like: csas$type 16 bits Set to 4 (CSAS_MCS_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$group_len 8 bits Length of group address. csas$sequence 32 bits CSAS Sequence number. csas$origin x octets Originator's protocol address. csas$group y octets Servermap's group address. For this CSAS, the sequence number is incremented whenever an MCS joins or leaves the group. B.4.2.5 CSAS_BLOCK_JOINS. The complete CSAS Record looks like: csas$type 16 bits Set to 5 (CSAS_BLOCK_JOINS) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$sequence 32 bits CSAS Sequence number. csas$origin x octets Originator's protocol address. Luciani, et al. [Page 28] INTERNET-DRAFT SCSP-NBMA Expires October 1996 For this CSAS, the sequence number is incremented whenever a block MARS_JOIN, or matching block MARS_LEAVE, occurs. B.4.3 Client State Advertisement (CSA) Records. The amount of information needed to update, e.g., a cluster membership list or group membership list, may exceed the size of a link layer PDU. Hence, MARS related CSA Records relating to a single sub-cache may be fragmented across a number of CSU Request messages. This may be considered analogous to the fragmentation of a a group's membership list across a number of MARS_MULTIs when a MARS replies to a single MARS_REQUEST. To match the CSAS records, a set of CSA record types are defined. CSA_CLUSTER_LIST 1 CSA_MCS_LIST 2 CSA_HOST_MAP 3 CSA_MCS_MAP 4 CSA_BLOCK_JOINS 5 Every type allows for fragmentation of the CSA Record across multiple CSU Request messages. Analogous to MARS_MULTIs, CSA Record fragments carry a 15 bit fragment number and 1 bit 'end of fragment' (EOF) flag. Re-assembly of fragments requires collecting CSA Record fragments referring to the same sub-cache type and entry, until the EOF flag is set. The re-assembled CSA Record is then processed. A sequence of CSU Requests carrying a fragmented CSA Record SHALL carry the same CSU Sequence Number (appendix B.1.2). If the CSU Sequence Number changes during the re-assembly of a CSA Record, the fragments collected so far are discarded. A sequence of CSA Record fragments of the same CSA Record type SHALL carry the same CSA Sequence Number. If the CSA Sequence Number changes during the re-assembly of a fragmented CSA Record, the fragments so far are discarded. (The CSA Sequence number for any given type of cache information is derived in the same way as the CSAS Sequence number for the equivalent CSAS message, as described in the previous section). The 15 bit fragment number in consecutive fragments of a CSA Record SHALL start at 1 and increment by 1 for each fragment. Fragments SHALL be transmitted in order of their fragment sequence numbers. All but the final fragment shall have the EOF flag set to 0. The final (or first, if there is only one) fragment SHALL have the EOF flag set to 1. Luciani, et al. [Page 29] INTERNET-DRAFT SCSP-NBMA Expires October 1996 If the fragment sequence number skips by more than one at the receiver, the CSA Record being re-assembled is considered in error. It is discarded after the final fragment is received. If the final fragment does not arrive within 10 seconds of the last received fragment, the CSA Record re-assembly is terminated and the fragments collected so far are discarded. B.4.3.1 CSA_CLUSTER_LIST. This CSA Record carries the entire membership of the current cluster, along with the Cluster Member IDs (CMIs) assigned by the MARS they registered with. These CMIs are then used in other CSA Record types as a short-form representation of actual cluster members. csa$type 16 bits Set to 1 (CSA_CLUSTER_LIST). csa$orig_len 8 bits Length of csa$origin. csa$unused 8 bits unused. csa$sequence 32 bits CSA Sequence number. csa$flagxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment. csa$thtl 8 bits Type and length of ATM addresses. csa$tstl 8 bits Type and length of ATM sub-addresses. csa$origin x octets Originator's protocol address. csa$atmaddr.1 q octets ATM address of member 1. csa$subaddr.1 r octets ATM sub-address of member 1. csa$cmi.1 16 bits Cluster Member ID for entry 1. [..etc..] csa$atmaddr.N q octets ATM address of member N. csa$subaddr.N r octets ATM sub-address of member N. csa$cmi.N 16 bits Cluster Member ID for entry N. B.4.3.2 CSA_MCS_LIST. This CSA Record carries the entire list of currently registered Multicast Servers (MCSs). Each MCS is also assigned an internal ID by the MARS they registered with - this is used to compress the size of subsequent CSA_HOST_MAP messages. csa$type 16 bits Set to 2 (CSA_MCS_LIST). csa$orig_len 8 bits Length of csa$origin. csa$unused 8 bits unused. csa$sequence 32 bits CSA Sequence number. csa$flagxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment. csa$thtl 8 bits Type and length of ATM addresses. csa$tstl 8 bits Type and length of ATM sub-addresses. csa$origin x octets Originator's protocol address. Luciani, et al. [Page 30] INTERNET-DRAFT SCSP-NBMA Expires October 1996 csa$atmaddr.1 q octets ATM address of MCS 1. csa$subaddr.1 r octets ATM sub-address of MCS 1. csa$cmi.1 16 bits Internal MCS ID for entry 1. [..etc..] csa$atmaddr.N q octets ATM address of member N. csa$subaddr.N r octets ATM sub-address of member N. csa$cmi.N 16 bits Internal MCS ID for entry N. B.4.3.3 CSA_HOST_MAP This CSA Record carries the list of cluster members who have joined a specified group using a single-group MARS_JOIN operation. The Cluster Member IDs are used to represent each group member with each CSA Record fragment. A recipient MARS uses this CSA in conjunction with the current Cluster membership list to derive the actual ATM addresses of group members. csa$type 16 bits Set to 3 (CSA_HOST_MAP). csa$orig_len 8 bits Length of csas$origin. csa$group_len 8 bits Length of csa$group. csa$sequence 32 bits CSA Sequence number. csa$flagxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment. csa$origin x octets Originator's protocol address. csa$group y octets Multicast group's protocol address. csa$cmi.1 16 bits Cluster Member ID for entry 1. csa$cmi.2 16 bits Cluster Member ID for entry 2. [..etc..] csa$cmi.N 16 bits Cluster Member ID for entry N. B.4.3.4 CSA_MCS_MAP This CSA Record carries the list of MCSs who have joined to support a specified group. The internal MCS IDs from prior CSA_MCS_LIST CSA Records are used to represent each MCS. A recipient MARS uses this CSA in conjunction with the current MCS membership list to derive the actual ATM addresses of group members. csa$type 16 bits Set to 4 (CSA_MCS_MAP). csa$orig_len 8 bits Length of csas$origin. csa$group_len 8 bits Length of csa$group. csa$sequence 32 bits CSA Sequence number. csa$flagxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment. csa$origin x octets Originator's protocol address. csa$group y octets Multicast group's protocol address. csa$cmi.1 16 bits Internal MCS ID for entry 1. csa$cmi.2 16 bits Internal MCS ID for entry 2. Luciani, et al. [Page 31] INTERNET-DRAFT SCSP-NBMA Expires October 1996 [..etc..] csa$cmi.N 16 bits Internal MCS ID for entry N. B.4.3.5 CSA_BLOCK_JOINS This CSA Record carries the list of Cluster Members who have joined blocks of the layer 3 group address space. The Cluster Member IDs from prior CSA_CLUSTER_LIST CSA Records are used to represent each cluster member and associate it with a specific pair. csa$type 16 bits Set to 5 (CSA_BLOCK_JOINS). csa$orig_len 8 bits Length of csas$origin. csa$group_len 8 bits Lengths of csa$min and csa$max fields. csa$sequence 32 bits CSA Sequence number. csa$flagxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment. csa$origin x octets Originator's protocol address. csa$min.1 y octets group address of block 1. csa$max.1 y octets group address of block 1. csa$cmi.1 16 bits Cluster Member ID for block 1. [..etc..] csa$min.N y octets group address of block N. csa$max.N y octets group address of block N. csa$cmi.N 16 bits Cluster Member ID for block N. B.4.4 CSAS and CSA priorities. The most important cache types for a MARS to exchange are the CSA_CLUSTER_LIST and CSA_MCS_LIST. Without alignment of these caches, the backup MARSs cannot know the cluster's membership or the currently registered MCSs. They will also be unable to interpret the other CSA Record types, which identify nodes using ID values supplied in the CSA_CLUSTER_LIST and CSA_MCS_LIST records. B.5: Packet Formats For LECS Work in progress. Appendix C: A Canonical Point Of Query The following sections of this appendix describe optional Designated Server (DS) functionality which is not completely within the realm of server synchronization but is closely related. One use of this functionality might be to have a dynamically elected server be responsible for assigning CMIs [5] to clients in an IPMC implementation. CSU messages are used to elect the "Designated" Server (DS) from the set Luciani, et al. [Page 32] INTERNET-DRAFT SCSP-NBMA Expires October 1996 of "Eligible" Servers (ESs). A server must also be configured with its Designated Server Priority (DSP) which relates its priority in the election of a DS. An ES is a server that is eligible to become the DS by virtue of the fact that it has a DSP which is greater than zero. C.1 Additional Abbreviations DS - Designated Server DSID - Designated Server ID DSP - Designated Server Priority ES - Eligible Server C.2 Additional Definitions Designated Server (DS) The DS is the contact point within the SG for off-SG stations wishing to query the state of the SG. Designated Server ID (DSID) The DSID is a unique token that identifies the DS in an SG. This value might be taken from the protocol address of the DS. Designated Server Priority (DSP) The DSP identifies the priority of a given server to become the DS. If the DSP is 0 then the server is ineligible to become the DS. Eligible Server (ES) An ES is a server that is eligible to become the DS as a result of having a DSP greater than zero. C.3 The Designated Server Functionality The remainder of this section assumes that the canonical point of query functionality is to be implemented. C.3.1 Overview When an LS has one or more CAFSMs in the Aligned State, the LS participates in the Designated Server (DS) election process for the given SG. Once a CAFSM has reached the Aligned State, the LS starts the DSTimer which is set to DSInitTime. Before this DSTimer expires, the LS MUST not include a Preferred DSID or Preferred DSP in the CSU messages it originates. While the DSTimer is running, the LS keeps track of its preferred DS from knowledge contained in its cache and Luciani, et al. [Page 33] INTERNET-DRAFT SCSP-NBMA Expires October 1996 from knowledge of its own DS Priority (DSP) and LSID. The preferred DS is the server with the highest DSP and in the case of a tie, the largest Server ID (SID) wins. CSU messages contain CSA records. Each CSA contains the following additional fields: a DS bit (which proclaims that the originator believes that it is DS) and a C/S bit (which proclaims that the cache entry refers to a Client (bit is zero) or a Server (bit is set to one)). Further, if the C/S bit is set then the CSA also contains a Preferred DSID field and a Preferred DSP field. Note that clients are assumed to have a DSP of zero. Servers are clients of themselves in the sense of keeping their own state in their own cache; thus a server always advertises itself. C.3.2 The Election Algorithm When the DSTimer expires the LS chooses its preferred DS and starts advertising it as well as the preferred DSP. The LS then does the following: 1) If the LS thinks that it is the preferred DS then a) If all known servers have chosen this LS as leader then the LS becomes the DS (see below) b) If one or more servers are advertising a different DS from the LS then 1) Start the DSOverrideTimer with DSOverrideInterval in it 2) When the DSOverrideTimer expires a) If 2/3 of the servers believe the LS to be leader then the LS becomes the DS (see below) 2) If the LS becomes DS it does the following: a) It increases its DSP by DSPIncrement or to DSPMax whichever is least b) It sends out a CSU message with its new DSP in Preferred DSP field, its LSID in the preferred DSID field, the DS bit set, the Originator ID field set to its LSID, and the Originator DSP field set to its new DSP. 3) At all times an LS is listening for a new DS with higher DSP then the current preferred DSP (and preferred DSID). If at any time the LS sees a DSP higher then the preferred DSP or a DSP which is equal to the current preferred DSP but with an associated DSID which is larger than the preferred DSID then the LS acts as follows: 1) If the LS was the DS then a) The LS announces that the other server is the DS by sending out a CSU message with the new DS's DSID in the preferred DSID, with the new DS's DSP in the preferred DSP field, the DS bit set off, and the Originator DSP field set to its original DSP (not its incremented DSP). b) The LS sets its DSP to its original value. 2) If the LS was not the DS then Luciani, et al. [Page 34] INTERNET-DRAFT SCSP-NBMA Expires October 1996 a) If the new preferred DS is not the LS then the LS simply advertises the new information pertaining to the new DS b) If the new preferred DS is the LS then restart the election process as if the DSTimer had just expired. If the LS loses "connectivity" with the DS (e.g., the cache entry in the LS for the DS is removed) then the LS acts as follows: 1) The LS starts a Re-electionTimer a) If connectivity is reestablished before the timer expires then stop the timer and continue as normal b) else restart the election process as if the DSTimer had just expired If at any time the last CAFSM of the LS for the given SG leaves the Aligned State then all memory of the DS for that SG is erased from the LS and re-election will not take place until at least one CAFSM of the LS for the given SG reaches the Aligned State at which point the election process will start from the beginning. C.3.3 Message Additions A CSU message carries 0 or more CSA records. When designated server functionality is used, CSA records have the following fields appended to them: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DS Proto Len | Pref DSP | unused | +-+-+- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Preferred Designated NHS Protocol Address | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ DS Proto Len This field holds the length in octets of the Preferred Designated NHS's Protocol Address. Pref DSP This field contains the priority of the preferred Designated NHS as seen from the perspective of the server creating the CSA record. This field does not exist in a record when the C/S bit is zero. Preferred DSID This field contains the ID of the preferred designated as seen from the perspective of the server creating the CSA record. This field does not exist in a record when the C/S bit is zero. Luciani, et al. [Page 35] INTERNET-DRAFT SCSP-NBMA Expires October 1996 References [1] "Classical IP and ARP over ATM", Laubach, RFC 1577. [2] "NBMA Next Hop Resolution Protocol (NHRP)", Katz, Piscitello, Cole, Luciani, draft-ietf-rolc-nhrp-07.txt. [3] "OSPF Version 2", Moy, RFC1583. [4] "PNNI Draft Specification", Dykeman, Goguen, ATM Forum 94-0471R16 (Straw Vote), 1996. [5] "Support for Multicast over UNI 3.0/3.1 based ATM Networks.", Armitage, draft-ietf-ipatm-ipmc-12.txt. [6] LAN Emulation over ATM Version 2 - LNNI specification - Draft 3 ATM Forum 95-1082R3, April 1996 [7] Assigned Numbers, J. Reynolds and J. Postel, RFC 1700. Acknowledgments This I-D is a distillation of issues raised during private discussions, on the IP-ATM mailing list, and during the Dallas IETF (12/95). Thanks to all who have contributed but particular thanks to Andy Malis, Raj Nair, and Matthew Doar of Ascom Nexion. I would also like to thank James Watt of Newbridge for comments that lead to a tighter document. Luciani, et al. [Page 36] INTERNET-DRAFT SCSP-NBMA Expires October 1996 Author's Address James V. Luciani Bay Networks, Inc. 3 Federal Street, BL3-04 Billerica, MA 01821 phone: +1-508-439-4724 email: luciani@baynetworks.com Grenville Armitage Bellcore, 445 South Street Morristown, NJ, 07960 Email: gja@thumper.bellcore.com Ph. +1 201 829 2635 Joel M. Halpern Newbridge Networks Corp. 593 Herndon Parkway Herndon, VA 22070-5241 Phone: +1-703-708-5954 Email: jhalpern@Newbridge.COM Luciani, et al. [Page 37]