Network Working Group Ralph Droms INTERNET DRAFT Bucknell University Greg Rabil Mike Dooley Arun Kapur Quadritek Systems Kim Kinnear Mark Stapp Cisco Systems Steve Gonczi Bernie Volz Process Software November 1998 Expires June 1999 DHCP Failover Protocol Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nic.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Abstract DHCP [RFC 2131] allows for multiple servers to be operating on a single network. Some sites are interested in running multiple servers in such a way so as to provide redundancy in case of server failure. In order for this to work reliably, the cooperating primary and Droms, et. al. [Page 1] DRAFT November 1998 secondary servers must maintain a consistent database of the lease information. This implies that servers will need to coordinate any and all lease activity so that this information is synchronized in case of failover. This document defines a protocol to provide this synchronization between two servers. One server is designated the "Primary" server, the other is the "Secondary" server. Additionally, this document describes a protocol for the automatic transfer of control from the primary to the secondary in the case of failure (failover), as well as a network partition. This document further develops the concepts presented in draft-ietf- dhc-failover-02.txt. 1. Introduction As the use of DHCP servers in networked environments grows, the dependency of those networks on the DHCP server increases. This is particularly true of the hosts that receive their configuration information from the DHCP server. Therefore, it is very important to be able to provide reliable, continuous availability of DHCP ser- vices. This specification describes a protocol to support automatic failover from a primary to its secondary server. The failover mechanism allows the secondary server to perform DHCP actions while the primary is down, or when a network failure prevents the primary and secondary from communicating. The protocol also specifies how reintegration is achieved when the primary again becomes operational or when the pri- mary and secondary can again communicate. In providing the specification for the failover, the protocol speci- fies how to guarantee reliable delivery of binding changes to the partner server. This is required to synchronize lease data between the primary and the secondary. The protocol further specifies a mechanism to allow either server to determine if it can communicate with its partner. The secondary will automatically begin to service DHCP requests whenever it cannot communicate with the primary. When the primary server becomes available again, the secondary will convey any changes that occurred since the time of failover back to the pri- mary. Through careful control of the difference between the lease times offered to DHCP clients and the lease time known by the secondary server, the protocol allows the primary to communicate with the secondary after the primary has completed communication with the DHCP client (a technique known as "lazy" update) and still guarantee that Droms, et. al. [Page 2] DRAFT November 1998 duplicate IP address allocations do not occur. Thus, the protocol does not directly impact the ability of a DHCP server to respond to DHCP client requests. 1.1. Requirements Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119]. 1.2. DHCP Terminology This document uses the following terms: o "DHCP client" or "client" A DHCP client is an Internet host using DHCP to obtain confi- guration parameters such as a network address. o "DHCP server" or "server" A DHCP server is an Internet host that returns configuration parameters to DHCP clients. o "binding" A binding is a collection of configuration parameters, including at least an IP address, associated with or "bound to" a DHCP client. Bindings are managed by DHCP servers. o "binding database" The collection of bindings managed by a primary and secondary. o "subnet address pool" A subnet address pool is the set of IP address which is associ- ated with a particular network number and subnet mask. In the simple case, there is a single network number and subnet mask and a set of IP addresses. In the more complex case (sometimes called "secondary subnets", sometimes "superscopes"), several (apparently unrelated) network number and subnet mask combina- tions with their associated IP addresses may all be configured together into one subnet address pool. o "Primary server" or "Primary" Droms, et. al. [Page 3] DRAFT November 1998 A DHCP server configured to provide primary service to a set of DHCP clients for a particular set of subnet address pools. o "Secondary server" or "Secondary" A DHCP server configured to act as backup to a primary server for a particular set of subnet address pools. o "stable storage" Every DHCP server is assumed to have some form of what is called "stable storage". Stable storage is used to hold information concerning IP address bindings (among other things) so that this information is not lost in the event of a server failure which requires restart of the server. 1.3. Requirements for this protocol The following list of goals must be (and are) achieved by this proto- col. 1. Implementations of this protocol must work with existing DHCP client implementations based on the DHCP protocol [RFC 2131]. 2. Implementations of the protocol must work with existing BOOTP relay implementations. 3. The protocol must provide failover redundancy between servers that are not located on the same subnet. 1.4. Goals for this protocol 1. Provide for continued service to DHCP clients through an automated mechanism in the event of failure of the primary server. 2. Avoid binding an IP address to a client while that binding is currently valid for another client. In other words, do not allocate the same IP address to two clients. 3. Minimize any need for manual administrative intervention. 4. Introduce no additional delays in server response time as a result of the communications required to implement the Fail- over protocol. Droms, et. al. [Page 4] DRAFT November 1998 5. Share IP address ranges between primary and secondary servers; i.e., impose no requirement that the pool of available addresses be divided between servers. 6. Continue to meet the goals and objectives of this protocol in the event of server failure or network partition. 7. Provide graceful reintegration of full protocol service after server failure or network partition. 8. Allow for one computer to act as a secondary server for multi- ple primary servers. Other topologies (e.g.: mesh) are also possible. primary and secondary servers SHOULD be viewed as "logical" servers and not necessarily physical computers. 9. Ensure that an existing client can keep its existing IP address binding if it can communicate with either the primary or secondary DHCP server implementing this protocol - not just whichever server that originally offered it the binding. 10. Ensure that a new client can get an IP address from some server. Ensure that in the face of partition, where servers continue to run but cannot communicate with each other, the above goals and requirements may be met. In addition, when the partition condition is removed, allow graceful automatic re- integration without requiring human intervention. 11. If either primary or secondary server loses all of the infor- mation that is has stored in stable storage, it should be able to refresh its stable storage from the other server. 1.5. Limitations of this Protocol The following are explicit limitations of this protocol. 1. Under normal operation, only one server at a time will hand out new IP addresses, but client lease renewals are serviced by both servers; the protocol provides reliability through redundancy and some degree of load balancing of lease renewals. 2. This protocol provides only one level of redundancy through a single secondary server for each primary server. 3. The protocol provides a way to detect when the primary and secondary server cannot communicate, but once this condition has been detected, does not (indeed, cannot) provide any way Droms, et. al. [Page 5] DRAFT November 1998 to further distinguish between network failure and failure of one of the servers. The protocol allows detection of an ord- erly shutdown of a participating server. 4. A subset of the address pool is reserved for secondary server use. In order to handle the failure case where both servers are able to communicate with DHCP clients, but unable to com- municate with each other, a subset of the IP address pool must be set aside as a private address pool for the secondary server. The secondary can use these to service newly arrived DHCP clients during such a period. The size of this private pool SHOULD be based only on the arrival rate of new DHCP clients and the length of expected down-time, and is not influenced in any way by the total number of DHCP clients sup- ported by the server pair. 5. The primary and secondary servers do not respond to client requests at all while recovering from a failure that could have resulted in duplicate IP assignments. (When synchroniz- ing in POTENTIAL-CONFLICT state). 2. Protocol Operations The protocol features a small number of messages to communicate bind- ing information, operational status and to manage various disconnect-reconnect scenarios between servers. 2.1. Message Addressing and Configuration granularity When discussing messages, an important question is "to whom are mes- sages sent" and "from whom are messages sent". What is the address- able entity from which and to which messages are sent? At one level, this would seem to be a single DHCP server, but in fact there are many situations where additional flexibility in configura- tion is useful. For instance, there might be several servers which are each primary for a distinct set of address pools, and one server which is secondary for all of those address pools. The situation with the primaries is straightforward, but the secondary will need to maintain a separate failover state, partner state, and communications up/down status for each of the separate primary servers for which it is acting as a secondary. The protocol allows for there to be a unique failover entity per partner per role (where role is primary or secondary). This failover entity can take actions and hold unique states. There are thus a Droms, et. al. [Page 6] DRAFT November 1998 maximum of two failover entities per partner (one for the partner as a primary and one for that same partner as a secondary.) Thus, in the case where there are two primary servers A and B each backed up by a single common secondary server C, there is one fail- over entity on each of A and B, and two different failover entities on C. The two different failover entities on C each have unique states and message xid ranges. As far as the protocol described in this draft is concerned, they constitute different "servers", although they are certainly part of one server (as the term is com- monly used) if they reside in the same process. It is not the case that there is subnet granularity for each failover entity. On one server, there is one failover entity per "partner- role", regardless of how many subnets or address pools are managed by that combination of partner and role. Conversely, any given subnet or pool will be associated with exactly one failover entity on a sin- gle server (but it will also be associated with the corresponding partner's failover entity.) When a message is received from the partner, the unique failover entity to which the message is directed is determined solely by the IP address of the partner and the setting of the SECONDARY bit in the 'flags' field of the message header. Throughout this document, the states and actions taken by "servers" are described. The terms "server", "primary server", and "secondary server" are commonly used to described the entity taking these states and taking actions. This description is wholly accurate only for the simplest of cases, where all of the address pools on one server are backed up by all of the address pools on another server. In this case, there is a "true" primary and secondary server. In all other cases, the term "server" is used to describe one of the two possible failover entities per partner. 2.2. Packet transport All messages sent by this protocol are sent in UDP packets. All mes- sages are unicast from the sender to the receiver. The next section discusses the port to use when sending DHCP failover UDP packets. DISCUSSION: See section 8, Extended discussion #1, for a discussion of the reasons to use UDP as the protocol. Droms, et. al. [Page 7] DRAFT November 1998 2.3. Port usage Compliant servers SHOULD use port 647 (assigned to dhcp-failover by IANA) for sending and receiving Failover protocol messages, though they MAY be configured to use a different port (including ports 67 or 68). Since the use of port 67 and 68 is allowed, the messages are format- ted in such a way that they can be distinguished from DHCP or BOOTP messages by the use of distinct message 'op' codes. Note that send- ing failover messages on port 67 to servers not designed to support them may not only not work, but may cause those servers to operate incorrectly or to crash. DISCUSSION: Some implementors have a strong requirement for using a separate port for the Failover protocol, and the use of the allocated port 647 will accommodate them. Some other implementors seem equally committed to allowing failover packets to be sent to the standard DHCP port, port 67. The above language strongly suggests that the failover port be used (by using SHOULD), but leaves open the pos- sibility of using the standard DHCP port (or any other) for servers designed to operate in that fashion. 2.4. Time synchronization between communicating servers Each Binding update message carries a "sent time stamp" (the time when the message was sent in GMT). This provides a simple mechanism to determine any "time drift" between communicating servers. DISCUSSION: If a UDP packet is successfully transmitted (i.e.: it does not get lost), the packet travel time is negligible in the framework of DHCP leases. By providing a GMT "sent time" stamp, the recipient can compare this with its notion of the current GMT time at the time it receives the packet. The difference (plus the packet travel time, which we ignore) is the time drift. The recipient MUST use this time drift value to bias "absolute time" values it receives from the sender. 2.5. Failover Protocol Messages The Failover protocol messages are sent using UDP and encoded using a packet format specific to the Failover protocol. To allow easy recognition of and separation of Failover protocol messages from Droms, et. al. [Page 8] DRAFT November 1998 BOOTP and DHCP messages, BOOTP packet 'op' field values 3..11 are used to indicate various Failover protocol message types. A Failover protocol message is always unicast from the source to the destination using the port defined in section 2.2. The sender, and never the recipient is responsible for retransmission when necessary. 2.6. Failover protocol packet header format All of the fields in the fixed portion of the packet MUST be filled with correct data in every message sent. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | op (1) | rev (1) | payload offset (2) | +---------------+---------------+---------------+---------------+ | xid (4) | +---------------------------------------------------------------+ | sending server ID ( IP address ) (4) | +---------------------------------------------------------------+ | time stamp (4) | +---------------------------------------------------------------+ | state (1) | flags(1) | reserved (2) | +---------------+---------------+---------------+---------------+ | 0 or more additional header bytes (variable) | +---------------------------------------------------------------+ | Payload Data, formatted as DHCP-style options | | (although using a unique option number space) | | (variable) | +---------------------------------------------------------------+ Droms, et. al. [Page 9] DRAFT November 1998 op - 1 byte These values extend the number space of the existing BOOTP message type "Op" field. The following message types are defined: Value Message Type ----- ------------ 0 reserved to BOOTP/DHCP, unused by failover 1 BOOTREQUEST (reserved to BOOTP/DHCP, unused by failover) 2 BOOTREPLY (reserved to BOOTP/DHCP, unused by failover) 3 DHCPPOOLREQ request allocation of addresses 4 DHCPPOOLRESP respond with allocation count 5 DHCPBNDUPD update partner with binding info 6 DHCPBNDACK acknowledge receipt of binding update 7 DHCPPOLL probe partner for comm. integrity 8 DHCPPRPL acknowledge comm. integrity 9 DHCPUPDATEREQALL request full transfer of binding info 10 DHCPUPDATEDONE ack send and ack of req'd binding info 11 DHCPUPDATEREQ req transfer of un-acked binding info rev - 1 byte Failover protocol version supported. Set to 1 for the Failover protocol described in this draft. The value 255 is reserved for experimental implementations. Such implementations SHOULD use the DHCP Vendor Class option to recognize a partner server which is using the same vendor's experimental implementation. payload offset - 2 bytes, network byte order The byte offset of the Payload area, from the beginning of the Failover packet header. The value for the current protocol version is 20. xid - 4 bytes, network byte order The sender of a Failover protocol packet is responsible for setting this number, and the receiver of the packet copies the number over into any response packet, treating it as opaque data. The sender SHOULD ensure that every packet sent to a particular IP address and port combination has a unique transaction id unless that packet is a re-transmission. Droms, et. al. [Page 10] DRAFT November 1998 sending server ID - 4 bytes, network byte order The IP address of the sending server. In conjunction with the setting of the SECONDARY flag, this uniquely determines the failover entity sending the message as well as that destined to receive the message. This is placed in the packet instead of being recovered from the IP header for security purposes (see section 8). time stamp - 4 bytes, unsigned, network byte order A time stamp, indicating the time when the packet was sent. The time is a 32 bit unsigned long value in network byte order, in units of seconds (GMT since EPOCH). It is used to determine the time drift between the sender and the recipient. The time drift is defined as the difference between "Arrive Time (GMT)" and "(Send Time (GMT)". The actual packet travel time is assumed to be negligible in this context. All Date-Time values contained in Failover messages MUST be corrected by the time drift before being stored by the recipient. state - 1 byte This field indicates the state of the sender, at the time the packet was sent. The field MUST be set in every Failover message. The server state value can be one of the following: Value Server State ----- ------------------------------------------------------------- 0 NO-STATE May only occur in POLL messages. The partner should reply, but should not react with any state transition. 1 STARTUP Startup state (1) 2 NORMAL Normal state 3 COMMUNICATIONS-INTERRUPTED Communication interrupted (safe) 4 PARTNER-DOWN Partner down (unsafe mode) 5 POTENTIAL-CONFLICT Synchronizing 6 RECOVER Recovering bindings from partner 7 PAUSED Shutting down for a short period. 8 SHUTDOWN Shutting down for an extended period. 9 RECOVER-DONE Interlock state prior to NORMAL Droms, et. al. [Page 11] DRAFT November 1998 Note 1: The STARTUP state is never set in the State field of the mes- sage, but rather is represented by the setting of the STARTUP flag (see the description of the Flags field immediately below). When the server is in the STARTUP state, the state transmitted in the State byte is the PREVIOUS state (usually, but not always, the last recorded in stable storage prior to a server going down -- see sec- tion 6.3 for details.) flags - 1 byte Currently, bits 7 (MSB), 6, and 5 are defined. All other bits are reserved, and must be set to 0. o SECONDARY Bit 7 is the SECONDARY flag and defines the server role. Bit 7 is 0 if the sender is a primary server, 1 if it is a secondary server. Note that this role is fixed for the duration of the relationship between primary and secondary server. In particu- lar, it does not change when and if the secondary server "takes over" for the primary server when it enters COMMUNICATIONS- INTERRUPTED or PARTNER-DOWN state -- each server retains its role throughout all of its state transitions. o RESTART Bit 6 is the RESTART flag. If bit 6 is 1, the sender is res- tarting. A server MUST set this bit every time it is re- started, and it MUST clear the bit upon receiving the first DHCPPRPL to a DHCPPOLL message it has sent with the bit set. Whenever a DHCPPOLL message is sent with the RESTART bit set in the 'flags' field, the MCLT Option, Option 235, MUST be included. Whenever a message with the RESTART bit is received by a server, it MUST transition through the communications failed state tran- sition. The RESTART bit signals that the partner server has been restarted, and if communications is already considered to have failed, then nothing need be done. If, however, the partner server appeared to be operating correctly, then it was able to restart without the receiving server noticing that it was ever gone. The communications failed transition is forced in this case to restart any on-going resynchronization processes that were operating with the partner server. See section 6.3 for additional information. Whenever a DHCPPOLL message is sent with the RESTART bit set, Droms, et. al. [Page 12] DRAFT November 1998 the server SHOULD include a Vendor Class Identifier, Option 60, in the message to identify the server to its partner. o STARTUP Bit 5 is the STARTUP flag. Bit 5 MUST be set to 1 whenever the server is in STARTUP state, and set to 0 otherwise. (Note that when in STARTUP state, the state transmitted in the 'state' field is usually the last recorded state from stable storage, but see section 6.3 for details.) reserved - 2 bytes 2 filler bytes, reserved. 2.7. DHCPPOOLREQ and DHCPPOOLRESP: A secondary server requests addresses for its unique use from the primary server by using the DHCPPOOLREQ message. The primary is in complete charge of how many addresses the secondary receives. The primary server will allocate IP addresses to the secondary server upon receipt of a DHCPPOOLREQ message and inform the secondary server of the number of additional addresses allocated in this allocation cycle by sending the number in the DHCPPOOLRESP message. When the primary server gets a DHCPPOOLREQ message, it computes which addresses should be transferred to the secondary, and queues up DHCPBNDUPD transactions by setting the Status of the selected addresses to "BACKUP". Having done this, it sends a DHCPPOOLRESP message. The DHCPPOOLRESP message carries the "Number of addresses transferred" as its payload. The primary server does not have to wait until all the above binding updates have been acknowledged, The secondary server keeps sending DHCPPOOLREQ messages until it receives a DHCPPOOLRESP with "Number of addresses transferred" = 0, or it decides that the partner is not responding. If the secondary server receives a DHCPPOOLRESP message with "Number of addresses transferred" > 0, it MUST send another DHCPPOOLREQ mes- sage, since additional addresses may still be waiting for it. How- ever, the time at which it sends subsequent DHCPPOOLREQ messages is implementation dependent. This mechanism makes it possible for the primary server to pace the transfer (e.g., it could generate all addresses all at once, or one-by-one) and to some degree for the secondary to pace their receipt. Droms, et. al. [Page 13] DRAFT November 1998 The primary server MUST respond to each DHCPPOOLREQ message it receives. If it has already generated all private addresses, or it has no available addresses, it MUST send DHCPPOOLRESP with "Number of addresses transferred" = 0. The secondary server MAY send a DHCPPOOLREQ message at any time, and although the primary server is under no obligation to allocate any additional addresses, it MUST respond with a DHCPPOOLRESP indicating how many new addresses it has allocated or 0 if no new addresses were allocated. 2.8. DHCPUPDATEREQ, DHCPUPDATEREQALL and DHCPUPDATEDONE: Whenever either server wishes to be updated with information the other server knows but has not yet transmitted, it will send a DHCPUPDATEREQ or DHCPUPDATEREQALL message. When either server gets a DHCPUPDATEREQ or DHCPUPDATEREQALL message, it computes which updates should be transferred to the partner, and queues up DHCPBNDUPD transactions as appropriate. Once all such updates have been acknowledged, it sends a DHCPUPDATEDONE message. If the message that initiated this process was a DHCPUPDATEREQ mes- sage, the receiving server will transmit only DHCPBNDUPD messages for IP addresses which its information indicates that its partner has not acked. If, however, the message that initiated this process was a DHCPUP- DATEREQALL message, the receiving server will transmit DHCPBNDUPD messages for all IP addresses involved in failover with this partner in this role. The secondary server periodically re-transmits the DHCPUPDATEREQ mes- sage, until it receives a DHCPUPDATEDONE message with a matching 'xid' field, or until it decides that the partner is not responding. This approach is similar to the DHCPPOOLREQ/DHCPPOOLRESP message exchange, with one critical difference: the DHCPPOOLRESP is sent as soon as the binding updates are queued up, but the DHCPUPDATEDONE message is deferred until all of the sender's DHCPBNDUPD messages have been successfully transmitted and a corresponding DHCPBNDACK message has been received for each of them. The server processing a DHCPUPDATEREQ message MUST NOT send a corresponding DHCPUPDATEDONE message until all of the DHCPBNDUPD mes- sages have been acked by the partner with a DHCPBNDACK message. Droms, et. al. [Page 14] DRAFT November 1998 Any retransmissions of the DHCPUPDATEREQ message MUST have the same transaction ID. Use of a new transaction ID may cause rebuilding of the outgoing binding update queue or other processing in the server with a negative effect on performance. 2.9. DHCPBNDUPD One server notifies its partner of a binding state change by using the DHCPBNDUPD message. Every DHCPBNDUPD message MUST contain: o An Assigned IP Address Option (Option 50). o A DHCP Binding Status (Option X). o Where the Binding Status is ACTIVE, EXPIRED, RELEASED, or RESET, it MUST also contain one or both of the Client Identifier (Option 61) and the Client Hardware Address (Option X+3). In the case where the Binding Status is ACTIVE, it MUST contain the Lease Duration, Option 51. o Where dynamic DNS updates are being used by the sending server, the Client FQDN Option, Option 81, is used by the sender to communication the status of the binding update to its partner. In response to a binding update, the recipient server MUST respond with a DHCPBNDACK message. Multiple binding updates MAY be batched up, and sent in one Failover protocol message (see section 3.1). 2.10. DHCPBNDACK This message implements either a positive or negative acknowledgment of one or more binding updates. A binding update, (or a batch of binding updates sent as one message) are matched up with their associated acknowledgment by having the same 'xid' field value in the message header. The server sending a DHCPBNDACK message MAY include any of the options that are acceptable in a DHCPBNDUPD message when the DHCPBNDACK message is returned to the sender. It MUST include at least the Assigned IP Address Option. If any of this information differs from the information in the DHCPBNDUPD message, the receiver MUST NOT update its bindings Droms, et. al. [Page 15] DRAFT November 1998 database with that information upon receipt of the DHCPBNDACK mes- sage, since the sender will have no way of knowing if the receiver actually received the message. The DHCPBNDACK MAY selectively reject one or more updates, by includ- ing one or more IP address - Reject Reason option pairs in the mes- sage body. The DHCPBNDACK implicitly acknowledges any binding updates it replies to, except those it enumerates using Reject Reason Codes. Implementations of this protocol MAY send batched updates, and they MUST be prepared to receive batched updates. 2.11. DHCPPOLL In the absence of other messages, a DHCPPOLL message is used to verify the communications integrity of the link between the primary and secondary servers. It is used by either server whenever there is some question about either the communications integrity or running status of the other server. Since current state and other status information is transmitted in every DHCPPOLL and in every DHCPPRPL message, the DHCPPOLL and DHCPPRPL exchange can also be used to signal a change in status by a server or as a way to request an update of the status of its partner. Whenever a DHCPPOLL message is generated it MUST have a unique value in the 'xid' field, unless it is a retransmission of a previously un-acked DHCPPOLL message. 2.12. DHCPPRPL This message simply replies to the DHCPPOLL message (PRPL = Poll reply). Like all messages, it needs to have all of the fixed portions of the failover packet header filled in, including the state and the flags fields. 3. Protocol Payload Data Format Payload data is encoded as a set of flexible DHCP/BOOTP style options [RFC 2132]. (The usual 1 byte option code, 1 byte length, and "length" bytes of data). The options are placed after the header, after skipping PayloadOffset bytes. The payload data options are not preceded by a "cookie" value. Droms, et. al. [Page 16] DRAFT November 1998 Since the packet is NOT a DHCP/BOOTP protocol packet, the options used here do not conflict with any existing "proper" DHCP/BOOTP options. In fact, these options are allocated in relationship to the DHCP option space in the following way. In cases where the syntax and semantics of a Failover Payload Option is identical to that of a DHCP/BOOTP option, the same option number is used. For options unique to the Failover protocol, option numbers starting at 230 are used. Thus, all new Failover protocol option numbers are assigned from a continuous range beginning with 230. The protocol is permissive in allowing various other DHCP options in binding updates. As long as the sender wishes to use an option, it MAY include it. On the other hand, the recipient MUST ignore any option it is not prepared to process. 3.1. Batching multiple binding updates in one packet Implementations of this protocol MAY send batched updates, and they MUST be prepared to receive batched updates. Multiple DHCPBNDUPD transactions MAY be batched together in one protocol message. Data sets for individual transactions MUST always begin with the Assigned IP Address (Option 50). Option ordering between the Assigned IP Address options is not significant. If batched updates are sent, they MUST be formatted as follows: Non-IP Address/Non-client specific options first Assigned IP address option (50) for the first address Options pertaining to first address, including at least DHCP Binding Status (230) Assigned IP address option (50) for the second address Options pertaining to second address, including at least DHCP Binding Status (230) ... In case an implementation chooses to reject some or all of the IP address binding information in a DHCPBNDUPD message in a DHCPBNDACK reply, the DHCPBNDACK message MUST contain one or more Assigned IP Address (Option 50) / Reject Reason Code pairs to indicate that the updates for the address(es) were not accepted. The Assigned IP Address options communicates which updates out of the batch are being rejected, and the Reject Reason Code indicates why. Any IP addresses Droms, et. al. [Page 17] DRAFT November 1998 present in the DHCPBNDUPD message without corresponding Option 50/ Reject Reason Code pairs in the DHCPBNDACK message are implicitly acked by the DHCPBNDACK message. If the DHCPBNDUPD message only con- tains one binding update and that update is rejected, a DHCPBNDACK with a single Assigned IP Address / Reject Reason Code pair MUST be sent. 3.2. DHCP Binding Status This option is used to convey the current state of a binding. This option is mandatory for DHCPBNDUPD messages. Code Len Type +-----+-----+-----+ | 230 | 1 | 1-7 | +-----+-----+-----+ Legal values for this option are: Value Binding Status ----- ------------------------------------------------ 1 FREE Lease has never been used 2 ACTIVE Lease is assigned to a client 3 EXPIRED Lease has expired 4 RELEASED Lease has been released by client 5 ABANDONED A server, or client flagged address as unusable 6 RESET Lease was freed by some external agent 7 BACKUP Lease belongs to secondary's private address pool 3.3. Assigned IP address Uses identical code and format to DHCP Option 50 (requested IP address). This option is mandatory for DHCPBNDUPD messages and in any DHCPBNDACK message where a Reject Reason Code option appears. Code Len Address +-----+-----+-----+-----+-----+-----+ | 50 | 4 | a1 | a2 | a3 | a4 | +-----+-----+-----+-----+-----+-----+ Droms, et. al. [Page 18] DRAFT November 1998 3.4. Absolute time This absolute time is used for the lease grant time as well the partner-down time. When used in a DHCPBNDUPD or DHCPBNDACK message, it represents the lease grant time. When used in a DHCPPOLL message, it represents the partner-down time. An absolute, GMT time value for this option, as time synchronization has already been achieved between the source and the target server using the time field in the message. Represented as seconds elapsed since Jan 1, 1970 (i.e. ANSI C time_t time value representation). Note that this is (at present) a signed field. Code Len Time +------+-----+-----+-----+-----+-----+ | 231 | 4 | t1 | t2 | t3 | t4 | +------+-----+-----+-----+-----+-----+ 3.5. Number of addresses transferred to Secondary Server A 32 bit unsigned long in network byte order. Reports the number of addresses transferred by the primary to the secondary server (addresses to be used for the secondary server's private address pool) Code Len Number of Addresses +-----+-----+-----+-----+-----+-----+ | 232 | 4 | n1 | n2 | n3 | n4 | +-----+-----+-----+-----+-----+-----+ 3.6. Lease Duration Uses the format and code of the standard DHCP IP Address Lease Time option (51). The time is in units of seconds, and is specified as a 32-bit unsigned integer. A Lease Duration of 0xFFFFFFFF indicates an infinite lease. Code Len Lease Time +-----+-----+-----+-----+-----+-----+ | 51 | 4 | t1 | t2 | t3 | t4 | +-----+-----+-----+-----+-----+-----+ Droms, et. al. [Page 19] DRAFT November 1998 3.7. Client Identifier The format, code and conventions used are identical to DHCP option 61. Code Len Type Client-Identifier +-----+-----+-----+-----+-----+--- | 61 | n | t1 | i1 | i2 | ... +-----+-----+-----+-----+-----+--- 3.8. Client Hardware Address The format is similar to DHCP option 61. T1 (type) MUST be set to the proper ARP hardware address code, as defined in the ARP section of RFC 1700 (it MUST NOT be zero!) Code Len Type MAC address +-----+-----+-----+-----+-----+--- | 233 | n | t1 | m1 | m2 | ... +-----+-----+-----+-----+-----+--- Either Client Id, Client Hardware Address or BOTH MAY be present in binding update transactions. At least one of them MUST be present. If both are present, the Client Id MUST be used to uniquely identify the owner of the binding (exactly as in RFC 2131). 3.9. Host Name Uses the format and code of DHCP option 12. Code Len Host Name +-----+-----+-----+-----+-----+-----+-----+-----+-- | 12 | n | h1 | h2 | h3 | h4 | h5 | h6 | ... +-----+-----+-----+-----+-----+-----+-----+-----+-- 3.10. Domain Name Uses the format and code of DHCP option 15. Code Len Domain Name +-----+-----+-----+-----+-----+-----+-- | 15 | n | d1 | d2 | d3 | d4 | ... +-----+-----+-----+-----+-----+-----+-- Droms, et. al. [Page 20] DRAFT November 1998 3.11. Client FQDN If an implementation supports Dynamic DNS updates, this option can be used to communicate the DNS name that was set. Uses the format and code of the Client FQDN option (81) as described in . Code Len Flags Rcode1 Rcode2 Domain Name +-----+-----+-----+------+------+-----+------ | 81 | n | f | r1 | r2 | d1 | d2... +-----+-----+-----+------+------+-----+------ 3.12. Reject Reason Code This option is used to selectively reject binding updates. It MAY be used in DHCPBNDACK message, always following an option 50. Option 50 contains the IP address of the specific update being rejected. Note that a Message option, DHCP Option 56, may be included to give a human readable error indication along with the Reject Reason Code. Code Len Reason code +-----+-----+----------+ | 234 | 1 | R1 | +-----+-----+----------+ Reason codes : 0 Reserved 1 Illegal IP address (not part of any address pool) 2 Fatal conflict exists: address in use by other client. 3 - 253 Reserved for new Reason Codes. 254 Unknown: Error occurred but does not match any reason code 255 Reserved for code expansion Droms, et. al. [Page 21] DRAFT November 1998 3.13. Message This option is used to supply a human readable message. It may be used in association with the Reject Reason Code to provide a human readable error message for the reject. Code Len Text +-----+-----+------+-----+-- | 56 | 1 | c1 | c2 | ... +-----+-----+------+-----+-- 3.14. MCLT - Maximum Client Lead Time Maximum Client Lead Time, in seconds. A 32 bit integer value, in network byte order. This option MUST be used in DHCPPOLL and DHCPPRPL messages, when the server is NOT in normal state. Code Len Time +------+-----+-----+-----+-----+-----+ | 235 | 4 | t1 | t2 | t3 | t4 | +------+-----+-----+-----+-----+-----+ 3.15. Vendor Class Identifier A string which identifies the vendor of the failover protocol implementation. The code for this option is 60, and its minimum length is 1. Code Len Vendor Class Identifier +-----+-----+-----+-----+-----+-- | 60 | n | i1 | i2 | i3 | ... +-----+-----+-----+-----+-----+-- 4. Challenging scenarios for a Failover protocol There exist a number of failure scenarios which will challenge the correctness guarantees of the Failover protocol. Two of the scenarios that the Failover protocol was specifically designed to handle correctly are detailed in this section in order to motivate some of the more unusual aspects of the protocol's operations. Droms, et. al. [Page 22] DRAFT November 1998 4.1. Primary Server crash before "lazy" update: In the case where the primary server sends a DHCPACK to a client for a newly allocated IP address and then crashes prior to sending the corresponding update to the secondary server, the secondary server will have no record of the IP address allocation. When the secondary server takes over, it may well try to allocate that IP address to a different client. In the case where the first client to receive the IP address is not on the net at the time (yet while there was still time to run on its lease), an ICMP echo (i.e., ping) will not prevent the secondary server from allocating that IP address to different client. This is handled in the protocol by having the primary and secondary allocate addresses for new clients from distinct address pools. A more likely (in that DHCPRENEWs are presumably more common than DHCPDISCOVERs) and more subtle version of this problem is where the primary server crashes after extending a client's lease time, and before updating the secondary with a new time using a lazy update. After the secondary takes over, if the client is not connected to the network the secondary will believe the client's lease has expired when, in fact, it has not. In this case as well, the IP address might be reallocated to a different client while the first client is still using it. This scenario is handled by the Failover protocol through control of the lease time and the use of the maximum client lead time (MCLT). See the next section for details. 4.2. Network partition where servers can't communicate but each can talk to clients: Several conditions are required for this situation to occur. First, due to a network failure, the primary and secondary servers cannot communicate. As well, some of the DHCP clients must be able to communicate with the primary server, and some of the clients must now only be able to communicate with the secondary server. When this condition occurs, both primary and secondary servers could attempt to allocate IP addresses for new clients from the same pool of available addresses. At some point, then, two clients will end up being allocated the same IP address. This will cause potentially serious problems when the network failure that created this situation is corrected. This is handled in the protocol by having the primary and secondary servers allocate addresses for new clients from distinct address Droms, et. al. [Page 23] DRAFT November 1998 pools. The specifics of how these two scenarios are handled are supplied in the next section. 5. Duplicate Address Assignment Control There are several ways that the Failover protocol avoids the possi- bility of duplicate address assignment. 5.1. Control of lease time The key problem with lazy update is that when the a server fails after updating a client with a particular lease time and before updating its partner, the partner will believe that a lease has expired even though the client still retains a valid lease on that IP address. In order to handle this problem, a period of time known as the "Max- imum Client Lead Time" (MCLT) is defined and must be known to both the primary and secondary servers. Proper use of this time interval places an upper bound on the difference allowed between the lease time provided to a DHCP client by a server and the lease time known by that server's partner. In order that this is not the maximum lease time that a server can ever provide to a client, during a lazy update the updating server typically updates its partner with lease time information which is longer than the lease time previously given to the client. This allows that server to give a longer lease time to the client the next time the client renews its lease. When moving to the PARTNER-DOWN state (where a server is allowed to reallocate the partner's IP addresses), a server will wait the Max- imum Client Lead Time before allocating any IP addresses from its partner's pool to any new DHCP clients. Thus, any clients which have a lease on an IP address with a lease time greater than that known by the server moving into PARTNER-DOWN state will either have contacted that server during the MCLT period or their leases will have expired. When a server has transitioned to PARTNER-DOWN state, it MUST NOT reallocate an IP address from one client to another client until an additional maximum client lead time interval after the lease on the first client expires. (Actually, until the maximum client lead time after what it believes to be the lease expiration time of the first client.) The fundamental relationship on which much of the correctness of this protocol depends is that the lease expiration time known to a DHCP client MUST NOT be more than the maximum client lead time greater Droms, et. al. [Page 24] DRAFT November 1998 than the lease expiration time known to a server's partner. The remainder of this section makes the above fundamental relation- ship more explicit. This protocol requires a DHCP server to deal with several different lease intervals and places specific restrictions on their relation- ships. The purpose of these restrictions is to allow the other server in the pair to be able to make certain assumptions in the absence of an ability to communicate between servers. The different lease times are: o desired client lease interval The desired client lease interval is the lease interval that a DHCP server would like to give to a DHCP client in the absence of any restrictions imposed by the Failover protocol. Its determination is outside of the scope of this protocol. Typi- cally this is the result of external configuration of a DHCP server. o actual client lease interval The actual client lease internal is the lease interval that a DHCP server gives out to a DHCP client. It may be shorter than the desired client lease interval (as explained below). o desired partner server lease interval The desired partner server lease interval is the lease expira- tion interval the local server tells to its partner. o acknowledged partner server lease interval The acknowledged partner server lease interval is the interval the partner server has most recently acknowledged. The key restriction (and guarantee) that any server makes with respect to lease intervals is that the actual client lease interval never exceeds the acknowledged partner server lease interval (if any) by more than a fixed amount. This fixed amount is called the "Max- imum Client Lead Time" (MCLT). The MCLT MAY be configurable, but for correct server operation it MUST be the same and known to both the primary and secondary servers. It is transmitted from the primary to the secondary in every message Droms, et. al. [Page 25] DRAFT November 1998 sent with the RESTART bit set, and also in every poll and poll reply message. The secondary MUST ensure that its value agrees with that of the primary. See section 3.14 concerning the MCLT Option. A server MUST record in its stable storage both the local server lease interval and the most recently acknowledged partner server lease interval for each IP address binding. It is assumed that the desired client lease interval can be determined through techniques outside of the scope of this protocol. Again, the fundamental relationship among these times which MUST be maintained is: actual client lease interval < ( acknowledged partner lease interval + MCLT ) The "acknowledged partner lease interval" is the acknowledged secon- dary server lease interval for the primary server, and it would be the acknowledged primary server lease interval for the secondary server when it is operating out of contact with the primary server. Figure 5.1-1 illustrates a initial lease to a client using the rules discussed in the example which follows it. Droms, et. al. [Page 26] DRAFT November 1998 DHCP Primary Secondary Client Server Server | | | | >-DHCPDISCOVER-> | | | <---DHCPOFFER-< | | | | | | >-DHCPREQUEST-> | | | (selecting) | | | | | | <--------DHCPACK-< | | | ^ (MCLT) | | | : | >-DHCPBNDUPD--> | | : | (1/2 MCLT + X ) | | : | | | : | <-DHCPBNDACK-< | | MCLT / 2 | | ... : ... ... | : | | | V | | | >-DHCPREQUEST-> | | | (renew) | | | | | | <--------DHCPACK-< | | | ^ (X) | | | : | >-DHCPBNDUPD--> | | : | ( 1/2 X + X ) | | : | | | : | <-DHCPBNDACK-< | | X / 2 | | | : | | ... ... ... ... Figure 5.1-1: Lazy Update Message Traffic X = Desired Client Lease Interval DISCUSSION: This protocol mandates no algorithm concerning these lease inter- vals, as long as above fundamental relationship is preserved. In the interests of clarity, however, let's examine a specific example. The MCLT in this case is 1 hour. The desired client lease interval is 3 days, and its renewal time is half the lease interval. Droms, et. al. [Page 27] DRAFT November 1998 The rules for this example are: o What to tell the client: Take the remainder of the acknowledged partner server lease interval. If this is a new lease, then this value will be zero. If this remainder plus the MCLT is greater than the desired client lease interval, give the client the desired client lease interval else give the client the remainder plus the MCLT. o What to tell the failover partner server: Take the renewal interval (typically half of the actual client lease interval), and add to it the desired client lease inter- val. In operation this might work as follows: When a primary server makes an offer for a new lease on an IP address to a DHCP client, it determines the desired client lease interval (in this case, 3 days). It then examines the ack- nowledged partner lease interval (which in this case is zero) and determines the remainder of the time left to run, which is also zero. To this it adds the the MCLT. Since the actual client lease interval cannot be allowed to exceed the remainder of the current partner lease interval plus the MCLT, the offer made to the client is for the remainder of the current partner lease interval (i.e., zero) plus the MCLT. Thus, the actual client lease interval is 1 hour. Once the primary server has performed the ACK to the DHCP client, it will update the secondary server with the lease information. However, the desired partner server lease interval will be com- posed of the one half of the current actual client lease interval added to the desired client lease interval. Thus, the secondary server is updated with a DHCPBNDUPD with a lease interval of 3 days + 1/2 hour specified in the Lease Duration Option (Option 51). When the primary server receives an ACK to its update of the secondary server's (partner's) lease interval, it records that as the acknowledged partner server lease interval. A server MUST NOT send a DHCPBNDACK in response to a DHCPBNDUPD message until it is sure that the information in the DHCPBNDUPD message resides in its stable storage. Thus, the primary server in this case can be sure that the secondary server has recorded the desired partner server lease interval in its stable storage when the primary server receives a DHCPBNDACK message from the secondary server. Droms, et. al. [Page 28] DRAFT November 1998 When the DHCP client attempts to renew at T1 (approximately one half an hour from the start of the lease), the primary server again determines the desired client lease interval, which is still 3 days. It then compares this with the remaining acknowledged partner server lease interval (3 days + 1/2 hour) and adjusts for the time passed since the secondary was last updated (1/2 hour). Thus the remaining time on the acknowledged partner server lease interval is 3 days. Adding the MCLT to this yields 3 days plus 1 hour, which is less than the desired client lease interval of 3 days. So the client is renewed for the desired client lease interval -- 3 days. When the primary DHCP server updates the secondary DHCP server after the DHCP client's renewal ACK is complete, it will calculate the desired partner server lease interval as the T1 fraction of the actual client lease interval (1/2 of 3 days this time = 1.5 days). To this it will add the desired client lease interval of 3 days, yielding a total desired partner server lease interval of 4.5 days. In this way, the primary attempts to have the secondary always "lead" the client in its understanding of the client's lease interval so as to be able to always offer the client the desired client lease interval. Once the initial actual client lease interval of the MCLT is past, the protocol operates effectively like the DHCP protocol does today in its behavior concerning lease intervals. However, the guarantee that the actual client lease interval will never exceed the remaining acknowledged partner server lease interval by more than the MCLT allows full recovery from a variety of failures. 5.2. Controlled re-allocation of IP addresses When in PARTNER-DOWN state (after a period defined in detail in sec- tion 6.5.2 has passed), a there are no restrictions on reallocating a lease from one client to another. In any other state, a server cannot reallocate an address from one client to another without first notifying (through a DHCPBNDUPD mes- sage) and receiving acknowledgement (through a DHCPBNDACK message) that its partner is aware that that first client is not using the address. This could be modeled in the following way (though this specific implementation is in no way required). An "available" IP address on a server may be allocated to any client. An IP address which was leased to a client and which expired or was released by that client would take on a new state, say "pending-available". When an IP address became "pending-available", the partner server would be Droms, et. al. [Page 29] DRAFT November 1998 notified that this IP address was "available" through a DHCPBNDUPD. When the sending server received the DHCPBNDACK for that IP address showing it was "available", it would move the IP address from "pending-available" to "available", and it would be available for allocation to any clients. A server MAY reallocate an IP address in "pending-available" state to the same client with no restrictions. 5.3. Secondary renewal of leases When operating in NORMAL state, a secondary server MAY process DHCPREQUEST messages for renewal or rebinding leases. In this case, the requirements for control of lease time and re-allocation of IP addresses are the same as that of the primary server. 6. Server Operation This section discusses the operation of a server implementing the Failover protocol using the state transition diagram in Figure 6.2-1. This is the common state transition diagram for both servers in a pair. 6.1. Server Initialization When a server starts it starts out in STARTUP state. See section 6.4 below for details. 6.2. Establishing Communications Integrity Central to the operation of the Failover protocol is a notion of "communications okay" or "communications failed". State transitions are taken in many cases when the status of communications with the partner changes. A specific discipline exists for establishing and verifying communi- cations integrity. Communications is set to "okay" whenever a mes- sage sent is acked by the partner. After an implementation dependent length of time from the communications "okay" event the communica- tions with the partner are deemed to have "failed" if no subsequent acknowledgments have been received. Whenever a DHCPPRPL, DHCPUP- DATEDONE, DHCPPOOLRESP or DHCPBNDACK is received this time period is restarted. Obviously, as the time period elapses, a server SHOULD send DHCPPOLL messages in order to elicit a DHCPPRPL message in reply, which will Droms, et. al. [Page 30] DRAFT November 1998 reset the time period. While an implementation SHOULD restart this time period on every DHCPUPDATEDONE, DHCPPOOLRESP or DHCPBNDACK or DHCPRPL, it MAY choose to only restart it on a DHCPPRPL. This technique ensures that two-way communications integrity exists between the servers. Were the timeout period to be reset on the receipt of any message from the partner, a network failure where one server could send but not receive messages to the partner could lead to failure of the entire redundant DHCP subsystem. For example, in a situation where the primary could send but not receive any messages, the secondary would never take over from the primary and yet DHCP clients would not receive any service. 6.3. Server State Transitions Figure 6.2-1 is the diagram of the server state transitions. The remainder of this section contains information important to the understanding of that diagram. The server stays in the current state until all of the actions speci- fied on the state transition are complete. If communications fails during one of the actions, the server simply stays in the current state and attempts a transition whenever the conditions for a transi- tion are later fulfilled. In the state transition diagram below, the "+" or "-" in the upper right corner of each state is a notation about whether communication is ongoing with the other server. The legend "responsive", "partially-responsive", or "unresponsive" in each state indicates whether the server is responsive to DHCP client requests in the respective state. The terms "responsive" and "unresponsive" have the obvious meanings, while "partially- responsive" means that a DHCP server may respond to DHCPREQUEST mes- sages that are RENEWAL or REBINDING, but to no other messages. In the state transition diagram below, when communication is reesta- blished between the two servers, each must record the state of the partner when communication was restored. State transitions on one server in some cases imply state transitions on the partner server, so a record of the current state of the partner server must be kept by each server. If a message is received from a partner with the state equal to zero (0), then the receiving server should respond to that message with a DHCPPRPL if it was a DHCPPOLL, but under no circumstances should it Droms, et. al. [Page 31] DRAFT November 1998 consider communications to be "okay", nor take any state transitions based on receipt of that message. If the state of the partner changes while communicating a server moves through the communications-failed transition and into whatever state results. It then immediately moves through whatever state transition is appropriate given the current state of the partner server. DISCUSSION: The point of this technique is simplicity, both in explanation of the protocol and in its implementation. The alternative to this technique of memory of partner state and automatic state transi- tion on change of partner state is to have every state in the fol- lowing diagram have a state transition for every possible state of the partner. With the approach adopted, only the states in which communications are reestablished require a state transition for each possible partner state. The current state of a server must be recorded in stable storage and thus be available to the server after a server restart. Droms, et. al. [Page 32] DRAFT November 1998 +---------------+ V +--------------+ | RECOVER - | | | STARTUP - | |(unresponsive) | +->|(unresponsive)| +---------------+ +--------------+ Comm. OK +-----------------+ Other State:-RECOVER | PARTNER DOWN - |<-----+ | | | (responsive) | | All POTENTIAL- +-----------------+ | Others CONFLICT------------ | --------+ ^(see | | Comm. OK | | 6.93) | UPDATEREQ(ALL) Other State: | +-----+ | Wait UPDATEDONE | | | Comm. | | Wait MCLT from fail RECOVER All Others| Failed | | +--------------+ | V V | | | |RECOVER-DONE +| +--+ +--------------+ | | |(unresponsive)| | | POTENTIAL + |<--+ | +--------------+ Wait for +>| CONFLICT | | Comm. OK Other | |(unresponsive)|<--- | --+ +--Other State:-+ State: | +--------------+ | | | | | RECOVER | | | | | All POTENT. DONE | Resolve Conflict | | | Others: CONFLICT-- | ----+ (see 6.9) | | | Wait for V V | | | Other State: NORMAL +-----------------+ | | | V | NORMAL + | External | | | +--+----------+-->|(see 6.72, 6.73) |-Command-->+ | | ^ ^ +-----------------+ | | | | | | | | | Wait for Comm. OK Comm. External | | Other Other Failed Command | | State: State: | or | | |RECOVER-DONE NORMAL Start Safe Safe | | | | COMM. INT. Period Timer Period | | | Comm. OK. | V expiration | | Other State: | +------------------+ | | | RECOVER +--| COMMUNICATIONS - |-----------+ | V +-------------| INTERRUPTED | Comm. OK | RECOVER | (responsive) |--Other State:-+ RECOVER-DONE--------->+------------------+ All Others Figure 6.2-1: Server state diagram. Droms, et. al. [Page 33] DRAFT November 1998 6.4. STARTUP state The STARTUP state affords an opportunity for a server to probe its partner server, before starting to service DHCP clients. DISCUSSION: Without the STARTUP state, a server would likely start in a state derived from its previously stored state (held in stable storage), if any. However, this may be inconsistent with the current state of the partner. The STARTUP state affords the opportunity for a server to potentially learn the partner's state and determine if that state is consistent with its derived starting state or whether some significant state change has occurred at the partner that forces the server to start in another state. This is especially critical if significant time has elapsed while the server was down. 6.4.1. Operation while in STARTUP state Whenever a server is in STARTUP state, it MUST be unresponsive to DHCP client requests, and so the time spent in the STARTUP state is necessarily short, typically on the order of a few seconds to a few tens of seconds. The exact time spent in the STARTUP state is imple- mentation dependent, and the primary and secondary server are not required to spend the same amount of time in the STARTUP state. Whenever any message is sent to the partner while in STARTUP state the STARTUP bit MUST be set in the 'flags' field of the message header. 6.4.2. Transition out of STARTUP state Each server starts out in startup state every time it initializes itself, and performs the following algorithm as part of its initiali- zation: 1. Ensure that the RESTART bit is set in the 'flags' field of the failover message header. Once set, the RESTART bit must remain set in all failover messages sent by the server to the partner until the first acknowledgment of a message is received from that partner. This is required to assure that the partner knows that the server has restarted, even if the partner itself is unreachable for a long while. Droms, et. al. [Page 34] DRAFT November 1998 Do not send any messages until step 5. 2. Is there any record in stable storage of a previous failover state? If yes, set previous-state to the last recorded state in stable storage, and continue with step 3. Is there any configuration information that indicates that this server was previously running but lost its stable storage? Such information must typically come from some administrative intervention, since it is difficult for a server to distinguish first startup from a startup after it has lost its stable storage. If yes, then set the previous- state to RECOVER, and set the time-of-failure to whatever time was configured, and go on to step 3. This time-of-failure will be used in the transition out of the RECOVER state into the RECOVER-DONE state, below. If there is no record of any previous failover state in stable storage nor of any previous operational activity for this server, then set the previous-state to RECOVER and set the time-of-failure to a time before the maximum-client-lead-time before now. If using standard Posix times, 0 would typically do quite well. 3. Is the previous-state NORMAL? If yes, set the previous-state to COMMUNICATIONS-INTERRUPTED. 4. Start the STARTUP state timer. The time that a server remains in the STARTUP state (absent any communications with its partner) is implementation dependent (and would typically be configurable). It should be long enough to poll several times and stand a good chance to receive a response to at least one poll from a heavily loaded partner across a slow network. 5. Start sending DHCPPOLL messages (with both the RESTART and STARTUP bits set in the 'flags' field). 6. Wait for "communications okay", i.e., the receipt of an DHCPPRPL message. When a DHCPPRPL message is received, clear the RESTART flag, clear the STARTUP flag, and set the current state to the previous-state. If the partner is in PARTNER-DOWN state, and if its partner- down time (received in the DHCPPRPL message in the Absolute Time Option) is later than the last recorded time of operation of this server, then set the current state to RECOVER. Droms, et. al. [Page 35] DRAFT November 1998 Then, transition to the current state and take the "communica- tions okay" state transition based on the current state of this server and the partner. 7. If the startup time expires, take an implementation dependent action: The server MAY go to the previous-state, or the server MAY wait. Reasons to go to previous-state and begin processing: If the current server is the only operational server, then if it waits, there will be no operational DHCP servers. This situation could occur very easily where one server fails and then the other crashes and reboots. If the rebooting server doesn't start processing DHCP client requests without first being in communication with the other server, then the level of DHCP redundancy is not particularly high. This is an appropriate approach if the possibility of partition is low, or if the safe period expiration time is well beyond the time at which an operator would notice and react to a partition situation. It is also quite appropriate if the safe period will never expire. Reasons to wait: If the current server has been down for longer than the maximum-client-lead-time, and it is partitioned from the other server, then when it returns it will attempt to use its own available addresses to allocate to new DHCP clients, and the other server may well be in PARTNER-DOWN state and may have already allocated some of those available addresses to DHCP clients. In cases where the possibility of partition is high, and the safe period expiration time is less than the likely operator reaction time, this is a good approach to use. 6.5. PARTNER-DOWN state PARTNER-DOWN state is a state either server can enter. When in this state, the server does not assume that the other server could still be operating and servicing a different set of clients, but instead assumes that it is the only server operating. For this reason, only one server should be operating in this state at a time. 6.5.1. Upon Entry to PARTNER-DOWN state When entering PARTNER-DOWN state a server MUST record the time of entry, and must transmit it during every DHCPPOLL message or DHCPPRPL Droms, et. al. [Page 36] DRAFT November 1998 message sent while in PARTNER-DOWN state. 6.5.2. Operation while in PARTNER-DOWN state A server in PARTNER-DOWN state MUST respond to DHCP client requests. It will allow renewal of all outstanding leases on IP addresses, and will allocate IP addresses from its own pool, and after a fixed period of time (the MCLT interval) has elapsed from entry into PARTNER-DOWN state, it will allocate IP addresses from the set of all available IP addresses. Once a server has entered NORMAL state, the PARTNER-DOWN state is entered only on command of an external agency (typically an adminis- trator of some sort) or after the expiration of an externally config- ured minimum safe-time after the beginning of COMMUNICATIONS- INTERRUPTED state. Any available IP address tagged as belonging to the other server (at entry to PARTNER-DOWN state) MUST NOT be used until the maximum- client-lead-time beyond the entry into PARTNER-DOWN state has elapsed. A server in PARTNER-DOWN state MUST NOT allocate an IP address to a DHCP client different from that to which it was allocated at the entrance to PARTNER-DOWN state until the maximum-client-lead-time beyond the its expiration time has elapsed. If this time would be earlier than the current time plus the maximum-client-lead-time, then the current time plus the maximum-client-lead-time is used. Two options exist for lease times given out while in PARTNER-DOWN state, with different ramifications flowing from each. If the server wishes the Failover protocol to protect it from loss of stable storage in PARTNER-DOWN state, then it should ensure that the MCLT based lease time restrictions in Section 5.1 are maintained, even in PARTNER-DOWN state. If the server wishes to forego the protection of the Failover proto- col in the event of loss of stable storage, then it need recognize no restrictions on actual client lease times while in PARTNER-DOWN state. A server in PARTNER-DOWN state MUST poll its partner and attempt to establish communications and synchronization. While a server is in PARTNER-DOWN state, it MUST send the absolute time of entry into PARTNER-DOWN using the absolute time option in Droms, et. al. [Page 37] DRAFT November 1998 every DHCPPOLL and DHCPRPL message sent. 6.5.3. Transitions out of PARTNER-DOWN state When a server in PARTNER-DOWN state succeeds in contacting its partner, its actions are conditional on the state and flags received in the message from the other server. If the STARTUP bit is set in the 'flags' field of a received DHCPPOLL message, the server in PARTNER-DOWN state will send a DHCPPRPL mes- sage with its current state (and with the absolute PARTNER-DOWN time in the DHCPPRPL). A server in PARTNER-DOWN state MUST NOT take any state transitions based on reestablishing communications if the STARTUP bit is set in the 'flags' field of the messages that reesta- blished communications. If the STARTUP bit is not set in the 'flags' field then a server in PARTNER-DOWN state will move into POTENTIAL-CONFLICT state if the other server is in the NORMAL, COMMUNICATIONS-INTERRUPTED, PARTNER- DOWN, or POTENTIAL-CONFLICT state. If the STARTUP bit is not set in the 'flags' field, then a server in PARTNER-DOWN state will stay in PARTNER-DOWN state if it detects that the other server is in RECOVER state. If the STARTUP bit is not set in the 'flags' field, then a server in PARTNER-DOWN state moves into NORMAL state if it detects that the other server is in RECOVER-DONE state. 6.6. RECOVER state This state indicates that the server has no information in its stable storage or that it is re-integrating with a server in PARTNER-DOWN state after it has been down. A server in this state will attempt to refresh its stable storage from the other server. 6.6.1. Operation in RECOVER state A server in RECOVER MUST NOT respond to DHCP client request. A server in RECOVER state will attempt to reestablish communications with the other server. 6.6.2. Transitions out of RECOVER state If the other server is in POTENTIAL-CONFLICT state when communica- tions are reestablished, then the server in RECOVER state will move to POTENTIAL-CONFLICT state itself. Droms, et. al. [Page 38] DRAFT November 1998 If the other server is in RECOVER state, then this server SHOULD sig- nal an error and halt processing. If the other server is in any other state, then the server in RECOVER state will request an update of missing binding information by send- ing an UPDATEREQ message. If the server has been configured to indi- cate that it has lost its stable storage, it will send an UPDATEREQALL message, otherwise it will send an UPDATEREQ message. It will wait for an UPDATEDONE message, and upon receipt of that mes- sage it will start a timer whose expiration is set to a time equal to the the time the server went down (if known) or the current time (if the down-time is unknown) plus the maximum-client-lead-time. When this timer goes off, the server will go into RECOVER-DONE state. This is to allow any IP addresses that were allocated by this server prior to loss of its client binding information in stable storage to contact the other server or to time out. See Figure 6.6-1. DISCUSSION: The actual requirement on this wait period in RECOVER is that it start when the recovering server went down, not necessarily when it came back up. If the time when the recovering server failed is known, then it could be communicated to the recovering server, and the wait period could be reduced to the maximum-client-lead-time less the difference between the current time and the time the server failed. In this way, the waiting period could be minimized. If an UPDATEDONE message isn't received within an implementation dependent amount of time, and no DHCPBNDUPD message are being received, then the UPDATEREQ(ALL) message will be re-transmitted. Droms, et. al. [Page 39] DRAFT November 1998 A B Server Server | | RECOVER PARTNER-DOWN | | | >--DHCPUPDATEREQ-------------> | | | | <-----------------DHCPBNDUPD--< | | >--DHCPBNDACK----------------> | ... ... | | | <-----------------DHCPBNDUPD--< | | >--DHCPBNDACK----------------> | | | | <-------------DHCPUPDATEDONE--< | | | Wait MCLT from last known | time of operation | | | RECOVER-DONE | | | | >--DHCPPOLL-(RECOVER-DONE)---> | | <-------------------DHCPPRPL--< | | | | NORMAL | | | <----------(NORMAL)-DHCPPOLL--< | | >--DHCPPRPL------------------> | | | NORMAL | | | | | Figure 6.6-1: Transition out of RECOVER state Droms, et. al. [Page 40] DRAFT November 1998 6.7. NORMAL state NORMAL state is the state used by a server when it can communicate with the other server. When in this state, the primary responds to DHCP all clients requests and while the secondary only responds to renewal or rebinding requests which it receives. This is one of the few states where the operation of the primary and secondary servers are quite different. 6.7.1. Upon Entry to NORMAL state When entering NORMAL state, a server will send to the other server all currently unacknowledged DHCPBNDUPD messages. When the above process is complete, if the server entering NORMAL state is a secondary server, then it will will request IP addresses for allocation using the DHCPPOOLREQ message and the techniques described in section 2.5. 6.7.2. Operation in NORMAL state: Primary Server When in NORMAL state, the primary server takes the following actions to implement the Failover protocol: o Lease Time Calculations As discussed in section 5.1, "Control of lease time", the lease interval given to a DHCP client can never be more than the maximum-client-lead-time greater than the acknowledged partner- server-lease-interval. As long as the primary server adheres to this constraint, the specifics of the lease intervals that it gives to either the DHCP client or the secondary DHCP server are implementation dependent. One possible approach is shown in section 5.1, but that particular approach is in no way required by this protocol. o Lazy Update of Secondary Server After an ACK of a IP address binding, the primary server attempts to update the secondary with the binding information. The lease time used in the update of the secondary MUST be at least that given to the DHCP client in the DHCPACK. It MAY, however, be longer. Droms, et. al. [Page 41] DRAFT November 1998 o Reallocation of IP Addresses Between Clients Whenever a client binding is released, a DHCPBNDUPD message must be sent to the secondary server, setting the binding state to RELEASED. However, until a DHCPBNDACK is received for this mes- sage, the IP address cannot be allocated to another client. It can be allocated to the same client again. 6.7.3. Operation in NORMAL state: Secondary Server In normal state, the secondary server receives binding updates from the primary server in DHCPBNDUPD messages. It records these in its client binding database in stable storage and then sends the corresponding DHCPBNDACK message to the primary server. It MUST ensure that the information is recorded in stable storage prior to sending the DHCPBNDACK message back to the primary server. While in NORMAL state, the secondary server MUST also acquire a series of IP addresses from the primary server to be used to satisfy DHCPDISCOVER requests from DHCP clients when in COMMUNICATIONS- INTERRUPTED state. See section 2.5 for details of this acquisition process. The secondary server periodically polls the primary server with the DHCPPOLL message. If it fails to receive a DHCPPRPL message in reply after a configured number of retries or some administratively deter- mined time, the secondary server transitions into COMMUNICATIONS- INTERRUPTED state. Both the DHCPPOLL and DHCPPRPL messages carry the current state of the sender. When in normal state, a secondary server is responsive to DHCP client requests if they are RENEWAL or REBINDING. Any changes it makes to any leases based on these responses should be sent to the primary server using DHCPBNDUPD messages. 6.7.4. Transitions out of NORMAL state If an external command is received by a server in NORMAL state informing it that its partner is down, then transition into PARTNER- DOWN state. If a server in NORMAL state fails to receive acks to any messages sent to its partner for an implementation dependent period of time, it will move into COMMUNICATIONS-INTERRUPTED state. (See section 6.2). Droms, et. al. [Page 42] DRAFT November 1998 If a server in NORMAL state receives any messages from its partner where the partner has changed state from that expected by the server in NORMAL state, then the server should transition into COMMUNICATIONS-INTERRUPTED state and take the appropriate state tran- sition from there. For example, it would be expected for the partner to transition from POTENTIAL-CONFLICT into NORMAL state, but not for the partner to transition from NORMAL into POTENTIAL-CONFLICT state. 6.8. COMMUNICATIONS-INTERRUPTED State A server goes into COMMUNICATIONS-INTERRUPTED state whenever it is unable to communicate with the other server. Primary and secondary servers cycle automatically (without administrative intervention) between NORMAL and COMMUNICATIONS-INTERRUPTED state as the network connection between them fails and recovers, or as the partner server cycles between operational and non-operational. No duplicate IP address allocation can occur while the servers cycle between these states. 6.8.1. Upon Entry to COMMUNICATIONS-INTERRUPTED state When a server enters COMMUNICATIONS-INTERRUPTED state, if it has been configured to support an automatic transition out of COMMUNICATIONS- INTERRUPTED state and into PARTNER-DOWN state, then a timer MUST be started for an implementation dependent period. It is anticipated that some alarm condition would be raised upon the transition from NORMAL state to COMMUNICATIONS-INTERRUPTED state. 6.8.2. Operation in COMMUNICATIONS-INTERRUPTED State In this state a server may respond to DHCP client requests. When allocating new IP addresses, each server allocates from its own IP address pool. When responding to renewal requests, each server will allow continued renewal of a DHCP client's current lease on an IP address, although the renewal period MUST not exceed the maximum client lead time (MCLT) beyond the lease time already acknowledged by the other server. A server operates in COMMUNICATIONS-INTERRUPTED state as the primary server does in NORMAL state. However, since the server cannot communicate with its partner in this state, the acknowledged-partner-lease-time will not be updated in any new bindings. This is likely to eventually cause the actual-client- lease-times to be the current-time plus the maximum-client-lead-time Droms, et. al. [Page 43] DRAFT November 1998 (unless this is greater than the desired-client-lease-time). 6.8.3. Transition out of COMMUNICATIONS-INTERRUPTED State If the safe period timer expires while a server is in the COMMUNICATIONS-INTERRUPTED state, it will go immediately into PARTNER-DOWN state. If an external command is received by a server in COMMUNICATIONS- INTERRUPTED state informing it that its partner is down, it will go immediately into PARTNER-DOWN state. If communications is restored with the other server, then the server in COMMUNICATIONS-INTERRUPTED state will go into another state based on the state of the partner: o partner in NORMAL or COMMUNICATIONS-INTERRUPTED The server will transition into the NORMAL state. o partner in RECOVER Stay in COMMUNICATIONS-INTERRUPTED state. o partner in RECOVER-DONE Transition into NORMAL state. o partner in PARTNER-DOWN or POTENTIAL-CONFLICT Transition into POTENTIAL-CONFLICT state. o partner in PAUSED Stay in COMMUNICATIONS-INTERRUPTED state. o partner in SHUTDOWN Transition into PARTNER-DOWN state. Droms, et. al. [Page 44] DRAFT November 1998 Primary Secondary Server Server NORMAL NORMAL | >--DHCPPOLL----->: | | :<--------DHCPPOLL--< | | : | COMMUNICATIONS : COMMUNICATIONS INTERRUPTED : INTERRUPTED | : | | >--DHCPPOLL------------------> | | <-------------------DHCPPRPL--< | NORMAL | | | | >--DHCPBNDUPD----------------> | | <-----------------DHCPBNDACK--< | | | | <-------------------DHCPPOLL--< | | >--DHCPPRPL------------------> | | NORMAL | | | <-----------------DHCPBNDUPD--< | | >--DHCPBNDACK----------------> | ... ... | | | <----------------DHCPPOOLREQ--< | | >--DHCPPOOLRESP-(2)----------> | | | | >--DHCPBNDUPD-(#1)-----------> | | <-----------------DHCPBNDACK--< | | | | <----------------DHCPPOOLREQ--< | | >--DHCPPOOLRESP-(0)----------> | | | | >--DHCPBNDUPD-(#2)-----------> | | <-----------------DHCPBNDACK--< | | | Figure 6.8-1: Transition from NORMAL to COMMUNICATIONS- INTERRUPTED and back (example with 2 addresses allocated to secondary) Droms, et. al. [Page 45] DRAFT November 1998 6.9. POTENTIAL-CONFLICT state This state indicates that the two servers are attempting to re- integrate with each other, but at least one of them was running in a state that did not guarantee automatic reintegration would be possible. In POTENTIAL-CONFLICT state the servers may determine that the same IP address has been offered and accepted by two different DHCP clients. It is a goal of this protocol to minimize the possibility that POTENTIAL-CONFLICT state is ever entered. 6.9.1. Upon Entry to POTENTIAL-CONFLICT When a primary server enters POTENTIAL-CONFLICT state it should request that the secondary send it all updates of which it is currently unaware by sending an UPDATEREQ message to the secondary server. A secondary server entering POTENTIAL-CONFLICT state will wait for the primary to send it an UPDATEREQ message. 6.9.2. Operation in POTENTIAL-CONFLICT state Any server in POTENTIAL-CONFLICT state MUST be unresponsive to incom- ing DHCP requests. 6.9.3. Transitions out of POTENTIAL-CONFLICT state If communications fails with the partner while in POTENTIAL-CONFLICT state, then a primary server will transition to PARTNER-DOWN state and a secondary server will stay in POTENTIAL-CONFLICT state. Whenever either server receives an UPDATEDONE message from its partner, it MUST transition to NORMAL state. This will cause the primary server to leave POTENTIAL-CONFLICT state prior to the secon- dary, since the primary sends an UPDATEREQ message and receives an UPDATEDONE before the secondary sends an UPDATEREQ message and receives its UPDATEDONE message. When a secondary server receives an indication that the primary server has transitioned from POTENTIAL-CONFLICT to NORMAL state, it SHOULD send an UPDATEREQ message to the primary server. Droms, et. al. [Page 46] DRAFT November 1998 Primary Secondary Server Server | | POTENTIAL-CONFLICT POTENTIAL-CONFLICT | | | >--DHCPUPDATEREQ-------------> | | | | <-----------------DHCPBNDUPD--< | | >--DHCPBNDACK----------------> | ... ... | | | <-----------------DHCPBNDUPD--< | | >--DHCPBNDACK----------------> | | | | <-------------DHCPUPDATEDONE--< | NORMAL | | >--DHCPPOLL--(NORMAL) -------> | | <-------------------DHCPPRPL--< | | | | <--------------DHCPUPDATEREQ--< | | | | >--DHCPBNDUPD----------------> | | <-----------------DHCPBNDACK--< | ... ... | | | >--DHCPBNDUPD----------------> | | <-----------------DHCPBNDACK--< | | | | >--DHCPUPDATEDONE------------> | | | | NORMAL | | | <----------------DHCPPOOLREQ--< | | >--DHCPPOOLRESP--------------> | | | Figure 6.9-1: Transition out of POTENTIAL-CONFLICT Droms, et. al. [Page 47] DRAFT November 1998 6.10. RECOVER-DONE state This state exists to allow an interlocked transition for one server from RECOVER state and another server from PARTNER-DOWN or COMMUNICATIONS-INTERRUPTED state into NORMAL state. 6.10.1. Operation in RECOVER-DOWN state A server in RECOVER-DONE state is responsive only to RENEWAL and REBINDING DHCP messages. 6.10.2. Transitions out of RECOVER-DONE state When a server in RECOVER-DONE state determines that its partner server has entered NORMAL state, then it will transition into NORMAL state as well. 6.11. PAUSED state This state exists to allow one server to inform another that it will be out of service for what is predicted to be a relatively short time, and to allow the other server to transition to COMMUNICATIONS- INTERRUPTED state immediately and (if it is a secondary server) to begin servicing clients with no interruption. A server which is aware that it is shutting down temporarily SHOULD send one or more DHCPPOLL messages with the 'state' field containing PAUSED. While a server may or may not transition internally into PAUSED state, the 'previous' state determined when it is restarted MUST be the state the server was in prior to receiving the command to shut- down and restart and its entry into the PAUSED state. 6.11.1. Upon entry to PAUSED state When entering PAUSED state, the server MUST remember the previous state, and use that state as the previous state when it is restarted. 6.11.2. Transitions out of PAUSED state A server transitions out of PAUSED state by being restarted. At that time, the previous state MUST be the state the server was in prior to entering the PAUSED state. Droms, et. al. [Page 48] DRAFT November 1998 6.12. SHUTDOWN state This state exists to allow one server to inform another that it will be out of service for what is predicted to be a relatively long time, and to allow the other server to transition immediately to PARTNER- DOWN state, and take over completely for the server going down. A server which is aware that it is shutting down SHOULD send one or more DHCPPOLL messages with the 'state' field containing SHUTDOWN. While a server may or may not transition internally into SHUTDOWN state, the 'previous' state determined when it is restarted MUST be the state active prior to the command to shutdown unless the server detects that its partner has moved to PARTNER-DOWN, in which case it MUST be RECOVER. 6.12.1. Upon entry to SHUTDOWN state When entering SHUTDOWN state, the server MUST record the previous state in stable storage for use when the server is restarted. It also MUST record the current time as the last time operational. A DHCPPOLL message SHOULD be sent to the partner with the 'state' field containing SHUTDOWN state. 6.12.2. A server in SHUTDOWN state MUST be unresponsive to DHCP client input. If a server receives any message indicating that the partner has moved to PARTNER-DOWN state while it is in SHUTDOWN state (e.g in response to the DHCPPOLL it sent containing SHUTDOWN state), then it MUST record RECOVER state as the previous state to be used when it is restarted. A server SHOULD wait for a few seconds after informing the partner of entry into SHUTDOWN state (if communications are okay) to determine if it will enter PARTNER-DOWN state. 6.12.3. Transitions out of SHUTDOWN state A server transitions out of SHUTDOWN state by being restarted. 7. Safe Period Due to the restrictions imposed on each server while in COMMUNICATIONS-INTERRUPTED state, long-term operation in this state Droms, et. al. [Page 49] DRAFT November 1998 is not feasible for either server. One reason that these states exist at all, is to allow the servers to easily survive transient network communications failures of a few minutes to a few days (although the actual time periods will depend a great deal on the DHCP activity of the network in terms of arrival and departure of DHCP clients on the network). Eventually, when the servers are unable to communicate, they will have to move into a state where they no longer can re-integrate without the some possibility of a duplicate IP address allocation. There are two ways that they can move into this state (known as PARTNER-DOWN). They can either be informed by external command that, indeed, the partner server is down. In this case, there is no difficulty in mov- ing into the PARTNER-DOWN state since it is an accurate reflection of reality and the protocol has been designed to operate correctly (even during reintegration) if, when in PARTNER-DOWN state the partner is, indeed, down. The more difficult scenario is when the servers are running unat- tended for extended periods, and in this case an option is provided to configure something called a "safe-period" into each server. This OPTIONAL safe-period is the period after which either the primary or secondary server will automatically transition to PARTNER-DOWN from COMMUNICATIONS-INTERRUPTED state. If this transition is completed and the partner is not down, then the possibility of duplicate IP address allocations will exist. The goal of the "safe-period" is to allow network operations staff some time to react to a server moving into COMMUNICATIONS-INTERRUPTED state. During the safe-period the only requirement is that the net- work operations staff determine if both servers are still running -- and if they are, to either fix the network communications failure between them, or to take one of the servers down before the expira- tion of the safe-period. The length of the safe-period is installation dependent, and depends in large part on the number of unallocated IP addresses within the subnet address pool and the expected frequency of arrival of previ- ously unknown DHCP clients requiring IP addresses. Many environments should be able to support safe-periods of several days. During this safe period, either server will allow renewals from any existing client. The only limitation concerns the need for IP addresses for the DHCP server to hand out to new DHCP clients and the need to re-allocate IP addresses to different DHCP clients. Droms, et. al. [Page 50] DRAFT November 1998 The number of "extra" IP addresses required is equal to the expected total number of new DHCP clients encountered during the safe period. This is dependent only on the arrival rate of new DHCP clients, not the total number of outstanding leases on IP addresses. In the unlikely event that a relatively short safe period of an hour is all that can be used (given a dearth of IP addresses or a very high arrival rate of new DHCP clients), even that can provide sub- stantial benefits in allowing the DHCP subsystem to ride through minor problems that could occur and be fixed within that hour. In these cases, no possibility of duplicate IP address allocation exists, and re-integration after the failure is solved will be automatic and require no operator intervention. 8. Security The Failover protocol MAY be secured with a simple shared secret mes- sage digest which covers each message. Since there are a number of configuration parameters that must be the same on each server in a pair, it is not unreasonable to require a shared secret be configured as well. Only information within the packet and covered by the message digest is used for operation of the protocol. It is for this reason that the IP address of the sending server is sent in the 'sending server id' field of the fixed header of the failover message when it might seem that the same information could be recovered from the source address of the IP packet. 9. Extended Discussion Some areas in the draft above warranted more extended discussion than was feasible to insert directly into the next. 1. UDP or TCP There has been debate about the utility of using UDP for the Failover protocol, since it doesn't supply guaranteed delivery. UDP has been chosen as the protocol of choice for the failover protocol due to the following factors: First, it is important to recognize that mere receipt of a packet by the other server in the pair (e.g., receipt of a DHCPBNDUPD packet by the secondary server) is not sufficient for the primary to update its own bindings database with new information about what the secondary knows. In all cases of Droms, et. al. [Page 51] DRAFT November 1998 transfers of binding information, the server of a DHCPBNDUPD message MUST update its own stable storage prior to replying with a DHCPBNDACK message (except in the marginal case where all of the updates are rejected). An action is required by the receiving server and an explicit ACK is needed by the sending server to ensure the integrity of the protocol. So, just knowing that the other server has received a Failover protocol packet is not intrinsically interesting. Second, the DHCP protocol, both the client and server side, is being implemented in progressively smaller and smaller machines. While this progression is most evident in DHCP clients, there exist implementations today of DHCP servers embedded in devices that are by no stretch of the imagination traditional "servers" running mainstream operating systems. In many ways, the Failover protocol is very well suited to such devices. Adding additional protocol infrastructure requirements to implement the Failover protocol might prevent its implementation in devices that in some ways need it most (devices with limited stable storage of their own). Third, there are only a few cases where the Failover protocol requires guaranteed delivery of packets. In particular, the normal Primary to Secondary DHCPBNDUPD message do not have to be delivered reliably. The consequences of lost DHCPBNDUPD messages are handled by the use of the MCLT, for the simple reason that since these messages are "lazy", they may not get delivered because of a server Failover prior to their transmission. The protocol is robust in the face of loss of either a DHCPBNDUPD message or a DHCPBNDACK message. Furthermore, a technique known as "fire and forget" may be used with this protocol and two cooperating implementations. If the DHCPBNDACK message contains all of the information ori- ginally in the DHCPBNDUPD message, then the DHCPBNDUPD message may be transmitted and forgotten by the sending server (typi- cally the primary). When and if the secondary receives the DHCPBNDUPD and replies with a DHCPBNDACK message and the pri- mary receives it, the primary will update its stable storage with a new picture of what the secondary knows about the lease time. If either of these messages is lost, the only downside is that the DHCP client associated with the binding in ques- tion may receive a shorter lease for one lease period than it would otherwise. This "fire and forget" technique could sub- stantially ease both the complexity of implementation and memory requirements of an implementation of the Failover pro- tocol, especially where two servers were communicating over a very slow link. Droms, et. al. [Page 52] DRAFT November 1998 10. Acknowledgments Ralph Droms started it all, by sketching out an initial interserver draft that embodied ideas from several past IETF meetings. In that draft, he acknowledged contributions by Jeff Mogul, Greg Minshall, Rob Stevens, Walt Wimer, Ted Lemon, and the DHC working group. Kim Kinnear and Bob Cole each extended that draft, separately and then together, until they created an interserver draft that supported any number of servers. The complexity of that approach was just too great, and that draft wasn't greeted with enthusiasm by many, includ- ing its authors. It did however lead to a much simpler approach embodied in the first Failover draft by Greg Rabil, Mike Dooley, Arun Kapur and Ralph Droms. This draft posited only two servers -- a primary and a secon- dary. Kim Kinnear then wrote the Safe Failover draft to layer on top of the Failover Draft and increase its robustness in the face of certain rare network failures. At the spring 1998 IETF meeting in LA, the DHC working group said that they wanted a merged Failover and Safe Failover draft. Steve Gonczi and Bernie Volz stepped up and produced the raw material for such a merged draft, along with a new message format designed around DHCP options and other extensions and clarifications. Kim Kinnear edited their work into draft format and made other changes in time for the Summer Chicago IETF meeting. During the summer and fall of 1998, two groups have been working on separate implementations of the evolving draft. Bernie Volz and Steve Gonczi constitute one group, and Kim Kinnear, Mark Stapp and Paul Fox make up the other. These two groups have worked together to produce considerable changes and simplifications of the protocol dur- ing this period, and Steve Gonczi and Kim Kinnear have edited these changes into this latest revision in time for submission to the December 1998 Orlando IETF meeting. These most recent changes have been reviewed by Ralph Droms, Greg Rabil, Bernie Volz, Steve Gonczi, Mark Stapp, Paul Fox, and Kim Kin- near. This does not preclude any of these people from expressing disagreement with what is contained in this draft at any future time. Many people have reviewed the various earlier drafts that went into this result. At American Internet, ideas were contributed by Brad Parker. At Cisco Systems, Paul Fox, and Ellen Garvey have contri- buted greatly to the form of the protocol. Glenn Waters of Bay Droms, et. al. [Page 53] DRAFT November 1998 Networks contributed ideas and enthusiasm to make a Failover protocol that was both "safe" and "lazy". 11. References [RFC 2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [RFC 2119] Bradner, S. "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119. [RFC 2132] Alexander, S., Droms, R., "DHCP Options and BOOTP Vendor Extensions", Internet RFC 2132, March 1997. 12. Author's information Ralph Droms 323 Dana Engineering Bucknell University Lewisburg, PA 17837 Phone: (717) 524-1145 EMail: droms@bucknell.edu Greg Rabil, Mike Dooley, Arun Kapur Lucent Technologies (Quadritek) 10 Valley Stream Parkway, Suite 240 Malvern, PA 19355 Phone: (800) 208-2747 EMail: grabil@lucent.com mdooley@lucent.com akapur@lucent.com Kim Kinnear Mark Stapp Cisco Systems 250 Apollo Drive Chelmsford, MA 01824 Phone: (978) 244-8000 Droms, et. al. [Page 54] DRAFT November 1998 EMail: kkinnear@cisco.com mjs@cisco.com Steve Gonczi, Bernie Volz Process Software Corporation 959 Concord St. Framingham, MA 01701 Phone: (508) 879-6994 EMail: gonczi@process.com volz@process.com Droms, et. al. [Page 55]