Network Working Group Ralph Droms INTERNET DRAFT Bucknell University Greg Rabil Mike Dooley Arun Kapur Quadritek Systems Kim Kinnear American Internet Steve Gonczi Bernie Volz Process Software August 1998 Expires March 1999 DHCP Failover Protocol Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Abstract DHCP [RFC 2131] allows for multiple servers to be operating on a single network. Some sites are interested in running multiple servers in such a way so as to provide redundancy in case of server failure. In order for this to work reliably, the cooperating Primary and Secondary servers must maintain a consistent database of the lease Droms, et. al. [Page 1] DRAFT January 1998 information. This implies that servers will need to coordinate any and all lease activity so that this information is synchronized in case of failover. This document defines a protocol to provide this synchronization between two servers. One server is designated the "Primary" server, the other is the "Secondary" server. Additionally, this document describes a protocol for the automatic transfer of control from the Primary to the Secondary in the case of failure (failover), as well as a network partition. This document is a merge of draft-ietf-dhc-failover-01.txt and draft-ietf-dhc-safe-failover-proto-00.txt, along with substantial changes to each. Unfortunately, this merge was not completed with sufficient time to allow review by any of the authors of draft-ietf- dhc-failover-01.txt, and so it may well not reflect their views even though their names appear as authors. See Section 11, issue #1 and Section 12 for more details. 1. Introduction As the use of DHCP servers in networked environments grows, the dependency of those networks on the DHCP server increases. This is particularly true of the hosts that receive their configuration information from the DHCP server. Therefore, it is very important to be able to provide reliable, continuous availability of DHCP ser- vices. This specification describes a protocol to support automatic failover from a primary to its secondary server. The failover mechanism allows the secondary server to perform DHCP actions while the primary is down, or when a network failure prevents the primary and secondary from communicating. The protocol also specifies how reintegration is achieved when the primary again becomes operational or when the pri- mary and secondary can again communicate. In providing the specification for the failover, the protocol speci- fies how to guarantee reliable delivery of changes to the secondary. This is required to synchronize the secondary's lease data with that of the primary. The protocol further specifies a mechanism to allow the secondary to determine if it can communicate with the primary server. The secondary will automatically begin to service DHCP requests whenever it cannot communicate with the primary. When the primary server becomes available again, the secondary will convey any changes that occurred since the time of failover back to the primary. Through careful control of the difference between the lease times Droms, et. al. [Page 2] DRAFT January 1998 offered to DHCP clients and the lease time known by the secondary server, the protocol allows the primary to communicate with the secondary after the primary has completed communication with the DHCP client (a technique known as "lazy" update) and still guarantee that duplicate IP address allocations do not occur. Thus, the protocol does not directly impact the ability of a DHCP server to respond to DHCP client requests. 1.1. Requirements Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119]. 1.2. DHCP Terminology This document uses the following terms: o "DHCP client" or "client" A DHCP client is an Internet host using DHCP to obtain confi- guration parameters such as a network address. o "DHCP server" or "server" A DHCP server is an Internet host that returns configuration parameters to DHCP clients. o "binding" A binding is a collection of configuration parameters, includ- ing at least an IP address, associated with or "bound to" a DHCP client. Bindings are managed by DHCP servers. o "binding database" The collection of bindings managed by a primary and secondary. o "subnet address pool" A subnet address pool is the set of IP address which is asso- ciated with a particular network number and subnet mask. In the simple case, there is a single network number and subnet mask and a set of IP addresses. In the more complex case (sometimes called "secondary subnets", sometimes "super- scopes"), several (apparently unrelated) network number and subnet mask combinations with their associated IP addresses Droms, et. al. [Page 3] DRAFT January 1998 may all be configured together into one subnet address pool. o "primary server" or "primary" A DHCP server configured to provide primary service to a set of DHCP clients for a particular set of subnet address pools. o "secondary server" or "secondary" A DHCP server configured to act as backup to a primary server for a particular set of subnet address pools. o "stable storage" Every DHCP server is assumed to have some form of what is called "stable storage". Stable storage is used to hold information concerning IP address bindings (among other things) so that this information is not lost in the event of a server failure which requires restart of the server. 1.3. Requirements for this protocol The following list of goals must be (and are) achieved by this proto- col. 1. Implementations of this protocol must work with existing DHCP client implementations based on the DHCP protocol [1]. 2. Implementations of the protocol must work with existing BOOTP relay implementations. 3. The protocol must provide failover redundancy between servers that are not located on the same subnet. 1.4. Goals for this protocol 1. Provide for continued service to DHCP clients through an automated mechanism in the event of failure of the Primary Server. 2. Avoid binding an IP address to a client while that binding is currently valid for another client. In other words, don't allocate the same IP address to two clients. 3. Minimize any need for manual administrative intervention. Droms, et. al. [Page 4] DRAFT January 1998 4. Introduce no additional delays in server response time as a result of inter-server communication. 5. Share IP address ranges between primary and secondary servers; i.e., impose no requirement that the pool of avail- able addresses be divided between servers. 6. Continue to meet the goals and objectives of this protocol in the event of server failure or network partition. 7. Provide graceful reintegration of full protocol service after server failure or network partition. 8. Allow for one computer to act as a Secondary Server for mul- tiple Primary Servers. Other topologies (e.g.: mesh) are also possible. Primary and Secondary Servers SHOULD be viewed as "logical" servers and not necessarily physical computers. 9. Ensure that an existing client can keep its existing IP address binding if it can communicate with either the Primary or Secondary DHCP server implementing this protocol - not just whichever server that originally offered it the binding. 10.Ensure that a new client can get an IP address from some server. Ensure that in the face of partition, where servers continue to run but cannot communicate with each other, the above goals and requirements may be met. In addition, when the partition condition is removed, allow graceful automatic re-integration without requiring human intervention. 11.If either Primary or Secondary Server loses all of the infor- mation that is has stored in stable storage, it should be able to refresh its stable storage from the other server. 1.5. Limitations of this Protocol The following are explicit limitations of this protocol. 1. Under normal operation, only one server at a time will ser- vice DHCP client requests; this protocol provides reliability through redundancy but not load balancing. 2. This protocol provides only one level of redundancy through a single Secondary Server for each Primary Server. 3. The protocol provides a way to detect when the primary and secondary server cannot communicate, but once this condition Droms, et. al. [Page 5] DRAFT January 1998 has been detected, does not (indeed, cannot) provide any way to further distinguish between network failure and failure of one of the servers. 4. A small number of IP addresses are reserved for Secondary Server use. In order to handle the failure case where both servers are able to communicate with DHCP clients, but unable to communicate with each other, a small number of IP addresses must be set aside as a private address pool for the Secondary Server. The Secondary can use these to service newly arrived DHCP clients during such a period. The size of this private pool SHOULD be based only on the arrival rate of new DHCP clients and the length of expected downtime, and is not influenced in any way by the total number of DHCP clients supported by the server pair. 5. The Primary and Secondary Servers SHOULD pause normal DHCP transaction processing while resynchronizing, after a system failure. 2. Protocol Operations The protocol necessary in providing redundant/failover servers can be grouped in three areas: o Messages to keep the Secondary Server's lease data synchron- ized with that of the Primary so that when failover occurs, there is no degradation of service. o Messages that allow the Secondary to determine the operational state of the Primary, so as to know when to start servicing DHCP traffic. o Messages that are used to coordinate the Primary regaining control when it has become available again. 2.1. Time synchronization between communicating servers Each Binding update message carries a "sent time stamp" (the time when the message was sent in GMT). This provides a simple mechanism to determine any "time drift" between communicating servers. DISCUSSION: If an UDP packet is successfully transmitted (i.e.: it does not get lost), the packet travel time is negligible in the framework Droms, et. al. [Page 6] DRAFT January 1998 of DHCP leases. By providing a GMT "sent time" stamp, the reci- pient can compare this with its notion of the current GMT time at the time it receives the packet. The difference (plus the packet travel time, which we ignore) is the time drift. The recipient can use this time drift value to bias all "absolute time" values it receives from the sender. 2.2. Failover Protocol Messages The Failover Protocol messages are encoded using a packet format specific to the Failover Protocol. To allow easy recognition of Failover Protocol messages, BOOTP packet "op" field values 3..14 are proposed to mark various Failover Protocol messages. A Failover Pro- tocol message is always unicast from the source to the destination. The sender, and never the recipient is responsible for reliable re- transmission. 2.3. Failover Protocol packet header format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | op (1) | rev (1) | payload offset (2) | +---------------+---------------+---------------+---------------+ | xid (4) | +---------------------------------------------------------------+ | 0 or more additional header bytes (variable) | +---------------------------------------------------------------+ | Payload data, formatted as DHCP-style options | | (although using a unique option number space) | | (variable) | +---------------------------------------------------------------+ op - 1 byte These values extend the number space of the existing BOOTP message type "Op" field. The following types are defined: Droms, et. al. [Page 7] DRAFT January 1998 3 DHCPPOOLREQ 4 DHCPPOOLRESP 5 DHCPBNDUPD 6 DHCPBNDACK 7 DHCPPOLL 8 DHCPPRPL 9 DHCPCTLREQ 10 DHCPCTLRET 11 DHCPCTLACK 12 DHCPCTLACKACK 13 DHCPREQUEREQ 14 DHCPREQUERESP rev - 1 byte Failover protocol version supported. Set to 1 for the Failover Proto- col described in this draft. payload offset - 2 bytes, network byte order The byte offset of the Payload area, from the beginning of the Fail- over packet header. The value for the current protocol version is 8. xid - 4 bytes, network byte order The sender of a failover protocol packet is responsible for setting this number, and the receiver of the packet copies the number over into any response packet. To the receiver it is opaque. The sender SHOULD ensure that every packet sent to a particular IP address and port combination has a unique transaction id unless that packet is a re-transmission. 2.4. DHCPPOOLREQ and DHCPPOOLRESP: Whenever the Secondary server transitions into NORMAL mode, it first sends a DHCPPOOLREQ message to initiate a transfer of a small range of IP addresses that will serve as its private address pool. This is necessary, because initially the Secondary server has no such address pool, and its pool gets depleted when it hands out addresses in COMMUNICATION-INTERRUPTED mode. This is why the request is sent every time the Secondary server transitions into NORMAL mode. The DHCPPOOLREQ message does not carry any payload data. When the Primary Server gets a DHCPPOOLREQ message, it computes which addresses should be transferred to the Secondary, and queues up DHCPBNDUPD transac- tions, setting the Status of these bindings to "BACKUP". Having done this, it sends a DHCPPOOLRESP message. The DHCPPOOLRESP message Droms, et. al. [Page 8] DRAFT January 1998 carries the "Number of addresses transferred" as its payload. The Secondary server keeps sending DHCPPOOLREQ messages until it receives a DHCPPOOLRESP with "Number of addresses transferred" = 0, or it decides that the partner is not responding. Each one of these message MUST have the same transaction ID. If a new transaction ID is used in one of these messages, the receiving server will begin the transmission of the DHCPBNDUPD messages all over again. To be clear, if the Secondary Server receives a DHCPPOOLRESP message with "Number of addresses transferred" > 0, it MUST send another DHCPPOOLREQ mes- sage. This mechanism makes it possible for the Primary Server to pace the transfer (e.g., it could generate all addresses all at once, or one-by-one). The Primary Server must respond to each DHCPPOOLREQ message it receives. If it has already generated all private addresses, or it has no available addresses, it MUST send DHCPPOOLRESP with "Number of addresses transferred" = 0. 2.5. DHCPREQUEREQ and DHCPREQUERESP: Whenever either server wishes to be updated with the information that the other server knows and has not yet transmitted to it, will send a DHCPREQUEREQ. The DHCPREQUEREQ message does not carry any payload data. When the either server gets a DHCPREQUEREQ message, it computes which updates should be transferred to the Secondary, and queues up DHCPBNDUPD transactions as appropriate. Having done this, it sends a DHCPRE- QUERESP message. The DHCPREQUESP message carries the "Number of addresses queued up" as its payload. The set of binding updates queued up will depend on the requesting server's state. (The state has already been communicated via prior DHCPPOLL/DHCPPRPL messages) The Secondary server keeps sending DHCPPREQUEREQ messages until it receives a DHCPREQUERESP with "Number of addresses queued up" = 0, or it decides that the partner is not responding. This is the same approach as in the DHCPPOOLREQ/DHCPPOOLRESP messages is used. Each one of these DHCPREQUEREQ message MUST have the same transaction ID. Use of a new transaction ID will cause re-building of the outgoing binding update queue. The Primary Server must respond to each DHCPREQUEREQ message it receives. If it has already queued up all of the previously unsent bindings update, then it MUST send DHCPREQUERESP with "Number of addresses queued up" = 0. Droms, et. al. [Page 9] DRAFT January 1998 2.6. DHCPBNDUPD The Primary notifies Secondary (or the other way around) of a binding state and data change. In response to a binding update, the recipient server MUST respond with a DHCPBNDACK message. Multiple binding updates can be batched up, and sent in one Failover Protocol message. 2.7. DHCPBNDACK This message implements a positive, or negative acknowledgement of one or more binding updates. A binding update, (or a batch of binding updates sent as one message) are matched up with their associated acknowledgment by having the same Xid field value in the message header. The server sending a DHCPBNDACK message MAY include any of the options that are acceptable in a DHCPBNDUPD message when the DHCPBNDACK message returned to the sender. If any of this informa- tion differs from the information in the DHCPBNDUPD message, the receiver SHOULD update its bindings database with that information upon receipt of the DHCPBNDACK message. The DHCPBNDACK MAY selectively reject one or more updates by includ- ing one or more IP address - Reject Reason option pairs in the mes- sage body. The DHCPBNDACK implicitly acknowledges any binding updates it replies to, except those it enumerates using Reject Reason Codes. 2.8. DHCPPOLL In order to determine the state of a given server, or to communicate a critical change in its own status, a participant can use the above message. This message inquires about the current state of the recipient, and tells the recipient what state the sender is. In response to the DHCPPOLL message, the participant will listen for a DHCPPRPL message. Droms, et. al. [Page 10] DRAFT January 1998 2.9. DHCPPRPL This message replies to the DHCPPOLL message (PRPL=Poll reply). The DHCPPRPL also carries server status information (see message payload details below). After a failover, when the Primary Server is restarted, the following messages are used to coordinate the Primary taking control back from the Secondary: DHCPCTLREQ - Request for control DHCPCTLRET - Return of control initiated DHCPCTLACK - Return of control completed DHCPCTLACKACK - Return of control completed message acknowledged. The Primary Server sends a DHCPCTLREQ message, indicating that it would like to take control of the bindings database. The Secondary Server replies with a DHCPCTLRET message, which serves as a signal to the Primary "Stand by to receive binding updates". This message then is followed by a set of binding updates from the secondary to the primary. When all updates have been transmitted (and acknowledged) from Secondary to Primary, a DHCPCTLACK message is sent from the Secondary to the Primary, to signal that "all updates from the Secon- dary are now completed". DISCUSSION: Note, that the DHCPCTLACK message type must be transmitted reli- ably, as the Primary Server will not start servicing clients, until it has received the DHCPCTLACK message. To provide this reliability, the DCHPCTLACKACK message is provided. This provides an acknowledgment of the DHCPCTLACK message, and the DHCPCTLACK message will be periodically re-sent until it is acknowledged. We could just periodically re- send the DHCPCTLACK message until we start receiving binding updates from the Primary, but the Primary may not have any updates to send at all, hence the need for an explicit DCHPCTLACKACK message. The Primary Server transitions into NORMAL state upon receiving a DHCPCTLACK from the secondary, when the secondary has completed send- ing all of its updates during synchronization. The DHCPCTLACKACK message is needed to prevent the primary from waiting and not servic- ing clients if the DHCPCTLACK message got lost. The Secondary server will keep re-sending the DHCPCTLACK message, until: 1. It Decides that the primary is not responding, so the Secon- dary server goes into COMMUNICATION- INTERRUPTED mode. Droms, et. al. [Page 11] DRAFT January 1998 2. It receives a DHCPCTLACKACK or a DHCPBNDUPD message from the primary. The Primary's DHCPBNDUPD messages would start arriving at the Secondary server, if the Primary did get the DHCPCTLACK, but the DHCPCTLACKACK message got lost. 3. Protocol Payload Data Format Payload data is encoded as a set of flexible DHCP/BOOTP style options. (The usual 1 byte option code, 1 byte length, and "length" bytes of data). The options are placed after the header, after skip- ping PayloadOffset bytes. The payload data options are not preceded "cookie" value. Since the packet is NOT a DHCP/BOOTP protocol packet, the options used here do not conflict with any existing "proper" DHCP/BOOTP options. In fact, these options are allocated in relationship to the DHCP option space in the following way. In cases where the syntax and semantics of a Failover Payload Option is identical to that of a DHCP/BOOTP option, the same number option number is used. For options unique to the Failover protocol, options numbers starting at 230 are used. Thus, all new Failover Protocol option numbers are assigned from a continuous range beginning with 230. This number is shown as an X in the tables below. The protocol is permissive in allowing various other DHCP options in binding updates. As long as the sender wishes to use an option, it MAY include it. On the other hand, the recipient MUST ignore any option it is not expecting. Multiple DHCPBNDUPD transactions can be batched together in one UDP packet. Option sets for individual transaction MUST always begin with the IP address (Option 50) . This is the only restriction on payload item ordering. In any other case, payload data items can be included in any desired order. In case an implementation chooses to use the DHCPBNDNAK mechanism, the DHCPBNDNAK message SHOULD contain one or more Option 50s from the NAK-ed message, to indicate which specific update items are being NAK-ed. While the synchronization is in progress, the secondary MUST NOT accept client requests, and the primary MUST NOT send any updates to the secondary. This is necessary to allow the Primary to be the sole arbitrator of any conflicting updates. Droms, et. al. [Page 12] DRAFT January 1998 3.1. DHCP Server Status This option is used to convey the current state of a server. Code Len Type +--+---+------+ | X| 1 | 1-15 | +--+---+------+ Allowed values for this option: Value Message Type ----- ------------ 1 UNKNOWN-STATE 2 PRIMARY-NORMAL Normal state 3 BACKUP-NORMAL 4 PRIMARY-COMINT Communication interrupted (safe) 5 BACKUP-COMINT 6 PRIMARY-PARTNERDOWN Partner down (unsafe mode) 7 BACKUP-PARTNERDOWN 8 PRIMARY-CONFLICT Synchronizing, after a "Partner-Down" divergence 9 PRIMARY-SYNC Synchronizing, after a "communications- interrupted" divergence. 10 BACKUP-SYNC 11 PRIMARY-RECOVER Recovering ALL bindings from partner 12 BACKUP-RECOVER 13 FAILOVER-DISABLED The server is running with the failover protocol disabled. (standalone) 14 SERVER-PAUSED The server is inactive, shutting down for a sort period. 15 SERVER-SHUTDOWN The server is inactive, shutting down for an extended period. When a server is being re-started, it should send a DHCPPOLL message to its partner, reporting its status (SERVER-PAUSED). In response, the recipient SHOULD go into COMMUNICATION-INTERRUPTED mode. Droms, et. al. [Page 13] DRAFT January 1998 When a server is being shut down, it should send a DHCPPOLL message to its partner, reporting its status (SERVER-SHUTDOWN). In response, the recipient SHOULD go into PARTNER-DOWN mode. 3.2. DHCP Binding Status This option is used to convey the current state of a binding. This option is mandatory for DHCPBNDUPD messages. Code Len Type +-----+-----+-----+ | X+1 | 1 | 1-7 | +-----+-----+-----+ Legal values for this option are: Value Message Type ----- ------------ 1 FREE The lease has never been used 2 ACTIVE assigned to a client * 3 EXPIRED 4 RELEASED A client released the lease 5 ABANDONED A server or client flagged address as not usable. 6 RESET Lease was freed by some external agent. 7 BACKUP Lease is set aside for Secondary server's private address pool. 3.3. Assigned IP address Uses identical code and format to DHCP Option 50 (requested IP address). Code Len Address +-----+-----+-----+-----+-----+-----+ | 50 | 4 | a1 | a2 | a3 | a4 | +-----+-----+-----+-----+-----+-----+ Droms, et. al. [Page 14] DRAFT January 1998 3.4. Lease grant time An absolute, GMT time value for this option, as time synchronization has already been achieved between the source and the target server using the Sent Time Stamp option. Represented as seconds since Jan 1, 1970 (i.e. ANSI C time_t time value representation). Code Len Time +------+-----+-----+-----+-----+-----+ | X+2 | 4 | t1 | t2 | t3 | t4 | +------+-----+-----+-----+-----+-----+ 3.5. Sent Time Stamp A time stamp using GMT, when the packet was sent. It is used to determine the time drift between the sender and the recipient. The time drift is defined as the difference between "Arrive Time (GMT)" and (Send Time (GMT)" . The actual packet travel time is assumed to be negligible in this context. All Date-Time values contained in Failover messages will be corrected by the time drift before being stored by the recipient. Code Len Time +-----+-----+-----+-----+-----+-----+ | X+3 | 4 | t1 | t2 | t3 | t4 | +-----+-----+-----+-----+-----+-----+ The time is a 32 bit unsigned long in network byte order, in units of seconds (GMT since EPOCH). 3.6. Number of addresses transferred to Secondary Server A 32 bit unsigned long in network byte order. Reports the number of addresses transferred by the Primary to the Secondary Server (addresses to be used for the Secondary Server's private address pool) Droms, et. al. [Page 15] DRAFT January 1998 Code Len Time +-----+-----+-----+-----+-----+-----+ | X+4 | 4 | t1 | t2 | t3 | t4 | +-----+-----+-----+-----+-----+-----+ 3.7. Lease Duration Uses the format and code of the standard DHCP IP Address Lease Time option. It is used by the DHCP protocol in the exact same way by the DHCPOFFER message. The time is in units of seconds, and is specified as a 32-bit unsigned integer. A Lease Duration of 0xFFFFFFFF indi- cates an infinite lease. Code Len Lease Time +-----+-----+-----+-----+-----+-----+ | 51 | 4 | t1 | t2 | t3 | t4 | +-----+-----+-----+-----+-----+-----+ 3.8. Client Identifier The format, code and conventions used are identical to DHCP option 61. Code Len Type Client-Identifier +-----+-----+-----+-----+-----+--- | 61 | n | t1 | i1 | i2 | ... +-----+-----+-----+-----+-----+--- 3.9. Client Hardware Address The format is similar to DHCP option 61. T1 (type) MUST be set to the proper ARP hardware address code ( it MUST NOT be zero!) TBD: Refer- ence the ARP document here. Droms, et. al. [Page 16] DRAFT January 1998 Code Len Type Client-Identifier +-----+-----+-----+-----+-----+--- | X+5 | n | t1 | i1 | i2 | ... +-----+-----+-----+-----+-----+--- Either Client Id, Client Hardware Address or BOTH MAY be present in binding update transactions. At least one of them MUST be present. If both are present, the Client Id MUST be used to uniquely identify the owner of the binding (exactly as in RFC 2131). 3.10. Host Name Uses the format and code of DHCP option 12. Code Len Host Name +-----+-----+-----+-----+-----+-----+-----+-----+-- | 12 | n | h1 | h2 | h3 | h4 | h5 | h6 | ... +-----+-----+-----+-----+-----+-----+-----+-----+-- 3.11. Domain Name Uses the format and code of DHCP option 15. Code Len Domain Name +-----+-----+-----+-----+-----+-----+-- | 15 | n | d1 | d2 | d3 | d4 | ... +-----+-----+-----+-----+-----+-----+-- 3.12. Reject Reason Code This option is used to selectively reject binding updates. It MAY be used in DHCPBNDACK message, always following an option 50.(The option 50 contains the IP address of the specific update being rejected). Droms, et. al. [Page 17] DRAFT January 1998 Code Len Reason code +-----+-----+-----+ | X+6 | 1 | R1 | +-----+-----+-----+- Reason codes : 1 Illegal IP address (not part of any address pool) 2 Fatal conflict exists: address in use by other client. 3.13. MDLI Maximum Delta Lease Interval, in seconds. A 32 bit integer value, in netwotk byte order. Code Len Time +------+-----+-----+-----+-----+-----+ | X+7 | 4 | t1 | t2 | t3 | t4 | +------+-----+-----+-----+-----+-----+ 4. Exchange of control between Primary and Secondary The Primary and Secondary Servers coordinate the exchange control over the bindings database through the use of DHCPPOLL and DHCPCTLREQ messages. In normal operation: The Primary sends notification of each change to its bindings data- base to the Secondary, and the Secondary keeps its bindings database synchronized with the Primary's database. The Secondary periodically sends DHCPPOLL messages to the Primary, and the Primary responds to each DHCPPOLL message with a DHCPPRPL message. If the Secondary does not receive a DHCPPRPL response mes- sage, the Secondary takes control of the bindings database and begins answering requests from DHCP clients. Note that the Secondary should be able to be configured to not perform the automatic switch-over. The conditions under which a Secondary takes control of the bindings database, e.g., the number of consecutive missing acknowledgments, should be configurable in the Secondary by the DHCP administrator. Droms, et. al. [Page 18] DRAFT January 1998 The Secondary records any changes it makes to the bindings database while it has control. The Secondary continues to send DHCPPOLL mes- sages to the Primary. The DHCPPOLL messages also carry information on the state of the Secondary Server. To regain control of the bindings database, e.g., after the Primary Server has recovered from a failure, or a partitioned network condi- tion, the Primary sends a DHCPCTLREQ message to the Secondary. The Secondary stops answering DHCP client requests, and responds to its Primary with a DHCPCTLRET message. After sending the DHCPCTLRET mes- sage, the Secondary sends DHCPBNDUPD messages for each of the changes it has made to the bindings database. The Primary sends a DHCPBNDACK for each DHCPBNDUPD message it receives. The Secondary completes the transfer of control by sending a DHCPCTLACK message to the Primary as soon as all of its updates were acknowledged. Note, that the Primary SHOULD NOT send any DHCPBNDUPD messages while synchronization is in progress with the Secondary. Once the synchronization is completed, and the Primary transitions into NORMAL state, and starts sending DHCPBNDUPD transactions on any accumulated binding changes it may have. 5. Duplicate address assignment scenarios In the following two scenarios, the protocol could end up allocating duplicate IP addresses, unless the measures recommended in Section 6. are taken: Primary Server crash before "lazy" update: In the case where the Pri- mary Server sends an ACK to a client for a newly allocated IP address and then crashes prior to sending the corresponding update to the Secondary Server, the Secondary Server will have no record of the IP address allocation. When the Secondary Server takes over, it may well try to allocate that IP address to a different client. In the case where the first client to receive the IP address is not on the net at the time (yet while there was still time to run on its lease), an ICMP echo (i.e., ping) will not prevent the Secondary Server from allocating that IP address to different client. A more likely and subtle version of this problem is where the Primary Server crashes after extending a client's lease time, and before updating the Secondary with a new time using a lazy update. After the Secondary takes over, if the client is not connected to the network the Secondary will believe the client's lease has expired when, in fact, it has not. In this case as well, the IP address might be Droms, et. al. [Page 19] DRAFT January 1998 reallocated to a different client while the first client is still using it. Network partition where servers can't communicate but each can talk to clients: Several conditions are required for this situation to occur. First, due to a network failure, the Primary and Secondary Servers cannot communicate. As well, some of the DHCP clients must be able to communicate with the Primary Server, and some of the clients must now only be able to communicate with the Secondary Server. When this condition occurs, both Primary and Secondary Servers could attempt to allocate IP addresses for new clients from the same pool of available addresses. At some point, then, two clients will end up being allocated the same IP address. This will cause potentially serious problems when the network failure that created this situation is corrected. The next section details how the Failover Protocol prevents either of the above scenarios (and other related scenarios) from causing dupli- cate IP address allocation. 6. Duplicate Address Assignment Control There are several ways that the Failover protocol avoids the possi- bility of duplicate address assignment. 6.1. Control of lease time The key problem with lazy update is that when the primary server fails after updating a client with a particular lease time and before updating the secondary server, the secondary server will believe that a lease has expired even though the client still retains a valid lease on that IP address. In order to handle this problem, a period of time known as the "max- imum delta lease interval" (MDLI) is defined and must be known to both the primary and secondary servers. Proper use of this time interval places an upper bound on the difference allowed between the lease time provided to a DHCP client and the lease time known by the secondary server. In order that this is not the maximum lease time that the primary can ever provide to a client, during a lazy update the primary typically updates the secondary with lease time informa- tion which is longer than the lease time previously given to the client. In the case where the secondary needs to take over from the primary, the secondary will not reallocate any IP addresses from one client to a different clients. When transitioning to the PARTNER-DOWN state (where the secondary is allowed to reallocate IP addresses), the Droms, et. al. [Page 20] DRAFT January 1998 secondary will wait the maximum-delta-lease-interval before complet- ing the state transition. Thus, any clients which have a lease on an IP address with a lease time greater that than known by the secondary will either have contacted the secondary during that time or the their lease will have expired. This protocol requires a DHCP server to deal with several different lease intervals and places specific restrictions on their relation- ships. The purpose of these restrictions is to allow the other server in the pair to be able to make certain assumptions in the absence of an ability to communicate between servers. The different lease times are: o desired client lease interval The desired client lease interval is the lease interval that the DHCP server would like to give to the DHCP client in the absence of any restrictions imposed by the Failover Protocol. Its determination is outside of the scope of this protocol. Typically this is the result of external configuration of a DHCP server. o actual client lease interval The actual client lease internal is the lease interval that that DHCP server gives out to the DHCP client. It may be shorter than the desired client lease interval (as explained below). o Primary Server lease interval The Primary Server lease interval is the interval after which the Primary Server believes that DHCP client's lease will expire. o desired Secondary Server lease interval The desired Secondary Server lease interval is the interval the Primary Server tells to the Secondary Server after which the lease will expire. o acknowledged Secondary Server lease interval The acknowledged Secondary Server lease interval is the inter- val the Secondary Server has most recently acknowledged. The key restriction (and guarantee) that the Primary Server makes with respect to lease intervals is that the actual client Droms, et. al. [Page 21] DRAFT January 1998 lease interval never exceeds the acknowledged Secondary Server lease interval (if any) by more than a fixed amount. This fixed amount is called the "maximum delta lease interval" (MDLI). The MDLI MAY be configurable, but for correct server operation it MUST be known to both the Primary and Secondary Servers. The Primary Server MUST record in its state both the Primary Server lease interval and the most recently acknowledged Secondary Server lease interval. It is assumed that the desired client lease interval can be determined through techniques outside of the scope of this protocol. The above lease time descriptions are written for the case where the where the Primary server is operating and in communication with the Secondary server. In the case where the Secondary server is operat- ing out of communications with the Primary server, then the relation- ships must hold in the other direction. The fundamental relationship among these times which MUST be main- tained is: actual client lease interval < ( acknowledged other server lease interval + MDLI ) The "acknowledged other server lease interval" is the acknowledged secondary server lease interval for the Primary server, and it would be the acknowledged primary server lease interval for the Secondary server when it is operating out of contact with the Primary server. DISCUSSION: This protocol mandates no particular detailed algorithms concern- ing these lease intervals, as long as above fundamental relation- ship is preserved. In the interests of clarity, however, let's examine a specific example. The MDLI in this case is 1 hour. The desired client lease interval is 3 days. In operation this might work as fol- lows: When a Primary Server makes an offer for a new lease on an IP address to a DHCP client, it determines the desired client lease interval (in this case, 3 days). It then examines the ack- nowledged Secondary lease interval (which in this case is zero). Droms, et. al. [Page 22] DRAFT January 1998 Since the actual client lease interval can not be allowed to exceed the current Secondary lease interval by more than the MDLI, the offer made to the DHCP client (the actual client lease inter- val) is for (essentially) the MDLI, 1 hour. Once the Primary Server has performed the ACK to the DHCP client, it will update the Secondary Server with the lease information. However, the Secondary Server lease interval will be composed of the current actual client lease interval + ( 1.5 * desired client lease interval). Thus, the Secondary Server is updated with a lease interval of 4.5 days + 1 hour. When the Primary Server receives an ACK to its update of the Secondary Server's lease interval, it records that as the ack- nowledged Secondary Server lease interval. The Primary Server MUST ensure that the Secondary Server has received and recorded in its stable storage the Secondary Server lease interval. When the DHCP client attempts to renew at T2 (approximately one half an hour from the start of the lease), the Primary Server again determines the desired client lease time, which is still 3 days. It then compares this with the remaining acknowledged Secondary Server lease interval (adjusting for the time passed since the Secondary Server was last updated), which is 4.5 days + to the desired client lease interval as it is less than the ack- nowledged Secondary lease interval. When the Primary DHCP server updates the Secondary DHCP server after the DHCP client's renewal ACK is complete, it will calculate the Secondary Server lease interval as the actual client lease interval (3 days this time) + .5 the desired client lease interval (1.5 days). In this way, the Primary attempts to have the Secon- dary always "lead" the client in its understanding of the client's lease interval. Once the initial actual client lease interval of the MDLI is past, the protocol operates effectively like the DHCP protocol does today in its behavior concerning lease intervals. However, the guarantee that the actual client lease interval will never exceed the acknowledged Secondary Server lease interval by more than the MDLI allows full recovery from failures in lazy update. 6.2. Controlled re-allocation of IP addresses When the servers cannot communicate neither server will allow an IP address previously used by one client to be offered to a different client. As a corollary, during normal operations the primary server Droms, et. al. [Page 23] DRAFT January 1998 must update the secondary server whenever a lease expires or an IP address is released, and must receive acknowledgement of that update before offering the IP address of the expired or released IP address to a different client. 7. Server States The following server states are defined: NORMAL State: NORMAL state is the state used by a server when it can communicate with the other server in the Primary-Secondary Server pair. When in this state, the Primary responds to DHCP clients requests, while the Secondary does not. COMMUNICATION-INTERRUPTED state: A server goes into this state whenever it is unable to communicate with the other server. Both the Primary and Secondary Servers can go into this state, although the behavior changes that result are dif- ferent. Primary and Secondary Servers cycle automatically (without administrative intervention) between NORMAL and COMMUNICATION- INTERRUPTED state as the network connection between them fails and recovers, or as the partner server cycles between operational and non-operational. No duplicate IP address allocation can occur while the servers cycle between these states. In this state both servers may respond to DHCP client requests. When allocating new IP addresses, each server allocates from a different pool. When respond- ing to renewal requests, each server will allow continued renewal of a DHCP client's current lease on an IP address. PARTNER-DOWN state: PARTNER-DOWN state is a state either server can enter. Once a server has entered NORMAL state, the PARTNER-DOWN state is entered only on command of an external agency (typically an administrator of some sort) or after the expiration of an externally configured minimum safe-time after the beginning of COMMUNICATION-INTERRUPTED state. When in this state, the server no longer assumes that the other server could still be operational and servicing a a different set of clients, but instead assumes that it is the only server operating. Only one server should be operating in this state at a time. The server in this state will respond to DHCP client requests. It will allow renewal of all outstanding leases on IP addresses, and will allocate IP addresses from its own pool, and after a fixed period of time, it will allocate IP addresses from the set of all available IP Droms, et. al. [Page 24] DRAFT January 1998 addresses. The server will transition out of PARTNER-DOWN state after automatic re-integration the companion server is complete. This automatic re- integration will typically be initiated by the restart of the server which was down. POTENTIAL-CONFLICT state: This state indicates that the two servers are attempting to rein- tegrate with each other, but at least one of them was running in a state that did not guarantee automatic reintegration would be possi- ble. In POTENTIAL-CONFLICT state the servers may determine that the same IP address has been offered and accepted by two different DHCP clients. RECOVER state: This state indicates that the server has no information in its stable storage. A server in this state will attempt to refresh its stable storage from the other server. SYNC state: In this state, the Secondary Server attempts to synchronize its stable storage with the Primary Server. Both the Primary and Secon- dary may have information that the other lacks. 8. Primary Server Operation This section discusses the operation of the primary server using the state transition diagram in Figure 8.2-1. 8.1. Primary Server Initialization When the Primary Server starts, there are three possibilities: it has never started before and therefore has no record of any previous state nor of any client binding information; it has started before and has a record of a previous state and possibly of some client binding information; it has started before, but failed catastrophi- cally, and now has no record of any previous state (nor of any client binding information). When the Primary Server starts, if it has any record of a previous state, then if that state was NORMAL or COMMUNICATION-INTERRUPTED it moves to COMMUNICATION- INTERRUPTED state. If that state was PARTNER-DOWN or POTENTIAL-CONFLICT, then it moves to PARTNER-DOWN state. If that state was RECOVER, then the Primary Server moves into the RECOVER state. Droms, et. al. [Page 25] DRAFT January 1998 If it has no record of any previous state, then either this is an initial startup, or a recovery from a catastrophic failure where stable storage and all client binding information was lost. These are distinguished by recovery from a catastrophic failure being indicated by some external configuration indication to the Primary Server. 8.2. Primary Server State Transitions Figure 8.2-1 is the diagram of the Primary Server's state transi- tions. The remainder of this section contains information important to the understanding of that diagram. The server stays in the current state until all of the actions speci- fied on the state transition are complete. If communications fails during one of the actions, the server simply stays in the current state and attempts a transition whenever the conditions for a transi- tion are later fulfilled. In the state transition diagram below, the "+" or "-" in the upper right corner of each state is a notation about whether communication is ongoing with the Secondary Server. The legend "responsive" and "unresponsive" in each state indicates whether the Primary Server is responsive to DHCP client requests in the respective state. In the diagram state transition diagram below, when communication is reestablished between the Primary and Secondary Server, the Primary server must record the state of the Secondary Server when the commun- ication was reestablished. If the state of the Secondary Server changes while communicating, then the Primary Server moves through the communications-failed tran- sition, and into whatever state results. It then immediately moves through whatever state transition is appropriate given the current state of the Secondary Server. DISCUSSION: The point of this technique is simplicity, both in explanation of the protocol and in its implementation. The alternative to this technique of memory of partner state and automatic state transi- tion on change of partner state is to have every state in the fol- lowing diagram have a state transition for every possible state of the partner. With the approach adopted, only the states in which communications are reestablished require a state transition for each possible partner state. All state transitions of the Primary Server must be recorded in its stable storage, and thus be available to the server after a server Droms, et. al. [Page 26] DRAFT January 1998 restart. Previous Primary State: NORMAL or RECOVER PARTNER DOWN COMMUNICATION POTENTIAL CONFLICT INTERRUPTED | +---+ V | | +----------------+ +-----------------+ | | - | | - | | | RECOVER | | PARTNER DOWN |<-----+ | | (unresponsive) | | (responsive) | | | +----------------+ +-----------------+ | | | | | ^ | | Comm. OK | Comm. OK | | | Sec. State: | Sec. State: Comm. | | | | V All Others Failed | | | RECOVER +<---+ V | | | All | | +-------------+ | | Others | Comm. OK | POTENTIAL +| | | | Note Sec. State: | CONFLICT | | | | Poss. RECOVER |(responsive) |<---- | --+ | V Error NORMAL +-------------+ | | | Sec->Pri | Pri->Sec | | | | Sync | Sync. Resolve Conflict | | | | | V V | | | Wait MDLI | +-----------------+ | | | from Fail. | | + | External | | | V V | NORMAL |-Command-->+ | | +-----++------>| (responsive) | | | | ^ +-----------------+ | | | | | | | | Pri<->Sec Comm. External | | Sync Failed Command | | | | or | | Comm. OK | "Safe Period" | | Sec. State: V expiration | | NORMAL +-----------------+ | | | COMM. INT. | - |---------->+ | | RECOVER------| COMMUNICATIONS | | | | INTERRUPTED | Comm. OK | +------------------>| (responsive) |--Sec. State:--+ +-----------------+ All Others Figure 8.2-1: Primary Server state diagram. Droms, et. al. [Page 27] DRAFT January 1998 8.3. Primary Server in PARTNER-DOWN state When it is in PARTNER-DOWN state, the Primary Server operates largely as does a normal DHCP server, with none of the special algorithms described below. In PARTNER-DOWN state the Primary Server MUST respond to DHCP client requests. Any available IP address tagged as belonging to the Secondary Server (at entry to PARTNER-DOWN state) MUST NOT be used until the MDLI beyond the entry into PARTNER-DOWN state has elapsed. The Primary Server MUST NOT allocate an IP address to a DHCP client different from that to which it was allocated at the entrance to PARTNER-DOWN state until the MDLI beyond the its expiration time has elapsed. If this time would be earlier than the current time plus the MDLI, then the current time plus the MDLI is used. Two options exist for lease times, with different ramifications flow- ing from each. If the Primary Server wishes the Failover Protocol to protect it from loss of stable storage in any state, then it should ensure that the MDLI based lease time restrictions in Section 6.1 are maintained, even in PARTNER-DOWN state. If the Primary Server wishes to forego the protection of the Failover Protocol in the event of loss of stable storage, then it need recog- nize no restrictions on actual client lease times while in PARTNER- DOWN state. The Primary Server MUST poll the Secondary Server and attempt to establish communications and synchronization with it. Once the Primary succeeds in contacting the Secondary Server, the Primary examines the state of the Secondary Server. If the state of the Secondary Server is RECOVER or NORMAL, then both servers have been running in such a way that duplicate IP address allocations were inhibited. In this case, the Primary Server updates the Secondary Server with its client binding information, and moves into the NORMAL state. Once contact has been established, if the state of the Secondary Server is anything other than RECOVER or NORMAL then the Primary Server moves into the POTENTIAL-CONFLICT state. 8.4. Primary Server in RECOVER state When Primary Server is initialized in the RECOVER state it expects to Droms, et. al. [Page 28] DRAFT January 1998 refresh its stable storage from an existing Secondary Server. In this state the Primary Server MUST NOT respond to DHCP client requests. When the Primary Server succeeds in contacting the Secondary Server, if it determines that the Secondary Server is itself in the RECOVER state (which indicates that the Secondary Server has no existing client binding information), the Primary Server will move directly into NORMAL state after signaling some kind of an error (since some person had to explicitly start the Primary Server in RECOVER state to refresh its lost client binding information from the Secondary, and the Secondary had no state). If the Primary Server determines that the Secondary Server is in any state other than RECOVER, then the Secondary Server has some client binding information that the Primary Server needs before it moves into the NORMAL state. The Primary Server will attempt to refresh its state from the Secondary Server, and it will remain in the RECOVER state until it is successful in doing so. The Primary Server MUST remain in RECOVER state until a period of at least the MDLI has passed since the Primary Server was known to have failed. This is to allow any IP addresses that were allocated by the Primary Server prior to loss of Primary Server client binding infor- mation in stable storage to contact the Secondary Server or to time out. DISCUSSION: The actual requirement on this wait period in RECOVER is that it start when the Primary Server went down, not necessarily when it came back up. If the time when the Primary Server failed is known, then it could be communicated to the recovering server, and the wait period could be reduced to the MDLI less the difference between the current time and the time the server failed. In this way, the waiting period could be minimized. 8.5. Primary Server in NORMAL state When in NORMAL state, the Primary Server takes the following actions to implement the Safe Failover Protocol: o Lease Time Calculations As discussed in Section 6.1, "Control of lease time", the lease interval given to a DHCP client can never be more than the maximum delta lease interval greater than the acknowledged Droms, et. al. [Page 29] DRAFT January 1998 Secondary Server lease interval. As long as the Primary Server adheres to this constraint, the specifics of the lease intervals that it gives to either the DHCP client or the Secondary DHCP server are implementation dependent. One possible approach is shown in Section 6.1, but that particular approach is in no way required by this proto- col. o Lazy Update of Secondary Server After an ACK of a IP address binding, the Primary Server attempts to update the Secondary with the binding information. The lease time used in the update of the Secondary MUST be at least that given to the DHCP client in the DHCPACK. It MAY, however, be longer. o Reallocation of IP Addresses Between Clients Whenever a client binding is released, a DHCPBNDUPD message must be sent to the Secondary Server, setting the binding state to RELEASED. However, until a DHCPBNDACK is received for this message, the IP address cannot be allocated to another client. 8.6. Primary Server in COMMUNICATION-INTERRUPTED Mode When in COMMUNICATION-INTERRUPTED state the Primary Server operates in such a way that correct operation is ensured even if the Secondary Server is still up and operational, but unable to communicate to the Secondary Server. When communications are reestablished between the Primary and Secondary Servers, if both are still in COMMUNICATION- INTERRUPTED state, then the re-integration of their operation will proceed automatically and without human intervention. The protocol is designed to ensure that reintegration will proceed in an error free manner and that no actions taken by either server while in COMMUNICATION-INTERRUPTED state will cause problems during reintegra- tion. The Primary Server operates in COMMUNICATION-INTERRUPTED state as it does in NORMAL state. However, since it cannot communicate with the Secondary in this state, the acknowledged-Secondary-lease-time will not be updated in any new bindings. This is likely to eventually cause the actual- client-lease-times to be the current-time plus the MDLI (unless this is greater than the desired-client-lease-time). Droms, et. al. [Page 30] DRAFT January 1998 The Primary Server can simply queue updates to the Secondary on com- munication interruption and stay in the NORMAL state. If, at the time communication with the Secondary is reestablished, the Secondary remains in the NORMAL state as well, then the queued updates for the Secondary will simply be processed. COMMUNICATION-INTERRUPTED state for the Primary Server is a signal that it has stopped queuing updates to the Secondary, and is able to respond to a variety of possible Secondary states. It is anticipated that some alarm condition would be raised upon the transition from NORMAL state to COMMUNICATION-INTERRUPTED state. Once the Primary Server has been in COMMUNICATION-INTERRUPTED state for a period equal to the safe-period, then it can (if configured to do so) transition into the PARTNER-DOWN state. An external command may also force a transition to PARTNER-DOWN state. 9. Secondary Server Operation The Secondary Server responds to DHCP client requests only in the PARTNER-DOWN and COMMUNICATION-INTERRUPTED states. 9.1. Secondary Server Initialization When the Secondary Server starts, there are three possibilities: it has never started before and therefore has no record of any previous state nor of any client binding information; it has started before and has a record of a previous state and possibly of some client binding information; it has started before, but failed catastrophi- cally, and now has no record of any previous state (nor of any client binding information). When the Secondary Server starts, if it has any record of a previous state, then if that state was NORMAL, COMMUNICATION-INTERRUPTED, or SYNC, it moves to COMMUNICATION-INTERRUPTED state. If that state was PARTNER-DOWN or POTENTIAL-CONFLICT, then it moves to PARTNER-DOWN state. In all other cases (both other previous states and the cases where there is no record of a previous state), the Secondary Server moves into the RECOVER state. 9.2. Secondary Server State Transitions The server stays in the current state until all of the actions speci- fied on the state transition are complete. If communications fails during one of the actions, the server simply stays in the current state and attempts a transition whenever the conditions for a Droms, et. al. [Page 31] DRAFT January 1998 transition are later fulfilled. In the state transition diagram below, the "+" or "-" in the upper right corner of each state is a notation about whether communication is ongoing with the Primary Server. The legend responsive" and "unresponsive" in each state indicates whether the Secondary Server is responsive to DHCP client requests in the respective state. In the state transition diagram below, when communication is reesta- blished between the Secondary and Primary Server, the Secondary Server must record the state of the Primary Server when the communi- cations was reestablished. If the state of the Primary Server changes while communicating, then the Secondary Server moves through the communications-interrupted transition, and into whatever state results. At that time, it then immediately moves through whatever state transition is appropriate for the current state of the Primary Server. All state transitions of the Secondary Server must be recorded in its stable storage, and thus be available to the server after a server restart. Droms, et. al. [Page 32] DRAFT January 1998 Previous Secondary State: NORMAL RECOVER PARTNER DOWN COMM. INT. POTENTIAL CONFLICT SYNC | | +---+ V V | +----------------+ +-----------------+ | | RECOVER - | | PARTNER DOWN - |<-----+ | | (unresponsive) | | (responsive) | | | +----------------+ +-----------------+ | | | | | ^ | | Comm. OK | Comm. OK | | | Pri. State: | Pri. State: Comm. | | | | V All Others Failed | | | RECOVER +<---+ V | | | | | | +--------------+ | | | | Comm. OK | POTENTIAL + | | | All | Pri. State: | CONFLICT | | | Others | RECOVER |(unresponsive)|<--- | --+ | | Note | +--------------+ | | | | Poss. Sec->Pri | | | | V Error Sync. Resolve Conflict | | | Pri->Sec | V V | | | Sync | +-----------------+ | | | V V | NORMAL + |-External->+ | | +-----++------>| (unresponsive) | Command | | | ^ +-----------------+ | | | Pri<->Sec | ^ | | | Sync | Start Alloc Timer | | | | | Sec->Pri | | | +--------------+ | Sync | | | | + |--->+ | External | | | SYNC | Comm. Comm. OK Command | | | unresponsive | Failed Pri. State: or | | +--------------+ | RECOVER "Safe Period" | | ^ V | expiration | | | +------------------+ | | | Comm. OK | COMMUNICATIONS - |---------->+ | | Pri. State: | INTERRUPTED | Comm. OK | | NORMAL-----| (responsive) |--Pri. State:--+ | COMM. INT. +------------------+ All Others | ^ +---------------------+ Figure 9.2-1: Secondary Server State Diagram. Droms, et. al. [Page 33] DRAFT January 1998 9.3. Secondary Server in RECOVER state The Secondary DHCP server comes up in the RECOVER state when it has no record of any previous state (or that previous state was RECOVER). It stays in this state until it establishes communication with the Primary Server, and is unresponsive to DHCP client requests in this state. Essentially it is idle until it can contact the Primary Server. When it establishes communication with the Primary Server, it attempts to load its client binding database from that of the Primary Server using the techniques specified in section 6. Once the Secondary Server's client binding database is refreshed from that of the Primary, the Secondary Server moves into NORMAL state. 9.4. Secondary Server in NORMAL state In normal state, the Secondary Server receives state updates from the Primary Server in DHCPBNDUPD messages. It records these in its client binding database in stable storage and then sends the corresponding DHCPBNDACK message to the Primary Server. While in NORMAL state, the Secondary Server MUST also acquire a series of IP addresses from the Primary Server to be used to satisfy DHCPDISCOVER requests from DHCP clients when in COMMUNICATION- INTER- RUPTED state. See Section 2.2.2 for details of this acquisition pro- cess. The Secondary Server periodically polls the Primary Server with the DHCPPOLL message. If it fails to receive a DHCPPRPL message in reply after a configured number of retries or some administratively deter- mined time, the Secondary Server transitions into COMMUNICATION- INTERRUPTED state. Both the DHCPPOLL and DHCPPRPL messages carry the current status of the sender. If an external command is received by the Secondary Server, it can move from NORMAL to PARTNER- DOWN state directly. Such a command might be sent when the Primary Server was removed from server, and an operator wanted the Secondary Server to take over immediately and completely from the Primary Server.(Note that the Secondary Server takes over from the Primary Server when in COMMUNICATION- INTERRUPTED state, but less completely than in PARTNER-DOWN state). Droms, et. al. [Page 34] DRAFT January 1998 9.5. Secondary Server in COMMUNICATION-INTERRUPTED state When in COMMUNICATION-INTERRUPTED state the Secondary Server operates in such a way that correct operation is ensured even if the Primary Server is still up and operational, but unable to communicate to the Secondary Server. When communications are reestablished between the Primary and Secondary Servers, if both are still in COMMUNICATION- INTERRUPTED state, then the re-integration of their operation will proceed automatically and without human intervention. The protocol is designed to ensure that reintegration will proceed in an error free manner and that no actions taken by either server while in COMMUNICATION-INTERRUPTED state will cause any conflicts to occur during re-integration. In COMMUNICATION-INTERRUPTED state, the Secondary Server responds to DHCP client requests. When processing a DHCPREQUEST from a DHCP client, the Secondary Server MUST ensure that the client- lease-time is never more than the maximum-delta-lease- interval from the current-time, independent of the desired- client-lease-time. When processing a DHCPRELEASE request from a DHCP client or the expiration of a lease, the Secondary Server must not reallocate the IP address to a different client. If the same client subsequently performs a DHCPDISCOVER request, the Secondary Server SHOULD offer it the previously used IP address. When processing a DHCPDISCOVER request from a DHCP client, the secon- dary MUST allocate IP addresses from the list of IP addresses that it acquired from the Primary Server in RECOVER state. When it exhausts this list, it MUST stop responding to DHCPDISCOVER requests (except those it can satisfy by offering expired or released IP addresses to their previously bound clients). The Secondary Server MUST continue to send DHCPPOLL messages to the Primary Server when in COMMUNICATION-INTERRUPTED state. If it receives a DHCPPRPL message in reply, the Secondary Server determines the state of the Primary Server. If the Primary Server is in NORMAL or COMMUNICATION-INTERRUPTED state, then the Secondary Server moves into the SYNC state. If, however, the Primary Server is in RECOVER state, then the Secon- dary Server updates the Primary Server with its known client binding information, and moves into NORMAL state upon completion of that update. If instructed to by an outside agency (e.g., an administrator), the Droms, et. al. [Page 35] DRAFT January 1998 Secondary Server SHOULD move into PARTNER-DOWN state. Once the Secondary Server has been in COMMUNICATION-INTERRUPTED state for a period equal to the safe-period, then it may (if configured to do so) transition into the PARTNER-DOWN state in the absence of an external command. 9.6. Secondary Server in SYNCH state The Secondary Server does not respond to DHCP client requests when in SYNCH state. DISCUSSION: This is the entire reason for this states existence, otherwise the activities specified for this state could happen as part of a state transition from the COMMUNICATION-INTERRUPTED state to the NORMAL state. However, in the COMMUNICATION-INTERRUPTED state the Secondary Server responds to DHCP client requests. Having the Secondary Server respond to DHCP client requests during the syn- chronization process (and thus taking actions requiring further synchronization) seemed like a bad idea. The Secondary Server synchronizes its information with the Primary Server while in SYNCH state. Both Primary and Secondary Servers may have information the other lacks because of operations performed while communications were interrupted. During the synchronization process, the Secondary Server continues to poll the Primary Server with DHCPPOLL messages. If it fails to receive a reply, it moves back into COMMUNICATION-INTERRUPTED state. When synchronization is complete, the Secondary Server moves into NORMAL state. 9.7. Secondary Server in PARTNER-DOWN state The Secondary Server responds to DHCP client requests when in PARTNER-DOWN state. Any available IP address which does not belong to the private pool established by the Secondary Server (at entry to PARTNER-DOWN state) MUST NOT be used until the MDLI beyond the entry into PARTNER-DOWN state has elapsed. The Secondary Server MUST NOT allocate an IP address to a DHCP client different from that to which it was allocated at the entrance to Droms, et. al. [Page 36] DRAFT January 1998 PARTNER-DOWN state until the MDLI beyond the its expiration time has elapsed. If this time would be earlier than the current time plus the MDLI, then the current time plus the MDLI is used. Two options exist for lease times, with different ramifications flow- ing from each. If the Secondary Server wishes the Failover Protocol to protect it from loss of stable storage in any state, then it should ensure that the MDLI based lease time restrictions in Section 6.1 are maintained, even in PARTNER-DOWN state. If the Secondary Server wishes to forego the protection of the safe Failover Protocol in the event of loss of stable storage, then it MAY recognize no restrictions on actual client lease times while in PARTNER-DOWN state. The Secondary Server continues to poll the Primary Server with DHCPPOLL messages. If the Secondary Server receives a reply, and the Primary Server is in the RECOVER state, the Secondary Server updates the Primary Server with all of the Secondary's client binding infor- mation, and then moves into the NORMAL state. If communications with the Primary Server are reestablished, and the Primary Server is in any other state but RECOVER, the Secondary Server moves into the POTENTIAL-CONFLICT state (as does the Primary Server). 9.8. Secondary Server in POTENTIAL-CONFLICT state The secondary server enters POTENTIAL-CONFLICT state when the combi- nation of its state and that of the primary indicate that a potential conflict of IP address allocation has occurred. There is no guaran- tee that such a conflict has occurred -- just the possibility. In this state each server compares its client binding information with that of the other server and any conflicts are resolved in an imple- mentation dependent manner. When (and if) the resolution process completes, each server moves into the NORMAL state. 10. Safe Period Due to the restrictions imposed on each server while in COMMUNICATION-INTERRUPTED state, long-term operation in this state is not feasible for either server. One reason that these states exist at all, is to allow the servers to easily survive transient network Droms, et. al. [Page 37] DRAFT January 1998 communications failures of a few minutes to a few days (although the actual time periods will depend a great deal on the DHCP activity of the network in terms of arrival and departure of DHCP clients on the network). Eventually, when the servers are unable to communicate, they will have to move into a state where they no longer can re-integrate without the some possibility of a duplicate IP address allocation. There are two ways that they can move into this state (known as PARTNER-DOWN). They can either be informed by external command that, indeed, the partner server is down. In this case, there is no difficulty in mov- ing into the PARTNER-DOWN state since it is an accurate reflection of reality and the protocol has been designed to operate correctly (even during reintegration) if, when in PARTNER-DOWN state the partner is, indeed, down. The other difficulty is when the servers are running unattended for extended periods, and in this case the option is provided to config- ure something called a "safe- period" into each server. This OPTIONAL safe-period is the period after which either the Primary or Secondary Server will automatically transition to PARTNER-DOWN from COMMUNICATION-INTERRUPTED state. If this transition is completed and the partner is not down, then the possibility of duplicate IP address allocations will exist. The goal of the "safe-period" is to allow network operations staff some time to react to a server moving into COMMUNICATION-INTERRUPTED state. During the safe-period the only requirement is that the net- work operations staff determine if both servers are still running -- and if they are, to either fix the network communications failure between them, or to take one of the servers down before the expira- tion of the safe-period. The length of the safe-period is installation dependent, and depends in large part on the number of unallocated IP addresses within the subnet address pool and the expected frequency of arrival of previ- ously unknown DHCP clients requiring IP addresses. Many environments should be able to support safe-periods of several days. During this safe period, either server will allow renewals from any existing client. The only limitation concerns the need for IP addresses for the DHCP server to hand out to new DHCP clients and the need to re-allocate IP addresses to different DHCP clients. The number of "extra" IP addresses required is equal to the expected total number of new DHCP clients encountered during the safe period. Droms, et. al. [Page 38] DRAFT January 1998 This is dependent only on the arrival rate of new DHCP clients, not the total number of outstanding leases on IP addresses. In the unlikely event that a relatively short safe period of an hour is all that can be used (given a dearth of IP addresses or a very high arrival rate of new DHCP clients), even that can provide sub- stantial benefits in allowing the DHCP subsystem to ride through a minor problems that could occur and be fixed within that hour. In these cases, no possibility of duplicate IP address allocation exists, and re-integration after the failure is solved will be automatic and require no operator intervention. 11. Open Issues A number of details remain to be worked out. They are as follows: 1. Level of Agreement and Completion This draft is incomplete in two senses. First, none of the authors agree with everything written, and quite a number of issues remain to be worked out among the various authors (to say nothing about the rest of the community). Second, this draft is not yet complete enough to support creation of inter-operable implementations. However, we believe that even though this draft is very much a work in progress, there is value with sharing it with the rest of the DHCP community in its current form. 2. Failover Port We need to resolve whether the Failover protocol runs with the same or a different port as the DHCP protocol. In the interests of allowing implementation of the Failover protocol by a dif- ferent process or sub-process, having it use a different port seems reasonable. 3. High Level Operations While the detailed operations are beginning to come together, the higher level operations (like reintegration) are, as yet, incompletely specifcied. This will be rectified in a later revision. 4. Option Spaces The draft currently reflects some rather fuzzy goals of using DHCP options where they apply but also defining new options. It Droms, et. al. [Page 39] DRAFT January 1998 uses the "user defined option space" for this, which is probably not a good idea. Perhaps the DHCP Panel will produce a larger option space in which all of these options can be defined, or perhaps (as it written in the draft) this protocol will just have to define entirely unique options. 5. Subnet Level Granularity This protocol talks about a server being in one state or another, however the desire is for this protocol to operate independently in each address pool for which a primary and secondary server is defined. In this way, the "server" state really refers to the "subnet" state. Once the protocol is vali- dated, the editing work to make it operate at subnet granularity will be performed. 6. Secondary Server Communications with DHCP Clients There are two situations where we may want to allow the secon- dary server to communicate with DHCP clients even though the secondary can communicate with the primary and would normally be unresponsive to DHCP client requests. The first situation which deserves consideration is where the secondary has given a DHCP client a lease on an IP address when it was not able to communicate with the primary, and then subse- quently the secondary becomes able to communicate with the pri- mary. When the client unicasts its DHCPREQUEST to the secondary to renew its lease, the secondary will not be able to communi- cate with the client (as this protocol is defined). Should we allow the Secondary to extend the lease for the DHCP client and then inform the primary of that extension using the DHCPBNDUPD message in the same was as the Primary uses that message? The second situation arises where a client can only communicate with the secondary due to some network failure, but the primary and secondary server can communicate. As written, the protocol will not allow the secondary to offer a lease to the DHCP client, but it would be straightforward to modify the protocol to allow the secondary to do so. The only difficult part of this change to the protocol would be to suggest how the secon- dary would know that the DHCP client could talk only to the secondary. But, given that if the DHCP primary could talk to the DHCP client, the secondary would expect to hear about it in DHCPBNDUPD messages at some point, the absence of such messages could be used as a signal to communicate to the DHCP client in question. Droms, et. al. [Page 40] DRAFT January 1998 7. UDP or TCP There has been much debate about the utility of using UDP for the failover protocol, since it doesn't supply guaranteed delivery. Certainly rebuilding TCP out of UDP would be a mis- take. Some factors to consider in this debate are as follows: First, it is important to recognize that mere receipt of a packet by the other server in the pair (e.g., receipt of a DHCPBNDUPD packet by the secondary server) is not sufficient for the primary to update its own bindings database with new infor- mation about what the secondary knows. In all cases of transfers of bindings information, the server of a DHCPBNDUPD message MUST update its own stable storage prior to replying with a DHCPBNDACK message (except in the marginal case where all of the updates are rejected). An action is required by the receiving server and an explicit ACK is needed by the sending server to ensure the integrity of the protocol. So, just know- ing that the other server has received a Failover protocol packet is not intrinsically interesting. Second, the DHCP protocol, both the client and server side, is being implemented in progressively smaller and smaller machines. While this progression is most evident in DHCP clients, there exist implementations today of DHCP servers embedded in devices that are by no stretch of the imagination traditional "servers" running mainstream operating systems. In many ways, the Fail- over protocol is very well suited to such devices. Adding addi- tional protocol infrastructure requirements to implement the Failover protocol could easily prevent its implementation in devices that in some ways need it most. Third, there are only a few cases where the Failover protocol requires guaranteed delivery of packets. In particular, the normal Primary to Secondary DHCPBNDUPD message to not have to be delivered reliably. The consequences of lost DHCPBNDUPD mes- sages are handled by the use of the MDLI, for the simple reason that since these messages are "lazy", they may not get delivered because of a server failover prior to their transmission. Given that the protocol is robust in the face of loss of either a DHCPBNDUPD message or a DHCPBNDACK message, a technique known as "fire and forget" may be used with this protocol and two cooperating implementations. If the DHCPBNDACK message contains all of the information originally in the DHCPBNDUPD message, then the DHCPBNDUPD message may be transmitted and forgotten by the sending server (typically the primary). When and if the secondary receives the DHCPBNDUPD and replies with a DHCPBNDACK message and the primary receives it, the primary will update its Droms, et. al. [Page 41] DRAFT January 1998 stable storage with a new picture of what the secondary knows about the lease time. If either of these messages is lost, the only downside is that the DHCP client associated with the bind- ing in question may receive a shorter lease for one lease period than it would otherwise. This "fire and forget" technique could substantially ease both the complexity of implementation and memory requirements of an implementation of the Failover protocol, especially where two servers were communicating over a very slow link. 12. Acknowledgments Ralph Droms started it all, by sketching out an initial interserver draft that embodied ideas from several past IETF meetings. In that draft, he acknowledged contributions by Jeff Mogul, Greg Minshall, Rob Stevens, Walt Wimer, Ted Lemon, and the DHC working group. Kim Kinnear and Bob Cole each extended that draft, separately and then together, until they created an interserver draft that supported any number of servers. The complexity of that approach was just too great, and led to a much simpler approach embodied in the first Fail- over draft by Greg Rabil, Mike Dooley, and Arun Kapur and Ralph Droms. This draft posited only two servers -- a primary and a secon- dary. Kim Kinnear then wrote the Safe Failover draft to layer on top of the Failover Draft and increase its the robustness in the face of certain rare network failures. At the spring 1998 IETF meeting in LA, the DHC working group said that they wanted a merged Failover and Safe Failover draft. Steve Gonczi and Bernie Volz stepped up and produced the raw material for such a merged draft, along with a new message format designed around DHCP options and other extensions and clarifications. Kim Kinnear edited their work into draft format and made other changes, and that is what you have in your hands. Many people have reviewed the various drafts that went into this result. At American Internet, ideas have been contributed by Mark Stapp, Brad Parker, and Ellen Garvey. Glenn Waters of Bay Networks contributed ideas and enthusiasm to make a Failover protocol that was both "safe" and "lazy". 13. References [1] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [2] Alexander, S., Droms, R., "DHCP Options and BOOTP Vendor Extensions", Internet RFC 2132, March 1997. Droms, et. al. [Page 42] DRAFT January 1998 [3] Rabil, G., Dooley, M., Kapur, A., Droms, R., "DHCP Failover Protocol", draft-ietf-dhc-failover-00.txt. [4] Gudmundsson, Olafur, "Security Architecture for DHCP", draft-ietf-dhc-security-arch-00.txt. 14. Author's information Ralph Droms 323 Dana Engineering Bucknell University Lewisburg, PA 17837 Phone: (717) 524-1145 EMail: droms@bucknell.edu Greg Rabil, Mike Dooley, Arun Kapur Quadritek Systems, Inc. 10 Valley Stream Parkway, Suite 240 Malvern, PA 19355 Phone: (800) 208-2747 EMail: grabil@quadritek.com mdooley@quadritek.com akapur@quadritek.com Kim Kinnear American Internet Corporation 4 Preston Ct. Bedford, MA 01730-2334 Phone: (781) 276-4587 EMail: kinnear@american.com Steve Gonczi, Bernie Volz Process Software Corporation 959 Concord St. Framingham, MA 01701 Phone: (508) 879-6994 EMail: gonczi@process.com volz@process.com Droms, et. al. [Page 43]