Network Working Group T. Asveren Internet-Draft Sonus Networks Expires: June 13, 2008 U. Bodin Operax December 11, 2007 Diameter State Recovery Considerations draft-asveren-dime-state-recovery-02.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on June 13, 2008. Copyright Notice Copyright (C) The IETF Trust (2007). Abstract This document discusses parameters to consider, different approaches and design strategies to synchronize and/or recover state in Diameter applications after failure of an active instance. Asveren & Bodin Expires June 13, 2008 [Page 1] Internet-Draft Diameter State Recovery Considerations December 2007 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Session State and the Need for Recovery . . . . . . . . . . . 4 4. Proprietary Mechanisms . . . . . . . . . . . . . . . . . . . . 5 5. Protocol Assisted State Recovery . . . . . . . . . . . . . . . 6 5.1. Service Models . . . . . . . . . . . . . . . . . . . . . . 6 5.2. Parameters to Consider . . . . . . . . . . . . . . . . . . 8 5.2.1. Notification of the Peer About Failure . . . . . . . . 8 5.2.2. Transfer of Session Data . . . . . . . . . . . . . . . 8 5.2.3. Backup Server Selection . . . . . . . . . . . . . . . 9 5.2.4. Timing of State Reconstruction . . . . . . . . . . . . 10 5.3. Approaches . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3.1. Using a New Session . . . . . . . . . . . . . . . . . 11 5.3.2. Backup Instance Triggered Recovery . . . . . . . . . . 11 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12 9. Normative References . . . . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 Intellectual Property and Copyright Statements . . . . . . . . . . 14 Asveren & Bodin Expires June 13, 2008 [Page 2] Internet-Draft Diameter State Recovery Considerations December 2007 1. Introduction There are a variaety of Diameter applications defined to perform different tasks. For some of these tasks, synchronizing and/or recovering state for ongoing sessions after failure of a Diameter endpoint is desirable, e.g. Diameter Credit Control Application. The recovery could be achieved by a proprietary mechanism, could be assisted by protocol mechanisms or could be a combination thereof. This document focuses on issues associated with protocol assisted state recovery. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [2]. The following terms defines the functionality used in describing entities in this document. Ongoing Session A Diameter session, for which at least the first transaction has been completed but not the last transaction according to the application message flow. Terminated Session A Diameter session that existed in the past, for which the last transaction according to the application message flow has been completed. Initial message A Diameter message used to create a new Diameter session. Mid-session message A Diameter message used to refresh or modify an existing Diameter session. Service Instance An instance of service provided by a Diameter application to another entity, e.g. charging, authentication services. Asveren & Bodin Expires June 13, 2008 [Page 3] Internet-Draft Diameter State Recovery Considerations December 2007 Diameter Transaction A Diameter request/answer pair. 3. Session State and the Need for Recovery Some Diameter applications make use of sessions consisting of multiple transactions. The context necessary to be able to process/ trigger further messages in an ongoing session constitutes the session state. In multi-transaction sessions, it is possible that one of the endpoints fail during a session. Depending on the application, it may not be possible/desirable to terminate the corresponding service instance. In such a case, it is necessary to utilize a backup node which can process messages for the ongoing session or to use a new session without terminating the service instance. Diameter Active Backup Peer Instance Instance | | | |----REQ1---->| | | (session1) | | | | | |<---ANS1-----| | | (session1) | | | | | | Active | | Instance | | Fails | | | | |----REQ2----------------->| | (session1) | | | | | |<---ANS2------------------| | (session1) | | | | | Figure 1: Session Failover to Backup Instance Another important aspect related with failing instances is the possibility of hanging resources on the peer Diameter entity. This could happen if the peer Diameter entity does not clean up session state unless the session is terminated according to the expected application message flow. It should be noted that while state recovery is a desirable feature for certain applications, hanging Asveren & Bodin Expires June 13, 2008 [Page 4] Internet-Draft Diameter State Recovery Considerations December 2007 resources is an unacceptable situation for all applications, hence although some of the mechanisms described in this document could be used to prevent the occurance of such a case, it is recommended that application layer mechanisms, e.g. application layer timers, are used for this purpose. Nonetheless, certain strategies mentioned in this document could be used to expedite session state cleanup after failovers. 4. Proprietary Mechanisms Proprietary mechanisms do not assume any specific behavior from their peers. They usually rely on some form of state replication between active and backup instances. +---------+ +----------+ | Diameter|<------------->| Active | | Peer | Session | Instance | +---------+ Messaging +----------+ ^ | Session | State | Replication V +----------+ | Backup | | Instance | +----------+ Figure 2: Data Replication with a Proprietary Machanism It should be noted that Figure 2 is just an abstract representation of proprietary data replication between active and backup instances. Actual implementation may vary depending on the mechanims used. Proprietary state synchronization is a common technique utilized by Public Switched Telephone Network equipment vendors to provide 5 9's reliability. There are also initiatives to define a standard set of APIs for platforms/middleware providing data synchronization services, e.g. Application Interface Specification of Service Availability Forum. Proprietary data replication between active and backup instances may be asynchronous in nature. This means that they may not provide loss-less state replication at all times. Hence, after a failover to a backup instance, some session states may have been lost and other states may be wrongly kept by the backup instance. That is, states may have been terminated through session signalling to the initially Asveren & Bodin Expires June 13, 2008 [Page 5] Internet-Draft Diameter State Recovery Considerations December 2007 active instance but the removal of the corresponding session states were not properly reflected in the data replication process. 5. Protocol Assisted State Recovery Protocol assisted state recovery relies on contents of the messages exchanged between Diameter entities. 5.1. Service Models For each Diameter session Diameter messaging happens between a client and server. Although not a sender/receiver of Diameter messages, physical service/resource provided is also a parameter when designing state recovery mechanisms. The physical resource/service is application dependent and could be bandwith allocated on a router for QoS application, voice transfer resources used for a prepaid voice call application etc. Depending on Diameter application, physical resource/service could be at the client or server side. For example for Diameter Credit Control Application the physical resource is controlled by the client, whereas for QoS application with a push scenario it is controlled by the server. In case a proprietary data replication mechanism which is not loss- less is used between active and backup instances to support failover, it may be desirable to make use of the data present in the physical resource/service. This case can benefit from a synchronization phase before session data is transfered for purposes of rebuilding lost state. Physical resource/service could be used to extract some information regarding session state to be reconstructed. For certain scenarios this information could be enough for state reconstruction or could be used in addition to information obtained via other means, e.g. in a proprietary data replication mechanism, failovers could be followed by a synchronization phase based on information obtained from the physical resource/service. Below is given a conceptual diagram for the DCCA client side state recovery utilizing the state kept by service control logic. +-----+ | +-------+ | | (2) | ---(1)--->| |Service| Service | | Data-1| Asveren & Bodin Expires June 13, 2008 [Page 6] Internet-Draft Diameter State Recovery Considerations December 2007 Start | +-------+ +---------+ Request | | | | | |-----(3)------->| | | |Credit Control | DCCA | | | Request for | Client |---(4)-----> | | Service Data-1 | Logic | CCR(Initial) | | | (Active)| | | | |<---(5)------ | |<-----(6)-------| | CCA(Initial) | | Grant Service +---------+ | | | S | (7) | e | DCCA Client | r | Logic (Active) | v | fails | i | | c | (8) | e | DCCA Client | | Logic (Standby) | C | detects failure | o | | n | +---------+ | t |<-----(9)-------| | | r | Request for | | | o | State Retrieval| DCCA | | l | | Client | | |-------(10)---->| Logic | | | Credit Control |(Standby)|---(11)----> | | Request for | | CCR(Initial) | | Service Data-1 | | | | | |<---(12)----- | | | | CCA(Initial) | | | | | | | |---(13)----> | | | | CCR(Update) | | | | | | | |<---(14)----- ---(15)-->| | | | CCA(Update) Service | | | | End | | | |---(16)----> Request | | | | CCR(Terminate) | | | | | | | |<---(17)----- +-----+ +---------+ CCA(Terminate) Figure 3: Using Service Information for DCCA Client Side State Recovery Asveren & Bodin Expires June 13, 2008 [Page 7] Internet-Draft Diameter State Recovery Considerations December 2007 5.2. Parameters to Consider There are several aspects which may be important for a protocol assisted session state recovery mechanism. They may or may not be part of the design choices for a protocol assisted session state recovery mechanism, depending on the strategy utilized. 5.2.1. Notification of the Peer About Failure Usually it is necessary for the remote peer to be informed about the failure of the active instance in the context of protocol assisted state recovery. This could be achieved in different ways: Application Layer Timers Application layer timers could be utilized to send new requests periodically. Lack of a new request or a corresponding answer for a sent request/receipt or UNABLE_TO_DELIVER error answer could indicate that the peer Diameter entity has failed. Notification from Standby Instance After failure of the active instance, standby instance can send a message to the remote Diameter peer to inform it about failure of the active instance. This method requires standby instance to know the identities of the remote Diameter peers, with which the failed active instance had ongoing sessions. This information could be exchanged by a proprietary data replication mechanism. Alternatively, standby instance could have a configured list of remote peers and notify all of them. 5.2.2. Transfer of Session Data For protocol assisted recovery it is necessary to supply enough information to the backup instance so that session state can be constructed. What constitutes session state data needs to be defined on a per application basis. Also, in certain cases (e.g. when a separate mechanism for state replication is used in combination with protocol assisted state recovery) the transfer of session data may be preceeded by a state synchronization phase. For example, a generic message providing a list of all active sessions could be used for such a synchronization phase. Some approaches to transfer session data include: Asveren & Bodin Expires June 13, 2008 [Page 8] Internet-Draft Diameter State Recovery Considerations December 2007 Using a New Session Upon detection of the failure of the active instance, remote Diameter peer may start a new session without terminating the service instance. Using Application Messages Data necessary to reconstruct the session state may be transferred in an application defined message by AVP(s) specifically defined for that purpose. Alternatively, an AVP may be used to flag that all data carried in the message is sent for the purposes of state synchronization. Using a Generic Message Data necesary to reconstruct session state may be transferred in a message specifically defined for that purpose. Such a message may carry state information for one or multiple sessions. 5.2.3. Backup Server Selection A Diameter peer needs to know the identity of the backup instance, so that it can send the necessary data to reconstruct session state. Furthermore, loadbalancing of the ongoing sessions to different backup instances may be necessary as well, to prevent overloading of backup entities. Active Instance Guided Selection Active instance could communicate the identity of the backup instance(s) to the peer Diameter entity with an AVP. Information about how the load should be distributed among multiple backup instances could be communicated as well. Backup Instance Guided Selection If the notification of the peer Diameter entity about the failure of the active instance is performed via a message sent by the standby instance, the identity of the backup instance would be known to the the peer Diameter entity. This message could carry information about other backup instances and loadsharing information too. Selection Based on Configuration The Diameter peer may know the identities of backup servers through configuration and try to loadshare ongoing session based Asveren & Bodin Expires June 13, 2008 [Page 9] Internet-Draft Diameter State Recovery Considerations December 2007 on a locally defined algorithm. For requests, which are rejected by a standby instance with TOO_BUSY_HERE error answer, another standby instance could be tried. 5.2.4. Timing of State Reconstruction When state reconstruction should happen may vary depending on the application. The following two models are foreseen: State Reconstruction After Failure It may be necessary to reconstruct the state after the backup instance detects failure of the active instance. This model is useful when the state for ongoing sessions is necessary to generate answers for requests belonging to new sessions. Care should be taken when determining the necessary information for such cases, it could be the case that what is needed is some cumulative data based on session states rather than the per session information and this could impact the design choices to recover/replicate the data or even the choice between a proprietary mechanism and protocol assisted recovery. Another use case is when autonomous requests need to be generated from the side, where the active instance has failed. In such a situation, backup instance needs to know ongoing sessions immediately after it detects failure of the active instance so that it can generate such requests. If state reconstruction after failure is needed, notification of the Diameter peer about failure should be done by the backup instance. State Reconstruction Upon Receipt of a Request For certain applications, it could be enough if a backup server can reply for requests for ongoing sessions after the failure of the active instance. In such scenarios, state information contained in the new requests for ongoing sessions (i.e. mid- session messages) could be used to reconstruct session state on the standby instance. 5.3. Approaches The choice between a proprietary and protocol assisted state recovery mechanism is not a straightforward one. Depending on the application and the reliability level required a detailed analysis needs to be done to justify usage of one of the methods. Asveren & Bodin Expires June 13, 2008 [Page 10] Internet-Draft Diameter State Recovery Considerations December 2007 If it is desired to use protocol assisted recovery, parameters discussed in Section 5.2 need to be considered. It should be noted that choices made for different parameters are not always independent of each other, e.g. if state reconstruction immediately after failure detection is necessary, using a new session to transfer session data strategy can't be utilized. Below, two different approaches are discussed in detail. 5.3.1. Using a New Session As mentioned in Section 5.2.2 a new session can be used to rebuild state after failure. This approach can be sufficient if immediate state reconstruction after failure is not needed. That is, knowledge of the history of the session are not needed to proceed providing the service of the failed over Diameter node. An example diagram is given in Figure 3. It focuses on events happening on the client side for a DCCA session. On the server side, the sessions which were created by the active instance are cleaned up after expiry of Tcc timer. A variant of using a new session for rebuilding state is to use application messages. For example, regular mid-session messages maintaining soft-state can be used if they contain enough information for the desired state reconstruction. Such messages could contain an AVP carrying a flag indicating that it's a mid-session message and not an initial message issued to create a completely new session. The ability to separate between recreated session and new session can be important to some applications. For example, it may be desirable to give recreated sessions preference over new session to resources controlled by a Diameter server. 5.3.2. Backup Instance Triggered Recovery In case immediate state reconstruction is desired or strictly needed by a backup Diameter instance, this instance may need to trigger transfer of session data to recover state. This requires session data to be available and reachable to the backup Diameter instance. Possible locations of such data include the physical resource/service controlled by the failed over Diameter instance and the entities utilizing the service offered by the Diameter instance (i.e. entities issuing Diameter requests for the offered service). As mentioned in Section 5.2.2 application application messages or a generic message can be used to transfer session data for state reconstruction. Application messages or a generic message transferring the desired session data could be preceeded by a generic synchronization message providing the backup Diameter instance with a complete list of all active sessions. By that the backup Diameter Asveren & Bodin Expires June 13, 2008 [Page 11] Internet-Draft Diameter State Recovery Considerations December 2007 instance can distribute the recovery of session data over time. This may be useful if this instance is to start provide its service imediately instead of waiting until the state reconstruction process is completed. Requesting session data in parallel with answering to service requests requires however that period with incomplete session state after that the backup Diameter instance starts providing the service is acceptable. A generic synchronization message can also be useful in a combined solution using both a proprietary mechanism for state replication and protocol aided state recovery. The complete list of all active sessions provided in such a message providing can be compared with the list of sessions replicated through a proprietary mechansism. Thereby a potential mis-match can be identified and missing session data can be explicitly requested by the backup Diameter instance. 6. IANA Considerations This document does not require any IANA action. 7. Security Considerations Certain procedures in protocol assisted state recovery, e.g. notification of the Diameter peer about failure of an active instance by the standby instance, could introduce security risks. It is expected that use of IPSec/TLS together with a transitive trust model should eliminate these concerns. 8. Acknowledgments 9. Normative References [1] Calhoun, P., Loughney, J., Guttman, E., Zorn, G., and J. Arkko, "Diameter Base Protocol", RFC 3588, September 2003. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Asveren & Bodin Expires June 13, 2008 [Page 12] Internet-Draft Diameter State Recovery Considerations December 2007 Authors' Addresses Tolga Asveren Sonus Networks 4400 Route 9 South Freehold, NJ, 07728 USA Email: tasveren@sonusnet.com Ulf Bodin Operax Aurorum Science Park 8 SE-977 75 Lulea Sweden Email: uffe@operax.com Asveren & Bodin Expires June 13, 2008 [Page 13] Internet-Draft Diameter State Recovery Considerations December 2007 Full Copyright Statement Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Asveren & Bodin Expires June 13, 2008 [Page 14]