Network Working Group D. Zhang Internet-Draft Huawei Intended status: Informational D. Zhang Expires: April 21, 2011 Huawei Symantec October 18, 2010 Redundancy Considerations for IPsec Distributed deployment draft-zhang-ipsecme-distributed-redundancy-00.txt Abstract This informational document attempts to analyze several critical issues with failover of IPSec Gateways (IG) distributed across different networks. Additionally, several candidate approaches to such issues are listed and compared. The objective of this work is to provide useful information for the future research in the failover of multiple geographically distributed IGs. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 21, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Zhang & Zhang Expires April 21, 2011 [Page 1] Internet-Draft Redundancy for IPsec D-Deployment October 2010 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Table of Contents 1. Instroduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Compare of Tight Clusters to Loose Clusters . . . . . . . . . . 3 4. Application Scenarios . . . . . . . . . . . . . . . . . . . . . 4 4.1. Multi-Homing . . . . . . . . . . . . . . . . . . . . . . . 4 4.2. High Available Home Agent . . . . . . . . . . . . . . . . . 4 4.3. Datacenter . . . . . . . . . . . . . . . . . . . . . . . . 4 5. Failure Detection . . . . . . . . . . . . . . . . . . . . . . . 6 6. Discovery of Replacing Gateways in a Switchover . . . . . . . . 6 7. Reconstruction of IPSec Channels . . . . . . . . . . . . . . . 7 7.1. Hot Stand-by . . . . . . . . . . . . . . . . . . . . . . . 7 7.2. Cold Stand-by . . . . . . . . . . . . . . . . . . . . . . . 8 7.3. IKE Session Resumption . . . . . . . . . . . . . . . . . . 8 7.3.1. Tickets by IGs . . . . . . . . . . . . . . . . . . . . 8 7.3.2. Synchronizing IKE-SAs amongst IGs . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . . . 9 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 11.1. Normative References . . . . . . . . . . . . . . . . . . . 9 11.2. Informative References . . . . . . . . . . . . . . . . . . 9 Zhang & Zhang Expires April 21, 2011 [Page 2] Internet-Draft Redundancy for IPsec D-Deployment October 2010 1. Instroduction In [PS], a cluster is defined as "...a set of two or more gateways, implementing the same security policy, and protecting the same domain". According to whether cluster members share an identical IP address, clusters can be further broken down into tight clusters and loose clusters. The members of a tight cluster share a same IP address, while the members of a loose cluster do not. In practice, tight clusters are widely employed to improve the availability of IPSec gateway systems. In [PS], multiple issues in generating high- available tight clusters are well explored. However, the issues with high-available loose clusters are rarely discussed in previous work. In this document, we list several application scenarios of high- available loose IG clusters and analyze several core issues with implementing high-available loose clusters and discuss the candidate solutions. 2. Terminology HA: denotes high-available for short. IG: denotes IPSec Gateway for short. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Compare of Tight Clusters to Loose Clusters In a tight IG cluster, all members (i.e., IGs) share the same IP address. Therefore, little work needs to be done by users during a failure recovery process. In the best condition, users may even not notice that a failure has occurred and the functionality of an IG is taken over by another in the cluster. Again, because, all the IGs in a tight cluster share a same IP address, they must be deployed on a same link and thus must be geographically adjacent. Typically, the IPSec gateways in a tight cluster are located within a same building, or even in a same room. In some IG clusters, state information (e.g., SAs) of different IGs needs to be synchronized for handling potential failures. Obviously, the adjacency of the gateways can bring benefit in such synchronizations. In a loose IG cluster, different IGs hold different IP addresses, which will introduce more complexity in failover processes. For instance, when an IG takes over a failed IG, users must be informed of the change so that they can update the IP address of the gateway in their IKE and IPSec SAs. Compared with tight IG clusters, the Zhang & Zhang Expires April 21, 2011 [Page 3] Internet-Draft Redundancy for IPsec D-Deployment October 2010 members of a loose IG cluster can be located in different network and may also load balance simultaneously. In the event of a failure, the users of a failed IG may be taken over by one or more of other IGs in the cluster. These properties introduce more flexibility in the deployment but also make the state synchronization of cluster members more difficult. In addition, when two members are located far from each other, the transporting latency of the packets will be considerable. The link-layer state synchronization protocols such as [VRRP] cannot be used as more. 4. Application Scenarios Some application scenarios for distributed deployment of loose clusters are introduced in this section. 4.1. Multi-Homing When an enterprise has multiple branches distributed in various places, it is common for the enterprise to connect its branches and headquarter with the Internet and use IPsec VPNs to protect the security of the communication transported over the Internet. For the purpose of traffic engineering and high availability, the enterprise may take advantage of the services of multiple ISPs. Therefore, there can be more than one IPSec gateway deployed at the border of the enterprise network; each of them contacts to a different ISP and has a different IP address. Those IGs can be used to construct a loose IG cluster to provide high available service for users within the corporations. 4.2. High Available Home Agent In Mobile IPv6, it is recommended to use IPSec to protect signaling messages transported between mobile nodes and their home agents. Also, it has been realized that the communications of a number of users may be affected by a failure of a home agent. To address this problem, a high reliable solution has been proposed in [HARP]. This solution requires a mobile node to register at different home agents with various IP address in advance. When a home agent is failed, another home agent informs the affected mobile nodes through the pre- generated IPSec SAs. Because the home agent actually perform a part of functionality of IGs, in the rest document, we denote home agents and IGs as well in order to simplify the discussion. 4.3. Datacenter As an efficient and security way to maintain large amounts of data, Internet Data Centers (IDCs) has been widely adopted by big organizations (e.g., banks, financial corporations, public medical Zhang & Zhang Expires April 21, 2011 [Page 4] Internet-Draft Redundancy for IPsec D-Deployment October 2010 systems and governments) to store their crucial information properties. Because the loss of such data may cause unaffordable results, most IDCs deploy security mechanisms for their users to securely access and transport their data, and it is common for IDCs to maintain multiple copies of critical data in different locations for disaster recovery Now, consider an IDC where there are multiple geographically distributed branches. To benefit the discussion, we assume that an IPSec gateway (or several gateways which are organized as a tight cluster) is deployed at the boundary of the network of each branch. Therefore, a user (e.g., a corporation) can select an appropriate gateway to access the IDC, according to certain policies. On this occasion, it may be not feasible for users to use anycast and multicast to achieve failover and load balancing amongst the gateways of different IDC branches. Typically, the gateways of the IDC can be deployed thousands miles away from one another. Because both anycast and multicast addresses cannot be aggregated, the adoption of the two technologies in global networks will raise route scaling issues. In addition, as with the illustration in [ANALYSIS], when using anycast, packets are delivered in a nondeterministic way, which may cause confusions in trust model and result in errors during executions of stateful protocols. Therefore, it is not recommended to use anycast and IPSec cooperatively [ANALYSIS]. A practical design option is to associate different IPSec gateways of an IDC with different IP prefixes. An effective failover solution for the geographically distributed gateways of an IDC (which actually compose a loose cluster) can effectively improve the disaster resistance capability of the IDC. When the network between a user and a gateway is broken due to network errors or disasters, the user can try to access the data through the gateway of another branch. In the remainder of this document, an example IDC will be used as an example to further discuss the issues with loose clusters. +---------------------------+ +----------------------------+ | | | | | NW 1 |---- | NW 2 | | +-------+ | | +-------+ | +-------| GW1 |-----------+ +----------| GW2 |---------+ +-------+ +-------+ | | | +---------+ | +-------------| User |------------+ +---------+ Figure 1. An example IDC Zhang & Zhang Expires April 21, 2011 [Page 5] Internet-Draft Redundancy for IPsec D-Deployment October 2010 As illustrated in Figure 1, the IDC has two branch networks, NW1 and NW2, which are located in different cities and connected with a private link. NW1 has an IPsec gateway IG1, and NW1 has an IPsec gateway IG2. In a normal condition, a user accesses the IDC through IG1. However, when the network between the user and IG1 is un- available, user needs to quickly transfer to IG2 to access the IDC. 5. Failure Detection An important issue in developing a loose cluster and achieving failover is how to identify the failure of a cluster. An intuitive approach is to enable a user to execute aliveness detection protocols (e.g., [DPD]) to monitor the condition of the Gateway on the other side of the tunnel. However, when a gateway supports a large amount of users, the overhead of processing the aliveness checking queries sent from users may be considerable and may negatively influence the performance of the gateway. It is also possible to assign the job of aliveness detection to IGs themselves. That is, aliveness detection protocols are executed on the IGs of a cluster. After the failure of an IG is perceived, the work of the failed IG will be taken over by another member in the cluster. The IG taking the place of the failed one needs to notify the affected users of the switchover (the knowledge of the users are shared from the failed IG in prior). During this process cryptographic mechanisms need to be adopted, in order to protect the integrity of notification messages and enable users to assess the switchover claims inside the messages. For instance, a user can to generate multiple IKE and IPSec SAs with different IG in the cluster, which has been adopted in HARP. Apart from this solution, an IG can generate a ticket for every other IG in the cluster; the ticket is signed with its private key. In a failover process, the replacing IG sends its ticket to a user. Thus, the user can verify the signature associated with the ticket and decide whether to trust the new IG. 6. Discovery of Replacing Gateways in a Switchover When a user realizes that its current IG (e.g., G1 in Figure 1) cannot work appropriately, the user needs to find the IG (e.g., G2 in Figure 1) which can replace the failed one to perform security access service. Follows are three types of methodologies which the user can use to get the knowledge essential for a switchover. o From the current IG. A user can get the information to assess G2 (e.g., the IP address) from IG1. The exchange of such information can be embedded into the IKE exchange or be carried out by addition signal messages. The redirection mechanism for IKEv2 proposed in [REDIRECT] can be potentially extended to support this Zhang & Zhang Expires April 21, 2011 [Page 6] Internet-Draft Redundancy for IPsec D-Deployment October 2010 approach. o From a third party. The address of G2 can be offer by a third party such as a DNS server. In order to achieve this, new resource records need to be specified. o Manual configuration. The system manager of the user can pre- configure the information of the IGs in a cluster in a list. Particularly, G1 is assigned as the master IG. After detecting the failure of the master IG, the user can select IG2 from the candidate list and attempts to contact it to achieve failover. This method requires the IPSec equipments on the user side to maintain multiple IPSec peers for the purpose of failover. This functionality has been implemented in the IG products of multiple vendors. 7. Reconstruction of IPSec Channels During a switchover, in order to recover the influence introduced by the failed IG, a user needs to generate a secure channel with the replacing IG in quick and efficient ways. In this section, several candidate solutions are discussed. 7.1. Hot Stand-by In a hot stand-by solution, the IKE and IPSec states are synchronized between the IGs in a timely fashion. Take the IDC in figure as example, after the user and IG1 has generated an IPSec tunnel successfully, the states related to the user are synchronized to IG2. In the event that IG1 fails and the user switches over to IG2, the negotiation process in generating IKE and IPSec SAs can be largely avoided. This solution has been widely adopted in developing tight clusters. In such tight clusters, a VRRP like protocol is executed to synchronize state between cluster members. However, in certain circumstances, real-time synchronization may be difficult to be guaranteed. For instance, a gateway may be crashed before it can send the latest synchronization packets. Also, a user may have already sent out certain numbers of packets to the crashed gateway before the failure is detected. On both occasions, the states on the replacing gateway and the the users can be inconsistent. Such issues have been discussed in [PS]. In [HA], a solution is proposed to help users and replacing gateways to synchronize their IKE/IPSec states (e.g., the latest sequence numbers of IPSec packets). Compared with tight clusters, more issues need to be considered in integrating hot stand-by in loose clusters. Firstly, two members of a loose cluster can be located thousands miles from each other. In such a case, the latency in transporting synchronization packets can be considerable and the network connecting two members may be unreliable and only Zhang & Zhang Expires April 21, 2011 [Page 7] Internet-Draft Redundancy for IPsec D-Deployment October 2010 have limited bandwidth as well. Secondly, after receiving a synchronized SA, the IP addresses in the SA must be carefully replaced with the IP addresses of the local IG. Thirdly, a user and the replacing IG need to assign new indexes to the IKE and IPSec SAs generated before the switchover so as to avoid the confliction of indexes. 7.2. Cold Stand-by Cold stand-by is the simplest solution to achieve failover in loose clusters. There is no requirement in synchronizing states among cluster members. If the cold stand-by function is implemented in the IDC in Figure 1, when the failure on IG1 is detected, user just attempts to contact IG2 and try to generate IKE and IPSec SAs from scratch. This process normally takes a large amount of computing resources and time, which makes this solution relatively low efficient. 7.3. IKE Session Resumption IKE session resumption is actually "semi-"hot standby solution, which enables users and gateways not to generate their IKE SAs in scratch in failover scenarios. In [RESUMPTION], a quick recovery mechanism for IKEv2 states is proposed. After an IG reboots, the client and the IG can re-establish the previous IKE SAs, while using less message exchanges and computational costs. While this solution is restricted in re-building state on a same IG, it can be easily extended to be used for failover in IG clusters. 7.3.1. Tickets by IGs This section introduces the solution which achieves IKE session resumption in loose clusters by maintaining IKE state information on the user side. When a user communicate with its original IG (e.g., the IG1 in Figure 1), the IG encapsulate its IKE state in a ticket and send the ticket to the user. In RFC5723, the data structure of a ticket is specified. When IG1 fails, the user sends the ticket obtained from IG1 to the replacing IG (e.g., the IG2 in Figure 1). Upon receiving the ticket, IG2 parses it and stores the IKE state locally. Then, IG2 can re-establish the IPSec SA with the user based on resumed IKE SA. This method requires no direct state synchronization between IG1 and IG2. However, IG1 and IG2 must share certain secrets in advance so that IG2 can decrypt the ticket and learn the date encapsulated. In addition, both IGs need to share the format of the ticket. Otherwise, IG2 cannot learn the meanings of data in the ticket. Zhang & Zhang Expires April 21, 2011 [Page 8] Internet-Draft Redundancy for IPsec D-Deployment October 2010 7.3.2. Synchronizing IKE-SAs amongst IGs This section introduces the solution which achieves IKE session resumption in loose clusters by synchronizing IKE-SAs amongst IGs. In this solution, IKE SAs are synchronized amongst the cluster members. Therefore, the original IG only needs to send the index of a ticket to the associated user. When the original IG fails, the user can present the ticket to the replacing IG, the IG can thus use the index to find the correspondent IKE state. 8. Security Considerations TBD. 9. IANA Considerations No assignments by the IANA are required. 10. Acknowledgements We would like to thank Paul Hoffman and Yaron Sheffer for discussion about the work and for their valuable advice. 11. References 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. 11.2. Informative References [ANALYSIS] Hagino, J. and E. Karupiah, "An analysis of IPv6 anycast", draft-ietf-ipngwg-ipv6-anycast-analysis-02 (work in progress), June 2003. [DPD] Huang, G., Beaulieu, S., and D. Rochefort, "A Traffic- Based Method of Detecting Dead Internet Key Exchange (IKE) Peers", RFC 3706, February 2004. [HA] Singh, R., Kalyani, G., Nir, Y., and D. Zhang, "Protocol Support for High Availability IKEv2/IPsec", draft-ietf-ipsecme-ipsecha-protocol-00 (work in progress), September 2010. [HARP] Wakikawa, R., "Home Agent Reliability Protocol", draft-ietf-mip6-hareliability-06 (work in progress), Zhang & Zhang Expires April 21, 2011 [Page 9] Internet-Draft Redundancy for IPsec D-Deployment October 2010 July 2010. [PS] Vidya, V., "IPsec Gateway Failover and Redundancy - Problem Statement and Goals", draft-vidya-ipsec-failover-ps-02 (work in progress), May 2007. [REDIRECT] Devarapalli, V. and K. Weniger, "Redirect Mechanism for the Internet Key Exchange Protocol Version 2 (IKEv2)", RFC 5685, November 2009. [RESUMPTION] Sheffer, Y. and H. Tschofenig, "Internet Key Exchange Protocol Version 2 Session Resumption", RFC 5273, January 2010. [VRRP] Hinden, R., "Virtual Router Redundancy Protocol", RFC 3768, April 2004. Authors' Addresses Dacheng Zhang Huawei KuiKe Building, No.9 Xinxi Rd., Shangdi HaiDian District, Beijing China EMail: zhangdacheng@huawei.com Dong Zhang Huawei Symantec 3rd Floor,Section D, Keshi Building, No.28, Xinxi Rd., Shangdi HaiDian District, Beijing China EMail: zhangdong_rh@huaweisymantec.com Zhang & Zhang Expires April 21, 2011 [Page 10]