Network Working Group                                           D. Zhang
Internet-Draft                                                    Huawei
Intended status: Informational                                  D. Zhang
Expires: April 21, 2011                                  Huawei Symantec
                                                        October 18, 2010


       Redundancy Considerations for IPsec Distributed deployment
           draft-zhang-ipsecme-distributed-redundancy-00.txt

Abstract

   This informational document attempts to analyze several critical
   issues with failover of IPSec Gateways (IG) distributed across
   different networks.  Additionally, several candidate approaches to
   such issues are listed and compared.  The objective of this work is
   to provide useful information for the future research in the failover
   of multiple geographically distributed IGs.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 21, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of


Zhang & Zhang            Expires April 21, 2011                 [Page 1]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Table of Contents

   1.  Instroduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3
   3.  Compare of Tight Clusters to Loose Clusters . . . . . . . . . . 3
   4.  Application Scenarios . . . . . . . . . . . . . . . . . . . . . 4
     4.1.  Multi-Homing  . . . . . . . . . . . . . . . . . . . . . . . 4
     4.2.  High Available Home Agent . . . . . . . . . . . . . . . . . 4
     4.3.  Datacenter  . . . . . . . . . . . . . . . . . . . . . . . . 4
   5.  Failure Detection . . . . . . . . . . . . . . . . . . . . . . . 6
   6.  Discovery of Replacing Gateways in a Switchover . . . . . . . . 6
   7.  Reconstruction of IPSec Channels  . . . . . . . . . . . . . . . 7
     7.1.  Hot Stand-by  . . . . . . . . . . . . . . . . . . . . . . . 7
     7.2.  Cold Stand-by . . . . . . . . . . . . . . . . . . . . . . . 8
     7.3.  IKE Session Resumption  . . . . . . . . . . . . . . . . . . 8
       7.3.1.  Tickets by IGs  . . . . . . . . . . . . . . . . . . . . 8
       7.3.2.  Synchronizing IKE-SAs amongst IGs . . . . . . . . . . . 9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . . . 9
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9
   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . 9
   11. References  . . . . . . . . . . . . . . . . . . . . . . . . . . 9
     11.1. Normative References  . . . . . . . . . . . . . . . . . . . 9
     11.2. Informative References  . . . . . . . . . . . . . . . . . . 9


Zhang & Zhang            Expires April 21, 2011                 [Page 2]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


1.  Instroduction

   In [PS], a cluster is defined as "...a set of two or more gateways,
   implementing the same security policy, and protecting the same
   domain".  According to whether cluster members share an identical IP
   address, clusters can be further broken down into tight clusters and
   loose clusters.  The members of a tight cluster share a same IP
   address, while the members of a loose cluster do not.  In practice,
   tight clusters are widely employed to improve the availability of
   IPSec gateway systems.  In [PS], multiple issues in generating high-
   available tight clusters are well explored.  However, the issues with
   high-available loose clusters are rarely discussed in previous work.

   In this document, we list several application scenarios of high-
   available loose IG clusters and analyze several core issues with
   implementing high-available loose clusters and discuss the candidate
   solutions.

2.  Terminology

   HA: denotes high-available for short.

   IG: denotes IPSec Gateway for short.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.  Compare of Tight Clusters to Loose Clusters

   In a tight IG cluster, all members (i.e., IGs) share the same IP
   address.  Therefore, little work needs to be done by users during a
   failure recovery process.  In the best condition, users may even not
   notice that a failure has occurred and the functionality of an IG is
   taken over by another in the cluster.  Again, because, all the IGs in
   a tight cluster share a same IP address, they must be deployed on a
   same link and thus must be geographically adjacent.  Typically, the
   IPSec gateways in a tight cluster are located within a same building,
   or even in a same room.  In some IG clusters, state information
   (e.g., SAs) of different IGs needs to be synchronized for handling
   potential failures.  Obviously, the adjacency of the gateways can
   bring benefit in such synchronizations.

   In a loose IG cluster, different IGs hold different IP addresses,
   which will introduce more complexity in failover processes.  For
   instance, when an IG takes over a failed IG, users must be informed
   of the change so that they can update the IP address of the gateway
   in their IKE and IPSec SAs.  Compared with tight IG clusters, the


Zhang & Zhang            Expires April 21, 2011                 [Page 3]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


   members of a loose IG cluster can be located in different network and
   may also load balance simultaneously.  In the event of a failure, the
   users of a failed IG may be taken over by one or more of other IGs in
   the cluster.  These properties introduce more flexibility in the
   deployment but also make the state synchronization of cluster members
   more difficult.  In addition, when two members are located far from
   each other, the transporting latency of the packets will be
   considerable.  The link-layer state synchronization protocols such as
   [VRRP] cannot be used as more.

4.  Application Scenarios

   Some application scenarios for distributed deployment of loose
   clusters are introduced in this section.

4.1.  Multi-Homing

   When an enterprise has multiple branches distributed in various
   places, it is common for the enterprise to connect its branches and
   headquarter with the Internet and use IPsec VPNs to protect the
   security of the communication transported over the Internet.  For the
   purpose of traffic engineering and high availability, the enterprise
   may take advantage of the services of multiple ISPs.  Therefore,
   there can be more than one IPSec gateway deployed at the border of
   the enterprise network; each of them contacts to a different ISP and
   has a different IP address.  Those IGs can be used to construct a
   loose IG cluster to provide high available service for users within
   the corporations.

4.2.  High Available Home Agent

   In Mobile IPv6, it is recommended to use IPSec to protect signaling
   messages transported between mobile nodes and their home agents.
   Also, it has been realized that the communications of a number of
   users may be affected by a failure of a home agent.  To address this
   problem, a high reliable solution has been proposed in [HARP].  This
   solution requires a mobile node to register at different home agents
   with various IP address in advance.  When a home agent is failed,
   another home agent informs the affected mobile nodes through the pre-
   generated IPSec SAs.  Because the home agent actually perform a part
   of functionality of IGs, in the rest document, we denote home agents
   and IGs as well in order to simplify the discussion.

4.3.  Datacenter

   As an efficient and security way to maintain large amounts of data,
   Internet Data Centers (IDCs) has been widely adopted by big
   organizations (e.g., banks, financial corporations, public medical


Zhang & Zhang            Expires April 21, 2011                 [Page 4]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


   systems and governments) to store their crucial information
   properties.  Because the loss of such data may cause unaffordable
   results, most IDCs deploy security mechanisms for their users to
   securely access and transport their data, and it is common for IDCs
   to maintain multiple copies of critical data in different locations
   for disaster recovery

   Now, consider an IDC where there are multiple geographically
   distributed branches.  To benefit the discussion, we assume that an
   IPSec gateway (or several gateways which are organized as a tight
   cluster) is deployed at the boundary of the network of each branch.
   Therefore, a user (e.g., a corporation) can select an appropriate
   gateway to access the IDC, according to certain policies.

   On this occasion, it may be not feasible for users to use anycast and
   multicast to achieve failover and load balancing amongst the gateways
   of different IDC branches.  Typically, the gateways of the IDC can be
   deployed thousands miles away from one another.  Because both anycast
   and multicast addresses cannot be aggregated, the adoption of the two
   technologies in global networks will raise route scaling issues.  In
   addition, as with the illustration in [ANALYSIS], when using anycast,
   packets are delivered in a nondeterministic way, which may cause
   confusions in trust model and result in errors during executions of
   stateful protocols.  Therefore, it is not recommended to use anycast
   and IPSec cooperatively [ANALYSIS].

   A practical design option is to associate different IPSec gateways of
   an IDC with different IP prefixes.  An effective failover solution
   for the geographically distributed gateways of an IDC (which actually
   compose a loose cluster) can effectively improve the disaster
   resistance capability of the IDC.  When the network between a user
   and a gateway is broken due to network errors or disasters, the user
   can try to access the data through the gateway of another branch.

   In the remainder of this document, an example IDC will be used as an
   example to further discuss the issues with loose clusters.

   +---------------------------+     +----------------------------+
   |                           |     |                            |
   |          NW 1             |---- |             NW 2           |
   |       +-------+           |     |          +-------+         |
   +-------|  GW1  |-----------+     +----------|  GW2  |---------+
           +-------+                            +-------+
               |                                    |
               |             +---------+            |
               +-------------|  User   |------------+
                             +---------+
                      Figure 1. An example IDC


Zhang & Zhang            Expires April 21, 2011                 [Page 5]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


   As illustrated in Figure 1, the IDC has two branch networks, NW1 and
   NW2, which are located in different cities and connected with a
   private link.  NW1 has an IPsec gateway IG1, and NW1 has an IPsec
   gateway IG2.  In a normal condition, a user accesses the IDC through
   IG1.  However, when the network between the user and IG1 is un-
   available, user needs to quickly transfer to IG2 to access the IDC.

5.  Failure Detection

   An important issue in developing a loose cluster and achieving
   failover is how to identify the failure of a cluster.  An intuitive
   approach is to enable a user to execute aliveness detection protocols
   (e.g., [DPD]) to monitor the condition of the Gateway on the other
   side of the tunnel.  However, when a gateway supports a large amount
   of users, the overhead of processing the aliveness checking queries
   sent from users may be considerable and may negatively influence the
   performance of the gateway.

   It is also possible to assign the job of aliveness detection to IGs
   themselves.  That is, aliveness detection protocols are executed on
   the IGs of a cluster.  After the failure of an IG is perceived, the
   work of the failed IG will be taken over by another member in the
   cluster.  The IG taking the place of the failed one needs to notify
   the affected users of the switchover (the knowledge of the users are
   shared from the failed IG in prior).  During this process
   cryptographic mechanisms need to be adopted, in order to protect the
   integrity of notification messages and enable users to assess the
   switchover claims inside the messages.  For instance, a user can to
   generate multiple IKE and IPSec SAs with different IG in the cluster,
   which has been adopted in HARP.  Apart from this solution, an IG can
   generate a ticket for every other IG in the cluster; the ticket is
   signed with its private key.  In a failover process, the replacing IG
   sends its ticket to a user.  Thus, the user can verify the signature
   associated with the ticket and decide whether to trust the new IG.

6.   Discovery of Replacing Gateways in a Switchover

   When a user realizes that its current IG (e.g., G1 in Figure 1)
   cannot work appropriately, the user needs to find the IG (e.g., G2 in
   Figure 1) which can replace the failed one to perform security access
   service.  Follows are three types of methodologies which the user can
   use to get the knowledge essential for a switchover.

   o  From the current IG.  A user can get the information to assess G2
      (e.g., the IP address) from IG1.  The exchange of such information
      can be embedded into the IKE exchange or be carried out by
      addition signal messages.  The redirection mechanism for IKEv2
      proposed in [REDIRECT] can be potentially extended to support this


Zhang & Zhang            Expires April 21, 2011                 [Page 6]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


      approach.

   o  From a third party.  The address of G2 can be offer by a third
      party such as a DNS server.  In order to achieve this, new
      resource records need to be specified.

   o  Manual configuration.  The system manager of the user can pre-
      configure the information of the IGs in a cluster in a list.
      Particularly, G1 is assigned as the master IG.  After detecting
      the failure of the master IG, the user can select IG2 from the
      candidate list and attempts to contact it to achieve failover.
      This method requires the IPSec equipments on the user side to
      maintain multiple IPSec peers for the purpose of failover.  This
      functionality has been implemented in the IG products of multiple
      vendors.

7.  Reconstruction of IPSec Channels

   During a switchover, in order to recover the influence introduced by
   the failed IG, a user needs to generate a secure channel with the
   replacing IG in quick and efficient ways.  In this section, several
   candidate solutions are discussed.

7.1.  Hot Stand-by

   In a hot stand-by solution, the IKE and IPSec states are synchronized
   between the IGs in a timely fashion.  Take the IDC in figure as
   example, after the user and IG1 has generated an IPSec tunnel
   successfully, the states related to the user are synchronized to IG2.
   In the event that IG1 fails and the user switches over to IG2, the
   negotiation process in generating IKE and IPSec SAs can be largely
   avoided.  This solution has been widely adopted in developing tight
   clusters.  In such tight clusters, a VRRP like protocol is executed
   to synchronize state between cluster members.  However, in certain
   circumstances, real-time synchronization may be difficult to be
   guaranteed.  For instance, a gateway may be crashed before it can
   send the latest synchronization packets.  Also, a user may have
   already sent out certain numbers of packets to the crashed gateway
   before the failure is detected.  On both occasions, the states on the
   replacing gateway and the the users can be inconsistent.  Such issues
   have been discussed in [PS].  In [HA], a solution is proposed to help
   users and replacing gateways to synchronize their IKE/IPSec states
   (e.g., the latest sequence numbers of IPSec packets).  Compared with
   tight clusters, more issues need to be considered in integrating hot
   stand-by in loose clusters.  Firstly, two members of a loose cluster
   can be located thousands miles from each other.  In such a case, the
   latency in transporting synchronization packets can be considerable
   and the network connecting two members may be unreliable and only


Zhang & Zhang            Expires April 21, 2011                 [Page 7]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


   have limited bandwidth as well.  Secondly, after receiving a
   synchronized SA, the IP addresses in the SA must be carefully
   replaced with the IP addresses of the local IG.  Thirdly, a user and
   the replacing IG need to assign new indexes to the IKE and IPSec SAs
   generated before the switchover so as to avoid the confliction of
   indexes.

7.2.  Cold Stand-by

   Cold stand-by is the simplest solution to achieve failover in loose
   clusters.  There is no requirement in synchronizing states among
   cluster members.  If the cold stand-by function is implemented in the
   IDC in Figure 1, when the failure on IG1 is detected, user just
   attempts to contact IG2 and try to generate IKE and IPSec SAs from
   scratch.  This process normally takes a large amount of computing
   resources and time, which makes this solution relatively low
   efficient.

7.3.  IKE Session Resumption

   IKE session resumption is actually "semi-"hot standby solution, which
   enables users and gateways not to generate their IKE SAs in scratch
   in failover scenarios.  In [RESUMPTION], a quick recovery mechanism
   for IKEv2 states is proposed.  After an IG reboots, the client and
   the IG can re-establish the previous IKE SAs, while using less
   message exchanges and computational costs.  While this solution is
   restricted in re-building state on a same IG, it can be easily
   extended to be used for failover in IG clusters.

7.3.1.  Tickets by IGs

   This section introduces the solution which achieves IKE session
   resumption in loose clusters by maintaining IKE state information on
   the user side.  When a user communicate with its original IG (e.g.,
   the IG1 in Figure 1), the IG encapsulate its IKE state in a ticket
   and send the ticket to the user.  In RFC5723, the data structure of a
   ticket is specified.  When IG1 fails, the user sends the ticket
   obtained from IG1 to the replacing IG (e.g., the IG2 in Figure 1).
   Upon receiving the ticket, IG2 parses it and stores the IKE state
   locally.  Then, IG2 can re-establish the IPSec SA with the user based
   on resumed IKE SA.  This method requires no direct state
   synchronization between IG1 and IG2.  However, IG1 and IG2 must share
   certain secrets in advance so that IG2 can decrypt the ticket and
   learn the date encapsulated.  In addition, both IGs need to share the
   format of the ticket.  Otherwise, IG2 cannot learn the meanings of
   data in the ticket.


Zhang & Zhang            Expires April 21, 2011                 [Page 8]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


7.3.2.  Synchronizing IKE-SAs amongst IGs

   This section introduces the solution which achieves IKE session
   resumption in loose clusters by synchronizing IKE-SAs amongst IGs.
   In this solution, IKE SAs are synchronized amongst the cluster
   members.  Therefore, the original IG only needs to send the index of
   a ticket to the associated user.  When the original IG fails, the
   user can present the ticket to the replacing IG, the IG can thus use
   the index to find the correspondent IKE state.

8.  Security Considerations

   TBD.

9.  IANA Considerations

   No assignments by the IANA are required.

10.  Acknowledgements

   We would like to thank Paul Hoffman and Yaron Sheffer for discussion
   about the work and for their valuable advice.

11.  References

11.1.  Normative References

   [RFC2119]     Bradner, S., "Key words for use in RFCs to Indicate
                 Requirement Levels", RFC 2119, March 1997.

11.2.  Informative References

   [ANALYSIS]    Hagino, J. and E. Karupiah, "An analysis of IPv6
                 anycast",
                 draft-ietf-ipngwg-ipv6-anycast-analysis-02 (work in
                 progress), June 2003.

   [DPD]         Huang, G., Beaulieu, S., and D. Rochefort, "A Traffic-
                 Based Method of Detecting Dead Internet Key Exchange
                 (IKE) Peers", RFC 3706, February 2004.

   [HA]          Singh, R., Kalyani, G., Nir, Y., and D. Zhang,
                 "Protocol Support for High Availability IKEv2/IPsec",
                 draft-ietf-ipsecme-ipsecha-protocol-00 (work in
                 progress), September 2010.

   [HARP]        Wakikawa, R., "Home Agent Reliability Protocol",
                 draft-ietf-mip6-hareliability-06 (work in progress),


Zhang & Zhang            Expires April 21, 2011                 [Page 9]

Internet-Draft      Redundancy for IPsec D-Deployment       October 2010


                 July 2010.

   [PS]          Vidya, V., "IPsec Gateway Failover and Redundancy -
                 Problem Statement and Goals",
                 draft-vidya-ipsec-failover-ps-02 (work in progress),
                 May 2007.

   [REDIRECT]    Devarapalli, V. and K. Weniger, "Redirect Mechanism for
                 the Internet Key Exchange Protocol Version 2 (IKEv2)",
                 RFC 5685, November 2009.

   [RESUMPTION]  Sheffer, Y. and H. Tschofenig, "Internet Key Exchange
                 Protocol Version 2 Session Resumption", RFC 5273,
                 January 2010.

   [VRRP]        Hinden, R., "Virtual Router Redundancy Protocol",
                 RFC 3768, April 2004.

Authors' Addresses

   Dacheng Zhang
   Huawei
   KuiKe Building, No.9 Xinxi Rd., Shangdi
   HaiDian District, Beijing
   China

   EMail: zhangdacheng@huawei.com


   Dong Zhang
   Huawei Symantec
   3rd Floor,Section D, Keshi Building, No.28, Xinxi Rd., Shangdi
   HaiDian District, Beijing
   China

   EMail: zhangdong_rh@huaweisymantec.com


Zhang & Zhang            Expires April 21, 2011                [Page 10]