Network Working Group Y. Wang Internet-Draft Huawei Intended status: Standards Track December 30, 2013 Expires: July 3, 2014 NFV High-Availability Technologies Gap Analysis draft-wang-nfv-high-availability-gap-analysis-00 Abstract High-Availability (HA) is a very important requirement throughout the history of carrier network, many technologies have emerged for it. With the trend of Network Function Virtualization (NFV), network function are migrated from dedicated hardware to software running over COTS servers, the same SLA of HA should be provided depending on network service itself. But some new challenges are brought by NFV, one example is Virtualized Network Function (VNF) cluster caused by the HA and performance limitation of individual VNF instance. For the VNF cluster, some gaps exist between with the available HA technologies on issues of multi-homing, state synchronization, share- risk prevention and HA role election, especially in the network which has a large scale deployment of NFV. This document firstly identifies the challenges emerged within NFV deployed networks. Then, available HA technologies are reviewed and the detailed gap analysis between them with the new challenges is discussed in depth. At last, the summary of these gaps is presented. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on July 3, 2014. Copyright Notice Wang Expires July 3, 2014 [Page 1] Internet-Draft NFV HA Technologies Gap Analysis December 2013 Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. New challenges and requirements to NFV HA solution . . . . . . 5 3.1. New features of NFV . . . . . . . . . . . . . . . . . . . 5 3.2. VNF cluster . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Failure detection and handling . . . . . . . . . . . . . . 6 3.4. Consideration of central management server . . . . . . . . 7 3.5. Policy enforcement . . . . . . . . . . . . . . . . . . . . 7 3.6. NFV HA solution requirements . . . . . . . . . . . . . . . 7 4. Gaps between available HA solutions and NFV's challenges . . . 9 4.1. VRRP . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2. BFD . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.3. APS . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4. NSR, NSF, SSO and GR . . . . . . . . . . . . . . . . . . . 11 4.5. STP . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.6. FRR . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.7. OAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 8. Informative References . . . . . . . . . . . . . . . . . . . . 18 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19 Wang Expires July 3, 2014 [Page 2] Internet-Draft NFV HA Technologies Gap Analysis December 2013 1. Introduction For the benefits of reduced operational and capital costs, automated deployment, and enhanced elasticity, Network Function (e.g. Firewall, IPS, Load Balancer, etc) Virtualization technology would be widely supported by DC and Carrier network. But, one key issue, the HA requirement of NFV should be considered over again to provide the same SLA of HA depending on the network services itself. Considering the reliability requirements, NFV HA architecture should support several key points detailed below: o Redundancy mechanism: LACP, VRRP, ECMP, etc; o Failure detection: IEEE 802.1ag, ITU-T Y.1731, BFD, MPLS(-TP) OAM, etc; o Failure notification: APS, etc; o State synchronization: NSR, NSF, SSO, GR, etc; o failure handling (switchover and failover): STP, EAPS, MPLS FRR, etc. [VNFP-PS] has provided the problem statement and working scope analysis focusing on the VNF reliability and high availability issues. One accompany draft [VNFP-UC] provides an overview of VNF HA use cases. In this document, we reviewed the challenges of NFV and traditional HA solution, and summarize the gap between them. Wang Expires July 3, 2014 [Page 3] Internet-Draft NFV HA Technologies Gap Analysis December 2013 2. Terminology NFV Network Function Virtualization VRRP Virtual Router Redundancy Protocol STP Spanning Tree Protocol FRR Fast Reroute BFD Bidirectional Forwarding Detection APS Automatic Protection Switching HA High Availability Wang Expires July 3, 2014 [Page 4] Internet-Draft NFV HA Technologies Gap Analysis December 2013 3. New challenges and requirements to NFV HA solution As described in [VNFP-PS] and [VNFP-UC], NFV brings some new features and requirements to current network, which also results in new challenges to NFV HA solution in the following aspects: 3.1. New features of NFV NFV network should support high scalability, cost reduction, high flexibility, high automation. o High scalability: NFV HA solution should support the large scale network. To support the HA of large number of VNFs in virtual network, the highly efficient and minimum impact to network HA solution is surely needed. NFV HA solution should also support an automatic and fast mechanism to discover and add/delete new VNFs into/out of VNF cluster; o Cost reduction: One of the intentions of deploying NFV is to reduce the cost, thus, whether for the VNF cluster or working/ protection group deployment, making full use of its members by active/active mode is the preferred choice. And, 1:N redundancy mechanism should be the basic requirement rather than 1:1, 1+1 or 1+N. But, multiple active VNFs maybe result in a performance problem. Because every active VNF needs to synchronize its state to every other VNF in the same VNF cluster. The full-mesh connectivity for state synchronization is very complex and will consume too much bandwidth and system resource to bring additional delay; o High flexibility: NFV brings high flexibility to network to make it easy for operator to scale up or down the virtualized network, and migrate VNF instances to other location on demand. Thus, VNF instances can be highly distributed in DC networks, Carrier networks and even customer premises, and they can be migrated dynamically. It is hard to located VNF VMs to fixed hosts. If the VNF instances in the same VNF cluster are located in the same host or the chains of VNF are crossed, it will reduce the SLA of HA due to share-risk problem. Another problem is related to service chain. For the consideration of HA, the working and backup service chains should not be crossed in the same location (e.g. host, hypervisor, VM, VNF), otherwise it will lead to a single point of failure; o High automation: NFV deployed network can be a large scale, highly distributed, and highly dynamic network. Which means a large amount of operations and functions must be automatically performed. Otherwise, too much manual configurations and Wang Expires July 3, 2014 [Page 5] Internet-Draft NFV HA Technologies Gap Analysis December 2013 interventions are not affordable. For example, the deployment and configuration of redundancy mechanisms, the failure detection and switchover, the adaptation to dynamic changes of VNFs should all be performed automatically and quickly. 3.2. VNF cluster NFV has transferred the running network functions from physical platform to virtualized platform, always result in the limitation of individual VNF instance's performance and reliability. So, NFV normally uses cluster with more members to meet the performance and reliability requirement. With the VNF numbers in VNF cluster increasing, some aspects should be considered: o State synchronization: State synchronization is the essential function for NFV, especially for the clustering mechanism. Otherwise, VNF cluster cannot work in several scenarios. One example is when forward flow and return flow hit different VNFs. The other example is switchover case. With the number of VNFs in cluster increased, state synchronization between them with full mesh becomes more complex and resource consuming. A general and efficient technology of state synchronization for various kinds of VNFs is needed; o Dynamic change of VNF members' role: A VNF member's role (active, especially standby) in VNF cluster can dynamically change in the run time due to switchover, NFV HA solution must adapt to it. 3.3. Failure detection and handling o Failure detection: In NFV infrastructure, failure can happen in multiple layers: hardware, hypervisor, VM, or VNF instance. So, the new failure detection mechanism should have the capability to detect the failure in all these layers. This mechanism should be simple, uniform, protocol independent, and software driven. One candidate is BFD; o Failure handling approaches: NFV technology brings new approaches to handle failures rather than traditional network. For instance, scaling up can resolve the problem of over-load. Restarting VNF instance or VNF dynamic relocation can all be used when VNF fails. The new HA solutions should make full use of these new NFV features. Wang Expires July 3, 2014 [Page 6] Internet-Draft NFV HA Technologies Gap Analysis December 2013 3.4. Consideration of central management server The north bound interfaces of VNF can be connected to the central management server to report VNF's running status. By this way, central management server can decide the roles of VNFs in VNF cluster and be used for the switchover mechanism. It can also be used for the state synchronization. 3.5. Policy enforcement There could be some policies reflecting the different reliability class of the service and hence affecting the selection of VNF instances. Examples would include isolation policies requiring that VNF instances be placed on separate physical servers or separate DC sites. Another example is to place some VNF instances in topologically closed locations. Other policies and related HA issues are TBD. 3.6. NFV HA solution requirements For coping with the above new challenges, the NFV HA solution should provide the functionalities as following: o Redundancy mechanism * The function to support VNF cluster with active/active mode * The function to support 1:N redundancy of VNF cluster; * The function to support VNFs to join/leave or scale up/down efficiently and automatically; * The function of share-risk prevention. o Failure detection * The simple, uniform, protocol independent, and software driven function to detect the failure of multiple layers of NFV infrastructure. o Failure notification * The function to notify central management server about VNF failure; * The function to notify other interested or influenced VNF about VNF failure. Wang Expires July 3, 2014 [Page 7] Internet-Draft NFV HA Technologies Gap Analysis December 2013 o State synchronization * The function to support the dynamic change of VNF member's role in VNF cluster due to switchover by tracking the binding relationship between active VNFs with standby VNFs; * The function to support VNF keep-alive monitoring and efficient state synchronization to avoid full mesh connectivity by using central controlled technology. o Failure handling (switchover and failover) * The function to support quick convergence for large VNF cluster; * The function to support central management server; * The function to utilize new NFV features, for example scaling up/down, restarting and dynamic relocation of VNF to overcome VNF failure or over-load. o Compatibility * Supporting different provider's VNF; * The compatibility to existed hardware network elements. Wang Expires July 3, 2014 [Page 8] Internet-Draft NFV HA Technologies Gap Analysis December 2013 4. Gaps between available HA solutions and NFV's challenges The available HA solutions are the combination of several technologies deployed in different points of network. The overview is as followed: o VRRP is a common redundancy mechanisms in Layer 3 and supports failure handling; o BFD is a common technology for failure detection; o APS protocol is mainly used for failure notification which naturally leads to failure handling process; o NSR, NSF, SSO and GR can support state synchronization between neighboring devices or standby RP during the device restarts; o STP and MPLS FRR are deployed as failure handling mechanisms in Layer 2 or 3. 4.1. VRRP The Virtual Router Redundancy Protocol (VRRP) is designed to eliminate the single point of failure inherent in the static default routed environment over Layer 3. The nodes or ports in the same group share the same virtual IP address with different MAC addresses. The elements in VRRP group elect the active role using pre-set priority. This priority/role changes only if switchover or fallback occurred. The active VRRP element notifies hosts in the subnet using gratuitous ARP. If the active VRRP element fails, the standby elements in the VRRP group would select a new active element. The new active element will sent gratuitous ARP to notify the corresponding hosts to update their MAC table. Consequently, the flows with the corresponding virtual IP as destination IP will be lead to the new active element. VRRP works as an integration solution of redundancy and failure handling mechanism. For requirements of redundancy mechanism: o VRRP supports cluster. However, there is only one active element per VRRP group. It does not support multiple active elements in the same group. Multiple active elements means multiple different VRRP groups distinguished by different virtual IP. The related hosts' gateway configuration should be planned carefully to these IPs, which is very inflexible; Wang Expires July 3, 2014 [Page 9] Internet-Draft NFV HA Technologies Gap Analysis December 2013 o VRRP can only support most 255 elements in one group, this maybe a drawback for NFV deployment in cross DC site scenarios in the future; o Only active element can send VRRP ADVERTISEMENT message to notify standby elements in same group, however standby elements can not announce their presence. Thus the automatic discovery of new elements is a problem for state synchronization; o VRRP does not support share-risk prevention. For requirements of failure handling: o VRRP group is an autonomy system without the support of central management server.Thus, it cannot take advantage of it to make an optimized choice between the VRRP elements flexibly according to their respective performance; o The standby elements take over the load of active element when it fails. While, if the active one fails due to overload, this is not effective. VNF can scale up to solve this problem, while VRRP cannot. 4.2. BFD BFD is a lightweight hello protocol designed to run over multiple transport protocols (e.g. IPv4, IPv6, MPLS, etc) used for Layer 3 failure detection. Any interested client (e.g. OSPF, BFP, HSRP, etc) registers with BFD and is notified as soon as BFD detects a neighbor loss. BFD establishes monitoring sessions between two neighbors and detects link or node failures if no BFD packet is received for a period. IBFD is a good candidate of failure detection solution for the NFV network due to its features of simplicity, efficiency, uniformity, protocol independence, software driven. But, BFD can only detect the failures of Layer 3 and does not support Layer 2. How it supports the failure detection for VNF cluster and multiple layer NFV infrastructures also needs more study. 4.3. APS APS (Automatic Protection Switching) protocol is a mature and proven mechanism specified for bidirectional protection switching which needs the coordination of the two endpoints of the transport entity in SONET/SDH networks. It can also be used for Ethernet [ITU-T G.8031] and MPLS [ITU-T Y.1720] network now. Wang Expires July 3, 2014 [Page 10] Internet-Draft NFV HA Technologies Gap Analysis December 2013 One endpoint transmits a new APS packet immediately to inform the far end endpoint for the coordination of protection switching when a change in the transmitted status occurs (e.g. link/node failure, force switch, signal fail, etc). APS protocol is mainly used for failure notification which will naturally leads to failure handling process. For requirements of redundancy mechanism: o APS does not support VNF cluster with active/active mode; o Usually, APS provide 1:1 or 1+1 backup, and no more than 1:14 backup will limit the scale of NFV network; o APS does not support dynamic election mechanism of path role. The role of path is pre-configured; o APS does not support share-risk prevention. For requirements of failure notification: o APS does not provide north bound interface to central management server. For requirements of failure handling: o APS does not support the new NFV failure handling features. 4.4. NSR, NSF, SSO and GR TBD. 4.5. STP The Spanning Tree Protocol (STP) is a network protocol that ensures a loop-free topology for any bridged Ethernet local area network. The basic function of STP is to prevent bridge loops and the broadcast storm that results from them. Spanning tree also allows a network design to include spare (redundant) links to provide automatic backup paths if an active link fails, without the danger of bridge loops, or the need for manual enabling/disabling of these backup links. STP works as an integration solution of redundancy and failure handling mechanism. For requirements of redundancy mechanism: Wang Expires July 3, 2014 [Page 11] Internet-Draft NFV HA Technologies Gap Analysis December 2013 o STP does not support VNF cluster with active/active mode; o STP do not provide the elect mechanism for dynamically electing active and standby elements in elements groups, the roles is determined by configured network layout and pre- assigned priority and port ID etc. After convergence, the role of port can only be changed by manual configuration or at next time of convergence when some ports fail; o STP does not provide the share-risk prevention mechanism. For requirements of failure handling: o STP does not provide north bound interface to central management server; o STP does not support new NFV failure handling features. 4.6. FRR MPLS Fast Reroute is a local restoration network resiliency mechanism. It is actually a feature of resource reservation protocol (RSVP) traffic engineering (RSVP-TE). In MPLS local protection each label switched path (LSP) passing through a facility is protected by a backup path which originates at the node immediately upstream to that facility. FRR works as an integration solution of redundancy and failure handling mechanism. For requirements of redundancy mechanism: o FRR does not support VNF cluster with active/active mode; o The FRR paths cannot dynamically elect active or standby paths, it is manual configured or layout by aptotic algorithms like LFA(loop free alternate)[RFC 5286] ; o FRR does not support share-risk prevention; o Usually, 2 or a little more paths can be protected in FRR, the large number of VNFs in a cluster will make network layout very complex. For requirements of failure handling: o FRR does not provide north bound interfaces to central management server; Wang Expires July 3, 2014 [Page 12] Internet-Draft NFV HA Technologies Gap Analysis December 2013 o FRR also has the similar problem with VRRP in the deployment of new NFV failure handling features. 4.7. OAM TBD Wang Expires July 3, 2014 [Page 13] Internet-Draft NFV HA Technologies Gap Analysis December 2013 5. Summary In conclusion, there is a gap between the available HA technologies and the new challenges of NFV. +--------------------+---------+---------+---------+---------+ | | VRRP | APS | STP | FRR | +--------------------+---------+---------+---------+---------+ | Support | | | | | | active/active | no | no | no | no | | cluster | | | | | +--------------------+---------+---------+---------+---------+ | Support 1:N | no | no | no | no | | backup | | | | | +--------------------+---------+---------+---------+---------+ | Automatic | no | no | no | no | | scalability | | | | | +--------------------+---------+---------+---------+---------+ | Share-risk | no | no | no | no | | prevention | | | | | +--------------------+---------+---------+---------+---------+ Figure 1: Gap Analysis Table of Redundancy Mechanism Note: 1. For NFV, VNF cluster with active/active mode is one of the basic requirements; 2. 1:N backup means 1 standby element for N active elements in one VNF cluster. +--------------------------------------+--------------------+ | | BFD | +--------------------------------------+--------------------+ | Support failure detection of | TBD | | VNF cluster and multiple layers | | +--------------------------------------+--------------------+ Figure 2: Gap Analysis Table of failure Detection Wang Expires July 3, 2014 [Page 14] Internet-Draft NFV HA Technologies Gap Analysis December 2013 +--------------------------------+--------------+ | | APS | +--------------------------------+--------------+ | Notify central network | no | | server | | +--------------------------------+--------------+ | Notify related VNF | yes | +--------------------------------+--------------+ Figure 3: Gap Analysis Table of failure Notification +-----------------------------------------+---------------------+ | | | +-----------------------------------------+---------------------+ | Dynamically election | | +-----------------------------------------+---------------------+ | Full mesh avoiding | | +-----------------------------------------+---------------------+ Figure 4: Gap Analysis Table of State Synchronization +-------------------+---------+--------+----------+----------+ | | | | | | | |VRRP |APS |STP |FRR | | | | | | | | | | | | | +-------------------+---------+--------+----------+----------+ | North bound API |no |no |no |no | +-------------------+---------+--------+----------+----------+ | Support new |no |no |no |no | | NFV | | | | | | approaches | | | | | +-------------------+---------+--------+----------+----------+ Figure 5: Gap Analysis Table of failure handling Wang Expires July 3, 2014 [Page 15] Internet-Draft NFV HA Technologies Gap Analysis December 2013 6. IANA Considerations This document has no actions for IANA. Wang Expires July 3, 2014 [Page 16] Internet-Draft NFV HA Technologies Gap Analysis December 2013 7. Security Considerations TBD. Wang Expires July 3, 2014 [Page 17] Internet-Draft NFV HA Technologies Gap Analysis December 2013 8. Informative References [VNF-PS] Zong, N., "Problem Statement for Reliable Virtualized Network Function (VNF) Pool", ID draft-zong-vnfpool-problem-statement-01, September 2013. [VNF-UC] Xia, L., "Use cases and Requirements for Virtual Service Node Pool Management", ID draft-xia-vsnpool-management-use-case-01, October 2013. Wang Expires July 3, 2014 [Page 18] Internet-Draft NFV HA Technologies Gap Analysis December 2013 Author's Address Yang Wang Huawei 101 Software Avenue, Yuhua District Nanjing, Jiangsu 210012 China Email: alex.wangyang@huawei.com Wang Expires July 3, 2014 [Page 19]