NFVRG C. Meirosu Internet Draft Ericsson Intended status: Informational A. Manzalini Expires: September 2016 Telecom Italia R. Steinert SICS G. Marchetto Politecnico di Torino I. Papafili Hellenic Telecommunications Organization K. Pentikousis EICT S. Wright AT&T March 20, 2016March 18, 2016 DevOps for Software-Defined Telecom Infrastructures draft-unify-nfvrg-devops-04.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on September 20, 2016. Meirosu, et al. Expires September 20,18, 2016 [Page 1] Internet-Draft DevOps Challenges March 2016 Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Abstract Carrier-grade network management was optimized for environments built with monolithic physical nodes and involves significant deployment, integration and maintenance efforts from network service providers. The introduction of virtualization technologies, from the physical layer all the way up to the application layer, however, invalidates several well-established assumptions in this domain. This draft opens the discussion in NFVRG about challenges related to transforming the telecom network infrastructure into an agile, model-driven production environment for communication services. We take inspiration from data center DevOps regarding how to simplify and automate management processes for a telecom service provider software-defined infrastructure (SDI). Among the identified challenges, we consider scalability of observability processes and automated inference of monitoring requirements from logical forwarding graphs, as well as initial placement (and re-placement) of monitoring functionality following changes in flow paths enforced by the controllers. In another category of challenges, verifying correctness of behavior for network functions where flow rules are no longer necessary and sufficient for determining the forwarding state (for example, stateful firewalls or load balancers) is very difficult with current technology. Finally, we introduce challenges associated with operationalizing DevOps principles at scale in software-defined telecom networks in three areas related to key monitoring, verification and troubleshooting processes. Table of Contents 1. Introduction...................................................3 Meirosu, et al. Expires September 20,18, 2016 [Page 2] Internet-Draft DevOps Challenges March 2016 2. Software-Defined Telecom Infrastructure: Roles and DevOps principles........................................................5 2.1. Service Developer Role....................................5 2.2. VNF Developer role........................................6 2.3. System Integrator role....................................6 2.4. Operator role.............................................6 2.5. Customer role.............................................6 2.6. DevOps Principles.........................................7 3. Continuous Integration.........................................8 4. Continuous Delivery............................................9 5. Consistency, Availability and Partitioning Challenges..........9 6. Stability Challenges..........................................10 7. Observability Challenges......................................12 8. Verification Challenges.......................................14 9. Troubleshooting Challenges....................................16 10. Programmable network management..............................17 11. DevOps Performance Metrics...................................18 12. Security Considerations......................................19 13. IANA Considerations..........................................19 14. References...................................................19 14.1. Informative References..................................19 15. Contributors.................................................22 16. Acknowledgments..............................................22 17. Authors' Addresses...........................................23 1. Introduction Carrier-grade network management was developed as an incremental solution once a particular network technology matured and came to be deployed in parallel with legacy technologies. This approach requires significant integration efforts when new network services are launched. Both centralized and distributed algorithms have been developed in order to solve very specific problems related to configuration, performance and fault management. However, such algorithms consider a network that is by and large functionally static. Thus, management processes related to introducing new or maintaining functionality are complex and costly due to significant efforts required for verification and integration. Network virtualization, by means of Software-Defined Networking (SDN) and Network Function Virtualization (NFV), creates an environment where network functions are no longer static or strictly embedded in physical boxes deployed at fixed points. The virtualized network is dynamic and open to fast-paced innovation enabling efficient network management and reduction of operating cost for network operators. A significant part of network capabilities are expected to become available through interfaces that resemble the APIs widespread within Meirosu, et al. Expires September 20,18, 2016 [Page 3] Internet-Draft DevOps Challenges March 2016 datacenters instead of the traditional telecom means of management such as the Simple Network Management Protocol, Command Line Interfaces or CORBA. Such an API-based approach, combined with the programmability offered by SDN interfaces [RFC7426], open opportunities for handling infrastructure, resources, and Virtual Network Functions (VNFs) as code, employing techniques from software engineering. The efficiency and integration of existing management techniques in virtualized and dynamic network environments are limited, however. Monitoring tools, e.g. based on simple counters, physical network taps and active probing, do not scale well and provide only a small part of the observability features required in such a dynamic environment. Although huge amounts of monitoring data can be collected from the nodes, the typical granularity is rather coarse. Debugging and troubleshooting techniques developed for software- defined environments are a research topic that has gathered interest in the research community in the last years. Still, it is yet to be explored how to integrate them into an operational network management system. Moreover, research tools developed in academia (such as NetSight [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were limited to solving very particular, well-defined problems, and oftentimes are not built for automation and integration into carrier- grade network operations workflows. The topics at hand have already attracted several standardization organizations to look into the issues arising in this new environment. For example, IETF working groups have activities in the area of OAM and Verification for Service Function Chaining [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service Function Chaining. At IRTF, [RFC7149] asks a set of relevant questions regarding operations of SDNs. The ETSI NFV ISG defines the MANO interfaces [NFVMANO], and TMForum investigates gaps between these interfaces and existing specifications in [TR228]. The need for programmatic APIs in the orchestration of compute, network and storage resources is discussed in [I-D.unify-nfvrg-challenges]. From a research perspective, problems related to operations of software-defined networks are in part outlined in [SDNsurvey] and research referring to both cloud and software-defined networks are discussed in [D4.1]. The purpose of this first version of this document is to act as a discussion opener in NFVRG by describing a set of principles that are relevant for applying DevOps ideas to managing software-defined telecom network infrastructures. We identify a set of challenges related to developing tools, interfaces and protocols that would Meirosu, et al. Expires September 20,18, 2016 [Page 4] Internet-Draft DevOps Challenges March 2016 support these principles and how can we leverage standard APIs for simplifying management tasks. 2. Software-Defined Telecom Infrastructure: Roles and DevOps principles Agile methods used in many software focused companies are focused on releasing small interactions of code to implement VNFs with high velocity and high quality into a production environment. Similarly, Service providers are interested to release incremental improvements in the network services that they create from virtualized network functions. The cycle time for devops as applied in many open source projects is on the order of one quarter year or 13 weeks. The code needs to undergo a significant amount of automated testing and verification with pre-defined templates in a realistic setting. From the point of view of infrastructure management, the verification of the network configuration as result of network policy decomposition and refinement, as well as the configuration of virtual functions, is one of the most sensitive operations. When troubleshooting the cause of unexpected behavior, fine-grained visibility onto all resources supporting the virtual functions (either compute, or network-related) is paramount to facilitating fast resolution times. While compute resources are typically very well covered by debugging and profiling toolsets based on many years of advances in software engineering, programmable network resources are a still a novelty and tools exploiting their potential are scarce. 2.1. Service Developer Role We identify two dimensions of the "developer" role in software- defined infrastructure (SDI). One dimension relates to determining which high-level functions should be part of a particular service, deciding what logical interconnections are needed between these blocks and defining a set of high-level constraints or goals related to parameters that define, for instance, a Service Function Chain. This could be determined by the product owner for a particular family of services offered by a telecom provider. Or, it might be a key account representative that adapts an existing service template to the requirements of a particular customer by adding or removing a small number of functional entities. We refer to this person as the Service Developer and for simplicity (access control, training on technical background, etc.) we consider the role to be internal to the telecom provider. Meirosu, et al. Expires September 20,18, 2016 [Page 5] Internet-Draft DevOps Challenges March 2016 2.2. VNF Developer role Another dimension of the "developer" role is a person that writes the software code for a new virtual network function (VNF). Depending on the actual VNF being developed, this person might be internal or external (e.g. a traditional equipment vendor) to the telecom provider. We refer to them as VNF Developers. 2.3. System Integrator role The System Integrator role is to some extent similar to the Service Developer: people in this role need to identify the components of the system to be delivered. However, for the Service Developer, the service components are pre-integrated meaning that they have the right interfaces to interact with each other. In contrast, the Systems Integrator needs to develop the software that makes the system components interact with each other. As such, the Systems Integrator role combines aspects of the Developer roles and adds yet another dimension to it. Compared to the other Developer roles, the System Integrator might face additional challenges due to the fact that they might not have access to the source code of some of the components. This limits for example how fast they could address issues with components to be integrated, as well as uneven workload depending on the release granularity of the different components that need to be integrated. 2.4. Operator role The role of an Operator in SDI is to ensure that the deployment processes were successful and a set of performance indicators associated to a service are met while the service is supported on virtual infrastructure within the domain of a telecom provider. 2.5. Customer role A Customer contracts a telecom operator to provide one or more services. In SDI, the Customer may communicate with the provider through an online portal. Compared to the Service Developer, the Customer is external to the operator and may define changes to their own service instance only in accordance to policies defined by the Service Developer. In addition to the usual per-service utilization statistics, in SDI the portal may enable the customer to trigger certain performance management or troubleshooting tools for the service. This, for example, enables the Customer to determine whether the root cause of certain error or degradation condition that they observe is located in the telecom operator domain or not and may facilitate the interaction with the customer support teams. Meirosu, et al. Expires September 20,18, 2016 [Page 6] Internet-Draft DevOps Challenges March 2016 2.6. DevOps Principles In line with the generic DevOps concept outlined in [DevOpsP], we consider that these four principles as important for adapting DevOps ideas to SDI: * Deploy with repeatable, reliable processes: Service and VNF Developers should be supported by automated build, orchestrate and deploy processes that are identical in the development, test and production environments. Such processes need to be made reliable and trusted in the sense that they should reduce the chance of human error and provide visibility at each stage of the process, as well as have the possibility to enable manual interactions in certain key stages. * Develop and test against production-like systems: both Service Developers and VNF Developers need to have the opportunity to verify and debug their respective SDI code in systems that have characteristics which are very close to the production environment where the code is expected to be ultimately deployed. Customizations of Service Function Chains or VNFs could thus be released frequently to a production environment in compliance with policies set by the Operators. Adequate isolation and protection of the services active in the infrastructure from services being tested or debugged should be provided by the production environment. * Monitor and validate operational quality: Service Developers, VNF Developers and Operators must be equipped with tools, automated as much as possible, that enable to continuously monitor the operational quality of the services deployed on SDI. Monitoring tools should be complemented by tools that allow verifying and validating the operational quality of the service in line with established procedures which might be standardized (for example, Y.1564 Ethernet Activation [Y1564]) or defined through best practices specific to a particular telecom operator. * Amplify development cycle feedback loops: An integral part of the DevOps ethos is building a cross-cultural environment that bridges the cultural gap between the desire for continuous change by the Developers and the demand by the Operators for stability and reliability of the infrastructure. Feedback from customers is collected and transmitted throughout the organization. From a technical perspective, such cultural aspects could be addressed through common sets of tools and APIs that are aimed at providing a shared vocabulary for both Developers and Operators, as well as simplifying the reproduction of problematic situations in the development, test and operations environments. Meirosu, et al. Expires September 20,18, 2016 [Page 7] Internet-Draft DevOps Challenges March 2016 Network operators that would like to move to agile methods to deploy and manage their networks and services face a different environment compared to typical software companies where simplified trust relationships between personnel are the norm. In software companies, it is not uncommon that the same person may be rotating between different roles. In contrast, in a telecom service provider, there are strong organizational boundaries between suppliers (whether in Developer roles for network functions, or in Operator roles for outsourced services) and the carrier's own personnel that might also take both Developer and Operator roles. How DevOps principles reflect on these trust relationships and to what extent initiatives such as co-creation could transform the environment to facilitate closer Dev and Ops integration across business boundaries is an interesting area for business studies, but we could not for now identify a specific technological challenge. 3. Continuous Integration Software integration is the process of bringing together the software component subsystems into one software system, and ensuring that the subsystems function together as a system. Software integration can apply regardless of the size of the software components. The objective of Continuous Integration is to prevent integration problems close to the expected release of a software development project into a production (operations) environment. Continuous Integration is therefore closely coupled with the notion of DevOps as a mechanism to ease the transition from development to operations. Continuous integration may result in multiple builds per day. It is also typically used in conjunction with test driven development approaches that integrate unit testing into the build process. The unit testing is typically automated through build servers. Such servers may implement a variety of additional static and dynamic tests as well as other quality control and documentation extraction functions. The reduced cycle times of continuous enable improved software quality by applying small efforts frequently. Continuous Integration applies to developers of VNF as they integrate the components that they need to deliver their VNF. The VNFs may contain components developed by different teams within the VNF Provider, or may integrate code developed externally - e.g. in commercial code libraries or in open source communities. Service providers also apply continuous integration in the development of network services. Network services are comprised of Meirosu, et al. Expires September 20,18, 2016 [Page 8] Internet-Draft DevOps Challenges March 2016 various aspects including VNFs and connectivity within and between them as well as with various associated resource authorizations. The components of the networks service are all dynamic, and largely represented by software that must be integrated regularly to maintain consistency. Some of the software components that Service Providers may be sourced from VNF Providers or from open source communities. Service Providers are increasingly motivated to engage with open Source communities [OSandS]. Open source interfaces supported by open source communities may be more useful than traditional paper interface specifications. Even where Service Providers are deeply engaged in the open source community (e.g. OPNFV) many service providers may prefer to obtain the code through some software provider as a business practice. Such software providers have the same interests in software integration as other VNF providers. 4. Continuous Delivery The practice of Continuous Delivery extends Continuous Integration by ensuring that the software (either a VNF code or code for SDI) checked in on the mainline is always in a user deployable state and enables rapid deployment by those users. For critical systems such as telecommunications networks, Continuous Delivery has the advantage of including a manual trigger before the actual deployment in the live system, compared to the Continuous Deployment methodology which is also part of DevOps processes in software companies. 5. Consistency, Availability and Partitioning Challenges The CAP theorem [CAP] states that any networked shared-data system can have at most two of following three properties: 1) Consistency (C) equivalent to having a single up-to-date copy of the data; 2) high Availability (A) of that data (for updates); and 3) tolerance to network Partitions (P). Looking at a telecom SDI as a distributed computational system (routing/forwarding packets can be seen as a computational problem), just two of the three CAP properties will be possible at the same time. The general idea is that 2 of the 3 have to be chosen. CP favor consistency, AP favor availability, CA there are no partition. This has profound implications for technologies that need to be developed in line with the "deploy with repeatable, reliable processes" principle for configuring SDI states. Latency or delay and partitioning properties are closely related, and such relation becomes more important in the case of telecom service providers where Devs and Ops interact with widely distributed infrastructure. Meirosu, et al. Expires September 20,18, 2016 [Page 9] Internet-Draft DevOps Challenges March 2016 Limitations of interactions between centralized management and distributed control need to be carefully examined in such environments. Traditionally connectivity was the main concern: C and A was about delivering packets to destination. The features and capabilities of SDN and NFV are changing the concerns: for example in SDN, control plane Partitions no longer imply data plane Partitions, so A does not imply C. In practice, CAP reflects the need for a balance between local/distributed operations and remote/centralized operations. Furthermore to CAP aspects related to individual protocols, interdependencies between CAP choices for both resources and VNFs that are interconnected in a forwarding graph need to be considered. This is particularly relevant for the "Monitor and Validate Operational Quality" principle, as apart from transport protocols, most OAM functionality is generally configured in processes that are separated from the configuration of the monitored entities. Also, partitioning in a monitoring plane implemented through VNFs executed on compute resources does not necessarily mean that the dataplane of the monitored VNF was partitioned as well. 6. Stability Challenges The dimensions, dynamicity and heterogeneity of networks are growing continuously. Monitoring and managing the network behavior in order to meet technical and business objectives is becoming increasingly complicated and challenging, especially when considering the need of predicting and taming potential instabilities. In general, instability in networks may have primary effects both jeopardizing the performance and compromising an optimized use of resources, even across multiple layers: in fact, instability of end- to-end communication paths may depend both on the underlying transport network, as well as the higher level components specific to flow control and dynamic routing. For example, arguments for introducing advanced flow admission control are essentially derived from the observation that the network otherwise behaves in an inefficient and potentially unstable manner. Even with resources over provisioning, a network without an efficient flow admission control has instability regions that can even lead to congestion collapse in certain configurations. Another example is the instability which is characteristic of any dynamically adaptive routing system. Routing instability, which can be (informally) defined as the quick change of network reachability and topology information, has a number of possible origins, including problems with connections, router Meirosu, et al. Expires September 20,18, 2016 [Page 10] Internet-Draft DevOps Challenges March 2016 failures, high levels of congestion, software configuration errors, transient physical and data link problems, and software bugs. As a matter of fact, the states monitored and used to implement the different control and management functions in network nodes are governed by several low-level configuration commands (today still done mostly manually). Further, there are several dependencies among these states and the logic updating the states (most of which are not kept aligned automatically). Normally, high-level network goals (such as the connectivity matrix, load-balancing, traffic engineering goals, survivability requirements, etc) are translated into low-level configuration commands (mostly manually) individually executed on the network elements (e.g., forwarding table, packet filters, link- scheduling weights, and queue-management parameters, as well as tunnels and NAT mappings). Network instabilities due to configuration errors can spread from node to node and propagate throughout the network. DevOps in the data center is a source of inspiration regarding how to simplify and automate management processes for software-defined infrastructure. Although the low-level configuration could be automated by DevOps tools such as CFEngine [C2015], Puppet [P2015] and Ansible [A2015], the high-level goal translation towards tool- specific syntax is still a manual process. In addition, while carrier-grade configuration tools using the NETCONF protocol support complex atomic transaction management (which reduces the potential for instability), Ansible requires third-party components to support rollbacks and the Puppet transactions are not atomic. As a specific example, automated configuration functions are expected to take the form of a "control loop" that monitors (i.e., measures) current states of the network, performs a computation, and then reconfigures the network. These types of functions must work correctly even in the presence of failures, variable delays in communicating with a distributed set of devices, and frequent changes in network conditions. Nevertheless cascading and nesting of automated configuration processes can lead to the emergence of non- linear network behaviors, and as such sudden instabilities (i.e. identical local dynamic can give rise to widely different global dynamics). Meirosu, et al. Expires September 20,18, 2016 [Page 11] Internet-Draft DevOps Challenges March 2016 7. Observability Challenges Monitoring algorithms need to operate in a scalable manner while providing the specified level of observability in the network, either for operation purposes (Ops part) or for debugging in a development phase (Dev part). We consider the following challenges: * Scalability - relates to the granularity of network observability, computational efficiency, communication overhead, and strategic placement of monitoring functions. * Distributed operation and information exchange between monitoring functions - monitoring functions supported by the nodes may perform specific operations (such as aggregation or filtering) locally on the collected data or within a defined data neighborhood and forward only the result to a management system. Such operation may require modifications of existing standards and development of protocols for efficient information exchange and messaging between monitoring functions. Different levels of granularity may need to be offered for the data exchanged through the interfaces, depending on the Dev or Ops role. Modern messaging systems, such as Apache Kafka [AK2015], widely employed in datacenter environments, were optimized for messages that are considerably larger than reading a single counter value (typical SNMP GET call usage) - note the throughput vs record size from [K2014]. It is also debatable to what extent properties such as message persistence within the bus are needed in a carrier environment, where MIBs practically offer already a certain level of persistence of management data at the node level. Also, they require the use of IP addressing which might not be needed when the monitored data is consumed by a function within the same node. * Common communication channel between monitoring functions and higher layer entities (orchestration, control or management systems) - a single communication channel for configuration and measurement data of diverse monitoring functions running on heterogeneous hard- and software environments. In telecommunication environments, infrastructure assets span not only large geographical areas, but also a wide range of technology domains, ranging from CPEs, access-, aggregation-, and transport networks, to datacenters. This heterogeneity of hard- and software platforms requires higher layer entities to utilize various parallel communication channels for either configuration or data retrieval of monitoring functions within these technology domains. To address automation and advances in monitoring programmability, software defined telecommunication infrastructures would benefit from a single flexible communication channel, thereby supporting the dynamicity of virtualized environments. Such a channel should ideally support propagation of Meirosu, et al. Expires September 20,18, 2016 [Page 12] Internet-Draft DevOps Challenges March 2016 configuration, signalling, and results from monitoring functions; carrier-grade operations in terms of availability and multi-tenant features; support highly distributed and hierarchical architectures, keeping messages as local as possible; be lightweight, topology independent, network address agnostic; support flexibility in terms of transport mechanisms and programming language support. Existing popular state-of-the-art message queuing systems such as RabbitMQ [R2015] fulfill many of these requirements. However, they utilize centralized brokers, posing a single point-of-failure and scalability concerns within vastly distributed NFV environment. Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015] on the other hard lacks any advanced features for carrier-grade operations, including high-availability, authentication, and tenant isolation. * Configurability and conditional observability - monitoring functions that go beyond measuring simple metrics (such as delay, or packet loss) require expressive monitoring annotation languages for describing the functionality such that it can be programmed by a controller. Monitoring algorithms implementing self-adaptive monitoring behavior relative to local network situations may employ such annotation languages to receive high-level objectives (KPIs controlling tradeoffs between accuracy and measurement frequency, for example) and conditions for varying the measurement intensity. Steps in this direction were taken by the DevOps tools such as Splunk [S2015], whose collecting agent has the ability to load particular apps that in turn access specific counters or log files. However, such apps are tool specific and may also require deploying additional agents that are specific to the application, library or infrastructure node being monitored. Choosing which objects to monitor in such environment means deploying a tool-specific script that configures the monitoring app. * Automation - includes mapping of monitoring functionality from a logical forwarding graph to virtual or physical instances executing in the infrastructure, as well as placement and re-placement of monitoring functionality for required observability coverage and configuration consistency upon updates in a dynamic network environment. Puppet [P2015] manifests or Ansible [A2015] playbooks could be used for automating the deployment of monitoring agents, for example those used by Splunk [S2015]. However, both manifests and playbooks were designed to represent the desired system configuration snapshot at a particular moment in time - they would now need to be generated automatically by the orchestration tools instead of a DevOps person. * Actionable data Meirosu, et al. Expires September 20,18, 2016 [Page 13] Internet-Draft DevOps Challenges March 2016 Data produced by observability tools could be utilized in a wide category of processes, ranging from billing and dimensioning to real- time troubleshooting and optimization. In order to allow for data- driven automated decisions and actuations based on these decisions, the data needs to be actionable. We define actionable data as being representative for a particular context or situation and an adequate input towards a decision. Ensuring actionable data is challenging in a number of ways, including: defining adaptive correlation and sampling windows, filtering and aggregation methods that are adapted or coordinated with the actual consumer of the data, and developing analytical and predictive methods that account for the uncertainty or incompleteness of the data. * Data Virtualization Data is key in helping both Developers and Operators perform their tasks. Traditional Network Management Systems were optimized for using one database that contains the master copy of the operational statistics and logs of network nodes. Ensuring access to this data from across the organization is challenging because strict privacy and business secrets need to be protected. In DevOps-driven environments, data needs to be made available to Developers and their test environments. Data virtualization collectively defines a set of technologies that ensure that restricted copies of the partial data needed for a particular task may be made available while enforcing strict access control. Further than simple access control, data virtualization needs to address scalability challenges involved in copying large amounts of operational data as well as automatically disposing of it when the task authorized for using it has finished. 8. Verification Challenges Enabling ongoing verification of code is an important goal of continuous integration as part of the data center DevOps concept. In a telecom SDI, service definitions, decompositions and configurations need to be expressed in machine-readable encodings. For example, configuration parameters could be expressed in terms of YANG data models. However, the infrastructure management layers (such as Software-Defined Network Controllers and Orchestration functions) might not always export such machine-readable descriptions of the runtime configuration state. In this case, the management layer itself could be expected to include a verification process that has the same challenges as the stand-alone verification processes we outline later in this section. In that sense, verification can be considered as a set of features providing gatekeeper functions to Meirosu, et al. Expires September 20,18, 2016 [Page 14] Internet-Draft DevOps Challenges March 2016 verify both the abstract service models and the proposed resource configuration before or right after the actual instantiation on the infrastructure layer takes place. A verification process can involve different layers of the network and service architecture. Starting from a high-level verification of the customer input (for example, a Service Graph as defined in [I-D.unify-nfvrg-challenges]), the verification process could go more in depth to reflect on the Service Function Chain configuration. At the lowest layer, the verification would handle the actual set of forwarding rules and other configuration parameters associated to a Service Function Chain instance. This enables the verification of more quantitative properties (e.g. compliance with resource availability), as well as a more detailed and precise verification of the abovementioned topological ones. Existing SDN verification tools could be deployed in this context, but the majority of them only operate on flow space rules commonly expressed using OpenFlow syntax. Moreover, such verification tools were designed for networks where the flow rules are necessary and sufficient to determine the forwarding state. This assumption is valid in networks composed only by network functions that forward traffic by analyzing only the packet headers (e.g. simple routers, stateless firewalls, etc.). Unfortunately, most of the real networks contain active network functions, represented by middle-boxes that dynamically change the forwarding path of a flow according to function-local algorithms and an internal state (that is based on the received packets), e.g. load balancers, packet marking modules and intrusion detection systems. The existing verification tools do not consider active network functions because they do not account for the dynamic transformation of an internal state into the verification process. Defining a set of verification tools that can account for active network functions is a significant challenge. In order to perform verification based on formal properties of the system, the internal states of an active (virtual or not) network function would need to be represented. Although these states would increase the verification process complexity (e.g., using simple model checking would not be feasible due to state explosion), they help to better represent the forwarding behavior in real networks. A way to address this challenge is by attempting to summarize the internal state of an active network function in a way that allows for the verification process to finish within a reasonable time interval. Meirosu, et al. Expires September 20,18, 2016 [Page 15] Internet-Draft DevOps Challenges March 2016 9. Troubleshooting Challenges One of the problems brought up by the complexity introduced by NFV and SDN is pinpointing the cause of a failure in an infrastructure that is under continuous change. Developing an agile and low- maintenance debugging mechanism for an architecture that is comprised of multiple layers and discrete components is a particularly challenging task to carry out. Verification, observability, and probe-based tools are key to troubleshooting processes, regardless whether they are followed by Dev or Ops personnel. * Automated troubleshooting workflows Failure is a frequently occurring event in network operation. Therefore, it is crucial to monitor components of the system periodically. Moreover, the troubleshooting system should search for the cause automatically in the case of failure. If the system follows a multi-layered architecture, monitoring and debugging actions should be performed on components from the topmost layer to the bottom layer in a chain. Likewise, the result of operations should be notified in reverse order. In this regard, one should be able to define monitoring and debugging actions through a common interface that employs layer hopping logic. Besides, this interface should allow fine-grained and automatic on-demand control for the integration of other monitoring and verification mechanisms and tools. * Troubleshooting with active measurement methods Besides detecting network changes based on passively collected information, active probes to quantify delay, network utilization and loss rate are important to debug errors and to evaluate the performance of network elements. While tools that are effective in determining such conditions for particular technologies were specified by IETF and other standardization organization, their use requires a significant amount of manual labor in terms of both configuration and interpretation of the results. In contrast, methods that test and debug networks systematically based on models generated from the router configuration, router interface tables or forwarding tables, would significantly simplify management. They could be made usable by Dev personnel that have little expertise on diagnosing network defects. Such tools naturally lend themselves to integration into complex troubleshooting workflows that could be generated automatically based on the description of a particular service chain. However, there are scalability challenges associated with deploying such tools in a network. Some tools may poll each networking device for the forwarding table information to Meirosu, et al. Expires September 20,18, 2016 [Page 16] Internet-Draft DevOps Challenges March 2016 calculate the minimum number of test packets to be transmitted in the network. Therefore, as the network size and the forwarding table size increase, forwarding table updates for the tools may put a non- negligible load in the network. 10. Programmable network management The ability to automate a set of actions to be performed on the infrastructure, be it virtual or physical, is key to productivity increases following the application of DevOps principles. Previous sections in this document touched on different dimensions of programmability: - Section 5 approached programmability in the context of developing new capabilities for monitoring and for dynamically setting configuration parameters of deployed monitoring functions - Section 7 reflected on the need to determine the correctness of actions that are to be inflicted on the infrastructure as result of executing a set of high-level instructions - Section 8 considered programmability in the perspective of an interface to facilitate dynamic orchestration of troubleshooting steps towards building workflows and for reducing the manual steps required in troubleshooting processes We expect that programmable network management - along the lines of [RFC7426] - will draw more interest as we move forward. For example, in [I-D.unify-nfvrg-challenges], the authors identify the need for presenting programmable interfaces that accept instructions in a standards-supported manner for the Two-way Active Measurement Protocol (TWAMP)TWAMP protocol. More specifically, an excellent example in this case is traffic measurements, which are extensively used today to determine SLA adherence as well as debug and troubleshoot pain points in service delivery. TWAMP is both widely implemented by all established vendors and deployed by most global operators. However, TWAMP management and control today relies solely on diverse and proprietary tools provided by the respective vendors of the equipment. For large, virtualized, and dynamically instantiated infrastructures where network functions are placed according to orchestration algorithms proprietary mechanisms for managing TWAMP measurements have severe limitations. For example, today's TWAMP implementations are managed by vendor-specific, typically command-line interfaces (CLI), which can be scripted on a platform-by-platform basis. As a result, although the control and Meirosu, et al. Expires September 20,18, 2016 [Page 17] Internet-Draft DevOps Challenges March 2016 test measurement protocols are standardized, their respective management is not. This hinders dramatically the possibility to integrate such deployed functionality in the SP-DevOps concept. In this particular case, recent efforts in the IPPM WG [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data model and effectively increase the programmability of TWAMP deployments in the future. Data center DevOps tools, such as those surveyed in [D4.1], developed proprietary methods for describing and interacting through interfaces with the managed infrastructure. Within certain communities, they became de-facto standards in the same way particular CLIs became de- facto standards for Internet professionals. Although open-source components and a strong community involvement exists, the diversity of the new languages and interfaces creates a burden for both vendors in terms of choosing which ones to prioritize for support, and then developing the functionality and operators that determine what fits best for the requirements of their systems. 11. DevOps Performance Metrics Defining a set of metrics that are used as performance indicators is important for service providers to ensure the successful deployment and operation of a service in the software-defined telecom infrastructure. We identify three types of considerations that are particularly relevant for these metrics: 1) technical considerations directly related to the service provided, 2) process-related considerations regarding the deployment, maintenance and troubleshooting of the service, i.e. concerning the operation of VNFs, and 3) cost-related considerations associated to the benefits from using a Software- Defined Telecom Infrastructure. First, technical performance metrics shall be service-dependent/- oriented and may address inter-alia service performance in terms of delay, throughput, congestion, energy consumption, availability, etc. Acceptable performance levels should be mapped to SLAs and the requirements of the service users. Metrics in this category were defined in IETF working groups and other standardization organizations with responsibility over particular service or infrastructure descriptions. Second, process-related metrics shall serve a wider perspective in the sense that they shall be applicable for multiple types of Meirosu, et al. Expires September 20,18, 2016 [Page 18] Internet-Draft DevOps Challenges March 2016 services. For instance, process-related metrics may include: number of probes for end-to-end QoS monitoring, number of on-site interventions, number of unused alarms, number of configuration mistakes, incident/trouble delay resolution, delay between service order and deliver, or number of self-care operations. Third, cost-related metrics shall be used to monitor and assess the benefit of employing SDI compared to the usage of legacy hardware infrastructure with respect to operational costs, e.g. possible man- hours reductions, elimination of deployment and configuration mistakes, etc. Finally, identifying a number of highly relevant metrics for DevOps and especially monitoring and measuring them is highly challenging because of the amount and availability of data sources that could be aggregated within one such metric, e.g. calculation of human intervention, or secret aspects of costs. 12. Security Considerations TBD 13. IANA Considerations This memo includes no request to IANA. 14. References 14.1. Informative References [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management and Orchestration V0.6.1 (draft)", Jul. 2014 [I-D.aldrin-sfc-oam-framework] S. Aldrin, R. Pignataro, N. Akiya. "Service Function Chaining Operations, Administration and Maintenance Framework", draft-aldrin-sfc-oam-framework-02, (work in progress), July 2015. Meirosu, et al. Expires September 20,18, 2016 [Page 19] Internet-Draft DevOps Challenges March 2016 [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function Chaining Verification", draft-lee-sfc-verification-00, (work in progress), February 2014. [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J. Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined Networking (SDN): Layers and Architecture Terminology", RFC 7426, January 2015 [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking: A Perspective from within a Service Provider Environment", RFC 7149, March 2014. [TR228] TMForum Gap Analysis Related to MANO Work. TR228, May 2014 [I-D.unify-nfvrg-challenges] R. Szabo et al. "Unifying Carrier and Cloud Networks: Problem Statement and Challenges", draft- unify-nfvrg-challenges-03 (work in progress), October 2016 [I-D.cmzrjp-ippm-twamp-yang] Civil, R., Morton, A., Zheng, L., Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way Active Measurement Protocol (TWAMP) Data Model", draft- cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015. [D4.1] W. John et al. D4.1 Initial requirements for the SP-DevOps concept, universal node capabilities and proposed tools, August 2014. [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined Networking: A Comprehensive Survey." To appear in proceedings of the IEEE, 2015. [DevOpsP] "DevOps, the IBM Approach" 2013. [Online]. [Y1564] ITU-R Recommendation Y.1564: Ethernet service activation test methodology, March 2011 [CAP] E. Brewer, "CAP twelve years later: How the "rules" have changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012. [H2014] N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N. McKeown; "I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks", In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp.71-95 Meirosu, et al. Expires September 20,18, 2016 [Page 20] Internet-Draft DevOps Challenges March 2016 [W2011] A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann; "OFRewind: Enabling Record and Replay Troubleshooting for Networks". In Proceedings of the Usenix Anual Technical Conference (Usenix ATC '11), pp 327-340 [S2010] E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration analysis and verification of federated Openflow infrastructures" In Proceedings of the 3rd ACM workshop on Assurable and usable security configuration (SafeConfig '10). Pp. 37-44 [OSandS] S. Wright, D. Druta, "Open Source and Standards: The Role of Open Source in the Dialogue between Research and Standardization" Globecom Workshops (GC Wkshps), 2014 , pp.650,655, 8-12 Dec. 2014 [C2015] CFEngine. Online: http://cfengine.com/product/what-is- cfengine/, retrieved Sep 23, 2015. [P2015] Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet, retrieved Sep 23, 2015. [A2015] Ansible. Online: http://docs.ansible.com/ , retrieved Sep 23, 2015. [AK2015] Apache Kafka. Online: http://kafka.apache.org/documentation.html, retrieved Sep 23, 2015. [S2015] Splunk. Online: http://www.splunk.com/en_us/products/splunk- light.html , retrieved Sep 23, 2015. [K2014] J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines). Online: https://engineering.linkedin.com/kafka/benchmarking-apache- kafka-2-million-writes-second-three-cheap-machines, retrieved Sep 23, 2015. [R2015] RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct 13, 2015 [Z2015] ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015 Meirosu, et al. Expires September 20,18, 2016 [Page 21] Internet-Draft DevOps Challenges March 2016 15. Contributors W. John (Ericsson), J. Kim (Deutsche Telekom), S. Sharma (iMinds) 16. Acknowledgments The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement no. 619609 - the UNIFY project. The views expressed here are those of the authors only. The European Commission is not liable for any use that may be made of the information in this document. We would like to thank in particular the UNIFY WP4 contributors, the internal reviewers of the UNIFY WP4 deliverables and Russ White and Ramki Krishnan for their suggestions. This document was prepared using 2-Word-v2.0.template.dot. Meirosu, et al. Expires September 20,18, 2016 [Page 22] Internet-Draft DevOps Challenges March 2016 17. Authors' Addresses Catalin Meirosu Ericsson Research S-16480 Stockholm, Sweden Email: catalin.meirosu@ericsson.com Antonio Manzalini Telecom Italia Via Reiss Romoli, 274 10148 - Torino, Italy Email: antonio.manzalini@telecomitalia.it Juhoon Kim Deutsche Telekom AG Winterfeldtstr. 21 10781 Berlin, Germany Email: J.Kim@telekom.de Rebecca Steinert SICS Swedish ICT AB Box 1263, SE-16429 Kista, Sweden Email: rebste@sics.se Sachin Sharma Ghent University-iMinds Research group IBCN - Department of Information Technology Zuiderpoort Office Park, Blok C0 Gaston Crommenlaan 8 bus 201 B-9050 Gent, Belgium Email: sachin.sharma@intec.ugent.be Guido Marchetto Politecnico di Torino Corso Duca degli Abruzzi 24 10129 - Torino, Italy Email: guido.marchetto@polito.it Ioanna Papafili Hellenic Telecommunications Organization Measurements and Wireless Technologies Section Laboratories and New Technologies Division 2, Spartis & Pelika str., Maroussi, GR-15122, Attica, Greece Buidling E, Office 102 Meirosu, et al. Expires September 20,18, 2016 [Page 23] Internet-Draft DevOps Challenges March 2016 Email: iopapafi@oteresearch.gr Kostas Pentikousis EICT GmbH Torgauer Strasse 12-15 Berlin 10829 Germany Email: k.pentikousis@eict.de Steven Wright AT&T Services Inc. 1057 Lenox Park Blvd NE, STE 4D28 Atlanta, GA 30319 USA Email: sw3588@att.com Wolfgang John Ericsson Research S-16480 Stockholm, Sweden Email: wolfgang.john@ericsson.com Meirosu, et al. Expires September 20,18, 2016 [Page 24]