Internet Research Task Force W. Tavernier Internet-Draft Ghent University - IBBT Intended status: Informational D. Papadimitriou Expires: April 18, 2011 Alcatel-Lucent Bell D. Colle Ghent University - IBBT October 15, 2010 Learning Capable Communication Network (LCCN) Problem Statement draft-tavernier-irtf-lccn-problem-statement-00 Abstract Operational procedures and protocols of today's communication networks typically use explicitly defined mechanisms and representations to reach the goals associated to their design. This practice results into numerous protocols having a restricted space for (self-)adaptability and sensitivity respective to their network context (e.g. network traffic conditions, failure conditions, etc.). On the other hand, a wide spectrum of learning and optimization techniques is available such that network could learn and optimize their behavior in the running context. This document describes the opportunities and challenges for a Learning Capable Communication Network (LCCN). Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 18, 2011. Tavernier, et al. Expires April 18, 2011 [Page 1] Internet-Draft LCCN Problem Statement October 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Learning opportunities . . . . . . . . . . . . . . . . . . . . 4 2.1. Availability of network data and statistics . . . . . . . 4 2.2. Availability of processing capacity . . . . . . . . . . . 5 3. The learning process . . . . . . . . . . . . . . . . . . . . . 5 4. Architectural implications . . . . . . . . . . . . . . . . . . 7 4.1. From a pre-defined open-loop control towards a self-adaptive closed-loop control . . . . . . . . . . . . 7 4.2. The integration of learning capability . . . . . . . . . . 9 5. Applicability . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1. Functional domains . . . . . . . . . . . . . . . . . . . . 10 5.2. Scope with respect to the hourglass model . . . . . . . . 10 6. Research directions and objectives . . . . . . . . . . . . . . 11 6.1. Relation to existing research domains . . . . . . . . . . 11 6.2. Experimental research objectives . . . . . . . . . . . . . 12 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 8. Security Considerations . . . . . . . . . . . . . . . . . . . 12 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 13 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 11. Informative references . . . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Tavernier, et al. Expires April 18, 2011 [Page 2] Internet-Draft LCCN Problem Statement October 2010 1. Introduction As currently instantiated, the Internet hour-glass model drives a top-down approach. Current communication networks typically operate with an explicit internal representation of themselves, their network knowledge, and their global goals. Routers follow explicitly pre- defined behavior, persistently decide and uniformly execute. Global Internet behavior is evaluated and configuration changed when the evaluation results indicate that the networking systems are not accomplishing what they were intended to, or when better functionality or performance is expected. In several Internet areas, this operational model shows its limits. Inter-domain routing protocols such as BGP due to their inherent exploration properties are increasingly impacted by topology and policy dynamics that delay their convergence. Network management becomes more and more complex, networks do not automatically take into account network traffic statistics, etc. Several efforts have been undertaken to overcome the increasing number of issues. However, improvement of the routing system to accommodate various scales of challenges in network efficiency further complicates its operation ([I-D.ietf-idr-bgp-issues]). Further patching the inter- domain routing system and equipment will result into more operational complexity. In this document, we suggest an alternative (bottom-up) approach to the Internet routing and forwarding system operation. Compared to current routed networks that requires explicit specification of their expected behavior, self-organizing and self-adaptive systems could dynamically modify or adjust their behavior to varying network conditions in order to tune their operation, optimize their overall performance and even add functionalities through closed-loop adaptive control. We see three main drivers for the design of Learning Capable Communication Networks (LCCN): i) the availability of network-related data, ii) the wide range of possible learning paradigms that can be borrowed from domains such as Artificial Intelligence (AI), machine learning, and bio-inspired learning, and iii) the increased CPU capacity available at both forwarding and control plane level, allowing for background monitoring, learning and optimization in routers. The structure of this document is as follows. In Section 2, we describe the opportunities for communication networks to learn how to best improve their performance. The next section (Section 3) gives a more formal but broad definition of the concept of learning. Section 4 provides a first set of architectural implications of Tavernier, et al. Expires April 18, 2011 [Page 3] Internet-Draft LCCN Problem Statement October 2010 adding learning capability to communication networks. The applicability domain of LCCNs is covered in Section 5, and possible research directions are described in Section 6. Concluding remarks and future work are oultined in Section 9. 2. Learning opportunities 2.1. Availability of network data and statistics Hosts communicate with each other by sending packets between each other via transit network nodes. As such a communication network is loaded with packets corresponding to network traffic flows between given network source and destination nodes. Many techniques exist to gather statistics about the resulting traffic flows crossing routers. o Online statistical counters measure properties of transiting traffic in a router using counters, for example the number of packets per destination prefix or used packet size distribution curves o Traffic sampling: instead of counting certain traffic characteristics, unmodified traffic is captured for some time interval. This sample is then used to derive certain characteristics, using e.g. the setting proposed in [Estan04]) by means of sample-and-hold technique. Unfortunately, the resulting statistical data is rarely used to directly improve the routing and/or forwarding decision of network nodes (referring to the active self-adaptive closed control loop in Section 4.1). However, it is clear that network operation could benefit from taking these statistics automatically into account to allow for inherent load balancing, prioritized traffic flow switch- over, network load maximization, etc. To a lesser extend (since the routing system is deterministically adaptive to topological and/or policy changes) this observation also applies to routing information exchanges. Not only the statistics of network traffic are valuable but also the behavioral aspects of the network itself possibly contain usable information for increasing the performance of the network. Statistics about node or link failures can help network recovery mechanisms to fine tune their operation based on the specific statistical context of the running network. Convergence behavior of routing protocols in the specific running context can be monitored such as to reduce the time of transient loops. In brief, the specific running conditions of communication networks possibly hide (statistical) information, which are currently (largely) unused by Tavernier, et al. Expires April 18, 2011 [Page 4] Internet-Draft LCCN Problem Statement October 2010 current Internet protocols; nevertheless, providing an opportunity to better analyze the behavior of the network behavior depending on the context it is running within. 2.2. Availability of processing capacity The possibility of maintaining network statistics is not only dependent on the network conditions and environment themselves, but also on the physical feasibility of monitoring and storing them over longer periods. Supported by Moore's law, we observe that processing power is increasing over last years, either in pure clock frequency of CPU, or in the occurrence of combinations of multiple CPU's on one chip. In combination with the high increase in line card speeds (up to 100 Gbps), the possibility of capturing useful network statistics in background seems within reach. 3. The learning process Many research fields study the concept of learning from various view points. In the context of LCCNs, learning algorithms correspond to the (broad) class of algorithms that discover the relationship between system variables (i.e. input, output and hidden variables) from data samples of its environment (obtained by means of measurement/monitoring). More formally, the learning process consists of the following steps (see Figure 1). Tavernier, et al. Expires April 18, 2011 [Page 5] Internet-Draft LCCN Problem Statement October 2010 ,--. + + |`--'| |KIB |<------------------+ + + | `--' | | | v | +----------------+ | ,--. | Learner | +------+ E + + | | / \ v |`--'| | +------------+ | / Hypothe- \ e -------->| |------> Learning | |----->\ sis h / n | + + | | algorithm | | \ / t | `--' | +------------+ | +------+ | Training +----------------+ | ^ | data set | | | +--------------------+ | | v + | ,--. +----------------+ / \ | + + | Performer | / \ | |`--'| | +------------+ | / Test \ Target +-->| |------> Learned | |-------->\ /---> function + + | | hypothesis| | \ / `--' | +------------+ | \ / Test +----------------+ + data set ^ | | +-------------------------------------+ Figure 1 o Step 0: Choose training and test data sets associated to a given (sequence of) event(s) observed in the system's running environment. o Step 1: Training (learner): learn an hypothesis h (model), function of the input (training data set) that approximates at best output y (symbolic = classification, numeric = regression). Knowledge: use prior "knowledge" stored in Knowledge Information Base (KIB) to learn h. o Step 2: Testing (performer): evaluate learned model using test data set. Tavernier, et al. Expires April 18, 2011 [Page 6] Internet-Draft LCCN Problem Statement October 2010 4. Architectural implications The control of dynamic systems such as communications networks and routers in particular, can be explained as an interative cycle referred to as the control loop. The coming sections explain the difference of existing communications networks and routers, with the control loop of LCCNs. 4.1. From a pre-defined open-loop control towards a self-adaptive closed-loop control The configuration and operation of existing communication networks typically consist of a set of components and algorithms acting in a relatively small space of states, transitions and optimization steps. Let's take as example routers: they distribute topology and/or distance information from which they compute (e.g. shortest) routing paths. Using this information, they derive entries looked up to forward packets based on incoming packets' destination address. When a topological or distance change occurs, routing updates are timely disseminated in the network such that each router achieves a coherent full view of the new network topology and/or distances and can re- compute new routing paths taking into account this new state of the network. While these procedures might seem effective at first sight, they are mostly pre-determined and inflexible with respect to the environment they are running in. Indeed, routers are agnostic to traffic characteristics and to statistics of network failures. This situation occurs because these techniques have been developed in the early days of packet communication networks. At that time, computational and memory resources were scarce, and the resulting techniques needed to act sparingly with the available resources. Moreover, most of these techniques aim to automate manual procedures used to configure or operate communication networks. As such, routers forward packets based on their destination address by applying pre-determined decision rules and execution procedures. While many engineering disciplines, such as the automotive or bio- industry, have adopted learning techniques to improve the performance of their operational control loops, in computer networking, their application has been restricted mainly to passive applications leading to open-loop control procedures. Examples of such applications are: time series models to analyze and predict network traffic data, anomaly detection techniques to check networks for strange events, or statistical models which try to detect Shared Risk Link Groups (SRLG). Most of the applications of learning techniques are used as interesting side information in the context of network operation. They help network managers to understand and predict the Tavernier, et al. Expires April 18, 2011 [Page 7] Internet-Draft LCCN Problem Statement October 2010 behavior of their network; however few existing network operation models include this learning capability into their direct control loop. In this context, the overall objective is to bring the application of data mining and learning techniques one step further: towards the active integration of these techniques into the operational and control processes of communication networks. For instance, we could augment the above control paradigm with a machine learning component enabling the system and network to learn about their own behavior and environment over time, to detect and analyze problems, adapt their decision, and tune their execution using output of models in order to increase their functionality and performance. Systems with such an adaptive closed-loop control have network elements autonomously interrelated and controlled, dynamically adapting to changing environments, and learning desired behavior. These fully distributed and technology-independent systems allow: i) self-configuration and self-organization, ii) self-protection and self-healing, and iii) self-optimization. The objective is to improve the Internet control/ routing and forwarding process by enabling, automating, and distributing the decision making processes involved in execution of their operation. +-----------+ +-----------+ system ===> | analyze |----------->+ decide | <=== rules knowledge +-----------+ +-----------+ ^ | | v +-----+-----+ +-----------+ self- | detect |<-----------+ execute | self- monitoring +-----------+ +-----------+ configuration ^ | | v +------------------------------+ | Controlled Element | +------------------------------+ Figure 2 Using a more advanced control loop, the network routing mechanism could learn from network traffic, failure patterns and other context- related data observed in the network, and adapt its routing machinery to optimize it in this context. The resulting self-adaptive closed- loop control is a four step cyclic process consisting of: i) a detection phase (e.g., monitor network traffic) which is about monitoring data, ii) an analysis or learning phase (e.g., build traffic models for prediction) in which the data obtained during the detection phase is analyzed and upon which models can be learned, Tavernier, et al. Expires April 18, 2011 [Page 8] Internet-Draft LCCN Problem Statement October 2010 iii) infer rules/decisions from the performed/learned analysis such that the learned model can influence the operation of the network and iv) an execution phase. 4.2. The integration of learning capability While it is premature (and part of the research work) to detail the implications on the Internet architecture, the design of a control system incorporating learning capability would benefit from the following design principles. o Adaptability: modular instead of relying on unified and ubiquitous approach in order to ensure gradual development (e.g. access vs core router) o Segmentability: rely on relative local view rather than a network global view in order to ensure scalability, robustness, and resiliency o Sizeability: inherits distributed properties and capabilities of routing system (e.g. intra- vs inter-domain) instead of a uniform and ubiquitous plane construction in order to ensure organic deployment Taking these principles into account, the resulting architecture should specify: i) expected behavior of the self-adaptive closed-loop process, ii) its components, and iii) the interfaces with existing routers' components and between learning-capable routers of a network (both intra- and inter-domain). The resulting closed-loop adaptive control includes a learning component that is either an upfront step or an online process, a feedback phase, and interactions with router/ network control. Today Step 1 Step 2 +--------------+ +----------------+ +------------------+ | | | +------------+ | | +--------------+ | | +----------+ | | | Learning | | | | Routing | | | | Routing | | | +------------+ | | | + learning | | | +----------+ | | weak coupling | | +--------------+ | | | ==> | +------------+ | ==> | integrated | | | | | Routing | | | strong coupling | | +----------+ | | +------------+ | | +--------------+ | | |Forwarding| | | +------------+ | | | Forwarding | | | +----------+ | | | Forwarding | | | | + learning | | | | | +------------+ | | +--------------+ | +--------------+ +----------------+ +------------------+ Figure 3 Tavernier, et al. Expires April 18, 2011 [Page 9] Internet-Draft LCCN Problem Statement October 2010 Including learning capabilities into current Internet router architectures can follow a phased approach. Internet routers typically consist of two functional components: i) a forwarding component which takes care of processing and forwarding packets according to pre-configured forwarding tables, and ii) a routing component which takes care of distributing topology/distance information, computing (shortest) routing paths using this information, and storing resulting entries into routing tables. Forwarding table entries are subsequently derived from routing table entries. As a first integration step, a new functional component comprising learning capability could be included. The new component would then be weakly coupled to the existing forwarding and routing components. This implies that the routing and/or forwarding component can be enhanced by of the learning component. These functionalities could be called via pre-defined interfaces between the components. While this is an overlaid but modular build-up of a router, integration of learning capability can go one step further. Indeed, in a next phase, instead of a separate learning component, the learning functionality could be tightly integrated into the routing and forwarding components themselves. This implies that the routing and forwarding processes themselves comprise a learning cycle (a self-adaptive closed-loop control). It is clear that both the phasing and the detailing of the architecture is an important challenge in the design of LCCNs. 5. Applicability 5.1. Functional domains The incorporation of learning component within the router architecture aims to i) enhance Internet functionality in order to cope with known operational challenges such as manageability, and diagnosability, ii) address new challenges such as security and accountability, and iii) improve its performance (in terms of e.g. scalability and availability) by adapting forwarding and routing system decisions. In this context of network quality, we can think of the automated inclusion of network traffic knowledge into the configuration of routes and resulting forwarding tables. 5.2. Scope with respect to the hourglass model Even if learning paradigms can be applied at all levels of the hour- glass model, LCCN-related research focuses on the (largest) lower half of the hourglass model ("everything over IP, and IP over everything"). As depicted in Figure 4, the goal of LCCN research is to apply learning capabilities from the transport layer up to the physical layer (including thus also the network and datalink layers). Tavernier, et al. Expires April 18, 2011 [Page 10] Internet-Draft LCCN Problem Statement October 2010 Whereas learning capability has typically been used at higher layers already, for example by banking applications, large-scale websites such as Amazon or Google, the real networking machinery that is running below is still relying on low-information processes with very limited learning capabilities. The incorporation of a learning component within wired and wireless communication networks aims to improve both their operation and performance from the physical network layer up to the TCP/IP layer. +---------------------+ \ email, WWW, / \ TV, ... / \---------------/ \SMTP,HTTP,RTP/ --- \-----------/ --- ^ \ TCP, / ^ | \ UDP / | | \-----/ | LCCN | / IP \ | scope | /-------\ | | /Ethernet,\ | | / PPP,... \ | | /-------------\ | v / CSMA, Sonet \ v --- /-----------------\ --- /copper,fiber,radio \ +---------------------+ Figure 4 6. Research directions and objectives 6.1. Relation to existing research domains Learning opportunities in communication networks have characteristics that are typical well-suited for research techniques borrowing from (machine) learning, robotics, AI, computational biology, etc. o Difficult to explicitly characterize: events cannot be well characterized even when examples are available (inherent complexity in characterizing an event) o Correlation: hidden correlations and trends between events within large amounts of associated data Tavernier, et al. Expires April 18, 2011 [Page 11] Internet-Draft LCCN Problem Statement October 2010 o Dynamicity: changing conditions over time (in particular, for routing system but also variability of traffic, user expectations and behaviors) o Quantity: amount of available data is too large for handling by manual intervention o Evolutive: new events are constantly detected/discovered 6.2. Experimental research objectives Experimental research is a primary goal of the activities to be conducted. The following objectives would be targeted: o The production of various studies is stimulated and should enable evaluation of performance and functional improvement resulting from the exploitation of various learning paradigms. A common understanding of these paradigms and their associated capabilities would complement this first step. o As different distribution models can be considered for what concerns the distribution of the learning processes (taking into account the various objectives but also constraints resulting from network partition), determining which model best fit Internet evolution is a specific target of this research activity. o Iterative cycles of experimentation shall allow to determine suitability of the resulting architecture as well as to determine practical feasibility, applicability and deployability of the concept on a large scale. Documentation of appropriate use cases/ scenarios would complement this work item. 7. IANA Considerations This memo includes no request to IANA. 8. Security Considerations Beside the research objectives detailed here above, security mechanisms for "communication channels" between learning components and "learning components" themselves shall be considered comprising among others message authentication but also means to prevent man-in- the-middle and DDoS attacks. Tavernier, et al. Expires April 18, 2011 [Page 12] Internet-Draft LCCN Problem Statement October 2010 9. Conclusions Current communication networks fail to use network-related statistics which could be valuable to improve their performance. In addition, current networks fail to provide solutions to challenging issues, because they become too complex to operate and manage by manual/open loop procedures. A learning-capable communication network (LCCN) includes a learning component which learns based on the network environment statistics and adapts and optimizes its behavior upon this. This gives new possibilities to improve network efficiency in several domains including network recoverability, accountability, security, scalability, and so on. The challenge (and next steps) of LCCNs lies into: i) developing self-adaptive closed)loop control system relying on learning capability, ii) building and applying it to various network mechanisms and iii) verifying the resulting prototypes in experimental environments. 10. Acknowledgements This work is supported by the European Commission (EC) Seventh Framework Programme (FP7) ECODE project (Grant No.223936). 11. Informative references [AI-modern] Russell, S., "Artificial Intelligence: A Modern Approach", 2003. [Estan04] Estan, C., "Building a better NetFlow", October 2004. [I-D.ietf-idr-bgp-issues] Lange, A., "Issues in Revising BGP-4 (RFC1771 to RFC4271)", draft-ietf-idr-bgp-issues-03 (work in progress), August 2010. [PRML] Bishop, C., "Pattern Recognition and Machine Learning", October 2003. Tavernier, et al. Expires April 18, 2011 [Page 13] Internet-Draft LCCN Problem Statement October 2010 Authors' Addresses Wouter Tavernier Ghent University - IBBT Gaston Crommenlaan 8 bus 201 Gent, 9050 Belgium Phone: +32(0)9 331 49 81 Email: wouter.tavernier@intec.ugent.be Dimitri Papadimitriou Alcatel-Lucent Bell Copernicuslaan 50 Antwerpen, 2018 Belgium Phone: Email: dimitri.papadimitriou@alcatel-lucent.com Didier Colle Ghent University - IBBT Gaston Crommenlaan 8 bus 201 Gent, 9050 Belgium Phone: +32(0)9 331 49 70 Email: didier.colle@intec.ugent.be Tavernier, et al. Expires April 18, 2011 [Page 14]