Network Management Research Group M-S. Kim Internet-Draft Y-G. Hong Intended status: Informational ETRI Expires: January 3, 2019 Y-H. Han KoreaTech T-J. Ahn KT K-H. Kim ETRI July 2, 2018 Intelligent Network Management using Reinforcement Learning draft-kim-nmrg-rl-03 Abstract This document describes intelligent network management system to autonomously manage and monitor using machine learning techniques. Reinforcement learning is one of the machine learning techniques that can provide autonomously management with multi-agent path-planning over a communication network. According to intelligent distributed multi-agent system, the main centralized node called by the global environment should not only manage all agents workflow in a hybrid peer-to-peer networking architecture and, but transfer and share information in distributed nodes. All agents in distributed nodes are able to be provided with a cumulative reward for each action that a given agent takes with respect to an optimized knowledge based on a to-be-learned policy over the learning process. The optimized and trained knowledge would be involved with a large state information by the control action over a network. A reward from the global environment is reflected to the next optimized control action autonomously for network management in distributed networking nodes. The Reinforcement Learning(RL) Process have developed and expanded to Deep Reinforcement Learning(DRL) with model-driven or data-driven technical approaches for learning process. The trendy technique has been widely to attempt and apply to networking fields since Deep Reinforcement Learning can be used in practical networking areas beyond dynamics and heterogeneous environment disturbances, so that in the technique can be intelligently learned in the effective strategy. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Kim, et al. Expires January 3, 2019 [Page 1] Internet-Draft draft-kim-mnrg-rl-03 July 2018 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 3, 2019. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 4 3. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1. General Motivation for Reinforcement Learning . . . . . . 4 3.2. Reinforcement Learning in networks . . . . . . . . . . . 4 3.3. Deep Reinforcement Learning in networks . . . . . . . . . 4 3.4. Motivation in our work . . . . . . . . . . . . . . . . . 5 4. Related Works . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Autonomous Driving System . . . . . . . . . . . . . . . . 5 4.2. Network Defect Prediction . . . . . . . . . . . . . . . . 5 4.3. Wireless Sensor Network (WSN) . . . . . . . . . . . . . . 6 4.4. Routing Enhancement . . . . . . . . . . . . . . . . . . . 6 4.5. Routing Optimization . . . . . . . . . . . . . . . . . . 6 4.6. Game Theory . . . . . . . . . . . . . . . . . . . . . . . 6 5. Intelligent Machine Learning Technologies . . . . . . . . . . 7 5.1. Reinforcement Learning (RL) . . . . . . . . . . . . . . . 7 5.2. Deep Learning (DL) . . . . . . . . . . . . . . . . . . . 7 5.3. Deep Reinforcement Learning (DRL) . . . . . . . . . . . . 7 5.4. Advantage Actor Critic (A2C) . . . . . . . . . . . . . . 8 Kim, et al. Expires January 3, 2019 [Page 2] Internet-Draft draft-kim-mnrg-rl-03 July 2018 5.5. Asynchronously Advantage Actor Critic (A3C) . . . . . . . 8 5.6. Policy using Distance and Frequency . . . . . . . . . . . 9 5.7. Distributed Computing Node . . . . . . . . . . . . . . . 9 5.8. Agent Sharing Information . . . . . . . . . . . . . . . . 9 6. Proposed Architecture . . . . . . . . . . . . . . . . . . . . 9 6.1. Architecture for Reinforcement Learning . . . . . . . . . 10 6.2. Architecture for Deep Reinforcement Learning . . . . . . 11 7. Use case of Reinforcement Learning . . . . . . . . . . . . . 11 7.1. Distributed Multi-agent Reinforcement Learning (RL): Sharing Information Technique . . . . . . . . . . . . . . 12 7.2. Intelligent Edge Computing technique for Traffic Control using Deep Reinforcement Learning . . . . . . . . . . . . 13 7.3. Edge computing system in a field of construction works using Reinforce Learning . . . . . . . . . . . . . . . . 14 7.4. Fault prediction for core-network using Deep Learning . . 14 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 9. Security Considerations . . . . . . . . . . . . . . . . . . . 15 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 10.1. Normative References . . . . . . . . . . . . . . . . . . 15 10.2. Informative References . . . . . . . . . . . . . . . . . 15 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 1. Introduction In large infrastructures such as transportation, health and energy systems, collaborative monitoring system is needed, where there are special needs for intelligent distributed networking systems with learning schemes. Agent reinforcement learning for intelligently autonomous network management, in general, is one of the challengeable methods in a dynamic complex cluttered environment over a network. It also needs the development of computational multi- agents learning systems in large distributed networking nodes, where the agents have limited and incomplete knowledge, and they only access local information in distributed networking nodes. Reinforcement Learning can become an effective technique to transfer and share information among agents via the global environment (centralized node), as it does not require a priori knowledge of the agent behavior or environment to accomplish its tasks [Megherbi]. Such a knowledge is usually acquired and learned automatically and autonomously by trial and error. Reinforcement Learning is one of the machine Learning techniques that will be adapted to the various networking environments for automatic networks[S. Jiang]. Thus, this document provides motivation, learning technique, and use case for network machine learning. Kim, et al. Expires January 3, 2019 [Page 3] Internet-Draft draft-kim-mnrg-rl-03 July 2018 Deep reinforcement learning recently proposes that the extended reinforcement Learning algorithm could emerge as more powerful model- driven or data-driven techniques over a large state space to overcome the classical behavior reinforcement Learning process. The deep reinforcement learning technique has been significantly shown as successful models in playing Atari games [V. Mnih]. The deep reinforcement learning provides more effective experimental system performance in a complex and cluttered networking environment. The classical reinforcement learning slightly has a limitation to be adopted in networking areas, since the networking environments consist of significantly large and complex components in fields of routing configuration, optimization and system management, so that deep reinforcement learning can provide much more state information for learning process. 2. Conventions and Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Motivation 3.1. General Motivation for Reinforcement Learning Reinforcement learning is a system capable of autonomous acquirement and incorporation of knowledge. It can continuously self-improve learning process with experience and attempts to maximize cumulative reward to manage an optimized learning knowledge by multi-agents- based monitoring systems[Teiralbar]. The maximized reward can be increasingly optimizing of learning speed for agent autonomous learning process. 3.2. Reinforcement Learning in networks Reinforcement learning is an emerging technology in terms of monitoring network system to achieve fair resource allocation for nodes within the wire or wireless mesh setting. Monitoring parameters of the network and adjusts based on the network dynamics can demonstrate to improve fairness in wireless environment Infrastructures and Resources[Nasim]. 3.3. Deep Reinforcement Learning in networks Deep reinforcement learning is a large state model-driven or data- driven approach on an intelligently learning strategy. The intelligent technique represents learning models successfully to Kim, et al. Expires January 3, 2019 [Page 4] Internet-Draft draft-kim-mnrg-rl-03 July 2018 train knowledge for control policy directly from high-dimensional sensory input using reinforcement learning with Q-value function in a convolutional neural network [Mnih]. The model repeatedly estimates reward using the defined reward function depending on the current states, to acquire more effective and optimized control action in following next steps. The deep reinforcement learning can be widely- adopted in routing optimization to attempt minimizing the network delay [Stampa]. 3.4. Motivation in our work There are many different networking management problems to intelligently solve, such as connectivity, traffic management, fast internet without latency and etc. We expect that machine-learning- based mechanism such as reinforcement learning will provide network solutions with multiple cases against human operating capacities even if it is a challengeable area due to a multitude of reasons such as large state space, complexity in the giving reward, difficulty in control actions, and difficulty in sharing and merging of the trained knowledge between agents in a distributed memory node to be transferred over a communication network.[Minsuk] 4. Related Works 4.1. Autonomous Driving System Recently, 5G network and AI are new trend and future research areas, so that a lot of business models have been developed and appeared in the networking fields. Autonomous vehicle has been simultaneously developed with 5G and AI. Autonomous vehicle is capable of self- automotive driving without human supervision depending on optimized trust region policy by reinforcement learning that enables learning of more complex and special network management environment. Such a vehicle provides a comfortable user experience safely and reliably on interactive communication network [April] [Markus]. 4.2. Network Defect Prediction Nowadays, the networking equipment handles a variety of services such as Internet, IPTV, VoIP in a single device. As the performance of the equipment improves, even if there is an advantage to construct the equipment to be separately constructed in a single device, the probability of the service failure of network equipment might be increasing. For that reason, the equipment failure risk over a network poses a major networking carriers, so that there is growing need to prevent disturbances by detecting network failure in advance. Machine learning such as deep learning or reinforcement learning emerged the preferred solutions to manage and monitor the networking Kim, et al. Expires January 3, 2019 [Page 5] Internet-Draft draft-kim-mnrg-rl-03 July 2018 equipment (LTE core, router and switch) prevented by the networking failure risk. 4.3. Wireless Sensor Network (WSN) Wireless sensor network (WSN) consists of a large number of sensors and sink nodes for monitoring systems with event parameters such as temperature, humidity, air conditioning, etc. Reinforcement learning in WSNs has been applied in a wide range of schemes such as cooperative communication, routing and rate control. The sensors and sink nodes are able to observe and carry out optimal actions on their respective operating environment for network and application performance enhancements[Kok-Lim]. 4.4. Routing Enhancement Reinforcement learning is used to enhance multicast routing protocol in wireless ad hoc networks, where each node has different capability. Routers in the multicast routing protocol are determined to discover optimal route with a predicted reward, and then the routers create the optimal path with multicast transmissions to reduce the overhead in reinforcement learning[Kok-Lim]. 4.5. Routing Optimization Routing optimization as traffic engineering is one of the important issues to control the behavior of transmitted data in order to maximize the performance of network [Stampa]. There are several attempts to be adopted with machine learning algorithms in the context of routing optimization. Deep reinforcement learning is recently one of solutions for unseen network states that cannot be achieved by traditional table-based reinforcement learning agent [Stampa]. Deep reinforcement learning can provide more improvement to optimal control routing configuration by given-agent on complex networking. 4.6. Game Theory The adaptive multi-agent system, which is combined with complexities from interacting game player, has developed in a field of reinforcement learning. In the early game theory, the interdisciplinary work was only focused on competitive games, but reinforcement learning has developed into a general framework for analyzing strategic interaction and has been attracted field as diverse as psychology, economics and biology.[Ann] AlphaGo is also one of the game theories using reinforcement learning, developed by Google DeepMind. Even though it began as a small learning computational program with some simple actions, it has now trained on Kim, et al. Expires January 3, 2019 [Page 6] Internet-Draft draft-kim-mnrg-rl-03 July 2018 a policy and value networks of thirty million actions, states and rewards. 5. Intelligent Machine Learning Technologies 5.1. Reinforcement Learning (RL) Agent reinforcement learning is machine-learning-based unsupervised algorithms based on an agent learning process. Reinforcement learning is normally used with a reward from centralized node (the global environment), and capable of autonomous acquirement and incorporation of knowledge. It is continuously self-improving and becoming more efficient as the learning process from an agent experience to optimize management performance for autonomous learning process.[Sutton][Madera] 5.2. Deep Learning (DL) The rule-based network equipment failure for judgment/prediction should have been described as a correct rule for equipment or case, and continuously updated when a new failure pattern occurs. Deep Learning (DL) techniques such as Convolution Neural Network(CNN), and Recurrent Neural Network(RNN) can be adapted to learn new patterns occurred by the networking faults. We are able to judge and predict a fault condition in these models. The deep learning models has advantages in terms of maintenance and expandability, since it can automatically learn features under the patterns without needing to describe the detailed rules. 5.3. Deep Reinforcement Learning (DRL) Nowadays, some of advanced techniques using reinforcement learning encounter and combine to deep learning technique in Neural Network(NN) that has made it possible to extract high-level features from raw data in compute vision [A Krizhevsky]. There are many challenges under the deep learning models such as convolution neural network, recurrent neural network and etc., on the reinforcement learning approach. The benefit of the deep learning applications is that lots of networking models, which have problematic issue due to complex and cluttered networking structure, can be used with large amounts of labelled training data. Recently, advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network (deep reinforcement learning network), can be used to learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning [V. Mnih]. The deep reinforcement learning(deep Q-network) can provide more extended and powerful scenarios to build networking Kim, et al. Expires January 3, 2019 [Page 7] Internet-Draft draft-kim-mnrg-rl-03 July 2018 models with optimized action controls, huge system states and real- time-based reward function. Moreover, the technique has a significant advantage to set highly sequential data in a large model state space. In particular, the data distribution in reinforcement learning is able to change as learning behaviors, that is a problem for deep learning approaches assumed by a fixed underlying distribution [Mnih]. 5.4. Advantage Actor Critic (A2C) Advantage Actor Critic is one of the intelligent reinforcement learning models based on policy gradient model. The intelligent approach can optimize deep neural network controller in terms of reinforcement learning algorithms, and show that parallel actor- learners have a stabilizing effect on training and they can be allowing all of the methods to successfully train neural network controllers [Volodymyr Mnih]. Even though the prior deep reinforcement learning algorithm with experience replay memory tremendously has performance in challenging of the control service domains, it still needs to use more memory and computational power due to off-policy learning methods. To make up for this algorithms, a new algorithm has appeared. The Advantage Actor Critic (consisting of actor and critic) method would implement generalized policy iteration alternating between a policy evaluation and a policy improvement step. Actor is a policy-based method that can improve the current policy for available the best next action. Critic in the value-based approach can evaluate the current policy and reduce the variance by a bootstrapping method. It is more stable and effective algorithm than the pure policy-based gradient methods. 5.5. Asynchronously Advantage Actor Critic (A3C) Asynchronously Advantage Actor Critic is the updated algorithm based on Advantage Actor Critic. The main algorithm concept is to run multiple environments in parallel to run the agent asynchronously instead of experience replay. The parallel environment reduces the correlation of agent's data and induces each agent to experience various states so that the learning process can become a stationary process. This algorithm is a beneficial and practical point of view since it allows learning performance even with a general multi-core CPU. In addition, it can be applied to continuous space as well as discrete action space, and also has the advantages of learning both feedforward and recurrent agent. A3C algorithm is possibly a number of complementary improvement to the neural network architecture and it has been shown to accurately produce and estimate of Q-values by including separate streams for the state value and advantage in the network to improve both value- Kim, et al. Expires January 3, 2019 [Page 8] Internet-Draft draft-kim-mnrg-rl-03 July 2018 based and policy-based methods by making it easier for the network to represent feature coordinates [Volodymyr Mnih]. 5.6. Policy using Distance and Frequency Distance and Frequency algorithm uses the state occurrence frequency in addition to the distance to goal. It avoids deadlocks and lets the agent escape the Dead, and it was derived to enhance agent optimal learning speed. Distance-and-Frequency is based on more levels of agent visibility to enhance learning algorithm by an additional way that uses the state occurrence frequency.[Al-Dayaa] 5.7. Distributed Computing Node Autonomous multi-agent learning process for network management environment is related to transfer optimized knowledge between agents on a given local node or distributed memory nodes over a communication network. 5.8. Agent Sharing Information This is a technique how agents can share information for optimal learning process. The quality of agent decision making often depends on the willingness of agents to share a given learning information collected by agent learning process. Sharing Information means that an agent would share and communicate the knowledge learned and acquired with or to other agents using RL. Agents normally have limited resources and incomplete knowledge during learning exploration. For that reason, the agents should take actions and transfer the states to the global environment under RL, then it would share the information with other agents, where all agents explore to reach their goals via a distributed reinforcement reward-based learning method on the existing local distributed memory nodes. MPI (Message Passing Interface) is used for communication way. Even if the agents do not share the capabilities and resources to monitor an entire given large terrain environment, they are able to share the needed information to manage collaborative learning process for optimized management in distributed networking nodes.[Chowdappa][Minsuk] 6. Proposed Architecture Kim, et al. Expires January 3, 2019 [Page 9] Internet-Draft draft-kim-mnrg-rl-03 July 2018 6.1. Architecture for Reinforcement Learning The architecture using reinforcement learning describes a collaborative multi-agent-based system in distributed environments as shown in figure 1, where the architecture is combined with a hybrid architecture making use of both a master and slave architecture and a peer-to-peer. The centralized node(global environment), assigns each slave computing node a portion of the distributed terrain and an initial number of agents. +-------------+ | | +-----------------+ | |<...... node 1 .......>| terrain 1 | | | +-----------------+ | Global env. | + | | (node 0) | | | | | | + | | +-----------------+ | |<...... node 2 .......>| terrain 2 | | | +-----------------+ +-------------+ Figure 1: Hybrid P2P and Master/Slave Architecture Overview Reinforcement Learning (RL) actions involve interacting with a given environment, so the environment provides an agent learning process with the elements as followings: o Agent control actions, large states and cumulative rewards o Initial data-set in memory o Random or learning process in a given node o Next, optimamization in neural network under reinforcement learning Additionally, agent actions with states toward its goal as below: o Agent continuously control actions to earn next optimized state based on its policy with reward o After an agent reaches its goal, it can repeatedly collect the information collected by the random or learning process to next learning process for optimal management Kim, et al. Expires January 3, 2019 [Page 10] Internet-Draft draft-kim-mnrg-rl-03 July 2018 o Agent learning process is optimized in the following phase and exploratory learning trials 6.2. Architecture for Deep Reinforcement Learning In shown as Figure2, we illustrate the fundamental architecture for relationship of an action, state and reward, and each agent explores to reach its goal(s) under deep reinforcement learning. The agent takes an action that leads to a reward from achieving an optimal path toward its goal. DRL Network +----------------------------------+ |Q-Value1| | |--------+ +-------+ +------+| ......Action......|Q-Value2|----|Network|----|States||<... . |--------+ +-------+ +------+| . . |Q-Value3| | . . +----------------------------------+ . . . +---------+----------+ . | Global Environment | . +---------+----------+ . . . . . . +-------------------+ +----------+ ...........>+ Large State Space +....States.....>+ D-Memory + +-------------------+ +----------+ Figure 2: DRL work-flow Overview Deep Reinforcement Learning network can provide a convolutional neural network to overcome the problematic issues of reinforcement Learning for successfully learning control policy from raw data in a complex environment. It is also used with an experience replay memory that randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors [V. Mnih]. 7. Use case of Reinforcement Learning Kim, et al. Expires January 3, 2019 [Page 11] Internet-Draft draft-kim-mnrg-rl-03 July 2018 7.1. Distributed Multi-agent Reinforcement Learning (RL): Sharing Information Technique In this section, we deal with case of a collaborative distributed multi-agent, where each agent has same or different individual goals in a distributed environment. Since sharing information scheme among the agents is problematic one, we need to expand on the work described by solving the challenging cases. Basically, the main proposed algorithm is presented by distributed multi-agent RL as below: +-------------------------------------------------------------------+ | Proposed Algorithm | +-------------------------------------------------------------------+ | (1) Let Ni denote the number of node (i= 1, 2, 3 ...) | | | | (2) Let Aj denote the number of agent | | | | (3) Let Dk denote the number of goals | | | | (4) Place initial number of agents Aj, in random position (Xm, | | Yn) | | | | (5) Initialization of data-set memory for neural network | | | | (6) Copy neutal network Q and store as the data-set memory | | | | (7) Every Aj in Ni | | | | -----> (a) Do initial exploration (random) to corresponding Dk | | | | -----> (b) Do exploration (using RL) for Tx denote the number of | | trial | +-------------------------------------------------------------------+ Table 1: Proposed Algorithm Kim, et al. Expires January 3, 2019 [Page 12] Internet-Draft draft-kim-mnrg-rl-03 July 2018 +-------------------------------------------------------------------+ | Random Trial | +-------------------------------------------------------------------+ | (1) Let Si denote the the current state | | | | (2) Relinquish Si so that the other agent can occupy the position | | | | (3) Assign the agent new position | | | | (4) Update the current state Si -> Si+1 | +-------------------------------------------------------------------+ Table 2: Random Trial +-------------------------------------------------------------------+ | Optimal Trial | +-------------------------------------------------------------------+ | (1) Let Si denote the the current state | | | | (2) Let ACj denote a contorl action | | | | (3) Let DRm denote discount reward | | | | (4) Choose ACj <- Policy(Si, ACj) in neural network | | | | (5) Update and copy the network for learning process in the | | global environment | | | | (6) Update the current state Si < Si+1- | | | | (7) Repeat a available network control action | +-------------------------------------------------------------------+ Table 3: Optimal Trial Multi-agent reinforcement learning in distributed nodes can improve the overall system performance to transfer or share information from one node to another node in following cases; expanded complexity in RL technique with various experimental factors and conditions, analyzing multi-agent sharing information for agent learning process. 7.2. Intelligent Edge Computing technique for Traffic Control using Deep Reinforcement Learning Edge computing is a concept that allows data from a variety of devices to be directly analyzed at the site or near the data, rather than being sent to a centralized data center such as the cloud. As such, edge computing will support data flow acceleration by Kim, et al. Expires January 3, 2019 [Page 13] Internet-Draft draft-kim-mnrg-rl-03 July 2018 processing data with low latency in real-time. In addition, by supporting efficient data processing on large amounts of data that can be processed around the source, and internet bandwidth usage will be also reduced. Deep reinforcement learning would be useful technique to improve system performance in an intelligent edge- controlled service system for fast response time, reliability and security. Deep reinforcement learning is model-free approach so that many algorithms such as DQN, A2C and A3C can be adopted to resolve network problems in time-sensitive systems. 7.3. Edge computing system in a field of construction works using Reinforce Learning In a construction site, there are many dangerous elements such as noisy, gas leak and vibration needed by alerts, so that real-time monitoring system to detect the alerts using machine learning techniques (DL, RL) can provide more effective solution and approach to recognize dangerous construction elements. Representatively, to monitor these elements CCTV (closed-circuit television) should be locally and continuously broadcasting in a situation of construction site. At that time, it is in-effective and wasteful even if the CCTV is constantly broadcasting unchangeable scenes in high definition. However, when any alert should be detected due to the dangerous elements, the streaming should be converted to high quality streaming data to rapidly show and defect the dangerous situation. To approach technically, DL is one of the solutions to automatically detect these kinds of dangerous situations with prediction in an advance. It can provide the transform data including with the high-rate streaming video and quickly prevent the other risks. RL is additionally important role to efficiently manage and monitor with the given dataset in real time. [TBD] 7.4. Fault prediction for core-network using Deep Learning EPC equipment such as PGW, SGW, MME, HSS and PCRF in the LTE core network send/receive messages using interfaces based on the 3GPP standard specification. These EPC equipment could create training data and model to predict/detect features of the precursor symptoms occurring before the networking failure when a specific equipment and LTE network service failures are discovered. In the addition, Deep Learning (DL) can predict various network faults such as in/out traffic, resource information of CPU/Memory and QoS performance in the case of IP core network equipment. [TBD] Kim, et al. Expires January 3, 2019 [Page 14] Internet-Draft draft-kim-mnrg-rl-03 July 2018 8. IANA Considerations There are no IANA considerations related to this document. 9. Security Considerations [TBD] 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . 10.2. Informative References [I-D.jiang-nmlrg-network-machine-learning] Jiang, S., "Network Machine Learning", ID draft-jiang- nmlrg-network-machine-learning-02, October 2016. [Megherbi] "Megherbi, D. B., Kim, Minsuk, Madera, Manual., A Study of Collaborative Distributed Multi-Goal and Multi-agent based Systems for Large Critical Key Infrastructures and Resources (CKIR) Dynamic Monitoring and Surveillance, IEEE International Conference on Technologies for Homeland Security", 2013. [Teiralbar] "Megherbi, D. B., Teiralbar, A. Boulenouar, J., A Time- varying Environment Machine Learning Technique for Autonomous Agent Shortest Path Planning, Proceedings of SPIE International Conference on Signal and Image Processing, Orlando, Florida", 2001. [Nasim] "Nasim ArianpooEmail, Victor C.M. Leung, How network monitoring and reinforcement learning can improve tcp fairness in wireless multi-hop networks, EURASIP Journal on Wireless Communications and Networking", 2016. [Minsuk] "Dalila B. Megherbi and Minsuk Kim, A Hybrid P2P and Master-Slave Cooperative Distributed Multi-Agent Reinforcement Learning System with Asynchronously Triggered Exploratory Trials and Clutter-index-based Selected Sub goals, IEEE CIG Conference", 2016. Kim, et al. Expires January 3, 2019 [Page 15] Internet-Draft draft-kim-mnrg-rl-03 July 2018 [April] "April Yu, Raphael Palefsky-Smith, Rishi Bedi, Deep Reinforcement Learning for Simulated Autonomous Vehicle Control, Stanford University", 2016. [Markus] "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, Learning Driving Styles for Autonomous Vehicles from Demonstration, Robotics and Automation (ICRA)", 2015. [Ann] "Ann Nowe, Peter Vrancx, Yann De Hauwere, Game Theory and Multi-agent Reinforcement Learning, In book: Reinforcement Learning: State of the Art, Edition: Adaptation, Learning, and Optimization Volume 12", 2012. [Kok-Lim] "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae Hsiang Kwong, Application of Reinforcement Learning to wireless sensor networks: models and algorithms, Published in Journal Computing archive Volume 97 Issue 11, Pages 1045-1075", November 2015. [Sutton] "Sutton, R. S., Barto, A. G., Reinforcement Learning: an Introduction, MIT Press", 1998. [Madera] "Madera, M., Megherbi, D. B., An Interconnected Dynamical System Composed of Dynamics-based Reinforcement Learning Agents in a Distributed Environment: A Case Study, Proceedings IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, Italy", 2012. [Al-Dayaa] "Al-Dayaa, H. S., Megherbi, D. B., Towards A Multiple- Lookahead-Levels Reinforcement-Learning Technique and Its Implementation in Integrated Circuits, Journal of Artificial Intelligence, Journal of Supercomputing. Vol. 62, issue 1, pp. 588-61", 2012. [Chowdappa] "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan, Thread-Safe Message Passing with P4 and MPI, Technical Report TR-CS-941025, Computer Science Department and NSF Engineering Research Center, Mississippi State University", 1994. [Mnih] "V.Mnih and et al., Human-level Control Through Deep Reinforcement Learning, Nature 518.7540", 2015. Kim, et al. Expires January 3, 2019 [Page 16] Internet-Draft draft-kim-mnrg-rl-03 July 2018 [Stampa] "G Stamp, M Arias, etc., A Deep-reinforcement Learning Approach for Software-defined Networking Routing Optimization, cs.NI", 2017. [Krizhevsky] "A Krizhevsky, I Sutskever, and G Hinton, Imagenet classification with deep con- volutional neural networks, In Advances in Neural Information Processing Systems, 1106-1114", 2012. [Volodymyr] "Volodymyr Mnih and et al., Asynchronous Methods for Deep Reinforcement Learning, ICML, arXiv:1602.01783", 2016. Authors' Addresses Min-Suk Kim Etri 161 Gajeong-Dong Yuseung-Gu Daejeon 305-700 Korea Phone: +82 42 860 5930 Email: mskim16@etri.re.kr Yong-Geun Hong ETRI 161 Gajeong-Dong Yuseung-Gu Daejeon 305-700 Korea Phone: +82 42 860 6557 Email: yghong@etri.re.kr Youn-Hee Han KoreaTech Byeongcheon-myeon Gajeon-ri, Dongnam-gu Choenan-si, Chungcheongnam-do 330-708 Korea Phone: +82 41 560 1486 Email: yhhan@koreatech.ac.kr Kim, et al. Expires January 3, 2019 [Page 17] Internet-Draft draft-kim-mnrg-rl-03 July 2018 Tae-Jin Ahn Korea Telecom 70 Yuseong-daero 1689 Beon-gil Yuseung-Gu Daejeon 305-811 Korea Phone: +82 42 870 8409 Email: Taejin.ahn@kt.com Kwi-Hoon Kim ETRI 161 Gajeong-Dong Yuseung-Gu Daejeon 305-700 Korea Phone: +82 42 860 6746 Email: kwihooi@etri.re.kr Kim, et al. Expires January 3, 2019 [Page 18]