NFVRG Internet-Draft Bose Perumal Intended Status: Informational Wenjing Chu R. Krishnan Hemalathaa. S Dell Expires: December 3 2015 June 29 2015 NFV Compute Acceleration Evaluation and APIs draft-perumal-nfvrg-nfv-compute-acceleration-00 Abstract Network functions are being virtualized and moved to industry standard servers. Steady growth of traffic volume requires more compute power to process the network functions. Network packet based architecture provides a lot of scope for parallel processing. Generic parallel processing can be done in common multicore platforms like GPUs, coprocessors like Intel Xeon Phi[6][7] and Intel[7]/AMD[10] multicore CPUs. In this draft to check the feasibility and to exploit this parallel processing capability, multi string matching is taken as the sample network function for URL filtering. Aho-Corasick algorithm has been made use for multi pattern matching. Implementation utilizes OpenCL [3] to support many common platforms[7][10][11]. A list of optimizations is done, the application is tested on Nvidia Tesla K10 GPUs. A common API for NFV Compute Acceleration has been proposed. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html Perumal, et al. Expires December 3 2015 [Page 1] Internet-Draft NFV Compute Acceleration June 28 2015 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5 2. OpenCL based Virtual Network Function Architecture . . . . . . 6 2.1 CPU Process . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Device Discovery . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Mixed Version Support . . . . . . . . . . . . . . . . . . . 7 2.4 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 8 3. Aho-Corasick Algorithm . . . . . . . . . . . . . . . . . . . . 9 4. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1 Variable size packet packing . . . . . . . . . . . . . . . . 9 4.2 Pinned Memory . . . . . . . . . . . . . . . . . . . . . . . 10 4.3 Pipelined Scheduler . . . . . . . . . . . . . . . . . . . . 10 4.4 Reduce Global memory access . . . . . . . . . . . . . . . . 10 4.5 Organizing GPU cores . . . . . . . . . . . . . . . . . . . . 10 5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1 Worst case 0 string match . . . . . . . . . . . . . . . . . 11 5.2 Packet match . . . . . . . . . . . . . . . . . . . . . . . . 12 6. Compute Acceleration API . . . . . . . . . . . . . . . . . . . 13 6.1 Add Network Function . . . . . . . . . . . . . . . . . . . . 13 6.2 Add Traffic Stream . . . . . . . . . . . . . . . . . . . . . 14 6.3 Add Packets to Buffer . . . . . . . . . . . . . . . . . . . 16 6.4 Process Packets . . . . . . . . . . . . . . . . . . . . . . 16 6.5 Event Notification . . . . . . . . . . . . . . . . . . . . . 16 6.6 Read Results . . . . . . . . . . . . . . . . . . . . . . . . 17 7. Other Accelerators . . . . . . . . . . . . . . . . . . . . . . 17 Perumal, et al. Expires December 3 2015 [Page 2] Internet-Draft NFV Compute Acceleration June 28 2015 8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 17 9. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 18 10 Security Considerations . . . . . . . . . . . . . . . . . . . 18 11 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 12 References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 12.1 Normative References . . . . . . . . . . . . . . . . . . . 18 12.2 Informative References . . . . . . . . . . . . . . . . . . 18 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 Perumal, et al. Expires December 3 2015 [Page 3] Internet-Draft NFV Compute Acceleration June 28 2015 1 Introduction Network equipment vendors use specialized hardware to process data at a low latency and high throughput. Packet processing above 4 Gb/s is done using expensive, purpose-built application-specific integrated circuits. However, the low unit volumes force manufacturers to price these devices at many times the cost of producing them, to recover the R&D cost. Network Function Virtualization (NFV)[1] is a key emerging area for network operators, hardware and software vendors, cloud service providers, and in general network practitioners and researchers. NFV introduces virtualization technologies into the core network to create a more intelligent, more agile service infrastructure. Network functions that are traditionally implemented in dedicated hardware appliances will need to be decomposed and executed in virtual machines running in data centers. The parallelism of graphics processor provides it the potential to function as network coprocessor. Network virtual function is responsible for specific treatment of received packets. A network virtual function can act at various layers of a protocol stack. When there is more compute power, multiple virtual network functions can be executed in a single system or VM. When multiple virtual network functions are processed in a system, some of them could be processed in parallel with other network functions. This paper proposes a method to represent ordered set of virtual network functions in a combination of a sequential and parallel order. This draft is for software based network functions, so any further reference to network function means virtual network function. Software written for specialized hardware like network processors, ASIC, FPGA, is closely tied to the hardware and specific vendor products. It cannot be reused in other hardware platforms. For generic compute acceleration different hardware platforms can be used, like GPUs from different vendors, Intel Xeon Phi coprocessors and multi core CPUs from different vendors. All these compute acceleration platforms support OpenCL as parallel programming language. Instead of every vendor writing OpenCL code, NFV Compute Acceleration (NCA) API has been proposed for a common compute accelerator in this paper. This API will be a library with C API functions for declaring the network functions as an ordered set and moving packets around. Multi-pattern string matching is used in a number of applications including network intrusion detection and digital forensics. Hence multi pattern matching is chosen as a sample network function. Aho- Perumal, et al. Expires December 3 2015 [Page 4] Internet-Draft NFV Compute Acceleration June 28 2015 Corasick[2] algorithm with few modifications has been used to find the first occurrence of any pattern from the signature database. Based on this network function the throughput numbers are measured. 1.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Perumal, et al. Expires December 3 2015 [Page 5] Internet-Draft NFV Compute Acceleration June 28 2015 2. OpenCL based Virtual Network Function Architecture Network functions like multi pattern matching is process intensive and common for multiple NFV applications. Generic compute acceleration framework functions and service specific functions are clearly separated in this prototype. The architecture diagram with one network function is shown in Figure 1. Multiple network functions can also be loaded. Most of the signature based algorithms like Aho- Corasick[2], Regex[8],etc. generate a Deterministic Finite Automaton(DFA)[2][8]. DFA database is generated in CPU and loaded to the accelerator. Kernels executed in the accelerator will use the DFA. +----------------------------------------+ +-------------------+ | CPU Process | | GPU/Xeon Phi,etc. | | | | | | | | | | | | | | Scheduler | | | | +------------+ +------------+ | | +------------+ | | | Packet | |Copy Packet | | | |Input Buffer| | | |Generator +----->|CPU to GPU +------------->|P1,P2,...,Pn| | | +------------+ +------------+ | | +-----+------+ | | | | | | | +------------+ | | +-----v------+ | | |Launch GPU | | | |GPU Kernels | | | |Kernels +------------->|K1,K2,...,Kn|<+ | | +------------+ | | +-----+------+ | | | | | | | | | +-------------+ +------------+ | | +-----v------+ | | | |Results for | |Copy Results| | | |Result Buf | | | | |each packet +<-----+GPU to CPU |<-------------+R1,R2,...,Rn| | | | +-------------+ +------------+ | | +------------+ | | | | | | | | | | | | | +----------------+ +-----------+ | | +------------+ | | | |Network Function| | NF | | | | NF | | | | |(AC,Regex,etc) +-->| Database +------------->| Database +-+ | | +-------+--------+ +-----------+ | | +------------+ | | ^ | | | +-----------|----------------------------+ +-------------------+ | +----+------+ | Signature | | Database | +-----------+ Figure 1. OpenCl based Virtual Network Function Software Architecture Diagram Perumal, et al. Expires December 3 2015 [Page 6] Internet-Draft NFV Compute Acceleration June 28 2015 2.1 CPU Process Accelerators like GPUs or coprocessors will augment CPU and currently they cannot function alone. Network virtual function is split between CPU and GPU. CPU process owns the packet preprocessing, packet movement and scheduling. GPUs will do the core functionality of the network functions. CPU process interfaces between the packet I/O and GPU. During initialization it does set of following functions. 1. Device Discovery 2. Initialize OpenCL object model 3. Initialize memory module 4. Initialize network functions 5. Trigger scheduler 2.2 Device Discovery Using OpenCL functions device discovery module discovers the platforms and devices. Based on number of devices discovered, device context and command queues are created. 2.3 Mixed Version Support OpenCl is designed to support devices with different capabilities under a single platform[3]. There are three version identifiers in OpenCl, the platform version, the version of a device, and the version(s) of the OpenCl C language supported on a device. The platform version indicates the version of the OpenCL runtime supported. The device version is an indication of the devices capabilities. The language version for a device represents the OpenCL programming language features a developer can assume are supported on a given device. OpenCl C is designed to be backwards compatible, so a device is not required to support more than a single language version to be considered conformant. If multiple language versions are supported, the compiler defaults to using the highest language version supported for the device. Code written for old device version may not utilize the full capabilities of new device if there are hardware architectural changes. Perumal, et al. Expires December 3 2015 [Page 7] Internet-Draft NFV Compute Acceleration June 28 2015 2.4 Scheduler Scheduling between the packet buffers coming from the network I/O to the device command queues is carried out by the scheduler. Scheduler operates on following parameters. N - Number of Packet buffers (Default 6) M - Number of Packets in each buffer (Default 16384) K - Number of Devices (Discovered 2) J - Number of Command Queues for each device (Default 3) I - Number of Commands to the device to complete single network function (Default 3) S - Number of network functions executed in parallel. (Default 1) Default values mentioned above are for the best results in our current hardware environment and multi string match function. Operations for completing network function for one packet buffer 1. Identify a free command queue 2. Copy packets from IO memory to pinned memory for GPU 3. Fill Kernel function parameters 4. Copy pinned memory to GPU global memory 5. Launch kernels for number of packets in the packet buffer 6. Check kernel execution completion and collect results 7. Report results to application Scheduler calls OpenCl API with number of kernels to be executed in parallel. Distributing the kernels to cores is taken care by OpenCl library. If there are any error during launching the kernels, OpenCl API returns error and appropriate error handling can he done. Perumal, et al. Expires December 3 2015 [Page 8] Internet-Draft NFV Compute Acceleration June 28 2015 3. Aho-Corasick Algorithm The Aho-Corasick algorithm [2] is the most effective multi pattern matching algorithm. Aho-Corasick algorithm is a kind of dictionary- matching algorithm that locates elements of a finite set of strings within an input text. The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. The algorithm works in two parts. The first part is the building of the tree from keywords that needs to be searched for, and the second part is searching the text for the keywords using the previously built tree (state machine). Searching for a keyword is efficient, because it only moves through the states in the state machine. If a character is match, goto () function is executed otherwise it follows fail () function. Match found is returned by the out () function. All the three functions just access the indexed data structures and return the value. goto () data structure is a two dimension matrix accessed based on current state and currently compared character. fail () function is an array, which has the link to alternate path for each state. Out function is an array of states and it has the records on whether the string search has completed on a particular state. Based on the signature database, all three data structures are constructed in CPU. These data structures are copied to GPU global memory during the initialization stage. Pointers to these data structures are passed as the kernel parameter when the kernels are launched. 4. Optimizations For this prototype Nvidia Tesla K10 GPU [5] is used which has 2 processors with 1536 cores each running at 745MHz. Each processor has 4GB of memory attached to it. It is connected to CPU via PCI 3.0 x16 interface. Server used is Dell R720 which has Intel Xeon 2665 with 2 processors each having 16 cores. Only one CPU core is used for our experiment. 4.1 Variable size packet packing Multiple copies from CPU to GPU is costly. Packets are batched for processing in GPU. Packet sizes vary from 64 bytes to 1500 bytes. Having a fixed size buffer for each packet, leads to copying a lot of unwanted memory from CPU to GPU in case of smaller number of packets. Perumal, et al. Expires December 3 2015 [Page 9] Internet-Draft NFV Compute Acceleration June 28 2015 For variable size packing one single large buffer is allocated for number of packets in the batch. Initial portion of the buffer has the packet start offsets for all packets. At the packet offset, packet size and packet contents are placed. Only buffer size filled with packets is copied from CPU to GPU. 4.2 Pinned Memory Memory allocated using malloc is paged memory. When coping from CPU to GPU, memory is first copied from paged memory to non-paged memory, then it is copied from non-paged memory to GPU global memory. OpenCL provides commands and procedure to allocate and copy memory from non-paged memory[3][4]. Using this pinned memory avoids one internal copy and showed 3x improvements in memory copy time. In our experiments pinned memory was used for CPU to GPU packet buffer copy and GPU to CPU result buffer copy. 4.3 Pipelined Scheduler OpenCL supports multiple command queues and Nvidia supports 32 command queues. Using non-blocking calls, commands can be placed on each queue. When GPU kernel functions are being executed, memory copy between CPU and GPU can happen in parallel. In our experiment 6 command queues were created, 3 queues for each GPU processor. Copy packet buffer to GPU, Launch GPU kernel functions and read results from GPU are executed in parallel for 6 batches of data. Scheduling is performed using round robin to maintain packet order. Using pipe lining, allows hiding 99% of copy time and utilizing the full processing power of GPU. 4.4 Reduce Global memory access OpenCL architecture and NVidia GPU architecture has 3 levels of memory, Global, Local and Private. Packets from CPU are copied to GPU global memory. Global memory access is costly and Char by Char access is not efficient. Accessing private memory is faster but private memory is small, it cannot hold complete packet. So packets are copied as 32 bytes at a time using vload8 and float type. 4.5 Organizing GPU cores Number of kernels functions(global-size) lunched should be more than the number GPU cores to hide latency. GPU provides the sub-grouping of cores to share memory. Optimal grouping size(local-size) is Perumal, et al. Expires December 3 2015 [Page 10] Internet-Draft NFV Compute Acceleration June 28 2015 calculated specific to GPU card. 5. Results Using Aho-Corasick algorithm measured the performance of GPU system with different parameters. Signature database is the top website names. Ethernet and IP headers are skipped in the search, which is 34 bytes for each packet. Only protocol header analysis or application header analysis can also be performed. Aho-Corasick algorithm is modified to match any one string from the signature database. After the first string is matched, result is written in the result buffer and function exits. If none of the string matched in the packet, whole packet is searched, this is the worst-case performance. If any one of the string matched earlier then remaining packet is not searched. To understand the performance and to keep track of the timing of how the commands execute, OpenCl supports a function clGetEventProfilingInfo, which allows querying cl_event to get counter values. The device time counter is returned in nanoseconds. For these experiment results Nvidia Tesla K10 GPU and Dell R720 server is used. Results were taken by executing on bare metal server on Linux. Same code can be executed inside the Virtual Machine also. 5.1 Worst case 0 string match Measured performance by varying signature database size to 1000, 2500 and 5000. Fixed size packets were generated with packet sizes 64/583/1450. Variable size packet generated with packet sizes from 64 to 1500 with an average packet size of 583 bytes and the results are shown in Table 1 and Table 2. +-----+--------+----------+----------+---------- +------------+ |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable| +-------------------------------------------------------------+ | 1000 | 37.03 | 30.74 | 31.68 | 15.08 | | 2500 | 37.03 | 30.17 | 31.15 | 14.94 | | 5000 | 36.75 | 30.03 | 31.15 | 14.87 | +--------------+----------+----------+----------=+------------+ Table 1: Bandwidth in Gbps for different packet sizes of traffic Perumal, et al. Expires December 3 2015 [Page 11] Internet-Draft NFV Compute Acceleration June 28 2015 +-----+--------+----------+----------+-----------+------------+ |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable| +-------------------------------------------------------------+ | 1000 | 77.67 | 7.07 | 2.93 | 3.47 | | 2500 | 77.41 | 6.95 | 2.88 | 3.44 | | 5000 | 77.07 | 6.91 | 2.88 | 3.42 | +--------------+----------+----------+----------=+------------+ Table 2: Number of packets in Million packets per second(mpps) for different packet sizes of traffic Varying signature database size with 1000, 2500 and 5000 do not have any major impact. State machine size gets bigger based on signature database size, but processing time for the packets remain the same. For fixed size packets total bandwidth processed was always above 30 Gbps. For variable bandwidth packets total bandwidth processed is 14.9 Gbps. Variable packet sizes vary from 64 to 1500 bytes, each packet being assigned to one core. The core which finishes early is idle till other cores complete their work. So full GPU power is not effectively utilized when using variable length packets. 5.2 Packet match Having match percentage as the key, different parameters are measured. Table 3 shows the match percentage against the bandwidth in Gbps. For this experiment variable size packets with an average of 583 bytes are used. 16384 packets are batched for processing in GPU and 16384 threads are instantiated. Each packet is checked for 5000 strings. +-----+--------+-----------+ |% of packets | Bandwidth | | matched | in Gbps | +--------------------------+ | 0 | 14.87 | | 15 | 18.50 | | 25 | 20.85 | | 35 | 33.02 | +--------------+-----------+ Table 3: Bandwidth in Gbps for different packet match percentage Perumal, et al. Expires December 3 2015 [Page 12] Internet-Draft NFV Compute Acceleration June 28 2015 +-----+--------+--------------+ |% of packets | No of Packets| | matched | in mpps | +-----------------------------+ | 0 | 3.42 | | 15 | 4.25 | | 25 | 4.80 | | 35 | 7.60 | +--------------+------===-----+ Table 4: Number of packets in Mpps for different packet match percentage The packet match percentage against number of packets processed in mpps is shown in Table 4. Worst case experiment is 0 packets matched, so whole packet need to be searched. Time for single buffer(16384 packets) copy from CPU to GPU is 0.903 milliseconds. Kernel execution time for single buffer is 9.784 milliseconds. Result buffer copy from GPU to CPU is 0.161 milliseconds. Total of 209 buffers processed in one second, which is 3.42 million packets and 14.9 Gbps. Best case experiment was executed with 35% of packets match. Time for single buffer(16384 packets) copy from CPU to GPU is 0.923 milliseconds. Kernel execution time for single buffer is 4.38 milliseconds. Result buffer copy from GPU to CPU is 0.168 milliseconds. Total of 464 buffers processed in one second, which is 7.6 million packets and 33.02 Gbps. 6. Compute Acceleration API Multiple compute accelerators like GPUs, Coprocessors, ASICs/FPGAs and multi core CPUs can be used for NFV. Having a common API for NFV Compute Acceleration (NCA) can abstract the hardware details and enable NFV applications to use compute acceleration. This API will be a C library, user can compile it along with their code. The delivery of end-to-end services often requires various network functions. Compute acceleration APIs should support the definition of ordered set of network functions and subset of these network functions which can be processed in parallel. 6.1 Add Network Function Multiple network functions can be defined in the system. Network functions are identified by network function id. Based on service chain requirement network functions are dynamically loaded to the cores and executed. The API function nca_add_ network-function adds a new network function to the NCA. Perumal, et al. Expires December 3 2015 [Page 13] Internet-Draft NFV Compute Acceleration June 28 2015 In OpenCL terminology kernel is a function or set of function executed in compute core. OpenCL code files are small files with these functions called kernel functions. int nca_add_network_function( int network_func_id, int (*network_func_init)(int network_func_id,void *init_params), char *cl_file_name, char *kernel_name, int (*set_kernel_arg)(int network_func_id,void *sf_args, char *pkt_buf), int result_buffer_size ) network_func_id : Network function identifier unique for every network function in the framework network_func_init : Initializing the network function, with the device memory allocations, service specific data structures are created. cl_file_name : File with network function kernel code kernel_name : Network function kernel entry function name set_kernel_arg : Function that will setup kernel arguments before calling the kernel result_buffer_size : result buffer size for this service function 6.2 Add Traffic Stream Traffic streams are identified by stream id. Traffic streams are initialized with number of buffers and size of each buffer allocated for this stream. Each buffer is identified by a buffer id and it can hold N number of packets. These buffers are treated like ring buffers. These buffers are allocated as a contiguous memory by NCA and the pointer is returned. Any notification during buffer processing is given through the callback function with stream_id, buffer_id and event. Traffic stream is associated with a service function chain. Service function chain is defined by three parameters. Number of network functions is mentioned in num_network_funcs. Actual network function ids are in service_function_chain array. Network functions are divided into subsets. Each subset has a subset number. Network functions within the subset can be executed in parallel. Subsets should be executed in sequence. There is a special subset number 0, which can be executed independent of any network functions in the chain Perumal, et al. Expires December 3 2015 [Page 14] Internet-Draft NFV Compute Acceleration June 28 2015 For example 6 service functions are represented below. num_network_funcs = 6; service_func_chain = {101, 105, 107, 108, 109, 110 } network_func_ordered_set = {1, 1, 1, 2, 2, 0} In the above example subset 1 which 101, 105, 107 should be executed first. Within this subset all 3 can be executed in parallel. After subset 1 subset 2 which is 108,109 will be executed. Subset 0 does not have any dependencies; scheduler can execute it at any time. typedef struct dev_params_s { int dev_type, int num_devices, } nca_dev; int nca_traffic_stream_init ( int num_buffers, int buffer_size, int (*notify_callback)(int buffer_id,int event) int num_network_funcs, int service_func_chain[CAF_MAX_SF], int network_func_parallel_sets[CAF_MAX_SF], nca_dev dev_params, ) stream-id : Unique id to identify traffic stream num_buffers : Number of buffers buffer_size : Size of each buffer notify_callback I : Event notification callback. num_service_funcs : Number of service functions in the service chain service_func_chain : Service function ids in this service chain network_func_parallel_set : subsets for sequential and parallel ordering of service functions. dev_params : For this traffic stream user can choose the device for processing Return Value : stream-id which is unique to identify traffic stream Perumal, et al. Expires December 3 2015 [Page 15] Internet-Draft NFV Compute Acceleration June 28 2015 6.3 Add Packets to Buffer Packets are added to the buffer directly by the client application or by calling nca_add_packets. One or more packets can be added to the buffer. int nca_add_packets( int context_id, int buffer_id, char * packet, int packet_len[] int num_packets ) stream_id : Steam id of the traffic stream buffer_id : Idenitfy the buffer to add the packet packet : Packet contents packet_len[] : Length of each packet num_packets : Number of packets 6.4 Process Packets Once packets are filled in the buffer, nca_buffer_ready is called to process the buffer. This function can also be called without filling the complete buffer. NCA scheduler marks this buffer for processing. int nca_buffer_ready( int context_id, int buffer_id ) stream_id : Stream id identifies the traffic stream buffer_id : Identify the buffer to add the packet 6.5 Event Notification NCA will notify event about the buffer using the registered callback function. After the buffer is processed for the registered services, notify event callback is called. Client can read the result buffer. int (*notify_callback) ( int stream_id, int buffer_id, int event ) stream_id : Stream id identifies the traffic stream buffer_id : Identify the buffer to add the packet event : Event maps to one of the buffer events. If the event is not specific to a buffer, buffer id is 0 Perumal, et al. Expires December 3 2015 [Page 16] Internet-Draft NFV Compute Acceleration June 28 2015 6.6 Read Results Client can read the results after service chain processing. Service chain processing completion is notified by an event through call back function. int caf_read_results( int context_id, int buffer_id, char *result_buffer ) stream_id : Stream id identifies the traffic stream buffer_id : Identify the buffer to add the packet result-buffer : Result buffer pointer to be copied. 7. Other Accelerators The prototype multi string search written in OpenCL successfully compiled and executed on both Intel Xeon Phi coprocessor and CPU only system with minimal changes in make file. For CPU only systems memory copies can be avoided. Since the optimizations for these platforms are not carried out, the performance numbers are not published. 8. Conclusion To get best performance out of GPUs with large number of cores, number of threads executed in parallel should be large. For a single network function the latency will be in milliseconds, so it will be suited for network monitoring functions. If GPUs are tasked to do multiple network functions in parallel it can be used for other NFV functions. Assigning single core for each packet gives best performance when all packet sizes are equal. For variable length packets performance goes down because the core processing the smaller packet has to be idle till the other cores complete processing the larger packets. Code written in OpenCL is easily portable to other platforms like Intel Xeon Phi, multicore CPU with just make file changes. Though the same code execute correctly on all platforms, to achieve good performance, platform specific optimizations need to be done. Proposed a network compute acceleration framework which will have all hardware specific optimizations and expose high level APIs to the applications. A set of APIs for defining traffic streams, network function addition and declaration of service chain with ordering method which include sequential and parallel. Perumal, et al. Expires December 3 2015 [Page 17] Internet-Draft NFV Compute Acceleration June 28 2015 9. Future Work Dynamic device discovery and optimized code for different algorithms and devices will make NCA as a common platform to develop applications on top of this. Integration of compute acceleration with IO acceleration technologies like Intel DPDK[9] can provide a complete networking platform for the applications. Verification and performance measurement of compute acceleration platform running inside a Virtual Machines. Compute acceleration platform running inside Linux containers or Docker. 10 Security Considerations Not Applicable 11 IANA Considerations Not Applicable 12 References 12.1 Normative References 12.2 Informative References [1] ETSI NFV White Paper: "Network Functions Virtualisation, An Introduction, Benefits, Enablers, Challenges, & Call for Action,"http://portal.etsi.org/NFV/NFV_White_Paper.pdf" [2] A.V.Aho and M.J.Corasick, "Efficient string matching:An aid to A.v. Aho and M.j.Corasick, "Efficient string matching:An aid to bibliographic search",Communications of the ACM, vol. 20, Session 10. [3] OpenCl Specification, "https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf" [4] OpenCl Best Practices Guide, "http://www.nvidia.com/content/cudazone/CUDABrowser/ downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf" [5] Nvidia Tesla K10, http://www.nvidia.in/content/PDF/kepler/Tesla- K10-Board-Spec-BD-06280-001-v07.pdf [6] Intel Xeon Phi "http://www.intel.in/content/www/in/en/processors/xeon/ xeon-phi-detail.html" [7] Intel OpenCl "https://software.intel.com/en-us/intel-opencl" Perumal, et al. Expires December 3 2015 [Page 18] Internet-Draft NFV Compute Acceleration June 28 2015 [8] Implementing Regular Expressions "https://swtch.com/~rsc/regexp/" [9] Intel DPDK "http://dpdk.org/" [10] AMD OpenCl Zone, "http://developer.amd.com/tools-and- sdks/opencl-zone/" [11] Nvida OpenCl "https://developer.nvidia.com/opencl Acknowledgements The authors would like to thank the following individuals for their support in verifying the prototype in different platforms : Shiva Katta and K. Narendra. Authors' Addresses Bose Perumal Dell Bose_Perumal@Dell.com Wenjing Chu Dell Wenjing_Chu@Dell.com Ram (Ramki) Krishnan Dell Ramki_Krishnan@Dell.com Hemalathaa S Dell Hemalathaa_S@Dell.com Perumal, et al. Expires December 3 2015 [Page 19]