S. Bailey (Sandburst) Internet-draft D. Garcia (Compaq) Expires: May 2002 J. Hilland (Compaq) A. Romanow (Cisco) Direct Access Problem Statement draft-garcia-direct-access-problem-00 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This problem statement describes barriers to the use of Internet Protocols for highly scalable, high bandwidth, low latency transfers necessary in some of today's important applications, particularly applications found within data centers. In addition to describing technical reasons for the problems, it gives an overview of common non-IP solutions to these problems which have been deployed over the years. The perspective of this draft is that it would be very beneficial Garcia, et al Expires May 2002 [Page 1] Internet-Draft Direct Access Problem Statement 13 Nov 2001 to have an IP-based solution for these problems so IP can be used for high speed data transfers within data centers, in addition to IP's many other uses. Table Of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 1.1. High Bandwidth Transfer Overhead . . . . . . . . . . . . 3 1.2. Proliferation Of Fabrics in Data Centers . . . . . . . . 4 1.3. Potential Solutions . . . . . . . . . . . . . . . . . . 4 2. High Bandwidth Data Transfer In The Data Center . . . . 6 2.1. Scalable Data Center Applications . . . . . . . . . . . 7 2.2. Client/Server Communication . . . . . . . . . . . . . . 7 2.3. Block Storage . . . . . . . . . . . . . . . . . . . . . 8 2.4. File Storage . . . . . . . . . . . . . . . . . . . . . . 9 2.5. Backup . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6. The Common Thread . . . . . . . . . . . . . . . . . . . 10 3. Non-IP Solutions . . . . . . . . . . . . . . . . . . . . 10 3.1. Proprietary Solutions . . . . . . . . . . . . . . . . . 11 3.2. Standards-based Solutions . . . . . . . . . . . . . . . 11 3.2.1. The Virtual Interface Architecture (VIA) . . . . . . . . 12 3.2.2. InfiniBand . . . . . . . . . . . . . . . . . . . . . . . 12 4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 13 5. Security Considerations . . . . . . . . . . . . . . . . 13 6. References . . . . . . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 15 A. RDMA Technology Overview . . . . . . . . . . . . . . . . 16 A.1 Use of Memory Access Transfers . . . . . . . . . . . . . 16 A.2 Use Of Push Transfers . . . . . . . . . . . . . . . . . 17 A.3 RDMA-based I/O Example . . . . . . . . . . . . . . . . . 18 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 19 1. Introduction Protocols in the IP family offer a huge, ever increasing range of functions, including mail, messaging, telephony, media and hypertext content delivery, block and file storage, and network control. IP has been so successful that applications only use other forms of communication when there is a very compelling reason. Currently, it is often not acceptable to use IP protocols for high-speed communication within a data center. In these cases, copying data to application buffers consumes too much CPU that is otherwise needed to perform application functions. Garcia, et al Expires May 2002 [Page 2] Internet-Draft Direct Access Problem Statement 13 Nov 2001 This limitation of IP protocols has not been particularly important until now because the domain of high performance transfers was limited to a relatively specialized niche of low volume applications, such as scientific supercomputing. Applications that needed more efficient transfer than IP could offer simply used other purpose-built solutions. As the use of the Internet has become pervasive and critical, the growth in number and importance of data centers has matched the growth of the Internet. The role of the data center is similarly critical. The high-end environment of the data center makes up the core and nexus of today's Internet. Everything goes in and out of data centers. Applications running within data centers frequently require high bandwidth data transfer. Due to the high host processing overhead of high bandwidth communication in IP, the industry has developed non-IP technology to serve data center traffic. That said, the obstacles to lowering host processing overhead in the IP are well- understood and straightforward to address. Simple techniques could allow the penetration of existing IP protocols into data centers where non-IP technology is currently used. Technology advances have made feasible specially designed network interfaces that place IP protocol data directly in application buffers. While it is certainly possible to use control information directly from existing IP protocol messages to place data in application buffers, but the sheer number and diversity of current and future IP protocols calls for a generic solution instead. Therefore, the goal is to investigate a generic data placement solution for IP protocols that would allow a single network interface to perform direct data placement for a wide variety of mature, evolving and completely new protocols. There is a great desire to develop lower overhead, more scalable data transfer technology based on IP. This desire comes from the advantages of using one protocol technology rather than several, and from the many efficiencies of technology based upon a single, widely adopted, open standard. This document describes the problems that IP faces in delivering highly scalable high bandwidth data transfer. The first section describes the issues in general. The second section describes several specific scenarios, discussing particular application domains and specific problems that arise. The third section describes approaches that have historically been used to address low overhead, high bandwidth data transfer needs. The appendix gives an overview of how a particular class of non-IP technologies Garcia, et al Expires May 2002 [Page 3] Internet-Draft Direct Access Problem Statement 13 Nov 2001 addresses this problem with Remote Direct Memory Access (RDMA). 1.1. High Bandwidth Transfer Overhead Transport protocols such as TCP [TCP] and SCTP [SCTP] have successfully shielded upper layers from the complexities of moving data between two computers. This has been very successful in making TCP/IP ubiquitous. However, with current IP implementations, Upper Layer Protocols (ULPs), such as NFS [NFSv3] and HTTP [HTTP], require incoming data packets to be buffered and copied before the data is used. It is this data copying that is a primary source of overhead in IP data transfers. Copying received data for high bandwidth transfers consumes significant processing time and memory bandwidth. If data is buffered and then copied, the data moves across the memory bus at least three times during the data transfer. By comparison, if the incoming data is placed directly where the application requires it, the data moves across the memory bus only once. This copying overhead currently means that additional processing resources, such as additional processors in a multiprocessor machine, are needed to reach faster and faster wire speeds. A wide range of ad hoc solutions have been explored to eliminate data copying overhead withing the framework of current IP protocols, but despite extensive study, still no adequate or general solution exists [Chase]. 1.2. Proliferation Of Fabrics in Data Centers The current alternative to paying the high costs due to data transfer overhead in data centers is the use of several different communication technologies at once. Data centers are likely to have separate Ethernet IP, Fibre Channel storage, and InfiniBand, VIA or proprietary interprocess communication (IPC) networks. Special purpose networks are used for storage and IPC to reduce the processor overhead associated with data communications; and in the case of IPC, to reduce latency as well. Using such proprietary and special purpose solutions runs counter to the requirements of data center computing. Data center designers and operators do not want the expense and complexity of building and maintaining three separate communications networks. Three NICs and three fabric ports are expensive, consume valuable Garcia, et al Expires May 2002 [Page 4] Internet-Draft Direct Access Problem Statement 13 Nov 2001 IO card slots, power and machine room space. A single IP fabric would be far preferable. IP networks are best positioned to fill the role of all three of these existing networks. At 1 to 10 gigabit speeds current IP interconnects could offer comparable or superior performance characteristics to special purpose purpose interconnects, if it were not for the high overhead and latency of IP data transfers. An IP-based alternative to the IPC and storage fabrics would be less costly, and much more easily manageable than maintaining separate communication fabrics. 1.3. Potential Solutions One frequently proposed solution to the problem of data transfer overhead in IP data transfers is to wait for the next generation of faster processors and speedier memories to render the problem irrelevant. However, in the evolution of the Internet, processor and memory speeds are not the only variables that have increased exponentially over time. Data link speeds have grown exponentially as well. Recently, spurred by the demand for core network bandwidth, data link speeds have grown faster than both processor computation rates and processor memory transfer rates. Whatever speed increases occur in processors and memories, it is clear that link speeds will continue to grow aggressively as well. Rather than relying on increasing CPU performance, non-IP solutions use network interface hardware to attack several Several distinct sources of overhead can be seen. For a small, one-way IP data transfer, typically both the sender and receiver must make several context switches, process several interrupts, and send and receive a network packet. In addition, the receiver must perform at least one data copy. This single transfer could require 10,000 instructions of execution and total time measured in hundreds of microseconds if not milliseconds. The sources of overhead in this transfer are: o context switches and interrupts, o execution of protocol code, o copying the data on the receiver. Copying competes with DMA and other processor accesses for memory system bandwidth, and all these sources of overhead can also have significant secondary effects on the efficiency of application execution by interfering with system caches. Garcia, et al Expires May 2002 [Page 5] Internet-Draft Direct Access Problem Statement 13 Nov 2001 Depending on the application, each of these sources of overhead may be small or large factor in total overhead, but the cumulative effect of all of them is nearly always substantial for high bandwidth transfers. If data transfers are very small, data copying is only a small cost, but context switching and protocol stack execution become performance limiting factors. For large transfers, the most common high bandwidth data transfers, context switching and protocol stack execution can be amortized away, within certain limits, but data copying becomes costly. Non-IP solutions address these sources of overhead with network interface hardware that: o reduces context switches and interrupts with kernel-bypass capability, where the application communicates directly through network interface without kernel intervention, o reduces protocol stack processing with protocol offload hardware that performs some or all protocol processing (e.g. ACK processing), o reduces data copying overhead by placing data directly in application buffers. The application of these techniques reduces both data transfer overhead, and data transfer latency. Context switches and data copying are substantial sources of end-to-end latency that are eliminated by kernel-bypass and direct data placement. Offloaded protocol processing can also typically be performed an order of magnitude faster than a comparable, general purpose protocol stack, due to the ability to exploit extensive parallelism in hardware. While protocol offload does reduce overhead, for the vast majority of current high bandwidth data transfer applications, eliminating data copies is much more important. These techniques, and others, may be equally applicable to reducing the overhead of IP data transfers. 2. High Bandwidth Transfers In The Data Center There are numerous uses of high bandwidth data transfers in today's data centers. While these applications are found in the data center, they have implications for the desktop as well. This problem statement focuses on data center scenarios below, but it would be beneficial to find a solution that meets data center while possibly remaining affordable for the desktop. Garcia, et al Expires May 2002 [Page 6] Internet-Draft Direct Access Problem Statement 13 Nov 2001 Why is high bandwidth data transfer in the data center important for IP networking? Performance on the Internet, as well as intranets, is dependent on the performance of the data center. Every request, be it a web page, database query or file and print service goes to or through data center servers. Often a multi- tiered computing solution is used, where multiple machines in the data center satisfy these requests. Despite the explosive growth of the server market, data centers are running into critical limitations that impact every client directly or indirectly. Unlike servers, clients are largely limited in performance by the human at the interface. In contrast, data center performance is limited by the speeds and feeds of the network and I/O devices as well as hardware and software components. With new protocols such as iSCSI, IP networks are increasingly taking on the functions of special purpose interconnects, such as Fibre Channel. However, the limitations created by high data transfer overhead described here have not as yet been addressed for IP protocols in general. First and foremost, all the problems illustrated in scenarios below occur on IP protocol based networks. It is imperative to understand the pervasiveness of IP networks within the data center and that all of the problems described below occur in IP-based data transfer solutions. Therefore, a solution to these problems will naturally also be a part of the IP protocol suite. Although the problems discussed below manifest themselves in different ways, investigation into the source of these problems shows a common thread running through them. These scenarios are not exhaustive list, but rather describe the wide range of problems exhibited in scalability and performance of the applications and infrastructures encountered in data center computing as a result of high communication overhead. 2.1. Scalable Data Center Applications A key characteristic of any data center application is its ability to scale as demands increase. For many Internet services, applications must scale in response to the success of the service and the increased demand which results. In other cases, applications must be scaled as capabilities are added to a service, again in response to the success of the service, changes in the competitive environment or goals of the provider. Virtually all data center applications require intermachine Garcia, et al Expires May 2002 [Page 7] Internet-Draft Direct Access Problem Statement 13 Nov 2001 communication, and therefore, application scalability may be directly limited by communication overhead. From the application viewpoint, every CPU cycle spent performing data transfer is a wasted cycle that affects scalability. For high bandwidth data transfers using IP, this overhead can be 30-40% of available CPU. If an application is running on a single single server, and it is scaled by adding a second server, communication overhead of 40% means that the CPU available to the application from two servers is only 120% of that of the single server. The problem is even worse with many servers, because most servers are communicating with more than one other server. If three servers are connected in a pipeline where 40% CPU is required for data transfers to or from another server, the total available CPU power would still be only 120% of the power of a single server! Not all data center applications require this level of communication, but many do. The high overhead of data transfers in IP severely impacts the viability of IP for scalable data center applications. 2.2. Client/Server Communication Client/server communication in the data center is a variation of the scalable data center application scenario, but applies to standalone servers as well as parallel applications. The overhead of high bandwidth data communication weighs heavily on the server. The server's ability to respond is limited by any communication overhead it incurs. In addition, client/server application performance is often dominated by data transfer latency characteristics. Reducing latency can greatly improve application performance. Techniques commonly employed in IP network interfaces, such as TCP checksum calculation offload, reduce transfer overhead somewhat, but they typically do not reduce latency at all. Another technique used to reduce latency in IP communication is to dedicate multiple threads of execution, each running on a separate processor, to processing requests concurrently. However, this multithreading solution has limits, as the number of outstanding requests can vastly exceed the number of processors. Furthermore, the effect of multithreading concurrency is additive with any other latency reduction in the data transfers themselves. To address the problems of high bandwidth IP client/server communication, a solution would ideally reduce both end to end communication latency, and communication overhead. Garcia, et al Expires May 2002 [Page 8] Internet-Draft Direct Access Problem Statement 13 Nov 2001 2.3. Block Storage Block storage, in the form of iSCSI [iSCSI] and IP Fibre Channel protocols [FCIP, iFCP], is a IP new application area of great interest to the storage and data center communities. Just as data centers eagerly desire to replace special-purpose interprocess communication fabrics with IP, there is parallel and equal interest in migrating block storage traffic from special-purpose storage fabrics to IP. As with other forms of high bandwidth communication, the data transfer overhead in traditional IP implementations, particularly the three bus crossings required for receiving data, may substantially limit data center storage transfer performance compared to what is commonplace with special-purpose storage fabrics. In addition, data copying, even if it is performed within a specialized IP-storage adapter, will substantially increase transfer latency, which can noticeably degrade the performance of both file systems, and applications. Protocol offload and direct data placement comparable to what is provided by existing storage fabric interfaces (Fibre Channel, SCSI, FireWire, etc.) are possible pieces of a solution to the problems created by IP data transfer overhead for block storage. It has been claimed that block storage is such an important application that IP block storage protocols should be directly offloaded by network interface hardware, rather through use of a generic application-independent offload solution. However, even the block storage community recognizes the benefits of more general-purpose ways to reduce IP transfer overhead, and most expect to eventually use such general-purpose capabilities for block storage when they become available, if for no other reason than it reduces the risks and impact of changing and evolving the block storage protocols themselves. 2.4. File Storage The file storage application exhibits a compound problem within the data center. File servers and clients are subject to the communication characteristics of both block storage and client/server applications. The problems created by high transfer overhead are particularly acute for file storage implementations that are built with a substantial amount of user-mode code. In any form of file storage application, many CPU cycles are spent traversing the kernel mode file system, disk storage subsystems, Garcia, et al Expires May 2002 [Page 9] Internet-Draft Direct Access Problem Statement 13 Nov 2001 protocol stacks, and driving network hardware, similar to the block storage scenario. In addition, file systems must address the communication problems of a distributed client/server application. There may be substantial shared state distributed among servers and clients creating the need for extensive communication to maintain this shared state. A solution to the communication overhead problems of IP data transfer for file storage involves a union of the approaches for efficient disk storage and efficient client/server communication, as discussed above. In other words, both low overhead and low latency communication are goals. 2.5. Backup One of the problems with IP-based storage backup is that it consumes a great deal of the host CPU's time and resources. Unfortunately, the high overhead required for IP-based backup is typically not acceptable in an active data center. The challenge of backup is that it is usually performed on machines which are also actively participating in the services the data center is providing. At a minimum, a machine performing backup must maintain some synchronization with other machines modifying the state being backed up, so the backup is coherent. As discussed in the section above on Scalable Data Center Applications, any overhead placed on active machines can substantially affect scalability and solution cost. Backup solutions on specialized storage-fabrics allow systems to back up the data without the host processor ever touching the data. Data is transfered to the backup device from disk storage through host memory, or sometimes even directly without passing through the host, as a so-called third party transfer. Storage backup in the data center could be done with IP if data transfer overhead were substantially reduced. 2.6. The Common Thread There is a common thread running through the problems of using IP communication in all of these scenarios. The union of the solutions to these problems are a high bandwidth, low latency, low CPU overhead data transfer solution. Non-IP solutions offer Garcia, et al Expires May 2002 [Page 10] Internet-Draft Direct Access Problem Statement 13 Nov 2001 technical solutions to these problems but the they lack the ubiquity and price/performance characteristics necessary for a viable, general solution. 3. Non-IP Solutions The most refined non-IP solution to reducing communication overhead, has a rich history reaching back almost 20 years. This solution uses a data transfer metaphor called Remote Direct Memory Access (RDMA). See Appendix A for an introduction to RDMA. In spite of the technical advantages of the various non-IP solutions, all have ultimately lacked the ubiquity and price/performance characteristics necessary to gain widespread usage. This lack of widespread adoption has also resulted in various shortcomings of particular incarnations, such as incomplete integration with native platform capabilities, or other software implementation limitations. In addition, no non-IP solutions offer the massive range of network scalability IP protocols support. Non-IP solutions typically only scale to tens or hundreds of nodes in a single network, and have no story to tell about interconnection of multiple networks. Several non-IP solutions will be briefly described here to show the state of experience with this set of problems. 3.1. Proprietary Solutions Low overhead communication technologies have traditionally been developed as proprietary value-added products by computer platform vendors. Such solutions were tightly integrated with platform operating systems and did provide powerful, well integrated communication capabilities. However, applications written for one solution were not portable to others. Also, the solutions were expensive, as is typically the case with value-added technologies. The earliest example of an low overhead communication technology was Digital's VAX Cluster Interconnect (CI), first released in 1983. The CI allowed computers and storage to be connected as peers on a small multipoint network used for both IPC and I/O. The CI made VAX/VMS Clusters the only alternative to mainframes for large commercial applications for many years. Tandem ServerNet was a another proprietary block transfer technology developed in the mid 1990s. It has been used to perform Garcia, et al Expires May 2002 [Page 11] Internet-Draft Direct Access Problem Statement 13 Nov 2001 Disk I/O, IPC and network I/O in the Himalaya product line. This architecture allows the Himalaya platform to be inherently scalable because the software has been designed to take advantage of the offload capability and zero copy techniques. Tandem attempted to take this product into the Industry Standard Server market, but the price/performance characteristics and its of being a proprietary solution prevented wide adoption. Silicon Graphics used a standards-based network fabric, HiPPI-800, but built a proprietary low overhead communication mechanism on top. Other platform vendors such as IBM, HP and Sun have also offered a variety of proprietary low overhead communication solutions over the years. 3.2. Standards-based Solutions Increasing fluidity in the landscape of major platform vendors has drastically increased the desire for all applications to be portable. Platforms which were here yesterday might be gone tomorrow. This has killed the willingness of application and data center designers and maintainers to use proprietary features of any platform. Unwillingness to continue to use proprietary interconnects forced platform vendors to collaborate on standards-based low overhead communication technologies to replace the proprietary ones which had become critical to building data center applications. Two of these standards-based solutions considered to be roughly parent and child are described below. 3.2.1. The Virtual Interface Architecture (VIA) VIA [VI] was a technology jointly developed by Compaq, Intel and Microsoft. VIA helped prove the feasibility of doing IPC offload, user mode I/O and traditional kernel mode I/O as well. While VIA implementations met with some limited success, VIA turned out to only fill a small market niche, for several reasons. First, commercially available operating systems lacked a pervasive interface. Second, because the standard did not define a wire protocol, no two implementations of the VIA standard were interoperable on the wire. Third, different implementations were not interoperable at the software layer either, since the API Garcia, et al Expires May 2002 [Page 12] Internet-Draft Direct Access Problem Statement 13 Nov 2001 definition was an appendix to the specification and not part of the specification itself. Yet with parallel applications, VIA proved itself time and again. It was used to set the new benchmark record in the terabyte data sort in Sandia Labs. It set new TPC-C records for distributed databases, and it was used to set new TPC-C records as the client- server communication link. VIA also set the foundation for work such as the Sockets Direct Protocol through the implementation of the Winsock Direct Protocol in Windows 2000 [WSD]. And it gave the DAFS collective a rally point for a common programming interface [DAFSAPI]. 3.2.2. InfiniBand InfiniBand [IB] was developed by the InfiniBand Trade Association (IBTA) as a low overhead communication technology that provides remote direct memory access transfers, including interlocked atomic operations, as well as traditional datagram-style transfers. InfiniBand defines a new electromechanical interface, card and cable form factors, physical interface, link layer, transport layer and upper layer software transport interface. The IBTA has also described a fabric management infrastructure to initialize and maintain the fabric. While all of the specialized technology of InfiniBand does provide impressive performance characteristics, IB lacks the ubiquity and price/performance of IP. In addition, management of InfiniBand fabrics will require new tools and training, and InfiniBand additionally lacks the huge base of applications, protocols, thoroughly engineered security and routing technology available in IP. 4. Conclusion This document has described the set of problems that hinder the widespread use of IP for high speed data transfers in data centers. There have been a variety of other, non-IP solutions available which have met with only limited success, for different reasons. After many years of experience in both the IP and non-IP domains, the problems appear to be reasonably well understood, and a direction to a solution is suggested by this study. However, some additional investigation and subsequent execution on an Garcia, et al Expires May 2002 [Page 13] Internet-Draft Direct Access Problem Statement 13 Nov 2001 architecture and necessary protocol(s) for reducing overhead in high bandwidth IP data transfers are required. 5. Security Considerations This draft states a problem and, therefore, does not require particular security considerations other than those dedicated to squelching the free spread of ideas, should the problem discussion itself be considered seditious or otherwise unsafe. 6. References [Chase] Jeff S. Chase, et.al., "End system optimizations for high- speed TCP", IEEE Communications Magazine , Volume: 39, Issue: 4 , April 2001, pp 68-74. http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} [DAFSAPI] "Direct Access File System Application Programming Interface", version 0.9.5, 09/21/2001. http://www.dafscollaborative.org/tools/dafs_api.pdf [FCIP] Raj Bhagwat, et al., "Fibre Channel Over TCP/IP (FCIP)", 09/20/2001. http://www.ietf.org/internet-drafts/draft-ietf- ips-fcovertcpip-06.txt [HTTP] J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1", RFC 2616, June 1999 [IB] InfiniBand Architecture Specification, Volumes 1 and 2, release 1.0.a. http://www.infinibandta.org [iFCP] Charles Monia et al., "iFCP - A Protocol for Internet Fibre Channel Storage Networking", 10/19/2001. http://www.ietf.org/internet-drafts/draft-ietf-ips-ifcp-06.txt Garcia, et al Expires May 2002 [Page 14] Internet-Draft Direct Access Problem Statement 13 Nov 2001 [iSCSI] J. Satran, et al., "iSCSI", 10/01/2001. http://www.ietf.org/internet-drafts/draft-ietf-ips- iscsi-08.txt [NFSv3] B. Callaghan, "NFS Version 3 Protocol Specification", RFC 1813, June 1995 [SCTP] R.R. Stewart, Q. Xie, K. Morneault, C. Sharp, H.J. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and, V. Paxson, "Stream Control Transmission Protocol," RFC2960, October 2000. [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981 [VI] Virtual Interface Architecture Specification version 1.0. http://www.viarch.org/html/collateral/san_10.pdf [WSD] "Winsock Direct and Protocol Offload On SANs", version 1.0, 3/3/2001, from "Designing Hardware for the Microsoft Windows Family of Operating Systems". http://www.microsoft.com/hwdev/network/san Authors' Addresses Stephen Bailey Sandburst Corporation 600 Federal Street Andover, MA 01810 USA Phone: +1 978 689 1614 Email: steph@sandburst.com Garcia, et al Expires May 2002 [Page 15] Internet-Draft Direct Access Problem Statement 13 Nov 2001 Dave Garcia Compaq Computer Corp. 19333 Valco Parkway Cupertino, CA 95014 USA Phone: +1 408 285 6116 EMail: dave.garcia@compaq.com Jeff Hilland Compaq Computer Corp. 20555 SH 249 Houston, TX 77070 USA Phone: +1 281 514 9489 EMail: jeff.hilland@compaq.com Allyn Romanow Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134 USA Phone: +1 408 525 8836 Email: allyn@cisco.com Appendix A. RDMA Technology Overview This section describes how Remote Direct Memory Access (RDMA) technology such as the Virtual Interface Architecture (VIA) and InfiniBand (IB) provide for low overhead data transfer. VIA and IB are examples of the RDMA technology also used by many proprietary low over head data transfer solutions. The IB and VIA protocols both provide memory access and push transfer semantics. With memory access transfers, data from the local computer is written/read directly to/from an address space of the remote computer. How, when and why buffers are accessed is defined by the ULP layer above IB or VIA. With push transfers, the data source pushes data to an anonymous receive buffer at the destination. TCP and UDP transfers are both Garcia, et al Expires May 2002 [Page 16] Internet-Draft Direct Access Problem Statement 13 Nov 2001 example of push transfers. VIA and IB both call their Push transfer a Send operation, which is a datagram-style push transfer. The data receiver chooses where to place the data; the receive buffer is anonymous with respect to the sender of the data. A.1 Use of Memory Access Transfers In the memory access transfer model, the initiator of the data transfer explicitly indicates where data is extracted from or placed on the remote computer. VI and InfiniBand both define memory access read (called RDMA Read) and memory access write (called RDMA Write) transfers. The buffer address is carried in each PDU allowing the network interface to directly place the data in application buffers. Placing the data directly into the application's buffer has three significant benefits: o CPU and memory bus utilization are lowered by not having to copy the data. Since memory access transfers use buffer addresses supplied by the application, data can be directly placed at its final location. o memory access transfers incur no CPU overhead during transfers if the network interface offloads RDMA (and lower layer) protocol processing. There is enough information in RDMA PDUs for the target network interface to complete RDMA Reads or RDMA Writes without any local CPU action. o Memory access transfers allow splitting of ULP headers and data. With memory access transfers, the ULP can control the exact placement of all received data, including ULP headers and ULP data. ULP headers and other control information can be placed in separate buffers from ULP data. This is frequently a distinct advantage compared to having ULP headers and data in the same buffers, as an additional data copy may be otherwise required to separate them. Providing memory access transfers does not mean a processor's entire memory space is open for unprotected transfers. The remote computer controls which of its buffers can be accessed by memory access transfers. Incoming RDMA Read and RDMA Write operations can only access buffers to which the receiving host has explicitly permitted RDMA accesses. When the ULP allows RDMA access to a buffer, the extent and address characteristics of buffer can be chosen by the ULP. A buffer could use the virtual address space of the process, it could be a physical address (if allowed), or it Garcia, et al Expires May 2002 [Page 17] Internet-Draft Direct Access Problem Statement 13 Nov 2001 could be a new virtual address space created for the individual buffer. In both IB and VIA the RDMA buffer is registered with the receiving network interface before RDMA operations can occur. For a typical hardware offload network interface, this is enough information to build an address translation table and associate appropriate security information with the buffer. The address translation table lets the NIC convert the incoming buffer target address into a local physical address. A.2 Use Of Push Transfers Memory access transfers contrast with the push transfers typically used by IP applications. With push transfers the source has no visibility or control over where data will be delivered on the destination machine. While most protocols use some form of push transfer, IB and VIA define a datagram-style push transfer that allows a form of direct data placement on the receive side. IB and VIA both require the application to pre-post receive buffers. The application pre-posts receive buffers for a connection and they are filled by subsequent incoming Send operations. Since the receive buffer is pre-posted, the network interface can place the data from the incoming Send operation directly into the application's buffer. IB and VIA allow use of a scattered receive buffers to support splitting the ULP header from data within a single Send. Neither memory access nor push transfers are inherently superior -- each has its merits. Furthermore, memory access transfers can be built atop push transfers or vice versa. However, direct support of memory access transfers allows much lower transfer overhead than if memory access transfers are emulated. A.3 RDMA-based I/O Example If the RDMA protocol is offloaded to the network interface, the RDMA Read operation allows an I/O subsystem, such as a storage array, to fully control all aspects of data transfer for outstanding I/O operations. An example of a simple I/O operation shows several benefits of using memory access transfers. Consider an I/O block Write operation where the host processor Garcia, et al Expires May 2002 [Page 18] Internet-Draft Direct Access Problem Statement 13 Nov 2001 wishes to move a block of data (the data source) to an I/O subsystem. The host first registers the data source with its network interface as an RDMA address block. Next the host pushes a small Send operation to the I/O subsystem. The message describes the I/O write request and tells the I/O subsystem where it can find the data in the virtual address space presented through the communication connection by the network interface. After receiving this message, the I/O subsystem can pull the data from the host's buffer as needed. This gives the I/O subsystem the ability to both schedule and pace its data transfer, thereby requiring less buffering on the I/O subsystem. When the I/O subsystem completes the data pull, it pushes a completion message back to the host with a small Send operation. The completion message tells the host the I/O operation is complete and that it can deregister its RDMA block. In this example the host processor spent very few CPU cycles doing the I/O block Write operation. The processor sent out a small message and the I/O subsystem did all the data movement. After the I/O operation was completed the host processor received a single completion message. Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR Garcia, et al Expires May 2002 [Page 19] Internet-Draft Direct Access Problem Statement 13 Nov 2001 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Garcia, et al Expires May 2002 [Page 20]