Internet-Draft Brent Callaghan Expires: June 2004 Sun Microsystems, Inc. Mark Wittle Network Appliance, Inc. Document: draft-callaghan-nfsrdmareq-00.txt December 2003 NFS RDMA Requirements Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Expires: June 2004 Callaghan and Wittle [Page 1] Internet-Draft NFS RDMA Requirements December 2003 Abstract Remote Direct Memory Access (RDMA) technology provides an efficient memory-to-memory data path across a network. This draft addresses the requirements that an RDMA provider must satisfy to meet the needs of the NFS protocol and its implementations. It is a companion to the NFS RDMA Problem Statement which presents a case for applying RDMA technology to the NFS protocol. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. RDMA Generic Requirements . . . . . . . . . . . . . . . . . 3 2.1. RDMA features: SEND, READ and WRITE . . . . . . . . . . . 3 2.2. Preserve ordering of RDMA SENDs . . . . . . . . . . . . . 4 2.3. Support 32 bit Steering Tags . . . . . . . . . . . . . . . 4 2.4. Need to support RDMA ops with 64 bit virtual addresses . . 4 2.5. Integrity of transferred data. . . . . . . . . . . . . . . 4 2.6. Data Privacy . . . . . . . . . . . . . . . . . . . . . . . 4 2.7. Must support existing NFS protocols, 2, 3 and 4 . . . . . 5 2.8. Direct Data Placement . . . . . . . . . . . . . . . . . . 5 2.9. Flow control . . . . . . . . . . . . . . . . . . . . . . . 5 2.10. Buffer size control . . . . . . . . . . . . . . . . . . . 6 2.11. Recovery from RDMA errors . . . . . . . . . . . . . . . . 6 2.12. Efficient use of RDMA services: RDMA READ vs WRITE . . . 6 3. RDMA Specific Requirements . . . . . . . . . . . . . . . . . 7 3.1. IP based addressing . . . . . . . . . . . . . . . . . . . 7 3.2. Minimal effect on latency or protocol overhead. . . . . . 7 3.3. Should handle large transfers . . . . . . . . . . . . . . 7 4. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 7. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 9 8. Full Copyright Statement . . . . . . . . . . . . . . . . . 10 1. Introduction The NFS RDMA Problem Statement [NFSRDMAPS] describes RDMA technology and how it can benefit NFS implementations. The NFS RDMA Problem Statement recommends the application of RDMA to the existing NFS protocol versions. One example of this is presented in "RPC Transport for ONC RPC" [RPCRDMA] which implements RDMA as a network transport at the RPC layer. This approach preserves the semantics of RPC based protocols while providing data marshaling access for RDMA. Extensions to the NFS version 4 protocol are proposed in the draft "NFSv4 RDMA and Session Extensions" [NFSRDMA] which supply RDMA Expires: June 2004 Callaghan and Wittle [Page 2] Internet-Draft NFS RDMA Requirements December 2003 session negotiation and support for Exactly-Once Semantics in NFS version 4. Any use of RDMA by NFS must necessarily affect the RPC layer, that is used by NFS. Therefore, this document considers the problem from the perspective of RPC. This RDMA transport is assumed to be implemented as an upper-half generic part that assumes a general RDMA model, and a lower-half that presents a particular implementation of RDMA, like RDDP. The requirements that follow are split into two groups: RDMA generic requirements, followed by RDMA specific requirements - specific to a particular implementation of RDMA. 2. RDMA Generic Requirements ONC RPC can utilize a number of different transports to move RPC messages. Established transports are UDP (Connectionless) and TCP (Connection-oriented). This draft anticipates that RDMA will be made available as another ONC RPC transport and assumes a generic RDMA model, i.e. an application may specify "rdma" as a transport for its RPC messages without regard to the particular implementation of RDMA such as RDDP. The following requirements are of this generic RDMA transport layer for RPC. 2.1. RDMA features: SEND, READ and WRITE The RDMA connection may support RDMA SEND, READ and WRITE operations. An RDMA SEND must move data to a pre-posted, untagged receive buffer on the peer and must signal a completion event to the receiver. An RDMA WRITE is initiated on the sending system and must transfer data between registered buffers using a steering tag, offset and size to designate buffers on both the sending system (data source) and the receiving system (data sink). A completion event on the data source must signal completion of the RDMA WRITE operation at the data source, though it does not necessarily signal the arrival of data at the sink. Similarly, an RDMA READ is initiated on the data sink and must transfer data between registered buffers on peers using a steering tag, offset and size to designate buffers on both source and sink systems. A completion event on the data sink system must signal completion of the RDMA READ operation and the availability of data in Expires: June 2004 Callaghan and Wittle [Page 3] Internet-Draft NFS RDMA Requirements December 2003 the sink buffer. 2.2. Preserve ordering of RDMA SENDs The RDMA transport must preserve the ordering of SENDs, A sequence of SENDs must arrive at the receiver in the same order that they were sent. There is no requirement for ordering of READ or WRITE operations. 2.3. Support 32 bit Steering Tags The remote addresses used by RDMA are qualified by a Steering Tag (STag) or memory handle that uniquely describes a region of registered memory on a host. There are no requirements for the content of the steering tag. An STag must not exceed 32 bits. 2.4. Need to support RDMA ops with 64 bit virtual addresses Support for 64 bit virtual addresses is universal in large operating environments. The NFS protocol allows up to 64 bit file offsets. RDMA operations must support 64 bit virtual addresses. 2.5. Integrity of transferred data. The use of NFS exposes file and filesystem contents to a network, where data may be corrupted or modified by attackers, or privacy lost to a third party. Network media like Ethernet provide CRC protection for data in an Ethernet frame. Data is also protected by UDP or TCP checksums. Finally, the use of an integrity or privacy mechanism in RPCSEC_GSS can protect data at the RPC layer - albeit with some loss of performance. An RDMA transport layer must provide no less guarantee of data security than is already provided by existing transports. Even better if the RDMA transport provides stronger protection of data integrity, particularly in a business-critical data center environment characterized by very high bit rates. 2.6. Data Privacy Some may perceive RDMA as providing new opportunities for attack or disclosure via the exposure of registered buffers to external view or modification. Since a peer system can obtain data from or place data directly into a remote buffer, it is not hard to imagine another malicious peer also accessing or using direct placement to modify that data before it is used by the application. While RPCSEC_GSS can provide protection, there should also be an RDMA provided mechanism that disallows remote access to direct-placed data after the data has been placed in the receiving buffer and before it is checked for Expires: June 2004 Callaghan and Wittle [Page 4] Internet-Draft NFS RDMA Requirements December 2003 integrity or decrypted. 2.7. Must support existing NFS protocols, 2, 3 and 4 An RDMA transport for RPC must support existing versions of NFS. This isn't to say that an RDMA transport must be interoperable with old implementations that do not support RDMA. We don't expect an NFS/UDP client to be interoperable with an NFS/TCP server. However, we should expect the benefits of an RDMA transport be made available to existing versions of the NFS protocol. The availability of RDMA as a simple transport add-on can provide performance benefits to existing implementations while preserving their semantics ("bug-for- bug compatible"). This will reduce implementation, training and deployment cost. 2.8. Direct Data Placement The NFS RDMA Problem Statement [NFSRDMAPS] describes the overhead in data copying suffered by existing NFS implementations. The data copying is necessary, either to move data between an application- defined buffer and a network buffer, or to achieve correct data alignment. An RDMA transport must support direct data placement and data alignment - sending data directly to a destination buffer on the remote system without any intermediate stops. 2.9. Flow control While RPC protocols are typically call-response, connection sharing and use of read-ahead and write-behind can cause multiple RPC messages to be in flight concurrently on a single network connection. RPC provides no inherent flow control - there is nothing to prevent a client from initiating any number of concurrent RPC messages that can overwhelm an NFS server. While TCP allows the server to push back on a client with a window of the amount of data it is willing to accept, it does not reflect the transaction capacity of the server, e.g. the number of service threads available to handle requests. While the effect of a transaction overload on a TCP connection is just the server dropping requests, an overload on an RDMA transport can cause transport errors and loss of the connection. RDMA transports that support the use of untagged, pre-posted receive buffers require that any message sent over the connection have a receive buffer posted at the transport endpoint. A client that transmits a number of messages without acknowledgement must not send more messages than the number of receive buffers queued at the connection. The client and server must be able to negotiate the number of posted receive buffers, i.e. maximum number of messages that can be in flight. It's possible that the number of posted buffers may change over time depending on the memory or compute resources available to the server, so this Expires: June 2004 Callaghan and Wittle [Page 5] Internet-Draft NFS RDMA Requirements December 2003 negotiation should extend over the life of the connection. Given that RPC is a call-response protocol, the number of messages in flight is determined by the client. Hence the client needs to know the server's limits, but the client can guarantee sufficient posted receive buffers at its end simply by posting a new receive buffer to accept an RPC reply whenever it sends an RPC call message. 2.10. Buffer size control While the number of untagged receive buffers must be agreed upon by the client and server, the size of these buffers is also important. If an RPC message exceeds the size of the posted receive buffer then a network error will result and the connection may be lost. The client and server must agree on the size of posted buffers. The NFS protocol already allows some negotiation of buffer sizes in the protocol itself (the NFS version 3 FSINFO procedure provides maximum and preferred transfer sizes) so this buffer negotiation could be supported in an extended version of the NFS protocol, perhaps in an NFSv4 minor version. Alternatively, buffer negotiation might be conducted in a protocol-generic fashion via a simple RPC protocol that negotiates receive buffer size. This protocol may run over the existing RDMA connection when it is being initialized and bootstrapped with a requirement for a minimum receive buffer size, say 1 KB. 2.11. Recovery from RDMA errors NFS is recognized as a robust protocol that survives network errors and outages. Even the loss of a connection or crash and recovery of an NFS server need not affect the persistence of an NFS mount. NFS users are familiar with the "NFS server not responding" message, and the ability of applications to carry on as if nothing had happened after an "NFS server OK" message. We should expect the same behavior from an NFS mount over an RDMA connection. It must be able to detect and recover from RDMA errors, re-establishing the RDMA connection if necessary. 2.12. Efficient use of RDMA services: RDMA READ vs WRITE An RDMA transport must allow the NFS implementation to take full advantage of RDMA services - especially RDMA READ and WRITE operations that perform efficient memory-to-memory data transfers. The transport must allow client and server to make best use of RDMA READ and WRITE to perform NFS data transfers either for aligned data movement such as NFS READ and WRITE, or for movement of large amounts Expires: June 2004 Callaghan and Wittle [Page 6] Internet-Draft NFS RDMA Requirements December 2003 of non-aligned data - such as READDIR results. 3. RDMA Specific Requirements The following requirements are placed on a specific provider of RDMA service. These are the features that it must provide or support to meet the needs of the generic RDMA model assumed by the RPC transport. 3.1. IP based addressing Existing RPC networks are based on DNS provided hostnames and NICs identified with IP addresses. Although an RDMA transport may have different addressing requirements, it must be possible to map those addresses or derive them from IP addresses to preserve system administrator's sanity. Similarly, we should expect the familiar properties of RPC protocols as implemented on IP to be preserved by an RDMA transport. For instance, some notion of a unique transport endpoint like the UDP or TCP port space. The ability to handle multiple client connections to a server endpoint, and the ability for multiple RPC protocols to share a single RDMA connection. 3.2. Minimal effect on latency or protocol overhead. Although the NFS protocol is commonly used to transfer large chunks of file data, most of the protocol messages are small. For instance, even though an NFS READ message may return a large amount of data in the reply, the call message is quite tiny: typically less than 100 bytes. NFS traffic mixes show that directory LOOKUP and fetching of file metadata with GETATTR, which account for the majority of NFS traffic, also have small RPC messages. It is important that an RDMA transport feature a small transaction cost, i.e. small messages must be handled efficiently with latency as low or lower than an equivalent UDP or TCP connection. 3.3. Should handle large transfers The performance of the NFS protocol is often determined by the network transfer size available. NFS version 2 was limited by the protocol to 8 KB transfers, which was quite large back in the mid 80's. NFS version 3 allowed the client and server to negotiate transfer size. While 32 KB transfers are common today, there is evidence to suggest that much larger transfer sizes can improve the performance of sequential access to large files. An RDMA transport is expected to provide efficient transfer of large blocks of data using specialized network hardware. NFS should expect an RDMA Expires: June 2004 Callaghan and Wittle [Page 7] Internet-Draft NFS RDMA Requirements December 2003 transport to support transfers of a Megabyte or more. 4. Security ONC RPC makes no security assumptions of its transports, since it can already provide security for RPC messages through RPCSEC_GSS mechanisms [RFC2203]. However, link-level encryption or encryption via IPsec [RFC2401] may provide important performance advantages for an RPCSEC_GSS security mechanism [CCM]. 5. Acknowledgements The authors would like to thank Tom Talpey and Chet Juszczak for their contributions to this document. 6. References [CCM] M. Eisler, N. Williams, "The Channel Conjunction Mechanism (CCM) for GSS" Internet Draft. http://www.ietf.org/internet-drafts/ draft-ietf-nfsv4-ccm-01.txt [DAFS] Direct Access File System Specification version 1.0, available from http://www.dafscollaborative.org, September 2001 [NFSRDMA] T. Talpey, S. Shepler, "NFSv4 RDMA and Session Extensions" http://www.ietf.org/internet-drafts/ draft-talpey-nfsv4-rdma-sess-00.txt [NFSRDMAPS] Tom Talpey, Chet Juszczak, "NFS RDMA Problem Statement" http://www.ietf.org/internet-drafts/ draft-talpey-nfs-rdma-problem-statement-00.txt [RFC1094] "NFS: Network File System Protocol Specification", (NFS version 2) Informational RFC, http://www.ietf.org/rfc/rfc1094.txt [RFC1813] B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol Expires: June 2004 Callaghan and Wittle [Page 8] Internet-Draft NFS RDMA Requirements December 2003 Specification", Informational RFC, http://www.ietf.org/rfc/rfc1813.txt [RFC1831] R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification Version 2", Standards Track RFC, http://www.ietf.org/rfc/rfc1831.txt [RFC1832] R. Srinivasan, "XDR: External Data Representation Standard", Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt [RFC2203] Eisler, M., Chiu, A. and L. Ling, "RPCSEC_GSS Protocol Specification", RFC 2203, September 1997. http://www.ietf.org/rfc/rfc2203.txt [RFC2401] S. Kent, R. Atkinson, RFC2401, "Security Architecture for the Internet Protocol ", November, 1998. http://www.ietf.org/rfc/rfc2401.txt [RFC3530] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards Track RFC, http://www.ietf.org/rfc/rfc3530.txt [RPCRDMA] B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" http://www.ietf.org/internet-drafts/ draft-callaghan-rpc-rdma-00.txt 7. Authors' Addresses Expires: June 2004 Callaghan and Wittle [Page 9] Internet-Draft NFS RDMA Requirements December 2003 Brent Callaghan Sun Microsystems, Inc. 17 Network Circle Menlo Park, California 94025 USA Phone: +1 650 786 5067 EMail: brent.callaghan@sun.com Mark Wittle Network Appliance, Inc. 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 919 993 5627 EMail: mark.wittle@netapp.com 8. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Expires: June 2004 Callaghan and Wittle [Page 10]