Network File System Version 4 C. Lever Internet-Draft Oracle Updates: 8166 (if approved) January 31, 2020 Intended status: Standards Track Expires: August 3, 2020 RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-07 Abstract This document specifies the format of RDMA-CM Private Data exchanged between RPC-over-RDMA version 1 peers as part of establishing a connection. The addition of the private data payload specified in this document is an optional extension that does not alter the RPC- over-RDMA version 1 protocol. This document updates RFC 8166. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on August 3, 2020. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Lever Expires August 3, 2020 [Page 1] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 3. Advertised Transport Properties . . . . . . . . . . . . . . . 3 3.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 3.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 4. Private Data Message Format . . . . . . . . . . . . . . . . . 5 4.1. Interoperability Considerations . . . . . . . . . . . . . 6 4.1.1. Interoperability with RPC-over-RDMA Version 1 Implementations . . . . . . . . . . . . . . . . . . . 7 4.1.2. Interoperability Amongst RDMA Transports . . . . . . 7 5. Updating the Message Format . . . . . . . . . . . . . . . . . 7 5.1. Feature Support Flags . . . . . . . . . . . . . . . . . . 8 5.2. Inline Threshold Values . . . . . . . . . . . . . . . . . 8 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 7.1. Guidance for Designated Experts . . . . . . . . . . . . . 10 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 8.1. Normative References . . . . . . . . . . . . . . . . . . 10 8.2. Informative References . . . . . . . . . . . . . . . . . 11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 12 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction The RPC-over-RDMA version 1 transport protocol [RFC8166] enables payload data transfer using Remote Direct Memory Access (RDMA) for upper-layer protocols based on Remote Procedure Calls (RPC) [RFC5531]. The terms "Remote Direct Memory Access" (RDMA) and "Direct Data Placement" (DDP) are introduced in [RFC5040]. The two most immediate shortcomings of RPC-over-RDMA version 1 are: o Setting up an RDMA data transfer (via RDMA Read or Write) can be costly. The small default size of messages transmitted using RDMA Send forces the use of RDMA Read or Write operations even for relatively small messages and data payloads. The original specification of RPC-over-RDMA version 1 provided an out-of-band protocol for passing inline threshold values between connected peers [RFC5666]. However, [RFC8166] eliminated support for this protocol making it unavailable for this purpose. Lever Expires August 3, 2020 [Page 2] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 o Unlike most other contemporary RDMA-enabled storage protocols, there is no facility in RPC-over-RDMA version 1 that enables the use of remote invalidation [RFC5042]. RPC-over-RDMA version 1 has no means of extending its XDR definition in such a way that interoperability with existing implementations is preserved. As a result, an out-of-band mechanism is needed to help relieve these constraints for existing RPC-over-RDMA version 1 implementations. This document specifies a simple, non-XDR-based message format designed to be passed between RPC-over-RDMA version 1 peers at the time each RDMA transport connection is first established. The mechanism assumes that the underlying RDMA transport has a private data field that is passed between peers at connection time, such as is present in the iWARP protocol (described in Section 7.1 of [RFC5044]) or the InfiniBand Connection Manager [IBA]. To enable current RPC-over-RDMA version 1 implementations to interoperate with implementations that support the private message format described in this document, implementation of the private data message is OPTIONAL. When the private data message has been successfully exchanged, peers may choose to perform extended RDMA semantics. However, the private message format does not alter the XDR definition specified in [RFC8166]. The message format is intended to be further extensible within the normal scope of such IETF work (see Section 5 for further details). Section 7 of the current document defines an IANA registry for this purpose. In addition, interoperation between implementations of RPC- over-RDMA version 1 that present this message format to peers and those that do not recognize this message format is guaranteed. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 3. Advertised Transport Properties 3.1. Inline Threshold Size Section 3.3.2 of [RFC8166] defines the term "inline threshold." An inline threshold is the maximum number of bytes that can be transmitted using one RDMA Send and one RDMA Receive. There are a Lever Expires August 3, 2020 [Page 3] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 pair of inline thresholds for a connection: a client-to-server threshold and a server-to-client threshold. If an incoming RDMA message exceeds the size of a receiver's inline threshold, the receive operation fails and the RDMA provider typically terminates the connection. To convey an RPC message larger than the receiver's inline threshold without risking receive failure, a sender must use explicit RDMA data transfer operations, which are more expensive than an RDMA Send. See Sections 3.3 and 3.5 of [RFC8166] for a complete discussion. The default value of inline thresholds for RPC-over-RDMA version 1 connections is 1024 bytes (as defined in Section 3.3.3 of [RFC8166]). This value is adequate for nearly all NFS version 3 procedures. NFS version 4 COMPOUND operations [RFC7530] are larger on average than NFS version 3 procedures [RFC1813], forcing clients to use explicit RDMA operations for frequently-issued requests such as LOOKUP and GETATTR. The use of RPCSEC_GSS security also increases the average size of RPC messages, due to the larger size of RPCSEC_GSS credential material included in RPC headers [RFC7861]. If a sender and receiver could somehow agree on larger inline thresholds, frequently-used RPC transactions avoid the cost of explicit RDMA operations. 3.2. Remote Invalidation After an RDMA data transfer operation completes, an RDMA consumer can use remote invalidation to request that the remote peer RNIC invalidate an STag associated with the data transfer [RFC5042]. An RDMA consumer requests remote invalidation by posting an RDMA Send With Invalidate Work Request in place of an RDMA Send Work Request. Each RDMA Send With Invalidate carries one STag to invalidate. The receiver of an RDMA Send With Invalidate performs the requested invalidation and then reports that invalidation as part of the completion of a waiting Receive Work Request. If both peers support remote invalidation, an RPC-over-RDMA responder might use remote invalidation when replying to an RPC request that provided chunks. Because one of the chunks has already been invalidated, finalizing the results of the RPC is made simpler and faster. However, there are some important caveats which contraindicate the blanket use of remote invalidation: Lever Expires August 3, 2020 [Page 4] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 o Remote invalidation is not supported by all RNICs. o Not all RPC-over-RDMA responder implementations can generate RDMA Send With Invalidate Work Requests. o Not all RPC-over-RDMA requester implementations can recognize when remote invalidation has occurred. o On one connection in different RPC-over-RDMA transactions, or in a single RPC-over-RDMA transaction, an RPC-over-RDMA requester can expose a mixture of STags that may be invalidated remotely and some that must not be. No indication is provided at the RDMA layer as to which is which. A responder therefore must not employ remote invalidation unless it is aware of support for it in its own RDMA stack, and on the requester. And, without altering the XDR structure of RPC-over-RDMA version 1 messages, it is not possible to support remote invalidation with requesters that mix STags that may and must not be invalidated remotely in a single RPC or on the same connection. There are some NFS/RDMA client implementations whose STags are always safe to invalidate remotely. For such clients, indicating to the responder that remote invalidation is always safe can allow such invalidation without the need for additional protocol to be defined. 4. Private Data Message Format With an InfiniBand lower layer, for example, RDMA connection setup uses a Connection Manager when establishing a Reliable Connection [IBA]. When an RPC-over-RDMA version 1 transport connection is established, the client (which actively establishes connections) and the server (which passively accepts connections) populate the CM Private Data field exchanged as part of CM connection establishment. The transport properties exchanged via this mechanism are fixed for the life of the connection. Each new connection presents an opportunity for a fresh exchange. An implementation of the extension described in this document MUST be prepared for the settings to change upon a reconnection. For RPC-over-RDMA version 1, the CM Private Data field is formatted as described in the following subsection. RPC clients and servers use the same format. If the capacity of the Private Data field is too small to contain this message format, the underlying RDMA transport is not managed by a Connection Manager, or the underlying RDMA transport uses Private Data for its own purposes, the CM Private Data field cannot be used on behalf of RPC-over-RDMA version 1. Lever Expires August 3, 2020 [Page 5] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 The first 8 octets of the CM Private Data field is to be formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Format Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags | Send Size | Receive Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Format Identifier: This field contains a fixed 32-bit value that identifies the content of the Private Data field as an RPC-over- RDMA version 1 CM Private Data message. In RPC-over-RDMA version 1 Private Data, the value of this field is always 0xf6ab0e18, in network byte order. The use of this field is further expanded upon in Section 4.1.2. Version: This 8-bit field contains a message format version number. The value "1" in this field indicates that exactly eight octets are present, that they appear in the order described in this section, and that each has the meaning defined in this section. Further considerations about the use of this field are discussed in Section 5. Flags: This 8-bit field contains bit flags that indicate the support status of optional features, such as remote invalidation. The meaning of these flags is defined in Section 5.1. Send Size: This 8-bit field contains an encoded value corresponding to the maximum number of bytes this peer is prepared to transmit in a single RDMA Send on this connection. The value is encoded as described in Section 5.2. Receive Size: This 8-bit field contains an encoded value corresponding to the maximum number of bytes this peer is prepared to receive with a single RDMA Receive on this connection. The value is encoded as described in Section 5.2. 4.1. Interoperability Considerations The extension described in this document is designed to allow RPC- over-RDMA version implementations that use CM Private Data to interoperate fully with RPC-over-RDMA version 1 implementations that do not exchange this information. Implementations that use this extension must also interoperate fully with RDMA implementations that use CM Private Data for other purposes. Realizing these goals Lever Expires August 3, 2020 [Page 6] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 require that implementations of this extension follow the practices described in the rest of this section. 4.1.1. Interoperability with RPC-over-RDMA Version 1 Implementations When a peer does not receive a CM Private Data message which conforms to Section 4, it needs to act as if the remote peer supports only the default RPC-over-RDMA version 1 settings, as defined in [RFC8166]. In other words, the peer MUST behave as if a Private Data message was received in which bit 15 of the Flags field is zero, and both Size fields contain the value zero. 4.1.2. Interoperability Amongst RDMA Transports The Format Identifier field defined in Section 4 is provided to enable implementations to distinguish RPC-over-RDMA version 1 Private Data from private data inserted at other layers, such as the private data inserted by the iWARP MPAv2 enhancement described in [RFC6581]. As part of connection establishment, the received private data buffer is searched for the Format Identifier word. The offset of the Format Identifier is not restricted to any alignment. If the RPC-over-RDMA version 1 CM Private Data Format Identifier is not present, an RPC- over-RDMA version 1 receiver MUST behave as if no RPC-over-RDMA version 1 CM Private Data has been provided. Once the RPC-over-RDMA version 1 CM Private Data Format Identifier is found, the receiver parses the subsequent octets as RPC-over-RDMA version 1 CM Private Data. As additional assurance that the private data content is valid RPC-over-RDMA version 1 CM Private Data, the receiver should check that the format version number field contains a valid and recognized version number, the size of the private data does not overrun the length of the buffer, and all reserved flag bits are zero. 5. Updating the Message Format Although the message format described in this document provides the ability for the client and server to exchange particular information about the local RPC-over-RDMA implementation, it is possible that there will be a future need to exchange additional properties. This would make it necessary to extend or otherwise modify the format described in this document. Any modification faces the problem of interoperating properly with implementations of RPC-over-RDMA version 1 that are unaware of this existence of the new format. These include implementations that that Lever Expires August 3, 2020 [Page 7] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 do not recognize the exchange of CM Private Data as well as those that recognize only the format described in this document. Given the message format described in this document, these interoperability constraints could be met by the following sorts of new message formats: o A format which uses a different value for the first four bytes of the format, as provided for in the registry described in Section 7. o A format which uses the same value for the Format Identifier field and a value other than one (1) in the Version field. Although it is possible to reorganize the last three of the eight bytes in the existing format, extended formats are unlikely to do so. New formats would take the form of extensions of the format described in this document with added fields starting at byte eight of the format and changes to the definition of previously reserved flags. 5.1. Feature Support Flags The bits in the Flags field are labeled from bit 8 to bit 15, as shown in the diagram above. When the Version field contains the value "1", the bits in the Flags field are to be set as follows: Bit 15: When both connection peers have set this flag in their CM Private Data, the responder MAY use RDMA Send With Invalidate when transmitting RPC Replies. Each RDMA Send With Invalidate MUST invalidate an STag associated only with the XID in the rdma_xid field of the RPC-over-RDMA Transport Header it carries. When either peer on a connection clears this flag, the responder MUST use only RDMA Send when transmitting RPC Replies. Bits 14 - 8: These bits are reserved and are always zero when the Version field contains 1. 5.2. Inline Threshold Values Inline threshold sizes from 1KB to 256KB can be represented in the Send Size and Receive Size fields. A sender computes the encoded value by dividing the buffer size, in octets, by 1024 and subtracting one from the result. A receiver decodes this value by performing the inverse set of operations: it adds one to the encoded value and then multiplies that result by 1024. The client uses the smaller of its own send size and the server's reported receive size as the client-to-server inline threshold. The Lever Expires August 3, 2020 [Page 8] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 server uses the smaller of its own send size and the clients's reported receive size as the server-to-client inline threshold. 6. Security Considerations The reader is directed to the Security Considerations section of [RFC8166] for background and further discussion. The RPC-over-RDMA version 1 protocol framework depends on the semantics of the Reliable Connected (RC) queue pair (QP) type, as defined in Section 9.7.7 of [IBA]. The integrity of CM Private Data and the authenticity of its source are ensured by the exclusive use of RC queue pairs. Any attempt to interfere with or hijack data in transit on an RC connection results in the RDMA provider terminating the connection. Additional analysis of RDMA transport security appears in the Security Considerations section of [RFC5042]. That document recommends IPsec as the default transport layer security solution. When deployed with iWARP, IPsec establishes a protected channel before any iWARP operations are exchanged, thus it protects the exchange of Private Data that occurs as each QP is established. However, IPsec is not available for InfiniBand or RoCE deployments. Those fabrics rely on physical security and cyclic redundancy checks to protect network traffic. Improperly setting one of the fields in a version 1 Private Message can result in an increased risk of disconnection (i.e., self-imposed Denial of Service). There is no additional risk of exposing upper- layer payloads after exchanging the Private Message format defined in the current document. In addition to describing the structure of a new format version, any document that extends the Private Data format described in the current document must discuss security considerations of new data items exchanged between connection peers. 7. IANA Considerations In accordance with [RFC8126], the author requests that IANA create a new registry in the "Remote Direct Data Placement" Protocol Category Group. The new registry is to be called the "RDMA-CM Private Data Identifier Registry". This is a registry of 32-bit numbers that identify the upper-layer protocol associated with data that appears in the application-specific RDMA-CM Private Data area. The fields in this registry include: Format Identifier, Description, and Reference. The initial contents of this registry are a single entry: Lever Expires August 3, 2020 [Page 9] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 +------------------+------------------------------------+-----------+ | Format | Format Description | Reference | | Identifier | | | +------------------+------------------------------------+-----------+ | 0xf6ab0e18 | RPC-over-RDMA version 1 CM Private | [RFC-TBD] | | | Data | | +------------------+------------------------------------+-----------+ Table 1: RDMA-CM Private Data Identifier Registry IANA is to assign subsequent new entries in this registry using the Expert Review policy as defined in Section 4.5 of [RFC8126]. 7.1. Guidance for Designated Experts The Designated Expert (DE), appointed by the IESG, should ascertain the existence of suitable documentation that defines the semantics and format of the private data, and verify that the document is permanently and publicly available. Documentation produced outside the IETF must not conflict with work that is active or already published within the IETF. The new Reference field should contain a reference to that documentation. The DE can assign new Format Identifiers at random as long as they do not conflict with existing entries in this registry. The Description field should contain the name of the RDMA consumer that will generate and use the private data. The DE will post the request to the nfsv4 WG mailing list (or a successor to that list, if such a list exists), for comment and review. The DE will approve or deny the request and publish notice of the decision within 30 days. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, October 2007, . Lever Expires August 3, 2020 [Page 10] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2007, . [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, June 2017, . [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call Version 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 8.2. Informative References [IBA] InfiniBand Trade Association, "InfiniBand Architecture Specification Volume 1", Release 1.3, March 2015. Available from https://www.infinibandta.org/ [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, DOI 10.17487/RFC1813, June 1995, . [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. Carrier, "Marker PDU Aligned Framing for TCP Specification", RFC 5044, DOI 10.17487/RFC5044, October 2007, . [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, May 2009, . [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access Transport for Remote Procedure Call", RFC 5666, DOI 10.17487/RFC5666, January 2010, . Lever Expires August 3, 2020 [Page 11] Internet-Draft RPC-Over-RDMA CM Private Data January 2020 [RFC6581] Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S. Wise, "Enhanced Remote Direct Memory Access (RDMA) Connection Establishment", RFC 6581, DOI 10.17487/RFC6581, April 2012, . [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, March 2015, . [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) Security Version 3", RFC 7861, DOI 10.17487/RFC7861, November 2016, . Acknowledgments Thanks to Christoph Hellwig and Devesh Sharma for suggesting this approach, and to Tom Talpey and Dave Noveck for their expert comments and review. The author also wishes to thank Bill Baker and Greg Marsden for their support of this work. Also, thanks to expert reviewers Sean Hefty and Dave Minturn. Special thanks go to document shepherd Brian Pawlowski, Transport Area Director Magnus Westerlund, NFSV4 Working Group Chairs David Noveck and Spencer Shepler, and NFSV4 Working Group Secretary Thomas Haynes. Author's Address Charles Lever Oracle Corporation United States of America Email: chuck.lever@oracle.com Lever Expires August 3, 2020 [Page 12]