Network Working Group R. Stewart Internet Draft Cisco Systems Category: Internet Draft D. Otis SANlight January 22, 2002 SCTP DDP/RDMA Adaptation draft-stewart-otis-sctp-ddp-rdma-00.txt Status of this Memo This document is an internet-draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract In some applications direct placement of data without the overhead of multiple copies is a desirable feature. To accomplish this goal, a direct placement adaptation layer is defined within this document. We propose a small shim that sits directly above SCTP and that possibly places data into a user buffer. The ultimate goal is to have placement occur by the network interface card, where this shim will coordinate such placement while proper network layering is maintained. As SCTP was not designed to directly handle offset based fragmentation, the shim must handle message fragmentation to introduce the proper offsets as well as determine completion notifications as a result of the required unordered delivery needed for immediate placement. Table of Content 1 Introduction.................................................2 1.1 Conventions ...............................................2 2 Adaptation Layer Formats.....................................2 2.1 Adaptation Layer Indicator ................................2 2.2 DATA chunk format .........................................3 3 Procedures...................................................6 3.1 Association Initialization ................................6 3.2 DDP and RDMA Data Placement ...............................6 3.2.1 Receiver Side Behavior ..................................6 4 IANA considerations..........................................8 5 Security Considerations......................................8 6 Acknowledgments..............................................8 Stewart-Otis [Page 1] Internet Draft SCTP DDP/RDMA Adaptation January 2002 7 Authors' Addresses...........................................8 8 References...................................................9 1 Introduction In some applications, the direct placement of data without the overhead of multiple copies is a desirable feature. To accomplish this goal, a direct placement adaptation layer is defined within this document. We propose a small shim sitting directly above SCTP that enables data to be directly placed into user buffers without assembly buffering. This assumes hardware able to validate each DATA chunk as received prior to placement and each DATA Chunk carries an offset within an identified user buffer. Some implementations may include this adaptation layer within their SCTP implementations to obtain maximum performance but the behavior of SCTP will be unaffected. In order to accomplish this we specify the use of the new adaptation layer indication as defined in [STEWa]. 1.1 Conventions The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [RFC2119]. RDMA is a mnemonic for Remote Direct Memory Access. DDP is a mnemonic for Direct Data Placement. 2 Adaptation Layer Formats 2.1 Adaptation Layer Indicator Three separate adaptation layers are defined which MAY appear in the INIT or INIT-ACK with the following format as defined in [STEWa]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type =0xC006 | Length = Variable | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Adaptation Indication | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Adaptation Indication: The following five values are allowed and one of them MUST be present to enable specific behaviors defined in this document: DDP - 0x00000001 DDP_WITH_TAG - 0x00000002 RDMA_ONLY - 0x00000003 Stewart-Otis [Page 2] Internet Draft SCTP DDP/RDMA Adaptation January 2002 RDMA_AND_NEGOTIATION - 0x00000004 RDMA_NEGOTIATION - 0x00000005 If DDP is specified then completion semantics are limited to either individual messages or individual Streams. The Stream implies the user buffer where length is the only limit. Note that in this mode the 'Placement Tag' field holds no meaning. If DDP_WITH_TAG is specified, then completion semantics are limited to individual messages as delineated by the Placement Tag or individual Streams. The Placement Tag implies the user buffer. If RDMA_ONLY is specified then the RDMA placement algorithms as specified in [NEW DRAFT] MUST be used to place data directly into user buffers. Normaly the 'Placement Tag' field is set to a value hashed in the receiver to a user buffer. The 'Placement Offset' is a byte offset into the user buffer. Please see [NEW DRAFT] for the specific details. If RDMA_NEGOTIATION is specified then this association MUST NOT place any data, but instead is being used for RDMA negotiation as a separate SCTP association. The proceedures for placement negotiation are defined in [NEW DRAFT]. If RDMA_PLACEMENT_AND_NEGOTIATION then both RDMA placement and negotiation are being sent over this association. 2.2 DATA chunk format The following format MUST be used on all DATA chunks. Note that the format expands the existing DATA chunk so that direct placement fields are considered user data by the SCTP stack. In addition, to allow immediate placement, all DATA chunks are sent as Unordered and the shim is required to perform all message fragmentation prior to being delivered to SCTP where SCTP is placed in a mode to refuse messages larger than the path MTU. Common DATA Chunk header: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 0 | Reserved|U|B|E| Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TSN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Stream Identifier S | Stream Sequence Number n | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Protocol Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ DDP Header Extension: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Stewart-Otis [Page 3] Internet Draft SCTP DDP/RDMA Adaptation January 2002 | Placement Mode | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Placement Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Placement Offset + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / Data (seq n of Stream S) / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RDMA Header Extension: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Placement Mode | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Placement Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Placement Offset + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Association Wide Sequence (AWS) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / Data (seq n of Stream S) / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Note, the following fields: Type, Reserved, U,B,E Length, TSN, Stream Identifier, Stream Sequence Number, Payload Protocol Identifier are defined in [RFC2960] and the reader should refer to it for any details for these fields. In the case for DDP and RDMA, Type will always be 0, the U, B, and E flags MUST be set, the Length will indicate the unpadded length of the DATA chunk, the TSN will represent a unique value associated with the DATA chunk, the Stream Identifier will indicate the Stream a message was sent, the Stream Sequence Number is invalid, and the Payload Protocol Identifier will be determined by the layer above the shim. An exception can occur when the Placement Mode indicates NO_PLACEMENT_MODE is active, the U, B, and E flags may be any value, and the Stream Sequence Number will be valid if the U flag is not set. Placement Offset: 64 bits (unsigned integer) Stewart-Otis [Page 4] Internet Draft SCTP DDP/RDMA Adaptation January 2002 When the Placement Mode is not set to NO_PLACEMENT_MODE, this value holds the placement byte offset for this data. The local endpoint MUST verify the offset is within the valid range for the placement buffer. Placement Mode: 24 bits (unsigned integer) This field will hold one of the following values: 0x000000 - NO_PLACEMENT_MODE. No RDMA or DDP placement is active. This indicates the Offset not specified and the data in this chunk is not to be directly placed. If the adaptation layer indication was either RDMA_AND_NEGOTIATION or RDMA_NEGOTIATION, this mode indicates the DATA chunk contains negotiation information. This mode must not be used if the adaptation layer indication was RDMA_ONLY. 0x000001 - DDP_MODE. The Direct Data Placement Mode is active and the associated header extension is valid. This mode MUST only be used when the adaptation layer indication was DDP. In this mode, the TAG field MAY contain information used by the ULP above the shim. 0x000002 - DDP_WITH_TAG_MODE. The Direct Data Placement mode is active and the associated header extension is valid. This mode MUST only be used when the adaptation layer indication was DDP_WITH_TAG. 0x000003 - RDMA_PLACEMENT_MODE. The RDMA placement mode is active and the associated header extension is valid. This mode MUST only be used when the adaptation layer indication was RDMA_ONLY or RDMA_AND_NEGOTIATION. Flags: 8 bits (unsigned integer) Bit 0 - Acknowledgement Requested. A signal provided to the ULP above the shim to indicate an Acknowledgement was requested. Bit 1 - Notification upon Completion. A signal provided to the ULP above the shim to indicate completion of this placement. Bit 2 - Release Tag upon Completion. A signal provided to the ULP that the current Tag is to be invalidated. Bit 3-7 - Reserved. For DDP modes, signals are held until the cumulative TSN is greater than or equal to the TSN of the DATA chunk carrying the signal flag. For RDMA modes, signals are held until the cumulative AWS is greater than or equal to the AWS of the DATA Stewart-Otis [Page 5] Internet Draft SCTP DDP/RDMA Adaptation January 2002 chunk carrying the signal flag. Comparisons and arithmetic on TSNs or AWS in this document SHOULD use Serial Number Arithmetic as defined in [RFC1982] where SERIAL_BITS = 32. Placement Tag: 32 bits (unsigned integer) When the Placement Mode is set to NO_PLACEMENT_MODE or DDP_MODE, this may hold signal information used by the ULP, otherwise, it holds the pre-negotiated placement tag. This tag should be used to lookup the actual buffer address, limits, and restrictions in the local endpoints tag lookup cache. Association Wide Sequence: 32 bits (unsigned integer) This value is incremented for each chunk sent to SCTP. This value is used to resolve the relative sequence of messages between Streams irrespective of the order they were sent by SCTP. 3 Procedures 3.1 Association Initialization At the startup of an association, an endpoint wishing to perform DDP or RDMA placement MUST include an adaptation layer indication in its INIT or INIT-ACK (as defined in 2.1). After the exchange of the first two messages (INIT and INIT-ACK), an endpoint MUST verify that the peer supports the mode by confirmation that the peer included one of the adaptation indications. If the peer did specify a DDP or RDMA adaptation, then ALL DATA chunks MUST contain the header extensions specified in section 2.2 and the endpoint SHOULD enable the indicated adaptation. If the peer endpoint did NOT specify a DDP or RDMA placement adaptation then the local endpoint MUST disable DDP and RDMA adaptation and it MUST NOT send DATA chunks with the additional fields as specified in section 2.2. 3.2 DDP and RDMA Data Placement 3.2.1 Receiver Side Behavior When a DATA chunk arrives and DDP or RDMA Placement adaptation has been enabled, the following procedures MUST be performed. R1 - If the Placement Mode is set to NO_PLACEMENT_MODE and the peer endpoint indicated RDMA_ONLY in its adaptation indication, the endpoint MUST abort the association. R2 - If the Placement Mode is set to DDP_MODE and the peer endpoint did not indicate DDP in its adaptation indication, the endpoint MUST abort the association. R3 - If the Placement Mode is set to DDP_WITH_TAG_MODE and the Stewart-Otis [Page 6] Internet Draft SCTP DDP/RDMA Adaptation January 2002 peer endpoint did not indicate DDP_WITH_TAG in its adaptation indication, the endpoint MUST abort the association. R4 - If the Placement Mode is set to RDMA and the peer endpoint did not indicate RDMA_ONLY or RDMA_AND_NEGOTIATION in its adaptation indication, the endpoint MUST abort the association. R5 - If the Placement mode is set to a recognized mode other than NO_PLACEMENT_MODE, the endpoint MUST use its placement cache to determine the data buffer to receive the payload of this DATA chunk. For modes using a Placement Tag, this field SHOULD be used to obtain buffer related information. The buffer SHOULD be indexed by Placement Offset and the data SHOULD be directly placed within the user buffer. Note: Great caution must be taken when referencing offsets to memory addresses. The Placement Tag SHOULD NOT be a direct memory address but instead an index to be translated into a memory address, memory limits, and read/write restrictions. The Placement Offset must be carefully verified to assure that the Offset is within the valid range of the indicated buffer. If any data placement specification is incorrect the association SHOULD be aborted. R6 - Otherwise if the Placement Flag is set to NO_PLACEMENT_MODE the endpoint MUST pass the message to its adaptation negotiation layer and process it as specified in [NEW- DRAFT]. 3.2.2 Sender side behavior The sender of a message MUST always include an extension header if a DDP or RDMA adaptation is enabled. The sender MUST perform the following when sending data: S1 - If RDMA_ONLY was specified by the sender in its adaptation indication it MUST NOT set the Placement Mode to the value of NO_PLACMENT_MODE. S2 - If RDMA_NEGOTIATION was specified by the sender in its adaptation indication it MUST set the Placement Mode to the value of NO_PLACEMENT_MODE. S3 - If the user message to be sent is to be directly placed and the Tag and Offset are known, the Placement Mode SHOULD be set to the appropriate placement mode and the Tag, Offset and flags SHOULD be placed into the appropriate fields in the outgoing DATA chunk. For messages that must be fragmented by the shim, only the last DATA chunk of the message will include the flag values and each subsequent fragment will have the offset byte value advanced according Stewart-Otis [Page 7] Internet Draft SCTP DDP/RDMA Adaptation January 2002 to the sum of each previous fragment size. S4 - If the user message to be sent is NOT to be directly placed (such as a message for negotiation as specified in [NEW- DRAFT] or a non-placed data message) the sender MUST specify the value of NO_PLACEMENT_MODE in the Placement Mode and set the Offset field to the value of 0. The Tag Field may contain information used by the ULP. [Editor's note: To save space, the Placement Mode and Flag field could be placed within the Payload Protocol Identifier field and the Placement Offset and Placement Tag could exchange position to still allow 64 bit alignment.] 4 IANA considerations This document defines five new Adaptation Layer Indications as specified within section 2.1. 5 Security Considerations Any direct placement of memory poses a significant security risk. Great caution must be taken when referencing offsets to memory addresses in behalf of peer endpoints. The Placement Tag SHOULD NOT be a direct memory address passed to a peer but instead an index to be translated into a memory address. The Placement Offset must be carefully verified to assure that the Offset is within a valid range of the buffer. If any data placement specification is incorrect the association SHOULD be aborted. 6 Acknowledgments The authors would like to thank the following people that have provided comments and input Stephen Bailey, Allyn Romanow. 7 Authors' Addresses Randall R. Stewart 24 Burning Bush Trail. Crystal Lake, IL 60012 USA EMail: rrs@cisco.com Douglas Otis 800 E. Middlefield Mountain View, CA 94043 USA Email dotis@sanlight.net Stewart-Otis [Page 8] Internet Draft SCTP DDP/RDMA Adaptation January 2002 8 References [RFC1982] Elz, R. and R. Bush, "Serial Number Arithmetic", RFC 1982, August 1996. [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2960] R. R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. J. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and, V. Paxson, "Stream Control Transmission Protocol," RFC 2960, October 2000. [STEWa] - Stewart, Ramalho, Xie, Tuexen, Rytina, Conrad, "SCTP Extensions for Dynamic Reconfiguration of IP Addresses", November 2001, draft-ietf-tsvwg-addip-sctp-03.txt, work-in-progress. [NEW-DRAFT] - A new draft to do the placement negotiation? Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Funding for the RFC Editor function is currently provided by the Internet Society. Stewart-Otis [Page 9]