Internet Engineering Task Force INTERNET-DRAFT Eddie Kohler draft-kohler-dcp-00.txt Mark Handley Sally Floyd Jitendra Padhye ACIRI 13 July 2001 Expires: January 2002 Datagram Control Protocol (DCP) Status of this Document This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC 2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document specifies the Datagram Control Protocol (DCP), which implements a congestion-controlled, unreliable flow of datagrams suitable for use by applications such as streaming media. Kohler/Handley/Floyd/Padhye [Page 1] INTERNET-DRAFT Expires: January 2002 July 2001 Table of Contents 1. Introduction. . . . . . . . . . . . . . . . . . . . . . 4 2. Concepts and Terminology. . . . . . . . . . . . . . . . 5 2.1. Anatomy of a DCP Connection. . . . . . . . . . . . . 5 2.2. Congestion Control . . . . . . . . . . . . . . . . . 6 2.3. Connection Initiation and Termination. . . . . . . . 6 2.4. Features . . . . . . . . . . . . . . . . . . . . . . 7 3. DCP Packets . . . . . . . . . . . . . . . . . . . . . . 7 3.1. Examples of DCP Congestion Control . . . . . . . . . 9 3.1.1. DCP with TCP-like Congestion Control. . . . . . . 9 3.1.2. DCP with TFRC Congestion Control. . . . . . . . . 10 3.2. DCP Generic Packet Header. . . . . . . . . . . . . . 11 3.3. DCP-Request Packet Format. . . . . . . . . . . . . . 14 3.4. DCP-Response Packet Format . . . . . . . . . . . . . 14 3.5. DCP-Data, DCP-Ack, and DCP-DataAck Packet Formats . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6. DCP-CloseReq and DCP-Close Packet Format . . . . . . 17 3.7. DCP-Reset Packet Format. . . . . . . . . . . . . . . 17 4. Options and Features. . . . . . . . . . . . . . . . . . 18 4.1. Padding Option . . . . . . . . . . . . . . . . . . . 18 4.2. Ignored Option . . . . . . . . . . . . . . . . . . . 19 4.3. Feature Negotiation. . . . . . . . . . . . . . . . . 19 4.3.1. Feature Numbers . . . . . . . . . . . . . . . . . 19 4.3.2. Ask Option. . . . . . . . . . . . . . . . . . . . 20 4.3.3. Choose Option . . . . . . . . . . . . . . . . . . 20 4.3.4. Answer Option . . . . . . . . . . . . . . . . . . 20 4.3.5. Example Negotiations. . . . . . . . . . . . . . . 20 4.3.6. Unknown Features. . . . . . . . . . . . . . . . . 21 4.3.7. State Diagram . . . . . . . . . . . . . . . . . . 21 4.4. Data Discarded Option. . . . . . . . . . . . . . . . 25 4.5. Init Cookie Option . . . . . . . . . . . . . . . . . 25 4.6. Timestamp Option . . . . . . . . . . . . . . . . . . 26 4.7. Timestamp Echo Option. . . . . . . . . . . . . . . . 26 5. Congestion Control IDs. . . . . . . . . . . . . . . . . 26 5.1. Single-Window Congestion Control . . . . . . . . . . 27 5.2. Unspecified Sender-Based Congestion Control. . . . . 27 5.3. TCP-like Congestion Control. . . . . . . . . . . . . 28 5.4. TFRC Congestion Control. . . . . . . . . . . . . . . 28 5.5. CCID-Specific Options and Features . . . . . . . . . 28 6. Acknowledgements. . . . . . . . . . . . . . . . . . . . 29 6.1. Acknowledgements and CCIDs . . . . . . . . . . . . . 29 6.2. Ack Piggybacking . . . . . . . . . . . . . . . . . . 31 6.3. Ack Ratio Feature. . . . . . . . . . . . . . . . . . 31 6.4. Use Ack Vector Feature . . . . . . . . . . . . . . . 32 6.5. Ack Vector Options . . . . . . . . . . . . . . . . . 32 6.5.1. Ack Vector Consistency. . . . . . . . . . . . . . 33 Kohler/Handley/Floyd/Padhye [Page 2] INTERNET-DRAFT Expires: January 2002 July 2001 6.5.2. Ack Vector Coverage . . . . . . . . . . . . . . . 35 6.6. Receive Buffer Drops Option. . . . . . . . . . . . . 35 6.7. Ack Vector Implementation Notes. . . . . . . . . . . 36 6.7.1. New Packets . . . . . . . . . . . . . . . . . . . 37 6.7.2. Sending Acknowledgements. . . . . . . . . . . . . 39 6.7.3. Clearing State. . . . . . . . . . . . . . . . . . 39 6.7.4. Processing Acknowledgements . . . . . . . . . . . 40 7. Explicit Congestion Notification. . . . . . . . . . . . 41 7.1. ECN Capable Feature. . . . . . . . . . . . . . . . . 41 7.2. ECN Nonces . . . . . . . . . . . . . . . . . . . . . 42 8. Path MTU Discovery. . . . . . . . . . . . . . . . . . . 43 9. Abstract API. . . . . . . . . . . . . . . . . . . . . . 44 10. DCP and the Congestion Manager . . . . . . . . . . . . 44 11. DCP and RTP. . . . . . . . . . . . . . . . . . . . . . 45 12. Security Considerations. . . . . . . . . . . . . . . . 45 13. IANA Considerations. . . . . . . . . . . . . . . . . . 45 14. Thanks . . . . . . . . . . . . . . . . . . . . . . . . 45 15. References . . . . . . . . . . . . . . . . . . . . . . 45 16. Authors' Addresses . . . . . . . . . . . . . . . . . . 46 Kohler/Handley/Floyd/Padhye [Page 3] INTERNET-DRAFT Expires: January 2002 July 2001 1. Introduction This document specifies the Datagram Control Protocol (DCP). DCP provides the following features: o An unreliable flow of datagrams, with acknowledgements. o A reliable handshake for connection setup and teardown. o Reliable negotiation of options, including negotiation of a suitable congestion control mechanism. o Mechanisms allowing a server to avoid holding any state for unacknowledged connection attempts or already-finished connections. o An optional mechanism that allows the sender to know, with high reliability, which packets reached the receiver. o Congestion control incorporating Explicit Congestion Notification (ECN) and the ECN Nonce, as per [RFC 2481] and [WES01]. o Path MTU discovery, as per [RFC 1191]. DCP is intended for applications that require the flow-based semantics of TCP, but which do not want TCP's in-order delivery and reliability semantics, or which would like different congestion control dynamics than TCP. Similarly, DCP is intended for applications that do not require the features of SCTP [RFC 2960] such as sequenced delivery within multiple streams. The sort of applications which could make use of DCP are those which have timing constraints on the delivery of data, such that reliable in-order delivery, when combined with congestion control, is likely to result in some information arriving at the receiver after it is no longer of use. Such applications might include streaming media and Internet telephony. To date most such applications have used either TCP, with the problems described above, or used UDP and implemented their own congestion control mechanisms (or no congestion control at all). The purpose of DCP is to provide a standard way to implement congestion control and congestion control negotiation for such applications. One of the motivations for DCP is to enable the use of ECN, along with conformant end-to-end congestion control, for applications that otherwise would be using UDP. In addition, DCP implements reliable connection setup, teardown, and feature negotiation. Kohler/Handley/Floyd/Padhye Section 1. [Page 4] INTERNET-DRAFT Expires: January 2002 July 2001 A DCP connection contains acknowledgement traffic as well as data traffic. Acknowledgements inform a sender whether its packets arrived, and whether they were ECN marked. Acks are transmitted as reliably as the congestion control mechanism in use requires, possibly up to completely reliably. 2. Concepts and Terminology 2.1. Anatomy of a DCP Connection Each DCP connection runs between two endpoints, which we often name DCP A and DCP B. Data may pass over the connection in either or both directions. The DCP connection between DCP A and DCP B consists of four sets of packets, as follows: (1) Data packets from DCP A to DCP B. (2) Acknowledgements from DCP B to DCP A. (3) Data packets from DCP B to DCP A. (4) Acknowledgements from DCP A to DCP B. We use the following terms to refer to subsets and endpoints of a DCP connection. Subflows A subflow consists of either data or acknowledgement packets, sent in one direction (from DCP A to DCP B, say). Each of the four sets of packets above is a subflow. (Subflows may overlap to some extent, since acknowledgements may be piggybacked on data packets.) Sequences A sequence consists of all packets sent in one direction, regardless of whether they are data or acknowledgements. The sets 1+4 and 2+3, from above, are each sequences. Each packet on a sequence has a different sequence number. Half-connections A half-connection consists of the data packets sent in one direction, plus the corresponding acknowledgements. The sets 1+2 and 3+4, from above, are each half-connections. Half-connections are named after the direction of data flow, so the A-to-B half- connection contains the data packets from A to B and the acknowledgements from B to A. Kohler/Handley/Floyd/Padhye Section 2.1. [Page 5] INTERNET-DRAFT Expires: January 2002 July 2001 HC-Sender and HC-Receiver In the context of a single half-connection, the HC-Sender is the endpoint sending data, while the HC-Receiver is the endpoint sending acknowledgements. For example, in the A-to-B half- connection, DCP A is the HC-Sender and DCP B is the HC-Receiver. 2.2. Congestion Control Each half-connection is managed by a congestion control mechanism. The endpoints negotiate these mechanisms at connection setup; the mechanisms for the two half-connections need not be the same, but they must both be TCP-compatible. Conformant congestion control mechanisms correspond to single-byte congestion control identifiers, or CCIDs. The CCID for a half- connection describes how the HC-Sender limits data packet rates in a TCP-friendly manner; how it maintains necessary parameters, such as congestion windows; how the HC-Receiver sends congestion feedback via acknowledgements; and how it manages the acknowledgement rate. Section 5 introduces the currently allocated CCIDs, which are defined in separate profile documents. The special CCID 0, Single-Window Congestion Control [CCID 0 PROFILE], is reserved for half-connections containing at most an initial window's worth of data. (The initial window is defined as in TCP; it is currently 2 packets.) This is useful for scenarios such as broadcast media, where all data travels from a "server" to a "client". If the client-to-server half-connection uses CCID 0, the server may use a simplified DCP implementation -- for instance, it need not keep lots of information about acknowledgements. We have not yet determined whether CCID 0 should reliably transmit this initial window of packets. 2.3. Connection Initiation and Termination Every DCP connection is actively initiated by one DCP, which connects to a DCP socket in the passive listening state. We refer to the active endpoint as "the client" and the passive endpoint as "the server". Most of the DCP specification is indifferent to whether a DCP is client or server. However, only the server may generate a DCP-CloseReq packet. (A DCP-CloseReq packet forces the receiving DCP to close the connection and maintain connection state for a reasonable time, allowing old segments to clear the network.) This means that the client cannot force the server to maintain connection state after the connection is closed. DCP does not support TCP-style simultaneous open. In particular, a host MUST NOT respond to a DCP-Request packet with a DCP-Response Kohler/Handley/Floyd/Padhye Section 2.3. [Page 6] INTERNET-DRAFT Expires: January 2002 July 2001 packet unless the destination port specified in the DCP-Request corresponds to a local socket opened for listening. DCP also does not support half-open connections. That is, DCP shuts down both half-connections as a unit. 2.4. Features DCP uses a generic mechanism to negotiate connection properties, such as the CCIDs active on the two half-connections. These properties are called features. (We reserve the term "option" for a collection of bytes in some DCP header.) A feature name, such as "CCID", generally corresponds to two featues on a connection, one per endpoint (or, equivalently, one per half-connection). For instance, there are two CCIDs per connection. The endpoint in charge of a particular feature is called its feature location. The Ask, Choose, and Answer options negotiate feature values. Ask is sent to a feature location, asking it to change its value for the feature. The feature location may respond with Choose, which asks the other endpoint to Ask again with different values, or it may change the feature value and acknowledge the request with Answer. Retransmissions make feature negotiation reliable. Section 4.3 describes these options further. 3. DCP Packets DCP has eight different packet types: o DCP-Request o DCP-Response o DCP-Data o DCP-Ack o DCP-DataAck o DCP-CloseReq o DCP-Close o DCP-Reset The progress of a typical DCP connection is as follows. Kohler/Handley/Floyd/Padhye Section 3. [Page 7] INTERNET-DRAFT Expires: January 2002 July 2001 (1) The client sends the server a DCP-Request packet specifying the client and server ports, the service that is being requested, and any features that are being negotiated, including the CCID that the client would like the server to use. The client MAY optionally piggyback some data on the DCP-Request packet -- an application-level request, say -- which the server MAY ignore. (2) The server sends the client a DCP-Response packet indicating that it is willing to communicate with the client. The response indicates any features and options that the server agrees to, whether an application request in the DCP-request was actually passed to the application, and optionally an Init Cookie that wraps up all this information and which MUST be returned by the client for the connection to complete. (3) The client sends the server a DCP-Ack packet that acknowledges the DCP-Response packet. This acknowledges the server's initial sequence number and returns the Init Cookie if there was one in the DCP-Response. It may also continue feature negotiation. (4) Next comes zero or more DCP-Ack exchanges as required to finalize feature negotiation. The client may piggyback an application-level request on its final ack, producing a DCP- DataAck packet. (5) The server and client then exchange DCP-Data packets, DCP-Ack packets acknowledging that data, and, optionally, DCP-DataAck packets containing piggybacked data and acknowledgements. If the client has no data to send, then the server will send DCP-Data and DCP-DataAck packets, while the client will send DCP-Acks exclusively. (6) The server sends a DCP-CloseReq packet requesting a close. (7) The client sends a DCP-Close packet acknowledging the close. (8) The server sends a DCP-Reset packet and clears its connection state. (9) The client receives the DCP-Reset packet and holds state for a reasonable interval of time to allow any remaining packets to clear the network. An alternative connection closedown sequence is initiated by the client: (6) The client sends a DCP-Close packet closing the connection. Kohler/Handley/Floyd/Padhye Section 3. [Page 8] INTERNET-DRAFT Expires: January 2002 July 2001 (7) The server sends a DCP-Reset packet and clears its connection state. (8) The client receives the DCP-Reset packet and holds state for a reasonable interval of time to allow any remaining packets to clear the network. This arrangement of setup and teardown handshakes permits the server to decline to hold any state until the handshake with the client has completed, and ensures that the client must hold the TimeWait state at connection closedown. 3.1. Examples of DCP Congestion Control Before giving the detailed specifications of DCP, we first give two more detailed examples on DCP congestion control in operation. 3.1.1. DCP with TCP-like Congestion Control The first example is of a connection where both half-connections use TCP-like Congestion Control, specified by CCID 2 [CCID 2 PROFILE]. In this example, the client sends an application-level request to the server, and the server responds with a stream of data packets. This example is of a connection using ECN. (1) The client sends the DCP-Request, which includes an Ask option asking the server to use CCID 2 for the server's data packets, and a Choose option informing the server that the client would like to use CCID 2 for the its data packets. (2) The server sends a DCP-Response, including an Answer option indicating that the server agrees to use CCID 2 for its data packets, and an Ask option indicating that the server agrees to the client's suggestion of CCID 2 for the client's data packets. (3) The client responds with a DCP-DataAck acknowledging the server's initial sequence number, and including an Answer option finalizing the negotiation of the client-to-server CCID, and an application-level request for data. We will not discuss the client-to-server half-connection further in this example. (4) The server sends DCP-Data packets, where the number of packets sent is governed by a congestion window cwnd, as in TCP. The details of the congestion window are defined in the profile for CCID 2, which is a separate document [CCID 2 PROFILE]. The server also sends Ack Ratio feature options specifying the number of server data packets to be covered by an Ack packet from the client. Kohler/Handley/Floyd/Padhye Section 3.1.1. [Page 9] INTERNET-DRAFT Expires: January 2002 July 2001 Some of these data packets are DCP-DataAck packets acknowledging data and/or ack packets from the client. (5) The client sends a DCP-Ack packet acknowledging the data packets for every Ack Ratio data packets transmitted by the server. Each DCP-Ack packet uses a sequence number and contains an Ack Vector, as defined in Section 6 on Acknowledgements. These packets also include Answer options answering any Ack Ratio requests from the server. (6) The server continues sending DCP-Data packets as controlled by the congestion window. Upon receiving DCP-Ack packets, the server examines the Ack Vector to learn about marked or dropped data packets, and adjusts its congestion window accordingly, as described in [CCID 2 PROFILE]. Because this is unreliable transfer, the server does not retransmit dropped packets. (7) Because DCP-Ack packets use sequence numbers, the server has direct information about the fraction of loss or marked DCP-Ack packets. The server responds to lost or marked DCP-Ack packets by modifying the Ack Ratio sent to the client, as described in [CCID 2 PROFILE]. (8) The server estimates round-trip times and calculates a TimeOut (TO) value much as the RTO (Retransmit Timeout) is calculated in TCP. Again, the specification for this is in [CCID 2 PROFILE]. The TO is used to determine when a new DCP-Data packet can be transmitted when the server has been limited by the congestion window and no feedback has been received from the client. (9) Each DCP-Data, DCP-DataAck, and DCP-Ack packet is sent as ECN- Capable, with either the ECT(0) or the ECT(1) codepoint set, as described in [WES01]. The client echoes the accumulated ECN Nonce for the server's packets along with its Ack Vector options. (10) The DCP-CloseReq, DCP-Close, and DCP-Reset packets to close the connection are as in the example above. 3.1.2. DCP with TFRC Congestion Control This example is of a connection where both half-connections use TFRC Congestion Control, specified by CCID 3 The specification for CCID 3 is in a separate profile [CCID 3 PROFILE]; the purpose of this example is to illustrate the range of uses for DCP. Kohler/Handley/Floyd/Padhye Section 3.1.2. [Page 10] INTERNET-DRAFT Expires: January 2002 July 2001 (1) The DCP-Request and DCP-Response packets specifying the use of CCID 3 and the initial DCP-DataAck packet are similar to those in the TCP-like example above. (2) The server sends DCP-Data packets, where the number of packets sent is governed by an allowed transmit rate, as in TFRC. The details of the allowed transmit rate are defined in the profile for CCID 3, which is a separate document [CCID 3 PROFILE]. Each DCP-Data packet has a sequence number, a timestamp, the server's estimate of the round-trip time, and the current sending rate. Some of these data packets are DCP-DataAck packets acknowledging data and/or ack packets from the client, but for simplicity we will not discuss the half-connection of data from the client to the server in this example. (3) The client sends DCP-Ack packets at most once per round-trip time, or as indicated by the Ack Ratio, acknowledging the data packets. These acknowledgements may be piggybacked on data packets, producing DCP-DataAck packets. Each DCP-Ack packet uses a sequence number and identifies the most recent packet received from the server, a timestamp, and feedback about the loss event rate calculated by the client, as specified by [CCID 3 PROFILE]. (4) The server continues sending DCP-Data packets as controlled by the allowed transmit rate. Upon receiving DCP-Ack packets, the server updates its allowed transmit rate as specified by [CCID 3 PROFILE]. (5) The server estimates round-trip times and calculates a TimeOut (TO) value much as the RTO (Retransmit Timeout) is calculated in TCP. Again, the specification for this is in [CCID 3 PROFILE]. (6) The use of ECN follows TCP-like Congestion Control, above, and is described further in [CCID 3 PROFILE]. (7) The DCP-CloseReq, DCP-Close, and DCP-Reset packets to close the connection are as in the examples above. 3.2. DCP Generic Packet Header All DCP packets begin with a generic DCP packet header: Kohler/Handley/Floyd/Padhye Section 3.2. [Page 11] INTERNET-DRAFT Expires: January 2002 July 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Dest Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Res | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Offset | # NDP | Cslen | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Source and Destination Ports: 16 bits each These fields identify the connection. Packets sent on the other sequence switch the source and destination port values. Type: 4 bits The type field specifies the type of the DCP message. The following values are defined: 0 DCP-Request packet. 1 DCP-Response packet. 2 DCP-Data packet. 3 DCP-Ack packet. 4 DCP-DataAck packet. 5 DCP-CloseReq packet. 6 DCP-Close packet. 7 DCP-Reset packet. Reserved (Res): 4 bits This field is reserved for future expansion. The version of DCP specified here MUST set the field to all zeroes on generated packets, and ignore its value on received packets. Sequence Number: 24 bits The sequence number field is initialized by a DCP-Request or DCP-Response packet, and increases by one (modulo 16777216) with every packet sent. The receiver uses this information to Kohler/Handley/Floyd/Padhye Section 3.2. [Page 12] INTERNET-DRAFT Expires: January 2002 July 2001 determine whether packet losses have occurred. Even packets containing no data update the sequence number. Data Offset: 8 bits The offset from the start of the DCP header to the beginning of the packet's payload, measured in 32-bit words. Number of Non-Data Packets (# NDP): 4 bits DCP sets this field to the number of non-data packets it has sent so far on its sequence, modulo 16. A non-data packet is simply any packet not containing user data; Data-Ack packets are the canonical example. When sending a non-data packet, DCP increments the # NDP counter before storing its value in the packet header. This field can help the receiving DCP decide whether a lost packet contained any user data. (An application may want to know when it has lost data. DCP could report every packet loss as a potential data loss, but that would cause false loss reports when non-data packets were lost.) For example, say that packet 10 had # NDP set to 5; packet 11 was lost; and packet 12 had # NDP set to 5. Then the receiving DCP could deduce that packet 11 contained data, since # NDP did not change. Likewise, if # NDP had gone up to 6 (and packets 10 and 12 contained user data), then packet 11 must not have contained any data. Checksum Length (Cslen): 4 bits The checksum length field specifies how much of the packet (in 32-bit words) following the DCP Options is covered by the checksum. If this field is 15, the entire packet is covered by the checksum. If this field is zero, only the DCP header and options are covered by the checksum. By setting the checksum length field to a value other than 15, a sender specifies that corruption is acceptable in some of the DCP packet's payload, and that partially corrupted data packets may be received and counted for congestion control purposes. For this field to be meaningful when set to a value other than 15, the link-layer must also support selective CRC mechanisms. Checksum: 16 bits DCP uses the TCP/IP checksum algorithm. Specifically, the checksum field is the 16 bit one's complement of the one's complement sum of all 16 bit words in the DCP header and options and, depending on the value of the checksum length field, some Kohler/Handley/Floyd/Padhye Section 3.2. [Page 13] INTERNET-DRAFT Expires: January 2002 July 2001 or all of the payload. When calculating the checksum, the checksum field itself is treated as 0. If a packet contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16 bit word for checksum purposes. The pad is not transmitted as part of the packet. 3.3. DCP-Request Packet Format A DCP connection is initiated by sending a DCP-Request packet. The format of a DCP request packet is: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Service Name | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The Service Name field, in combination with the Destination Port, identifies the service to which the sender is trying to connect. Service Names are 32-bit numbers allocated by the IETF; they are meant to correspond to application services and protocols. The host operating system MAY force every DCP socket, both actively and passively opened, to specify a Service Name. The connection will succeed only if the Destination Port on the receiver has the same Service Name as that given in the packet. If they differ, the receiver will respond with a DCP-Reset packet. 3.4. DCP-Response Packet Format In the second phase of the three-way handshake, the server sends a DCP-Response message to the client. The response initializes the server-to-client sequence number. In this phase, a server will often specify the options it would like to use, either from among those the client requested, or in addition to those. Among these options is the congestion control mechanism the server expects to use. Kohler/Handley/Floyd/Padhye Section 3.4. [Page 14] INTERNET-DRAFT Expires: January 2002 July 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Acknowledgement Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Acknowledgement Number: 24 bits The acknowledgement number field acknowledges the largest valid sequence number received so far on this connection. (The usual care must be taken in case of wrapped sequence numbers.) In the case of a DCP-Response packet, the acknowledgement number field will equal the sequence number from the DCP-Request. Acknowledgement numbers make no attempt to provide precise information about which packets have arrived; options such as the Ack Vector do this. 3.5. DCP-Data, DCP-Ack, and DCP-DataAck Packet Formats The payload data in a DCP connection is sent in DCP-Data and DCP- DataAck packets. DCP-Data packets look like this: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ DCP-Ack packets dispense with the data, but contain an acknowledgement number: Kohler/Handley/Floyd/Padhye Section 3.5. [Page 15] INTERNET-DRAFT Expires: January 2002 July 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Acknowledgement Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ DCP-DataAck packets contain both data and an acknowledgement number. That is, acknowledgement information is piggybacked on a data packet. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Acknowledgement Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ DCP-Ack and DCP-DataAck packets may include additional acknowledgement options, such as Ack Vector, as required by the congestion control mechanism in use. DCP A sends DCP-Data and DCP-DataAck packets to DCP B due to application events on host A. These packets are congestion- controlled by the CCID for the A-to-B half-connection. In contrast, DCP-Ack packets sent by DCP A are controlled by the CCID for the B- to-A half-connection. Generally, DCP A will piggyback acknowledgement information on data packets when acceptable, creating DCP-DataAck packets. DCP-Ack packets are used when there is no data to send from DCP A to DCP B, or when the link from A to B is completely congested (so sending data would be inappropriate). Section 6, below, describes acknowledgements in DCP. Kohler/Handley/Floyd/Padhye Section 3.5. [Page 16] INTERNET-DRAFT Expires: January 2002 July 2001 A DCP-Data or DCP-DataAck packet may contain no data if the application sends a zero-length datagram. 3.6. DCP-CloseReq and DCP-Close Packet Format The DCP-CloseReq and DCP-Close packets have the same format. However, only the server can send a DCP-CloseReq packet. Either client or server may send a DCP-Close packet. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Acknowledgement Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3.7. DCP-Reset Packet Format DCP-Reset packets unconditionally shut down a connection. Every connection shutdown sequence ends with a DCP-Reset, but resets may be sent for other reasons, including bad port numbers, bad option behavior, incorrect ECN Nonce Echoes, and so forth. The reason for a reset is represented in the reset itself by a four-byte number, the Reason field. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ / Generic DCP Header / / (12 octets) / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Acknowledgement Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reason | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kohler/Handley/Floyd/Padhye Section 3.7. [Page 17] INTERNET-DRAFT Expires: January 2002 July 2001 Reason: 32 bits The Reason field represents the reason that the sender reset the DCP connection. Particular values for this field will be described in later versions of this document. 4. Options and Features All DCP packets can contain options which can be used to extend DCP's functionality. Options may occupy space at the end of the DCP header and are a multiple of 8 bits in length. All options are included in the checksum. An option may begin on any byte boundary. The first octet of an option is the option type. Options with types 0 through 31 are single-byte options. Other options are followed by an octet indicating the option's length. The option-length counts the two octets of option-type and option-length as well as the option-data octets. The following options are currently defined: Option Section Type Length Meaning Reference ---- ------ ------- --------- 0 1 Padding 4.1 1 1 Data Discarded 4.4 32 4 Ignored 4.2 33 variable Ask 4.3 34 variable Choose 4.3 35 variable Answer 4.3 36 variable Init Cookie 4.5 37 variable Ack Vector [Nonce 0] 6.5 38 variable Ack Vector [Nonce 1] 6.5 39 3 Receive Buffer Drops 6.6 40 6 Timestamp 4.6 41 10 Timestamp Echo 4.7 128-255 variable CCID-Specific Options 5.5 4.1. Padding Option The padding option, with type 0, is a single byte option used to pad between or after options. It either ensures the payload begins on a 32-bit boundary (as required), or ensures alignment of following options (not mandatory). Kohler/Handley/Floyd/Padhye Section 4.1. [Page 18] INTERNET-DRAFT Expires: January 2002 July 2001 4.2. Ignored Option The Ignored option, with type 32, signals that a DCP did not understand some option. This can happen, for example, when a conventional DCP converses with an extended DCP. Each Ignored option has two octets of payload, the first containing the offending option type and the second containing the first octet of the offending option's payload. (If the offending option had no payload, this octet is 0.) +--------+--------+--------+--------+ |00100000|00000100|Opt Type|Opt Data| +--------+--------+--------+--------+ Type=32 Length=4 4.3. Feature Negotiation DCP contains a mechanism for reliably negotiating features, most notably the congestion control mechanism in use on each half- connection. The motivation was to implement reliable feature negotiation once, so that different options need not reinvent that particular wheel. Three options, Ask, Choose, and Answer, implement feature negotiation. Ask is sent to a feature's location, asking it to change the feature's value. The feature location may respond with Choose, which asks the other endpoint to Ask again with different values, or it may change the feature value and acknowledge the request with Answer. Features MUST NOT change values apart from feature negotiation, and enforced retransmissions make feature negotiation reliable. This ensures that both endpoints eventually agree on every feature's value. Some features are non-negotiable, meaning that the feature location MUST set its value to whatever the other endpoint requests. (The Ask option, for non-negotiable features, is more like "Command".) These features use the feature framework simply to achieve reliability. 4.3.1. Feature Numbers The first data octet of every Ask, Choose, or Answer option is a feature number, defining the type of feature being negotiated. The remainder of the data gives one or more values for the feature, and is interpreted according to the feature. The current set of feature Kohler/Handley/Floyd/Padhye Section 4.3.1. [Page 19] INTERNET-DRAFT Expires: January 2002 July 2001 numbers is as follows: Section Number Meaning Neg.? Reference ------ ------- ----- --------- 1 Congestion Control (CC) Y 5 2 ECN Capable Y 7.1 3 Ack Ratio N 6.3 4 Use Ack Vector Y 6.4 128-255 CCID-Specific Features ? 5.5 The "Neg.?" column is "Y" for normal features, and "N" for non- negotiable features. 4.3.2. Ask Option DCP B sends an Ask option to DCP A to ask it to change the value of some feature. (DCP A is the feature location.) DCP A MUST respond to the Ask option with either Choose or Answer. DCP B MUST retransmit the Ask option until it receives some relevant response. DCP B will always generate an Ask option in response to a Choose option; it may also generate an Ask option due to some application event. 4.3.3. Choose Option DCP A sends a Choose option to DCP B to ask it to confirm the value of some feature. (Again, DCP A is the feature location.) DCP B MUST respond to the Choose option with an Ask. DCP A MUST retransmit the Choose option until it receives a relevant Ask response. DCP A may generate a Choose option in response to some Ask option, or in response to some application event. 4.3.4. Answer Option DCP A sends an Answer option to DCP B to inform it of the current value of some feature. (Again, DCP A is the feature location.) DCP A MUST generate Answer options only in response to Ask options. DCP A need not ever retransmit an Answer option: DCP B will retransmit the relevant Ask as necessary. 4.3.5. Example Negotiations This section demonstrates several negotiations of the congestion control feature for the A-to-B half-connection. (This feature is located at DCP A.) In this sequence of packets, DCP A is happy with DCP B's suggestion of CC mechanism 2: Kohler/Handley/Floyd/Padhye Section 4.3.5. [Page 20] INTERNET-DRAFT Expires: January 2002 July 2001 B > A Ask(CC, 2) A > B Answer(CC, 2) Here, A and B jointly settle on CC mechanism 5: B > A Ask(CC, 3, 4) A > B Choose(CC, 1, 2, 5) B > A Ask(CC, 5) A > B Answer(CC, 5) In this sequence, A refuses to use CC mechanism 5. If B requires CC mechanism 5, its only recourse is to abort the connection: B > A Ask(CC, 3, 4, 5) A > B Choose(CC, 1, 2) B > A Ask(CC, 5) A > B Choose(CC, 1, 2) Here, A elicts agreement from B that it is satisfied with congestion control mechanism 2: A > B Choose(CC, 1, 2) B > A Ask(CC, 2) A > B Answer(CC, 2) 4.3.6. Unknown Features If a DCP receives an Ask or Choose option referring to a feature number it does not understand, it MUST respond with a corresponding Ignored option. This informs the remote DCP that the local DCP does not implement the feature. No other action need be taken. (Ignored may also indicate that the DCP endpoint could not respond to a CCID- specific feature request because the CCID was in flux; see Section 5.5.) 4.3.7. State Diagram These state diagrams present the legal transitions in a DCP feature negotiation. They define DCP's states and transitions with respect to the negotiation of a single feature it understands. There are two diagrams, corresponding to the two endpoints: the feature location, or DCP A, and what we call the "feature requester", DCP B. Kohler/Handley/Floyd/Padhye Section 4.3.7. [Page 21] INTERNET-DRAFT Expires: January 2002 July 2001 Transitions between states are triggered by receiving a packet ("RECV") or by an application event ("APP"). Received packets are further distinguished by any options relevant to the feature being negotiated. "RECV -" means the packet contained no relevant option. "RECV Ask" denotes an Ask option, "RECV Ans" an Answer option, and "RECV Ch" a Choose option. The data contained in an option is given in parentheses when necessary. The "SEND" action indicates which option the DCP will send next. Finally, the "SET-VALUE" action causes the DCP to change its value for the relevant feature. "SEND" does not force DCP to immediately generate a packet; rather, it says which feature option must be sent on the next packet generated. A DCP MAY choose to generate a packet in response to some "SEND" action. However, it MUST NOT generate a packet if doing so would violate the congestion control mechanism in use. The requester, DCP B, has four states: Known, Unknown, Failed, and Asking. Similarly, the feature location, DCP A, has four states: Known, Unknown, Failed, and Confirming. In both cases, Known denotes a state where the DCP knows the feature's current value, and believes that the other DCP agrees. Asking and Confirming denote states where the DCPs are in the process of negotiating a new value for the feature. The Unknown state can occur only at connection setup time. It denotes a state where the DCP does not know any value for the feature, and has not yet entered a negotiation to determine its value. Finally, the Failed state represents a state where the other DCP does not implement the feature under negotiation. A DCP may start in either the Unknown or Known state, depending on the feature in question. In particular, some features have a well- known value for new connections, in which case the DCPs begin the connection in the Known states. Kohler/Handley/Floyd/Padhye Section 4.3.7. [Page 22] INTERNET-DRAFT Expires: January 2002 July 2001 REQUESTER STATE DIAGRAM (DCP B) +-----------+ | Unknown | +-----------+ +----------+ | +-----------+ | |RECV - |RECV -/Ch | APP | |RECV Ch/Ans V |SEND - |SEND Ask V |SEND Ask +-----------+ | | +------------+ | | |----+ +------------>| |-----+ | Known |------------------------------>| Asking | | | RECV Ch | APP | |-----+ +-----------+ SEND Ask +------------+ |RECV - ^ | | ^ |SEND -/Ask | | | | | +------------------------------------------+ | +---------+ RECV Ans(O) | +----------+ SEND - +--------->| Failed | SET-VALUE O RECV Ign +----------+ SEND - Kohler/Handley/Floyd/Padhye Section 4.3.7. [Page 23] INTERNET-DRAFT Expires: January 2002 July 2001 FEATURE LOCATION STATE DIAGRAM (DCP A) (O represents any feature value acceptable to DCP A; X is not acceptable.) RECV Ask(O) SEND Ans(O) RECV - | APP SET-VALUE O +-----------+ SEND Ch(O) +--------------------| Unknown |------------+ | +-----------+ | | +-------+ | | +-----------+ | | |RECV - |RECV Ask(X) | | |RECV Ask(X) V V |SEND - |SEND Ch(O) V V |SEND Ch(O) +-----------+ | | +------------+ | (need not be | |----+ +------------>| |-----+ the same O) | Known |------------------------------>| Confirming | | |----+ RECV Ask | APP | |-----+ +-----------+ | SEND Ch(O) +------------+ |RECV - ^ ^ | | | ^ |SEND -/Ch(O) | | |RECV Ask(O) | | | | | | |SEND Ans(O) | | +---------+ | | |SET-VALUE O | | | +-------+ | | +----------+ +---------------------------------------------+ +-------->| Failed | RECV Ask(O) RECV Ign +----------+ SEND Ans(O) SEND - SET-VALUE O This specification allows several choices of action in certain states. The implementation will generally use feature-specific information to decide how to respond. For example, DCP A in the Known state may respond to an Ask option with either an Answer or a Choose option. If DCP A is willing to set the feature to the value specified by Ask, it will generally send an Answer; but if it would like to negotiate further, it will send a Choose. DCP B must retransmit Ask options, and DCP A must retransmit Choose options, until receiving a relevant response. However, they need not retransmit the option on every packet, as shown by the "RECV - / SEND -" transitions in the Asking and Confirming states. These state diagrams guarantee safety, but not liveness. Namely, no unexpected or erroneous options will be sent, but option negotiation might not terminate. For example, the following infinite negotiation is legal according to this specification. Kohler/Handley/Floyd/Padhye Section 4.3.7. [Page 24] INTERNET-DRAFT Expires: January 2002 July 2001 A > B Choose(1) B > A Ask(2) A > B Choose(1) B > A Ask(2)... Implementations may choose to enforce a maximum length on any negotiation -- for example, by resetting the connection when any negotiation lasts more than some maximum time. In the Asking and Confirming states, the value of the corresponding feature is in flux. DCP MAY change its behavior in these states -- for example, by refusing to send data until reentering a Known state. 4.4. Data Discarded Option This option is permitted in a DCP-Response packet only. It indicates that the payload of the DCP-Request packet was discarded by the server, and therefore should be resent in a following DCP- Data or DCP-DataAck packet. This option can be set by the server to avoid having to keep state for the connection until the handshake is complete. Doing so causes an additional round-trip time before the server can begin servicing the request. The tradeoff is under the control of local policy at the server. 4.5. Init Cookie Option This option is permitted in DCP-Response, DCP-Data, and DCP-DataAck messages. The option MAY be returned by the server in a DCP-Response mechanism. If so, then the client MUST echo the same Init Cookie option in its ensuing DCP-Data or DCP-DataAck message. The purpose of this option is to allow a DCP server to avoid having to hold any state until the three-way connection setup handshake has completed. The server wraps up the service name, server port, and any options it cares about from both the DCP-Request and DCP- Response in a opaque cookie. Typically the cookie will be encrypted using a secret known only to the server and include a cryptographic checksum or magic value so that correct decryption can be verified. When the server receives the cookie back in the response, it can decrypt the cookie and instantiate all the state it avoided keeping. The precise implementation of the Init Cookie does not need to be specified here as it is only relayed by the client, and does not need to be understood by the client. Kohler/Handley/Floyd/Padhye Section 4.5. [Page 25] INTERNET-DRAFT Expires: January 2002 July 2001 4.6. Timestamp Option This option is permitted in any DCP packet. The length of the option is 6 bytes. +--------+--------+--------+--------+--------+--------+ |00101000|00000110| Timestamp Value | +--------+--------+--------+--------+--------+--------+ Type=40 Length=6 The four bytes of option data carry the timestamp of this packet, in some undetermined form. A DCP receiving a Timestamp option SHOULD respond with a Timestamp Echo option on the next packet it sends. 4.7. Timestamp Echo Option This option is permitted in any DCP packet, as long as at least one packet carrying the Timestamp option has been received. The length of the option is 10 bytes. +--------+--------+------- ... -------+------- ... -------+ |00101001|00001010| TS Echo | Elapsed | +--------+--------+------- ... -------+------- ... -------+ Type=41 Len=10 (4 bytes) (4 bytes) The first four bytes of option data, TS Echo, carry a Timestamp Value taken from a preceding received Timestamp option. Usually, this will be the last packet that was received. The final four bytes indicate the amount of time elapsed since receiving the packet whose timestamp is being echoed. This time MUST be in microseconds. We are currently investigating ways to relax the last requirement. 5. Congestion Control IDs Each congestion control mechanism supported by DCP is assigned a congestion control identifier, or CCID: a number from 0 to 255. During connection setup, and optionally thereafter, the endpoints negotiate their congestion control mechanisms by negotiating the values for their Congestion Control features. Congestion Control has feature number 1. The feature located at DCP A is the CCID in use for the A-to-B half-connection. DCP B sends an "Ask(CC, K)" option to DCP A to ask A to use CCID K for its data packets. The data octets of Congestion Control feature negotiation options form a list of acceptable CCIDs, sorted in descending order of priority. For example, the option "Ask(CC 1, 2, 3)" asks the sender to use CCID 1, although CCIDs 2 and 3 are also acceptable. (This corresponds to the octets "1, 6, 1, 1, 2, 3": Ask option (1), option Kohler/Handley/Floyd/Padhye Section 5. [Page 26] INTERNET-DRAFT Expires: January 2002 July 2001 length (6), feature ID (1), CCIDs (1, 2, 3).) Similarly, "Answer(CC 1, 2, 3)" tells the receiver that the sender is using CCID 1, but that CCIDs 2 or 3 might also be acceptable. The CCIDs defined by this document are: CCID Meaning ---- ------- 0 Single-Window Congestion Control 1 Unspecified Sender-Based Congestion Control 2 TCP-like Congestion Control 3 TFRC Congestion Control A new connection starts with CCID 0 for both DCPs. If this is unacceptable for either DCP, that DCP will start in the Unknown state. A DCP SHOULD NOT send data when its Congestion Control feature is in the Unknown state. 5.1. Single-Window Congestion Control CCID 0 denotes the absence of congestion control, and is appropriate only for streams of pure acknowledgements, possibly including at most one window of data at connection startup. (Streams of pure acknowledgements are congestion controlled, but by the other half- connection's CCID. See Section 6 below.) This is appropriate for half-connections that will contain no data---for example, the client-to-server half-connection on a streaming media connection. Servers may want to encourage their clients to use CCID 0, since this will ensure that they need not maintain detailed acknowledgement information for clients' packets, simplifying their implementation. HC-Senders using CCID 0 MUST NOT send any data packets during the lifetime of the connection, possibly after at most one initial window of data (as defined by TCP; currently two packets) during connection startup. HC-Receivers using CCID 0 SHOULD reset the connection if they receive an unexpected data packet. We have not yet determined whether CCID 0 should reliably transmit this initial window of packets. CCID 0 is further described in [CCID 0 PROFILE]. 5.2. Unspecified Sender-Based Congestion Control CCID 1 denotes an unspecified sender-based congestion control mechanism. Separate features negotiate the corresponding congestion Kohler/Handley/Floyd/Padhye Section 5.2. [Page 27] INTERNET-DRAFT Expires: January 2002 July 2001 acknowledgement options -- for example, Ack Vector. CCID 1 is designed for research and extensibility. For example, say that CCID 98, a new sender-based congestion control mechanism using Ack Vector for acknowledgements, has entered the IETF standards process. Now, DCP A, which understands and would like to use CCID 98, is trying to communicate with DCP B, which doesn't yet know about CCID 98. DCP A can simply negotiate use of CCID 1 and, separately, negotiate Use Ack Vector. DCP B will provide the feedback DCP A requires for CCID 98, namely Ack Vector, without needing to understand the congestion control mechanism in use. It is not a conformant use of DCP to use CCID 1 in production environments as a proxy for a congestion control mechanism that has not entered the IETF standards process. 5.3. TCP-like Congestion Control CCID 2 denotes Additive Increase, Multiplicative Decrease (AIMD) congestion control with behavior modelled directly on TCP, including congestion window, slow start, timeouts, and so forth. CCID 2 is further described in [CCID 2 PROFILE]. 5.4. TFRC Congestion Control CCID 3 denotes TCP-Friendly Rate Control, an equation-based rate- controlled congestion control mechanism. CCID 3 is further described in [CCID 3 PROFILE]. 5.5. CCID-Specific Options and Features Option and feature numbers 128 through 255 are available for CCID- specific use. CCIDs may often need new option types -- for communicating acknowledgement or rate information, for example. CCID-specific option types let them create options at will without polluting the global options space. Option 128 might have different meanings on a half-connection using CCID 4 and a half-connection using CCID 8. CCID-specific options and features will never conflict with global options introduced by later versions of this specification. Any packet may contain information meant for either half-connection, so CCID-specific option and feature numbers explicitly signal the half-connection to which they apply. Option numbers 128 through 191 are for options sent from the HC-Sender to the HC-Receiver; option numbers 192 through 255 are for options sent from the HC-Receiver to the HC-Sender. Similarly, feature numbers 128 through 191 are for features located at the HC-Sender; feature numbers 192 through 255 are for features located at the HC-Receiver. (Ask options for a Kohler/Handley/Floyd/Padhye Section 5.5. [Page 28] INTERNET-DRAFT Expires: January 2002 July 2001 feature are sent *to* the feature location; Choose and Answer options are sent *from* the feature location. Thus, Ask(128) options are sent by the HC-Receiver by definition, while Ask(192) options are sent by the HC-Sender.) For example, consider a DCP connection where the A-to-B half- connection uses CCID 4 and the B-to-A half-connection uses CCID 5. Here is how a sampling of CCID-specific options and features are assigned to half-connections: Relevant Relevant Packet Option Half-conn. CCID ------ ------ ---------- ---- A > B 128 A-to-B 4 A > B 192 B-to-A 5 A > B Ask(128, ...) B-to-A 5 A > B Choose(128, ...) A-to-B 4 A > B Answer(128, ...) A-to-B 4 A > B Ask(192, ...) A-to-B 4 A > B Choose(192, ...) B-to-A 5 A > B Answer(192, ...) B-to-A 5 CCID-specific options and features have no clear meaning when the relevant CCID is in flux. A DCP SHOULD respond to CCID-specific options and features with Ignored options during those times. 6. Acknowledgements Congestion control requires receivers to transmit information about packet losses and ECN marks to senders. DCP receivers MUST report all congestion they see, using mechanisms appropriate for the CCID in use. Generally, this is accomplished through options. For example, on a half-connection with CCID 2 (TCP-like), the receiver reports acknowledgement information using the Ack Vector option. CCID- specific profiles say which options are relevant, and how to decide when to ack; this section describes common acknowledgement options and shows how acks using those options will commonly work. Acknowledgement options, such as Ack Vector, are only allowed on DCP-Ack, DCP-DataAck, DCP-Close, and DCP-CloseReq packets. 6.1. Acknowledgements and CCIDs Acknowledgements are controlled by CCIDs. Each CCID specifies which options its acknowledgements must use, when they should be sent, how they should be congestion controlled, and so on. Each CCID Kohler/Handley/Floyd/Padhye Section 6.1. [Page 29] INTERNET-DRAFT Expires: January 2002 July 2001 additionally describes the form acks-of-acks must take -- if required at all -- when the CCID is active on a unidirectional connection. This last point requires some explanation. DCP was designed to work well for both bidirectional and unidirectional flows of data, and for connections that transition between these states. However, acknowledgements required for a bidirectional connection are very different from those required for a unidirectional connection. Consider a connection where both connections use the same CCID (either 2 or 3), but the B-to-A half-connection has become quiescent; that is, DCP B has no more data to send to DCP A, and is sending only DCP-Acks. Now, for CCID 2, TCP-like Congestion Control, DCP B uses Ack Vector to reliably communicate which packets it has received. Because of this reliability, DCP A must inform DCP B when it receives an Ack Vector: that is, DCP A must occasionally acknowledge a pure acknowledgement. The ack-of-ack traffic need not be reliable; for instance, it need not use Ack Vector. DCP A might just send a DCP-DataAck packet every now and then, instead of DCP- Data. In contrast, for CCID 3, TFRC Congestion Control, DCP B's acknowledgements need not be reliable. B's DCP-Acks contain cumulative loss rates; TFRC works even if every DCP-Ack is lost. Therefore, DCP A need not ever acknowledge an acknowledgement. When communication is bidirectional, DCP A's ack-of-ack traffic is automatically contained in its normal acknowledgement traffic for DCP B's data. However, the required ack-of-ack traffic is significantly smaller and simpler than the normal ack traffic. Therefore, DCP sends only the ack-of-ack traffic when communication is unidirectional, since this reduces DCP A's acknowledgements to nothing, or nearly nothing. Thus, when communication is unidirectional, a single CCID -- in the example, the A-to-B CCID -- is controlling both DCP A's and DCP B's acknowledgements, in terms of their content, their frequency, and so forth. In the bidirectional case, the A-to-B CCID governs DCP B's acknowledgements, while the B-to-A CCID governs DCP A's acknowledgements. DCP A switches its ack pattern from bidirectional to unidirectional when it notices that DCP B has gone quiescent -- that is, B is no longer sending data packets. It switches from unidirectional to bidirectional when it must acknowledge even a single DCP-Data or DCP-DataAck packet from DCP B. (This includes the case where a single DCP-Data or DCP-DataAck packet was lost in transit. DCP A can detect this case using the # NDP field in the DCP packet header.) The B-to-A CCID defines when DCP B has gone quiescent; usually, this happens when a period has passed without B sending any data packets. Kohler/Handley/Floyd/Padhye Section 6.1. [Page 30] INTERNET-DRAFT Expires: January 2002 July 2001 For CCID 2, this period is roughly two round-trip times. The A-to-B CCID defines how DCP A handles acks-of-acks once DCP B has gone quiescent. 6.2. Ack Piggybacking Acknowledgements of A-to-B data MAY be piggybacked on data sent by DCP B, as long as that does not delay the acknowledgement longer than the A-to-B CCID would find acceptable. However, data acknowledgements often require more than 4 bytes to express. A large set of acknowledgements prepended to a large data packet might exceed the path's MTU. In this case, DCP B SHOULD send separate DCP- Data and DCP-Ack packets, or wait for a smaller datagram (but not too long). Piggybacking is particularly common at DCP A when the B-to-A half- connection is quiescent -- that is, when DCP A is just acknowledging DCP B's acknowledgements, as described above. There are three reasons to acknowledge DCP B's acknowledgements: to allow DCP B to free up information about previously acknowledged data packets from A; to shrink the size of future acknowledgements; and to manipulate the rate future acknowledgements are sent. Since these are secondary concerns, DCP A can generally afford to wait indefinitely for a data packet to piggyback its acknowledgement onto. Any restrictions on ack piggybacking are described in the relevant CCID's profile. 6.3. Ack Ratio Feature With Ack Ratio, DCP A can perform rudimentary congestion control on DCP B's acknowledgement stream by telling DCP B how to clock its acks. Ack Ratio has feature number 3. The Ack Ratio feature located at DCP B equals the ratio of data packets sent by DCP A to acknowledgement packets sent back by DCP B. For example, if it is set to four, then DCP B will send at least one acknowledgement packet for every four data packets DCP A sends. DCP A sends an "Ask(Ack Ratio)" option to DCP B to change DCP B's ack ratio. An Ack Ratio option contains two bytes of data: a sixteen-bit integer representing the ratio. A new connection starts with Ack Ratio 2 for both DCPs. This feature is non-negotiable. Kohler/Handley/Floyd/Padhye Section 6.3. [Page 31] INTERNET-DRAFT Expires: January 2002 July 2001 6.4. Use Ack Vector Feature The Use Ack Vector feature lets DCPs negotiate whether they should use Ack Vector options to report congestion. Ack Vector provides detailed loss information, and lets senders report back to their applications whether particular packets were dropped. Use Ack Vector is mandatory for some CCIDs, and optional for others. Use Ack Vector has feature number 4. The Use Ack Vector feature located at DCP B specifies whether DCP B should use the Ack Vector option to report congestion back to DCP A. DCP A sends an "Ask(Use Ack Vector, 1)" option to DCP B to ask B to send Ack Vector options as part of its acknowledgement traffic. A Use Ack Vector option contains a single octet of data. The receiver should send Ack Vector options if and only if this octet is nonzero. A new connection starts with Use Ack Vector 0 for both DCPs. 6.5. Ack Vector Options The Ack Vector gives a run-length encoded history of data packets received at the client. Each octet of the vector gives the state of that data packet in the loss history, and the number of preceding packets with the same state. The option's data looks like this: +--------+--------+--------+--------+--------+ |001001??| Length |SSLLLLLL|SSLLLLLL|SSLLLLLL|... +--------+--------+--------+--------+--------+ Type=37/38 \________ Vector ________/ The two Ack Vector options (option types 37 and 38) differ only in the values they imply for ECN Nonce Echo. Section 7.2 describes this further. The vector itself consists of a series of octets, each of whose encoding is: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |St | Run Length| +-+-+-+-+-+-+-+-+ St[ate]: 2 bits Kohler/Handley/Floyd/Padhye Section 6.5. [Page 32] INTERNET-DRAFT Expires: January 2002 July 2001 Run Length: 6 bits State occupies the most significant two bits of each byte, and can have one of four values: 0 Packet received (and not ECN marked). 1 Packet ECN marked. 2 Reserved. 3 Packet not yet received. The first byte in the first Ack Vector option refers to the packet indicated in the Acknowledgement Number; subsequent bytes refer to older packets. (Ack Vector may not be sent on DCP-Data packets, which lack an Acknowledgement Number.) If an Ack Vector contains the decimal values 0,192,3,64,5 and the Acknowledgement Number is decimal 100, then: Packet 100 was received (Acknowledgement Number 100, State 0, Run Length 0). Packet 99 was lost (State 3, Run Length 0). Packets 98, 97, 96 and 95 were received (State 0, Run Length 3). Packet 94 was ECN marked (State 1, Run Length 0). Packets 93, 92, 91, 90, 89, and 88 were received (State 0, Run Length 5). Run lengths of more than 64 must be encoded in multiple bytes. A single Ack Vector option can acknowledge up to 16192 data packets. Should more packets need to be acknowledged than can fit in 253 bytes of Ack Vector, then multiple Ack Vector options can be sent. The second Ack Vector option will begin where the first Ack Vector option left off, and so forth. Packets dropped in the receive buffer should be reported as not received (State 3). The Receive Buffer Drops option distinguishes between congestion losses and losses due to receive buffer overflow. 6.5.1. Ack Vector Consistency A DCP sender will commonly receive multiple acknowledgements for some of its data packets. For instance, an HC-Sender might receive two DCP-Acks with Ack Vectors, both of which contained information Kohler/Handley/Floyd/Padhye Section 6.5.1. [Page 33] INTERNET-DRAFT Expires: January 2002 July 2001 about sequence number 24. (Because of cumulative acking, information about a sequence number is repeated in every ack until the HC-Sender acknowledges an ack. Perhaps the HC-Receiver is sending acks faster than the HC-Sender is acknowledging them.) In a perfect world, the two Ack Vectors would always be consistent. However, there are many reasons why they might not be: o The HC-Receiver received packet 24 between sending its acks, so the first ack said 24 was not received (State 3) and the second said it was received or ECN marked (State 0 or 1). o The HC-Receiver received packet 24 between sending its acks, and the network reordered the acks. In this case, the packet will appear to transition from State 0 or 1 to State 3. o The network duplicated packet 24, but only one of the duplicates was ECN marked. Depending on the HC-Receiver's implementation, this might show up as a transition between States 0 and 1. To cope with these situations, HC-Sender DCP implementations SHOULD combine multiple received Ack Vector states according to this table: Received State 0 1 3 +---+---+---+ 0 | 0 | 1 | 0 | Old +---+---+---+ 1 | 1 | 1 | 1 | State +---+---+---+ 3 | 0 | 1 | 3 | +---+---+---+ To read the table, choose the row corresponding to the packet's old state, and the column corresponding to the packet's state in the newly received Ack Vector, then read the packet's new state off the table. The table is symmetric about the main diagonal, so it is indifferent to ack reordering. A HC-Sender MAY choose to throw away old information gleaned from the HC-Receiver's Ack Vectors, in which case it MUST ignore newly received acknowledgements from the HC-Receiver for those old packets. However, it is often kinder to save recent Ack Vector information for a while, so that the HC-Sender can undo its reaction to presumed congestion when a "lost" packet unexpectedly shows up (the transition from State 3 to State 0). Kohler/Handley/Floyd/Padhye Section 6.5.1. [Page 34] INTERNET-DRAFT Expires: January 2002 July 2001 6.5.2. Ack Vector Coverage We can divide the packets that have been sent from an HC-Sender to an HC-Receiver into four roughly contiguous groups. From oldest to youngest, these are: (1) Packets already acknowledged by the HC-Receiver, where the HC- Receiver knows that the HC-Sender has definitely received the acknowledgements. (2) Packets already acknowledged by the HC-Receiver, where the HC- Receiver cannot be sure that the HC-Sender has received the acknowledgements. (3) Packets not yet acknowledged by the HC-Receiver. (4) Packets not yet received by the HC-Receiver. The union of groups 2 and 3 is called the Unacknowledged Window. Generally, every Ack Vector the HC-Receiver sends will cover the whole Unacknowledged Window: Ack Vector acknowledgements are cumulative. (This simplifies Ack Vector maintenance at the HC- Receiver; see Section 6.7, below.) As packets are received, this window both grows on the right and shrinks on the left. It grows because there are more packets, and shrinks because the data packets' Acknowledgement Numbers will acknowledge previous acknowledgements, moving packets from group 2 into group 1. 6.6. Receive Buffer Drops Option The Receive Buffer Drops option indicates that some packets reported as not received, were actually dropped at the endpoint due to insufficient kernel space. The sender will probably react differently to receive buffer drops than congestion losses; for instance, it might not reduce its congestion window. The option's data looks like this: +--------+--------+--------+ |00100111|00000011| Count | +--------+--------+--------+ Type=39 Length=3 Count: 8 bits The Count field says how many acknowledged packets were dropped at the receive buffer, limited to packets acknowledged by the packet containing the option. Count is simply a number between 0 and 255. Kohler/Handley/Floyd/Padhye Section 6.6. [Page 35] INTERNET-DRAFT Expires: January 2002 July 2001 Multiple Receive Buffer Drops options are added together, so a single option with Count 2 is equivalent to two options, each with Count 1. A packet's total Receive Buffer Drops count MUST be less than or equal to the number of packets acknowledged by it as "not yet received". For example, assuming Ack Vector, the Receive Buffer Drops count must be less than or equal to the total number of State-3 packets in the Ack Vectors. If an ECN-marked packet is dropped at the receive buffer, it MUST NOT be included in the Receive Buffer Drops count. Such packets MUST be reported as the equivalent of "dropped by the network". (For Ack Vector, this is "not yet received".) 6.7. Ack Vector Implementation Notes This section discusses the particulars of DCP acknowledgement handling, in the context of an abstract implementation for Ack Vector. It may safely be skipped. The first part of our implementation runs at the HC-Receiver, and therefore acknowledges data packets. It generates Ack Vector options. The implementation has the following characteristics: o At most one byte of state per acknowledged packet. o O(1) time to update that state when a new packet arrives (normal case). o Cumulative acknowledgements. o Quick removal of old state. The basic data structure is a circular buffer containing information about acknowledged packets. Each byte in this buffer contains a state and run length; the state can be 0 (packet received), 1 (packet ECN marked), or 3 (packet not yet received). The live portion of the buffer is marked off by head and tail pointers; each is further marked with the HC-Sender sequence number to which it corresponds. The buffer grows from right to left. For example: +-------------------------------------------------------------------+ |S,L|S,L|S,L|S,L|S,L| | | | |S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L| +-------------------------------------------------------------------+ ^ ^ Tail, seqno = T Head, seqno = H <=== Head and Tail move this way <=== Kohler/Handley/Floyd/Padhye Section 6.7. [Page 36] INTERNET-DRAFT Expires: January 2002 July 2001 Each `S,L' represents a State/Run length byte. We will draw these buffers showing only their live portion; for example, here is another representation for the buffer above: +---------------------------------------------------+ (Head) H |S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L|S,L| T (Tail) +---------------------------------------------------+ This smaller Example Buffer contains actual data. +---------------------------+ 10 |0,0|3,0|3,0|3,0|0,4|1,0|0,0| 0 [Example Buffer] +---------------------------+ In concrete terms, its meaning is as follows: Packet 10 was received. (The head of the buffer has sequence number 10, state 0, and run length 0.) Packets 9, 8, and 7 have not yet been received. (The three bytes preceding the head each have state 3 and run length 0.) Packets 6, 5, 4, 3, and 2 were received. Packet 1 was ECN marked. Packet 0 was received. 6.7.1. New Packets When a packet arrives whose sequence number is larger than any in the buffer, the HC-Receiver simply moves the Head pointer to the left, increases the head sequence number, and stores a byte representing the packet into the buffer. For example, if HC-Sender packet 11 arrived ECN marked, the Example Buffer above would enter this new state (the change is marked with stars): +***----------------------------+ 11 |1,0|0,0|3,0|3,0|3,0|0,4|1,0|0,0| 0 +***----------------------------+ If the packet's state equals the state at the head of the buffer, the HC-Receiver may choose to increment its run length (up to the maximum). For example, if HC-Sender packet 11 arrived without ECN marking, the Example Buffer might enter this state instead: Kohler/Handley/Floyd/Padhye Section 6.7.1. [Page 37] INTERNET-DRAFT Expires: January 2002 July 2001 +--*------------------------+ 11 |0,1|3,0|3,0|3,0|0,4|1,0|0,0| 0 +--*------------------------+ Of course, the new packet's sequence number might not equal the expected sequence number. In this case, the HC-Receiver should enter the intervening packets as State 3. If several packets are missing, the HC-Receiver may prefer to enter multiple bytes with run length 0, rather than a single byte with a larger run length; this simplifies table updates when one of the missing packets arrives. For example, if HC-Sender packet 12 arrived, the Example Buffer would enter this state: +*******----------------------------+ 12 |0,0|3,0|0,1|3,0|3,0|3,0|0,4|1,0|0,0| 0 +*******----------------------------+ When a new packet's sequence number is less than the head sequence number, the HC-Receiver should scan the table for the byte corresponding to that sequence number. (Slightly more complex indexing structures could reduce the complexity of this scan.) Assume that the sequence number was previously lost (State 3), and that it was stored in a byte with run length 0. Then the HC-Receiver can simply change the byte's state. For example, if HC-Sender packet 8 was received, the Example Buffer would enter this state: +--------*------------------+ 10 |0,0|3,0|0,0|3,0|0,4|1,0|0,0| 0 +--------*------------------+ If the packet is not marked as lost, or if its sequence number is not contained in the table, the packet is probably a duplicate, and should be ignored. (The new packet's ECN marking state might differ from the state in the buffer; Section 6.5.1 describes what to do then.) If the packet's corresponding buffer byte has a non-zero run length, then the buffer might need be reshuffled to make space for one or two new bytes. Of course, the circular buffer may overflow, either when the HC- Sender is sending data at a very high rate, when the HC-Receiver's acknowledgements are not reaching the HC-Sender, or when the HC- Sender is forgetting to acknowledge those acks (so the HC-Receiver is unable to clean up old state). In this case, the HC-Receiver should either compress the buffer, transfer its state to a larger buffer, or drop all received packets until its buffer shrinks again. Kohler/Handley/Floyd/Padhye Section 6.7.1. [Page 38] INTERNET-DRAFT Expires: January 2002 July 2001 6.7.2. Sending Acknowledgements Whenever the HC-Receiver needs to generate an acknowledgement, the buffer's contents can simply be copied into one or more Ack Vector options. Copied Ack Vectors might not be maximally compressed; for example, the Example Buffer above contains three adjacent 3,0 bytes that could be combined into a single 3,2 byte. The HC-Receiver might, therefore, choose to compress the buffer in place before sending the option, or to compress the buffer while copying it; either operation is simple. Every acknowledgement sent by the HC-Receiver should include the entire state of the buffer. That is, acknowledgements are cumulative. The HC-Receiver should store information about each acknowledgement it sends in another buffer. Specifically, for every acknowledgement it sends, the HC-Receiver should store: o The HC-Receiver sequence number it used for the ack packet. o The HC-Sender sequence number it acknowledged (that is, the packet's Acknowledgement Number). Since acknowledgements are cumulative, this single number completely specifies the set of HC- Sender packets acknowledged by this ack packet. 6.7.3. Clearing State Some of the HC-Sender's packets will include acknowledgement numbers, which ack the HC-Receiver's acknowledgements. When such an ack is received, the HC-Receiver simply finds the HC-Sender sequence number corresponding to that acked HC-Receiver packet, and moves the buffer's Tail pointer up to that sequence number. (It may choose to keep some older information, in case a lost packet shows up late.) For example, say that the HC-Receiver storing the Example Buffer had sent two acknowledgements already: HC-Receiver Ack 59 acknowledged HC-Sender Seq 3, and HC-Receiver Ack 60 acknowledged HC-Sender Seq 10. Say the HC-Receiver then received a DCP-DataAck packet from the HC- Sender with Acknowledgement Number 59. This informs the HC-Receiver that the HC-Sender received, and processed, all the information in HC-Receiver packet 59. This packet acknowledged HC-Sender packet 3, so the HC-Sender has now received HC-Receiver's acknowledgements for packets 0, 1, 2, and 3. The Example Buffer should enter this state: Kohler/Handley/Floyd/Padhye Section 6.7.3. [Page 39] INTERNET-DRAFT Expires: January 2002 July 2001 +------------------*+ * 10 |0,0|3,0|3,0|3,0|0,2| 4 +------------------*+ * Note that the tail byte's run length was adjusted, since packet 3 was in the middle of that byte. The HC-Receiver can also throw away the information about HC-Receiver Ack 59. A careful implementation might also modify its own acknowledgement record to ensure that it is reasonably robust to reordering. Suppose that the Example Buffer is as before, but that packet 9 now arrives, out of sequence. The Example buffer would enter this state: +----*----------------------+ 10 |0,0|0,0|3,0|3,0|0,4|1,0|0,0| 0 +----*----------------------+ Now, if the HC-Receiver then received a DCP-DataAck packet from the HC-Sender with Sequence Number 11 and Acknowledgement Number 60, this might cause the tail pointer to be moved up to packet 10, although packet 9's arrival has not yet been acknowledged. Instead, when packet 9 arrived, the HC-Receiver's acknowledgement record might be modified to: HC-Receiver Ack 59 acknowledged HC-Sender Seq 3, and HC-Receiver Ack 60 acknowledged HC-Sender Seq 8. That is, any HC-Sender sequence number in the acknowledgement record is reduced to at most 8. This would prevent the Tail pointer from moving past packet 9 until the HC-Receiver knows that the HC-Sender has seen an Ack Vector indicating this packets arrival. 6.7.4. Processing Acknowledgements When the HC-Sender receives an acknowledgement, it generally cares about the number of packets that were dropped and/or ECN marked. It simply reads this off the Ack Vector. Additionally, it may check the ECN Nonce for correctness. (As described in Section 6.5.1, it may want to keep more detailed information about acknowledged packets in case packets change states between acknowledgements, or in case the application queries whether a packet arrived.) Of course, the HC-Sender must also acknowledge the HC-Receiver's acknowledgements, so the HC-Receiver can free up its state. This is much simpler than the HC-Receiver's acknowledgement code, since the HC-Receiver doesn't need complete acknowledgement information. For Kohler/Handley/Floyd/Padhye Section 6.7.4. [Page 40] INTERNET-DRAFT Expires: January 2002 July 2001 example, assuming that the HC-Receiver sends no data, the HC-Sender can simply ensure that at least once a round-trip time, it sends a DCP-DataAck packet acknowledging the latest DCP-Ack packet it has received. (The HC-Sender must watch for drops and ECN marks on received DCP-Ack packets, so that it can adjust the HC-Receiver's ack-sending rate in response to congestion; but it need not inform the HC-Receiver about which acks were dropped.) If the other half-connection is not quiescent -- that is, the HC- Receiver is sending data to the HC-Sender, possibly using another CCID -- then the acknowledgements on that half-connection are usually sufficient for the HC-Receiver to free its state. 7. Explicit Congestion Notification The DCP protocol is fully ECN-aware. Every CCID specifies how its endpoints respond to ECN marks. Furthermore, DCP, unlike TCP, allows senders to control the rate at which acknowledgements are generated (with options like Ack Ratio); this means that acknowledgements are generally congestion-controlled, and may have ECN-Capable Transport set. Every CCID profile describes how that profile interacts with ECN, both for data traffic and pure-acknowledgement traffic. A sender SHOULD set ECN-Capable Transport on a sent packet whenever the receiver has its ECN Capable feature turned on, and the relevant CCID allows it. The rest of this section describes the ECN Capable feature, and the interaction of the ECN Nonce with acknowledgement options such as Ack Vector. 7.1. ECN Capable Feature The ECN Capable feature lets a DCP inform its partner that it cannot read ECN bits from received IP headers, so the partner must not set ECN-Capable Transport on its packets. ECN Capable has feature number 2. The ECN Capable feature located at DCP A indicates whether or not A can successfully read ECN bits from received frames' IP headers. (This is independent of whether it can set ECN bits on sent frames.) DCP A sends a "Choose(ECN Capable, 0)" option to DCP B to inform B that A cannot read ECN bits. An ECN Capable feature contains a single octet of data. ECN capability is on if and only if this octet is nonzero. Kohler/Handley/Floyd/Padhye Section 7.1. [Page 41] INTERNET-DRAFT Expires: January 2002 July 2001 A new connection starts with ECN Capable 1 (that is, ECN capable) for both DCPs. If a DCP is not ECN capable, it MUST send "Choose(ECN Capable, 0)" options to the other endpoint until acknowledged (by "Ask(ECN Capable, 0)") or the connection closes. Furthermore, it MUST NOT accept any data until the other endpoint sends "Ask(ECN Capable, 0)". 7.2. ECN Nonces Congestion avoidance will not occur, and the receiver will sometimes get its data faster, when the sender is not told about any congestion events. Thus, the receiver has some incentive to falsify acknowledgement information, reporting that marked or dropped packets were actually received unmarked. This problem is more serious with DCP than with TCP, since TCP provides reliable transport: it is more difficult with TCP to lie about lost packets without breaking the application. ECN Nonces are a general mechanism to prevent ECN cheating (or loss cheating). Two values for the two-bit ECN header field indicate ECN- Capable Transport, 01 and 10. The second code point, 10, is the ECN Nonce. In general, a protocol sender chooses between these code points randomly on its output packets, remembering the sequence it chose. The protocol receiver reports, on every acknowledgement, the number of ECN Nonces it has received thus far. This is called the ECN Nonce Echo. Since ECN marking and packet dropping both destroy the ECN Nonce, a receiver that lies about an ECN mark or packet drop has a 50% chance of guessing right and avoiding discipline. The sender may react punitively to an ECN Nonce mismatch, possibly up to dropping the connection. The ECN Nonce Echo field need not be an integer; one bit is enough to catch 50% of infractions. In DCP, the ECN Nonce Echo field is encoded in acknowledgement options. For example, the Ack Vector option comes in two forms, Ack Vector [Nonce 0] (option 37) and Ack Vector [Nonce 1] (option 38), corresponding to the two values for a one-bit ECN Nonce Echo. The Nonce Echo for a given Ack Vector equals the base-2 modulus of the number of received ECN Nonce packets represented by that Ack Vector. Only packets marked as State 0 matter for this calculation (that is, received packets that were not ECN marked or dropped in the receive buffer). Every Ack Vector option is detailed enough for the sender to determine what the Nonce Echo should have been. It can check this calculation against the actual Nonce Echo, and complain if there is a mismatch. (The Ack Vector could conceivably report every ECN Nonce packet, using a separate code point for received ECN Nonces. However, this would limit Ack Vector's compressibility without providing much Kohler/Handley/Floyd/Padhye Section 7.2. [Page 42] INTERNET-DRAFT Expires: January 2002 July 2001 extra protection.) Consider a half-connection from DCP A to DCP B. DCP A SHOULD set ECN Nonces on its packets, and remember which packets had nonces, whenever DCP B reports that it is ECN Capable. An ECN-capable endpoint MUST calculate and use the correct value for ECN Nonce Echo when sending acknowledgement options. An ECN-incapable endpoint, however, SHOULD treat the ECN Nonce Echo as always zero. When a sender detects an ECN Nonce Echo mismatch, it SHOULD behave as if the receiver had reported one or more packets as ECN-marked (instead of unmarked). It MAY take more punitive action, such as resetting the connection. 8. Path MTU Discovery A DCP implementation should be capable of performing Path MTU (PMTU) discovery, as described in [RFC 1191]. The API to DCP SHOULD allow this mechanism to be disabled in cases where IP fragmentation is preferred. The rest of this section assumes PMTU discovery has not been disabled. A DCP implementation MUST maintain its idea of the current PMTU for each active DCP session. The PMTU should be initialized from the interface MTU that will be used to send packets. To perform PMTU discovery, the DCP sender sets the IP Don't Fragment (DF) bit. However, it is undersirable for MTU discovery to occur on the initial connection setup handshake, as the connection setup process may not be representative of packet sizes used during the connection, and performing MTU discovery on the initial handshake might unnecessarily delay connection establishment. Thus, DF SHOULD NOT be set on DCP-Request and DCP-Response packets. In addition DF SHOULD NOT be set on DCP-Reset packets, although typically these would be small enough to not be a problem. On all other DCP packets, DF SHOULD be set. Any API to DCP MUST allow the application to discover DCP's current PMTU. DCP applications SHOULD use the API to discover the PMTU, and SHOULD NOT send datagrams that are greater than the PMTU; the only exception to this is if the application disables PMTU discovery. If the application tries to send a packet bigger than the PMTU, the DCP implementation MUST drop the packet and return an appropriate error. As specified in [RFC 1191], when a router receives a packet with DF set that is larger than the PMTU, it sends an ICMP Destination Unreachable message to the source of the datagram with the Code indicating "fragmentation needed and DF set" (also known as a "Datagram Too Big" message). When a DCP implementation receives a Kohler/Handley/Floyd/Padhye Section 8. [Page 43] INTERNET-DRAFT Expires: January 2002 July 2001 Datagram Too Big message, it decreases its PMTU to the Next-Hop MTU value given in the ICMP message. If the MTU given in the message is zero, the sender chooses a value for PMTU using the algorithm described in Section 7 of [RFC 1191]. If the MTU given in the message is greater than the current PMTU, the Datagram Too Big message is ignored, as described in [RFC 1191]. (We are aware that this may cause problems for DCP endpoints behind certain firewalls.) If the DCP implementation has decreased the PMTU, and the sending application attempts to send a packet larger than the new MTU, the API MUST cause the send to fail returning an appropriate error to the application, and the application SHOULD then use the API to query the new value of the PMTU. When this occurs, it is possible that the kernel has some packets buffered for transmission that are smaller than the old PMTU, but larger than the new PMTU. The kernel MAY send these packets with the DF bit cleared, or it MAY discard these packets; it MUST NOT transmit these datagrams with the DF bit set. DCP currently provides no way to increase the PMTU once it has decreased. A DCP sender MAY optionally treat the reception of an ICMP Datagram Too Big message as an indication that the packet being reported was not lost due congestion, and so for the purposes of congestion control it MAY ignore the DCP receiver's indication that this packet did not arrive. However, if this is done, then the DCP sender MUST check the ECN bits of the IP header echoed in the ICMP message, and only perform this optimization if these ECN bits indicate that the packet did not experience congestion prior to reaching the router whose MTU it exceeded. 9. Abstract API TBA 10. DCP and the Congestion Manager This section will discuss the use of DCP with the Congestion Manager [RFC 3124], when there is a desire to share congestion control among multiple connections between the same pair of source and destination addresses. TBA Kohler/Handley/Floyd/Padhye Section 10. [Page 44] INTERNET-DRAFT Expires: January 2002 July 2001 11. DCP and RTP This section discusses the relationship between DCP and RTP [RFC 1889]. TBA 12. Security Considerations TBA 13. IANA Considerations DCP introduces five sets of numbers whose values should be allocated by IANA. o 32-bit Service Names (Section 3.3). o 32-bit DCP-Reset Reasons (Section 3.7). o 8-bit DCP Option Types (Section 4). The CCID-specific options 128 through 255 need not be allocated by IANA. o 8-bit DCP Feature Numbers (Section 4.3). The CCID-specific features 128 through 255 need not be allocated by IANA. o 8-bit DCP Congestion Control Identifiers (CCIDs) (Section 5). In addition, DCP would require a Protocol Number to be added to the registry of Assigned Internet Protocol Numbers. 14. Thanks There is a wealth of work in this area, including the Congestion Manager. We thank the staff and interns of ACIRI and the members of the End-to-End Research Group for feedback on DCP. 15. References [CCID 0 PROFILE] E. Kohler. Profile for DCP Congestion Control ID 0: Single-Window Congestion Control. Work in progress. [CCID 2 PROFILE] S. Floyd, E. Kohler. Profile for DCP Congestion Control ID 2: TCP-like Congestion Control. Work in progress. [CCID 3 PROFILE] J. Padhye. Profile for DCP Congestion Control ID 3: TFRC Congestion Control. Work in progress. Kohler/Handley/Floyd/Padhye Section 15. [Page 45] INTERNET-DRAFT Expires: January 2002 July 2001 [RFC 1191] J. C. Mogul, S. E. Deering. Path MTU discovery. RFC 1191. [RFC 1889] Audio-Video Transport Working Group, H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. RTP: A Transport Protocol for Real-Time Applications. RFC 1889. [RFC 2026] S. Bradner. The Internet Standards Process -- Revision 3. RFC 2026. [RFC 2481] K. Ramakrishnan, S. Floyd. A Proposal to add Explicit Congestion Notification (ECN) to IP. RFC 2481. [RFC 2960] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, V. Paxson. Stream Control Transmission Protocol. RFC 2960. [RFC 3124] H. Balakrishnan, S. Seshan. The Congestion Manager. RFC 3124. [WES01] David Wetherall, David Ely, Neil Spring. Robust ECN Signaling with Nonces. draft-ietf-tsvwg-tcp-nonce-00.txt, work in progress, January 2001. 16. Authors' Addresses Eddie Kohler Mark Handley Sally Floyd Jitendra Padhye AT&T Center for Internet Research at ICSI (ACIRI), ICSI, 1947 Center Street, Suite 600 Berkeley, CA 94704. Kohler/Handley/Floyd/Padhye Section 16. [Page 46]