Internet Engineering Task Force RMT WG INTERNET-DRAFT Adamson/Macker draft-macker-rmt-mdp-00.txt Newlink/NRL 22 October 1999 Expires: Apr 2000 The Multicast Dissemination Protocol (MDP) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other docu- ments at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (1999). All Rights Reserved. Abstract The Multicast Dissemination Protocol (MDP) is a protocol framework designed to provide reliable multicast data and file delivery ser- vices on top of the generic UDP/IP multicast transport [1]. MDP is well suited for reliable multicast bulk transfer of data across a heterogeneous internetwork. Further enhancements made to the pro- tocol are suitable for a range of network environments, including wireless internetwork environments. At its core, MDP is an effi- cient negative acknowledgement (NACK) based reliable multicast pro- tocol that leverages erasure-based coding in ways to improve pro- tocol efficiency and robustness. MDP also includes an optional adaptive end-to-end rate-based congestion control mode that is to Adamson, Macker Expires April 2000 [Page 1] Internet Draft The Multicast Dissemination Protocol October 1999 operate with competing flows (e.g., TCP sessions or other conges- tion aware flows). This document describes the protocol messages, building blocks, general operation, and optional modes of the pre- sent MDP instantiation and implementation. 1.0 Background This document describes the Multicast Dissemination Protocol (MDP) a protocol framework for reliable multicasting data that is espe- cially suitable for efficient bulk data transfer. MDP is expected to meet many of the critera described in [2]. The core MDP frame- work makes no design assumptions about network structure, hierar- chy, or reciprocal routing paths. The techniques and building blocks utilized in MDP are directly applicable to "flat" multicast groups but could be applied to a given level of a hierarchical (e.g. tree-based) multicast distribution system if so desired. Working MDP applications have been demonstrated across a range of network architecture and heterogeneous conditions including; the worldwide Internet MBone, bandwidth and routing asymmetries, satellite networks, and mobile wireless networks. Previous work on an earlier MDP design was implemented as part of the freely available Image Multicaster (IMM) reliable multicast application used and tested over the Internet Multicast Backbone (Mbone) since 1993 [3]. This document describes a more recent design and implementation of the Multicast Dissemination Protocol (MDP). The authors intend the present design to replace previous MDP work and this paper herein only references and discusses recent design work. 2.0 Protocol Motivation and Design Overview MDP provides end-to-end reliable transport of data over IP multi- cast capable networks. The primary design goal of MDP is to provide an efficient, scalable, and robust bulk data (e.g. computer files, transmission of persistent data) transfer capability adaptable across heterogeneous networks and topologies. MDP provides a num- ber of different reliable multicast services and modes of operation as described in different parts of this document. The goal of this flexible approach is to provide a useful set of reliable multicast building blocks. In addition, while the current capabilities of MDP focus on meeting specific bulk data and limited messaging transfer requirements, the MDP framework is envisioned to be extended to meet additional requirements in the future. The following factors were important considerations in the MDP design: Adamson, Macker Expires April 2000 [Page 2] Internet Draft The Multicast Dissemination Protocol October 1999 1) Heterogeneous, WAN-based networking operation 2) Minimal assumption of network structure for general operation 3) Operation over wide range of topologies and network link rates 3) Efficient asymmetric operation 4) Low protocol overhead and minimal receiver feedback 5) Potential use in large group sizes 6) Loose timing constraints and minimal group coordination 7) Dynamic group session support (members may leave and join) The current MDP protocol employs a form of parity-based repair using packet-level forward error correction coding techniques simi- lar to basic concept described in [4]. The use of parity-based repair for multicast reliability offers significant performance advantages in the case of uncorrelated packet loss among receivers (such as in a broadcast wireless environments or WAN distribu- tion)[5, 6]. The technique can also be leveraged to increase the effectiveness of receiver-based feedback suppression. These encoded parity packets are generally sent only "on demand" in response to repair requests the receiver group. However, the pro- tocol can be optionally configured to transmit some portion of repair packets proactively to potentially increase protocol perfor- mance (throughput and/or delay) in certain conditions (e.g. "a priori" expected group loss, long delays, some asymmetric network conditions, etc) [7]. Another aspect of the MDP protocol design is providing support for distributed multicast session participation with minimal coordina- tion among sources and receivers. The protocol allows sources and receivers to dynamically join and leave multicast sessions at will with minimal overhead for control information and timing synchro- nization among participants. To accommodate this capability, MDP protocol message headers contain some common information allowing receivers to easily synchronize to sources throughout the lifetime of a defined session. These common headers also include support for collection of transmission timing information (e.g., round trip delays) that allows MDP to adapt itself to a wide range of dynamic network conditions with little or no pre-configuration. The proto- col was purposely designed to be tolerant of inaccurate timing estimations or lossy conditions which may occur in mobile and wire- less networks. The protocol is also designed to exhibit conver- gence even under cases of heavy packet loss and large queueing or transmission delays. Scalability concerns in data multicasting have lead to a general increased interest in and adaptation of negative acknowledgement (NACK) based protocol schemes [8]. MDP is a protocol centered around the use of selective NACKs to request repairs of missing data. MDP also uses NACK suppression methods and dynamic event Adamson, Macker Expires April 2000 [Page 3] Internet Draft The Multicast Dissemination Protocol October 1999 timers to reduce retransmission requests and avoid congestion within the network. When used in pure multicast session operation, both NACKs and repair transmissions are multicast to the group to aid in feedback and control message suppression. This feature and additional message aggregation functionality reduce the likelihood of multicast control message implosion. MDP also dynamically col- lects group timing information and uses it to further improve its data delivery efficiency in terms of latency, overhead, and minimal redundant transmissions. In summary, the MDP design goals were to create a scalable reliable multicast transport protocol capable of operating in heterogeneous and possibly mobile internetwork environments. The capability of fully distributed operation with minimal precoordination among the group including the ability for participants to join and leave at any time was also an important consideration. MDP is intended to be suitable primarily for bulk data and file transfer with eventual support for streaming and other group data transport paradigms. While the various features of MDP are designed to provide some mea- sure of general purpose utility, we wish here to reemphasize the importance of understanding that "no one size fits all" in the reliable multicast transport arena. There are numerous engineering tradeoffs involved in reliable multicast transport design requiring increased application and network architecture considerations. Some performance requirements affecting design include: group size, heterogeneity (e.g., capacity and/or delay), asymmetric delivery, data ordering, delivery delay, group dynamics, mobility, congestion control, and transport across low capacity connections. MDP contains various options to accommodate many of these differing requirements. However, MDP is intended to work most efficiently as a reliable multicast bulk data transfer protocol in environments where protocol overhead and heterogeneity is a main concern. Likely application areas include mobile wireless, asymmetric satel- lite, and heterogeneous WAN conditions. MDP's most general mode of operation assumes little or no structure in the network architec- ture and works in an end-to-end fashion. This does not preclude the adaptation of the protocol to more structured applications (e.g., reliable multicast hierarchy, addition of local repair mech- anisms, sessions spanning multiple groups, etc). 3.0 MDP Protocol Definition 3.1 Assumptions An MDP protocol "session" instantiation (MdpSession) is defined by participants communicating User Datagram Protocol (UDP) packets over an Internet Protocol (IP) network on a common, pre-determined Adamson, Macker Expires April 2000 [Page 4] Internet Draft The Multicast Dissemination Protocol October 1999 network address and host port number. Generally, the participants exchange packets on an IP multicast group address, but unicast transport may also be established or applied as an adjunct to mul- ticast delivery. Currently the protocol uses a single multicast address for transmissions associated with a given MDP session. However, in the future, multiple multicast addresses might be employed to segregate separate degrees of repair information to different groups of receivers experiencing different packet loss characteristics with respect to a given source. This capability is under ongoing investigation. Also the protocol supports asymmetry where receiver participants may transmit back to source partici- pants via unicast routing instead of broadcasting to the session multicast address. Each participant (MdpNode) within an MdpSession is assumed to have an preselected unique 32-bit identifier (MdpNodeId). Source MdpN- odes MUST have uniquely assigned identifiers within a single MdpSession to distinquish multiple sources. Receivers SHOULD have unique identifiers to avoid certain protocol inefficiencies that may occur, particularly when operating with congestion control modes enabled or when using MDP's optional positive acknowledgement feature. The protocol does not preclude multiple source nodes actively transmitting within the context of a single MDP session (i.e. many- to-many operation), but any type of interactive coordi- nation among these sources is assumed to be controlled at a higher protocol layer. Unique data content transmitted within an MdpSession uses source- specific identifiers (MdpObjectTransportId) which are valid and applicable only during the actual _transport_ of the particular portion of data content. Any globally unique identification of transported data content must be assigned and processed by the higher level application using the MDP transport service. [ASIDE: It is anticipated that if, in the future, MDP is extended to support local repair mechanisms that the application will interact with the MDP transport service to process any _global_ data identifiers which may be required to support such operation. The same is true of possible extensions to MDP to support operation as part of a more highly structured (e.g. tree-based) data dissemination system. There are also numerous other supporting capa- bilities that may be required to implement certain types of multicast applications. While it does not currently specifically support them, MDP could play a role in the provision of these services in addition to its role of data transport. Examples of such capabilities include security (key distribution, group authentication) or Adamson, Macker Expires April 2000 [Page 5] Internet Draft The Multicast Dissemination Protocol October 1999 multicast session management.] 3.2 General MDP Source and Receiver Messaging and Interaction An MDP source primarily generates messages of type MDP_DATA and MDP_PARITY which carry the data content and related parity-based repair information for the bulk data (or file) objects being trans- ferred. The MDP_PARITY information is by default sent only on demand thus normally requiring no additional protocol overhead. The transport of an object can be optionally configured to proactively transmit some amount of MDP_PARITY messages with the original MDP data blocks to potentially enhance performance (e.g., improved delay). This configuration MAY be sensible for certain network conditions or can also allow for robust, asymmetric multicast (e.g., unidirectional routing, satellite, cable). A source message of type MDP_INFO is also defined and is used to carry any optional "out-of-band" context information for a given transport object. The content of MDP_INFO messages is repaired with a lower delay process than general encoded data and thus may serve special pur- poses in a reliable multicast application. The source also gener- ates messages of type MDP_CMD to perform certain protocol opera- tions such as congestion control probing, end-of-transmission flushing, round trip time estimation, optional positive acknowl- edgement requests, and "squelch" commands to indicate to requesting receivers the non-availability of previously-available or obsolete data. An MDP receiver generates messages of type MDP_NACK or MDP_ACK in response to transmissions of data and commands from a source. The MDP_NACK messages are generated to request repair of detected data transmission losses. Receivers generally detect losses by tracking the sequence of transmission from a source. Sequencing information is embedded in the transmitted data packets and end-of-transmission commands from the source. MDP_ACK messages are generated in response to certain commands transmitted by the source. In the general (and most scalable) protocol mode, receivers do not trans- mit any MDP_ACK messages. However, in order to meet potential user requirements for positive data acknowledgement, and to collect more detailed information for potential multicast congestion control algorithms, MDP_ACK messages are defined and potentially used. MDP_ACK messages are also generated by a small subset of receivers when MDP dynamic end-to-end congestion control is in operation. In addition to the messages described above, the protocol defines an optional MDP_REPORT message that is periodically transmitted by all source and receiver nodes. The MDP_REPORT message contains some additional session level information such as a string identi- fying the nodes "name" and provides a mechanism for collecting Adamson, Macker Expires April 2000 [Page 6] Internet Draft The Multicast Dissemination Protocol October 1999 group statistics on protocol operation. The MDP_REPORT messages are not critical for operation except during use of a current experimental feature which allows for automated formation of subset receiver groups participating in positive acknowledgement of data transmissions. In the future, the content of MDP_REPORT messages may be determined by the application, but currently the MDP_REPORT message content is currently fixed to contain performance statis- tics reporting. The current definition of MDP allows for reliable transfer of two different types of data content. These include the type MDP_OBJECT_DATA which are static, persistent blocks of data content maintained in the source's application memory storage and the type MDP_OBJECT_FILE which corresponds to data stored in the source's non-volatile file system. Both of these current types represent "MdpObjects" of finite size which are encapsulated for transmission as and are temporarily yet uniquely identified with the given source's MdpNodeId and a temporarily unique MdpObjectTransportId. All transmissions by individual sources and receivers are subject to rate control governed by a peak transmission rate set for each participant by the application. This can be used to limit the quantity of multicast data transmitted by the group. When MDP's congestion control algorithm is enabled the rate for sources is automatically adjusted. And even when congestion control is enabled, it may be desirable in some cases to establish minimum and maximum bounds for the rate adjustment depending upon the applica- tion. [ASIDE: The protocol has been designed with future support envi- sioned for data content of type MDP_OBJECT_STREAM which will correspond to a unbounded "stream" of messages (small and/or large in size). Although this behavior can be emulated with transmission of a series of MdpObjects of type MDP_OBJECT_DATA, there are additional protocol efficiencies which can be realized with true "stream" support. In the long term, the use of the MDP_OBJECT_STREAM type may supplant the other object types for most applications. This document will be updated when that design is complete.] 3.3 Message Type and Header Definitions This section describes the message formats used in MDP. Note that these messages do not currently adhere to any particular machine alignment methodology. During development of this protocol design, message field alignment has not been explicitly addressed. So it is likely that some optimization of the protocol message alignment Adamson, Macker Expires April 2000 [Page 7] Internet Draft The Multicast Dissemination Protocol October 1999 resulting in changes to the message formats will occur in the future. Therefore, please note that the message formats presented here represent the current experimental implementation and are doc- umented here for purposes of describing the fields' functionality. The field values are presented in standard network byte order (Big Endian) for those fields greater than one byte (8 bits) in length. 3.3.1 MDP Common Message Header All MDP protocol messages begin with a common header with informa- tion fields as follows: +--------+---------------+--------------------------------+ | Field | Length (bits) | Purpose | +--------+---------------+--------------------------------+ |type | 8 | MDP message type | +--------+---------------+--------------------------------+ |version | 8 | Protocol version number | +--------+---------------+--------------------------------+ |node_id | 32 | Message originator's MdpNodeId | +--------+---------------+--------------------------------+ The message "type" field is an 8-bit value indicating the MDP pro- tocol message type. These types are defined as follows: Message Type Value MDP_REPORT 1 MDP_INFO 2 MDP_DATA 3 MDP_PARITY 4 MDP_CMD 5 MDP_NACK 6 MDP_ACK 7 The "version" field is an 8-bit value indicating the protocol ver- sion number. Currently, MDP implementations SHOULD ignore received messages with a different protocol version number. This number is intended to indicate and distinguish upgrades of the protocol that may be non-interoperable. The "node_id" is a 32-bit value uniquely identifying the source of the message. A participant's MDP node identifier (MdpNodeId) can be set according to the application needs but unique identifiers must be assigned within a single MdpSession. In many cases, use of Adamson, Macker Expires April 2000 [Page 8] Internet Draft The Multicast Dissemination Protocol October 1999 the host IP address can suffice, but in some cases alternative methodologies for assignment of unique node identifiers within a multicast session may need to be considered. For example, he "source identifier" mechanism defined in the RTPv2 specification [9] may be applicable to use for MDP node identifiers. At this point in time, the protocol makes no assumptions about how these unique identifiers area actually assigned. 3.3.2 MDP_REPORT Message The MDP_REPORT message is used to report status information to other session participants. This report currently includes node name information for the purposes of potential bookkeeping, the reporting address identifier, and a number of statistics relating to the present source session. A report includes the following information and estimates based upon the reporting node1s local activity: duration of session participation, successful transfers, pending transfers, failed transfers, source re-syncs, block loss statistics, transmit rate, transmitted NACKs, suppressed NACKs, buffer utilization, average goodput, receiver rate. The MDP_REPORT message is not required for protocol operation, but provides useful periodic feedback for protocol debugging, perfor- mance monitoring, and statistical estimation. In addition to the common header, the MDP_REPORT message contains the following fields: +--------+---------------+-------------------------------+ | Field | Length (bits) | Purpose | +--------+---------------+-------------------------------+ |status | 8 | Reporting node's status flags | +--------+---------------+-------------------------------+ |flavor | 8 | Type of MDP_REPORT message | +--------+---------------+-------------------------------+ |content | -- | Flavor-dependent content | +--------+---------------+-------------------------------+ The "status" field contains flags indicating the sending node's current operating mode. The flags currently defined include: Adamson, Macker Expires April 2000 [Page 9] Internet Draft The Multicast Dissemination Protocol October 1999 +-----------+-------+------------------------------------------+ | Flag | Value | Purpose | +-----------+-------+------------------------------------------+ |MDP_CLIENT | 0x01 | Node is participating as a client | | | | (receiver) | +-----------+-------+------------------------------------------+ |MDP_SERVER | 0x02 | Node is participating as a server | | | | (source) | +-----------+-------+------------------------------------------+ |MDP_ACKING | 0x04 | Node wishes to provide positive acknowl- | | | | edgements | +-----------+-------+------------------------------------------+ The MDP_CLIENT and MDP_SERVER flags indicate the reporting node's levels of participation in the the corresponding MdpSession. The MDP_ACKING flag is set by reporting nodes wishing to participate in positive acknowledgement cycles. This is not a robust mechanism for forming positive acknowledgement receiver subsets and is pro- vided for experimental purposes. The "flavor" field indicates the type of MDP_REPORT message. Cur- rently, only one type of MDP_REPORT is defined: MDP_REPORT_HELLO, value = 1. This report contains a string with the "name" of the reporting MdpNode and a detailed periodic reception statistics report. (TBD) The contents of "client_stats" field will be described in the future. (In the present implementation, this field includes trans- mission/reception data rates, object transfer success/failures, goodput measurement, buffer utilization/overrun report, loss rate histogram, etc.) 3.3.3 MDP_INFO Message The object information message is used by sources to announce a small amount of "out-of-band" information regarding an object in transport. The information content must fit within the source's current "maximum segment_size" setting. Since the MDP_INFO content is all contained within a single MDP message, it allows for shorter turn-around time for receivers to "NACK" for the information, and for the source to subsequently provide a repair retransmission (NACK aggregation is greatly simplified under this condition). There are several uses envisioned for the MDP_INFO content in gen- eral multicast applications. For example, the MDP_INFO packets may be useful for _global_ object identifiers used within as part of an SRM-like local repair protocol [10] embedded with MDP. MDP_INFO may also be useful for reliable multicast session management purposes. Adamson, Macker Expires April 2000 [Page 10] Internet Draft The Multicast Dissemination Protocol October 1999 Additionally, when MDP_OBJECT_STREAM objects are introduced, the attached MDP_INFO may be useful for providing context information (MIME-type info, etc) on the corresponding stream. Note that the availability of MDP_INFO for a given object is optional. A flag in the header of MDP_DATA and MDP_PARITY packets indicates the avail- ability of MDP_INFO for a given transport object. In addition to the MDP common message header, MDP_INFO messages contain the fol- lowing fields: +---------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +---------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +---------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId identifier | +---------------+---------------+----------------------------------+ | object_size | 32 | Size of object (in bytes) | +---------------+---------------+----------------------------------+ | ndata | 8 | Source's FEC data block size | +---------------+---------------+----------------------------------+ | nparity | 8 | Maximum available parity per | | | | block | +---------------+---------------+----------------------------------+ | flags | 8 | Object transmission flags | +---------------+---------------+----------------------------------+ | grtt | 8 | Quantized current source GRTT | | | | estimate | +---------------+---------------+----------------------------------+ | segment_size | 16 | Source maximum segment payload | | | | (bytes) | +---------------+---------------+----------------------------------+ | data | -- | Info content (up to "seg- | | | | ment_size" bytes) | +---------------+---------------+----------------------------------+ The "sequence" field is used by MDP receivers for calculating a running estimate of packet loss for feedback to the MDP source in support of MDP's automatic congestion control technique. The 16-bit sequence number increases monotonically with each packet transmitted by an MDP source and rolls over when the maximum is reached. This sequence number increases independently of specific MdpObject transmission or repair. The "object_id" field is a monotonically and incrementally increas- ing value assigned by a source to the object being transmitted. Transmissions and repair requests related to that object use the Adamson, Macker Expires April 2000 [Page 11] Internet Draft The Multicast Dissemination Protocol October 1999 same "object_id" value. For sessions of very long duration, the "object_id" field may be repeated, but it is presumed that the 32-bit field size provides an adequate enough sequence space to prevent temporary object confusion amongst receivers and sources (i.e. receivers SHOULD re-synchronize with a server when receiving object sequence identifiers sufficiently out-of-range with the cur- rent state kept for a given source). During the course of trans- mission within an MDP session, an object is uniquely identified by the concatenation of the source "node_id" and the given "object_id". The "object_size" field indicates the size of the given transport object in bytes. Note that as MDP is extended to include "stream" objects of indeterminate length, a corresponding flag in the flags field will indicate the non-validity of the object_size field (or it may assume another use, e.g. additional sequencing informa- tion). The "ndata" and "nparity" fields are used by the source to adver- tise its current FEC encoding parameters, the number of MDP_DATA segments per coding block and number of available MDP_PARITY seg- ments for repair per block, respectively. The "flags" field is used to advertise information about current object transmission status. Defined flags currently include: Adamson, Macker Expires April 2000 [Page 12] Internet Draft The Multicast Dissemination Protocol October 1999 +--------------------+-------+------------------------------------------+ | Flag | Value | Purpose | +--------------------+-------+------------------------------------------+ |MDP_FLAG_REPAIR | 0x01 | Indicates message is a repair transmis- | | | | sion | +--------------------+-------+------------------------------------------+ |MDP_FLAG_BLOCK_END | 0x02 | Indicates end of coding block transmis- | | | | sion | +--------------------+-------+------------------------------------------+ |MDP_FLAG_RUNT | 0x04 | Indicates message size is less than seg- | | | | ment_size (applies to MDP_DATA messages | | | | only) | +--------------------+-------+------------------------------------------+ |MDP_FLAG_INFO | 0x10 | Indicates availability of MDP_INFO for | | | | object | +--------------------+-------+------------------------------------------+ |MDP_FLAG_UNRELIABLE | 0x20 | Indicates that repair transmissions for | | | | the specified object will be unavail- | | | | able. (One-shot, best effort transmis- | | | | sion) | +--------------------+-------+------------------------------------------+ |MDP_FLAG_FILE | 0x80 | Indicates object is "file-based" data | | | | (hint to use disk storage for reception) | +--------------------+-------+------------------------------------------+ The "grtt" field contains a quantized representation of the source- based current estimate of greatest round trip transmission delay time for the group. The value is in units of microseconds and is quantized using the following C function: unsigned char QuantizeGrtt(double grtt) { if (grtt > 1.0e03) grtt = 1.0e03; else if (grtt < 1.0e-06) grtt = 1.0e-06; if (grtt < 3.3e-05) return ((unsigned char)(grtt * 1.0e06) - 1); else return ((unsigned char)(ceil(255.0.- (13.0 * log(1.0e03/grtt))))); } Note that this function is useful for quantizing GRTT times in the range of 1 microsecond to 1000 seconds. MDP implementations may wish to further constrain GRTT estimates for practical reasons. The "segment_size" field indicates the source's current setting for Adamson, Macker Expires April 2000 [Page 13] Internet Draft The Multicast Dissemination Protocol October 1999 maximum message payload content (in bytes). Knowledge of this value allows an MDP receiver to allocate appropriate buffering resources. The "data" field of the MDP_INFO packet contains the information set for this object by the source MDP application. MdpObjects of type MDP_OBJECT_FILE use this field for file name information. An application may use this field for its own purposes for MdpObjects of type MDP_OBJECT_DATA. Furthermore, it is possible that data content of one "segment_size" or less may be entirely represented by a single MDP_INFO packet. The advantage of this approach for small messaging purposes is a more rapid repair retransmission cycle. The disadvantage is that FEC-based repair is not available for MDP_INFO messages. A number of uses for this special message type are anticipated in potential applications of MDP. The defini- tion and use of MDP_INFO as applied to objects of type MDP_OBJECT_FILE, MDP_OBJECT_DATA, and MDP_OBJECT_STREAM will be further refined in the future. 3.3.4 MDP_DATA This MDP_DATA message is used for carrying multicast user object data content within a session. In addition to the common header, it includes the following fields: Adamson, Macker Expires April 2000 [Page 14] Internet Draft The Multicast Dissemination Protocol October 1999 +----------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +----------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +----------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId identifier | +----------------+---------------+----------------------------------+ | object_size | 32 | Size of object (in bytes) | +----------------+---------------+----------------------------------+ | ndata | 8 | Source's FEC data block size | +----------------+---------------+----------------------------------+ | nparity | 8 | Maximum available parity per | | | | block | +----------------+---------------+----------------------------------+ | flags | 8 | Object transmission flags | +----------------+---------------+----------------------------------+ | grtt | 8 | Quantized current source GRTT | | | | estimate | +----------------+---------------+----------------------------------+ | offset | 32 | Data content's "offset" within | | | | object | +----------------+---------------+----------------------------------+ | segment_size* | 16 | Source maximum segment payload | | | | (bytes) | +----------------+---------------+----------------------------------+ | data | -- | Data content (up to "seg- | | | | ment_size" bytes) | +----------------+---------------+----------------------------------+ *The "segment_size" field is only present in MDP_DATA packets which are less than the source's current "segment_size" set- ting (i.e. for the last ordinal segment of an object). Note that many of the fields and their use are the same as for the MDP_INFO message type. Receivers can synchronize to sources and begin receiving reliable multicast content upon the reception of MDP_INFO, MDP_DATA, or MDP_PARITY. Some provision is made in the present protocol implementation to prevent dynamically joining receivers from significantly slowing the forward progress of an ongoing source session. This will be discussed in further detail later. There are only three fields in the MDP_DATA message which differ from the MDP_INFO message. The "offset" field is provided to indicate the position (in bytes) of the MDP_DATA message's data content with respect to the begin- ning of the object (offset zero). For example, for a file object, this corresponds to the "seek" offset, and for static data objects, Adamson, Macker Expires April 2000 [Page 15] Internet Draft The Multicast Dissemination Protocol October 1999 this corresponds to the offset from the base pointer to memory storage. For MDP_DATA messages, the "segment_size" field and only required for payloads shorter than the segment size. The MDP_FLAG_RUNT flag in the "flags" field indicates the presence of a segment_size indi- cator. For other (non-runt) packets, the source's "segment_size" setting can be implicitly determined from the size of the received message. The MDP_DATA data field simply contains the data content for the indicated portion of the associated transport object. 2.3.5 MDP_PARITY The MDP_PARITY message is used for parity-based repair messages. It is similar to the MDP_DATA message. In addition to the common header it includes following fields: +--------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +--------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +--------------+---------------+----------------------------------+ | object_id | 32 | Object transport identifier | +--------------+---------------+----------------------------------+ | object_size | 32 | Size of object (in bytes) | +--------------+---------------+----------------------------------+ | ndata | 8 | Source FEC block size | +--------------+---------------+----------------------------------+ | nparity | 8 | Maximum available parity | +--------------+---------------+----------------------------------+ | flags | 8 | Object transmission flags | +--------------+---------------+----------------------------------+ | grtt | 8 | Quantized current source GRTT | | | | estimate | +--------------+---------------+----------------------------------+ | offset | 32 | "Offset" of applicable coding | | | | block | +--------------+---------------+----------------------------------+ | parity_id | 8 | Parity segment id | +--------------+---------------+----------------------------------+ | data | -- | Parity content ("segment_size" | | | | bytes) | +--------------+---------------+----------------------------------+ With the exception of the occasional need for the "segment_size" field, a slightly different use of the "offset" field, and the Adamson, Macker Expires April 2000 [Page 16] Internet Draft The Multicast Dissemination Protocol October 1999 presence of the "parity_id" field, all of the fields in the MDP_PARITY message are the same and have the same use as the corre- sponding fields described for the MDP_DATA message type. The source's "segment_size" setting can always be implicitly deter- mined from MDP_PARITY messages since the parity payload is always of "segment_size" bytes. The "offset" field is used to indicate the offset of the first data segment of the FEC coding block for which the parity repair message has been calculated. The "parity_id" field is used to indicate the position within the source's FEC coding block of the parity segment content contained in the message. Note that it will always be a non-zero value since any valid coding block will always have at least one segment of data content. MDP could make use of other erasure-based coding schemes, but presently the implementation we are describing uses Reed-Solomon coding [11] and the description will be limited to application of that coding approach. The "data" field contains the Reed-Solomon parity information for the coding block position as indicated by the "parity_id" field. These parity packets are calculated using an 8-bit word size Reed-Solomon forward error correction code with each byte of the message corresponding to the same byte position for the associated coding block data content (i.e. code blocks are "striped" over the payload content portion of MDP_DATA and MDP_PAR- ITY messages). For MdpObjects whose last block is shortened because the object size is not an even multiple of the coding block size ("ndata") and source "segment_size", zero-value padding is assumed for short (runt) data messages and the Reed-Solomon encod- ing is further shortened for the last data coding block according to the object's size. 3.3.6 MDP_CMD MDP_CMD messages are generated by sources within a session to ini- tiate or respond to various protocol actions. Different MDP_CMD types are identified by an 8-bit "flavor" field in each MDP_CMD messages. The size and content of MDP_CMD messages vary depending upon their type. The following source command types are currently defined: Adamson, Macker Expires April 2000 [Page 17] Internet Draft The Multicast Dissemination Protocol October 1999 +-----------------+--------+----------------------------------+ | Command | Flavor | Purpose | +-----------------+--------+----------------------------------+ |MDP_CMD_FLUSH | 1 | Indicates source temporary or | | | | permanent end-of-transmission | | | | cycle. (Can assist in robustly | | | | initiating NACK repair requests | | | | from receivers). | +-----------------+--------+----------------------------------+ |MDP_CMD_SQUELCH | 2 | Indicates obsolete object for | | | | which a NACK has been received. | +-----------------+--------+----------------------------------+ |MDP_CMD_ACK_REQ | 3 | Requests positive acknowledge- | | | | ment of a specific object from a | | | | specific list of receivers. | +-----------------+--------+----------------------------------+ |MDP_CMD_GRTT_REQ | 4 | Probe used in collection of | | | | source's group GRTT estimate and | | | | congestion control feedback. | +-----------------+--------+----------------------------------+ 3.3.6.1 MDP_CMD_FLUSH The MDP_CMD_FLUSH command type is used when an MDP source has com- pleted transmission of all data it has pending to send and contains the following fields in addition to the MDP common message header: +------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +------------+---------------+----------------------------------+ | flavor | 8 | MDP_CMD type (value = 1) | +------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId of most | | | | recent MdpObject for which the | | | | source has completed transmis- | | | | sion | +------------+---------------+----------------------------------+ The "sequence" and "flavor" fields serve the purposes previously described. The "object_id" field indicates the last MdpObject for which the source completed transmission. This allows this message to initiate repair requests from any receivers with missing data content or completely missing MdpObjects. The process by which the source uses this message to "flush" the receiver set for repairs is Adamson, Macker Expires April 2000 [Page 18] Internet Draft The Multicast Dissemination Protocol October 1999 described later in this document. 3.3.6.2 MDP_CMD_SQUELCH The MDP_CMD_SQUELCH command type is used by the source in response to a repair request for invalid data. A receiver might make such a request after a severe network outage and it allows source applica- tions to "dequeue" data which the application no longer wishes to provide. In either case, receivers should stop requesting repair of MdpObjects for which MDP_CMD_SQUELCH commands are received. The MDP_CMD_SQUELCH contains the following fields in addition to the MDP common message header: +------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number T} flavor 8 T{ | | | | MDP_CMD type (value = 2) | +------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId of the | | | | MdpObject for which repair | | | | requests should be terminated. | +------------+---------------+----------------------------------+ The "sequence" and "flavor" fields serve the purposes previously described. The "object_id" field indicates the MdpObject for which the receiver should stop requesting repair. The MDP_CMD_ACK_REQ command can optionally be used by an MDP source to request explicit positive object receipts from a subset of receivers. This message contains the following fields in addition to the MDP common message header: Adamson, Macker Expires April 2000 [Page 19] Internet Draft The Multicast Dissemination Protocol October 1999 +------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +------------+---------------+----------------------------------+ | flavor | 16 | MDP_CMD type (value = 3) | +------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId of MdpOb- | | | | ject for which positive acknowl- | | | | edgement of complete reception | | | | is requested. | +------------+---------------+----------------------------------+ | data | -- | List of receiver MdpNodeIds from | | | | which positive acknowledgement | | | | is requested (source "seg- | | | | ment_size" is maximum "data" | | | | size) | +------------+---------------+----------------------------------+ The "sequence" and "flavor" fields serve the purposes previously described. The "object_id" field indicates the MdpObject for which the indi- cated receivers should provide positive acknowledgement (via an MDP_ACK message) of reception. The "data" field contains a list of MdpNodeIds from which the source expects to receive positive acknowledgement of reception. The maximum size of the "data" field is limited by the source's "segment_size" setting. Thus for large group sizes, it is possible that the positive acknowledgment process may take multiple "rounds". This process is described in detail in a later section of this document. 3.3.6.3 MDP_CMD_GRTT_REQ The MDP_CMD_GRTT_REQ command is periodically transmitted by an active source in order to collect responses from receivers to attain a running estimate of round trip packet transmission delays and other statistics for protocol operation. The process by which the MDP_CMD_GRTT_REQ messages are sent and how responses are obtained from receivers for modes of operation with and without dynamic congestion control enabled is described in detail later. This message type contains the following fields in addition to the MDP common message header: Adamson, Macker Expires April 2000 [Page 20] Internet Draft The Multicast Dissemination Protocol October 1999 +------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +------------+---------------+----------------------------------+ | sequence | 16 | Packet loss detection sequence | | | | number | +------------+---------------+----------------------------------+ | flavor | 16 | MDP_CMD type (value = 3) | +------------+---------------+----------------------------------+ | flags | 8 | GRTT request flags | +------------+---------------+----------------------------------+ | grtt_seq | 8 | GRTT_REQ sequence identifier | +------------+---------------+----------------------------------+ | send_time | 64 | Timestamp reference of when this | | | | message was sent by the source. | +------------+---------------+----------------------------------+ | hold_time | 64 | Receiver response window (time | | | | window over which receivers | | | | should spread their responses) | +------------+---------------+----------------------------------+ | tx_rate | 32 | Current source transmit rate | | | | (bytes/sec) | +------------+---------------+----------------------------------+ | rtt | 8 | Bottleneck node round trip time | | | | estimate | +------------+---------------+----------------------------------+ | loss | 16 | Bottleneck node packet loss | | | | estimate | +------------+---------------+----------------------------------+ | data | -- | List of representative MdpN- | | | | odeIds from which explicit (non- | | | | wildcard) acknowledgement is | | | | requested (source "segment_size" | | | | is maximum "data" size) | +------------+---------------+----------------------------------+ The "sequence" and "flavor" fields serve the purposes previously described. The "flags" field currently has one possible flag value defined. This flag is the MDP_CMD_GRTT_FLAG_WILDCARD (value = 0x01) flag which is used during MDP congestion control operation to indicate MDP_CMD_GRTT_REQ messages to which _all_ MDP receivers (regardless of their "representative" status) should explicitly respond via an MDP_ACK message. The "grtt_seq" field is a sequence number which is incremented each time the source transmits an MDP_CMD_GRTT_REQ command. This field is used in responses from receivers to identify which specific Adamson, Macker Expires April 2000 [Page 21] Internet Draft The Multicast Dissemination Protocol October 1999 MDP_CMD_GRTT_REQ message to which the receiver response applies. The "send_time" field is precision timestamp indicating the time that the MDP_CMD_GRTT_REQ message was transmitted. This consists of a 64-bit field containing 32-bits with the time in seconds and 32-bits with the time in microseconds since some reference time the source maintains (usually 00:00:00, 1 January 1970). The "hold_time" field is in the same format as the "send_time" field. (Note: It is likely that this will be quantized to an 8-bit value in a future revision using the same algorithm previously described for source GRTT advertisements in other messages.) The "hold_time" instructs receivers over what window of time they should distribute any explicit responses to the MDP_CMD_GRTT_REQ command. The "tx_rate" field indicates the source's current transmission rate in units of bytes per second. This information is used by receivers as part of MDP's rate-based congestion control algorithm which is described in detail later in this document. The "rtt" field indicates the round trip delay time measured for the current "bottleneck" congestion control representative node. This information is used by receivers as part of MDP's congestion control algorithm which is described in detail later. This 8-bit value is a quantized representation of the delay using the same quantization algorithm described for the GRTT estimate advertised in MDP_INFO, MDP_DATA, and MDP_PARITY messages. The "loss" field indicates the loss fraction measured for the cur- rent "bottleneck" congestion control representative node. This 16-bit value represents the loss fraction on scale of 0.0 to 1.0 where the decimal loss fraction can be obtained from the formula: loss_fraction = "loss" / 65535.0 This information is also used by receivers as part of MDP's conges- tion control algorithm which is described in detail later. The "data" field of the MDP_CMD_GRTT_REQ message contains a list of MdpNodeIds indicating the receiver nodes which are currently selected by the source to serve as congestion control representa- tives. These listed nodes should explicitly respond to the MDP_CMD_GRTT_REQ with an MDP_ACK message randomly within the "hold_time" indicated. More details on the congestion control approach are described later. 3.3.7 MDP_NACK Adamson, Macker Expires April 2000 [Page 22] Internet Draft The Multicast Dissemination Protocol October 1999 MDP_NACK messages are transmitted by MDP receivers in response to the detection of missing data in the sequence of transmissions received from a particular source. The specific times and condi- tions under which receivers will generate and transmit these MDP_NACK messages are governed by the processes described in detail later in this document. The payload of MDP_NACK messages contains a list of "ObjectNACKs" for different objects and portions of a those objects. In addition to the common message header the MDP_NACK messages contain the following fields: +--------------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +--------------------+---------------+----------------------------------+ | server_id | 32 | MdpNodeId of source for which | | | | NACK is intended | +--------------------+---------------+----------------------------------+ | grtt_response | 64 | Response to source's | | | | MDP_CMD_GRTT_REQ, if any (zero | | | | value if none) | +--------------------+---------------+----------------------------------+ | loss_estimate | 16 | Current packet loss estimate for | | | | the indicated source. | +--------------------+---------------+----------------------------------+ | grtt_req_sequence | 8 | Sequence number identifier of | | | | applicable MDP_CMD_GRTT_REQ | +--------------------+---------------+----------------------------------+ | data | -- | ObjectNACK list | +--------------------+---------------+----------------------------------+ The "server_id" field identifies the source to which the MDP_NACK message is destined. Other sources should ignore this message. (Note that this another reason why multiple potential sources within an MDP session MUST have unique MdpNodeIds). The "grtt_response" field contains a timestamp indicating the time at which the MDP_NACK was transmitted. The format of this times- tamp is the same as the "send_time" field of the MDP_CMD_GRTT_REQ. However, note that the "grtt_response" timestamp is _relative_ to the "send_time" the source provided with the corresponding MDP_CMD_GRTT_REQ command. The receiver adjusts the source's MDP_CMD_GRTT_REQ "send_time" timestamp by the time differential from when the receiver received the MDP_CMD_GRTT_REQ to when the MDP_NACK was transmitted to calculate the value in the "grtt_response" field. The following formula applies: "grtt_response" = request "send_time" + request_to_response_differential Adamson, Macker Expires April 2000 [Page 23] Internet Draft The Multicast Dissemination Protocol October 1999 If the "grtt_response" has ZERO value, that is an indication that the receiver has not yet received a MDP_CMD_GRTT_REQ command from the source and the source should ignore this portion of the response. The "loss_estimate" field is the receiver's current packet loss fraction estimate for the indicated source. The loss fraction is a value from 0.0 to 1.0 corresponding to a range of zero to 100 per- cent packet loss. The 16-bit "loss_estimate" value is calculated by the following formula: "loss_estimate" = decimal_loss_fraction * 65535.0 The "grtt_req_sequence" field contains the sequence number identi- fier of the received MDP_CMD_GRTT_REQ to which the response infor- mation in this MDP_NACK applies. The "data" field of the MDP_NACK contains the list of ObjectNACKs for different source MdpObjects. Note that ObjectNACKs for multi- ple objects may be contained in one MDP_NACK message and that each ObjectNACK consists of a hierarchical set of indicators and bit masks depending upon what data the receiver has detected is miss- ing. Each ObjectNACK in the list contained in the MDP_NACK "data" field is made up of the following fields: +------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +------------+---------------+----------------------------------+ | object_id | 32 | MdpObjectTransportId of object | | | | for enclosed RepairRequests | +------------+---------------+----------------------------------+ | nack_len | 16 | Total length (in bytes) of | | | | RepairRequests for indicated | | | | object. | +------------+---------------+----------------------------------+ | data | -- | RepairRequest list | +------------+---------------+----------------------------------+ The content in the data field of an ObjectNACK consists of a list of individual "RepairRequests" for the indicated MdpObject. There are multiple types of RepairRequests which each begin with an 8-bit type field of one of the following values: Adamson, Macker Expires April 2000 [Page 24] Internet Draft The Multicast Dissemination Protocol October 1999 +------------------+------+----------------------------------+ | RepairRequest | Type | Purpose | +------------------+------+----------------------------------+ | | | | +------------------+------+----------------------------------+ | REPAIR_SEGMENTS | 1 | Indicates receiver is missing | | | | portions of an encoding block. | +------------------+------+----------------------------------+ | REPAIR_BLOCKS | 2 | Indicates receiver is missing | | | | some blocks in entirety. | +------------------+------+----------------------------------+ | REPAIR_INFO | 3 | Indicates receiver requires | | | | retransmission of MDP_INFO for | | | | object. | +------------------+------+----------------------------------+ | REPAIR_OBJECT | 4 | Indicates receiver requires | | | | retransmission of entire object | +------------------+------+----------------------------------+ A REPAIR_SEGMENTS RepairRequest identifies the beginning the coding block (by its offset) and then provides a bit mask indicating which segments within that block require retransmission. A count of the total number of missing segments (erasures) is also provided. So, the following fields comprise a REPAIR_SEGMENTS RepairRequest: +-----------+---------------+-----------------------------------------+ | Field | Length (bits) | Purpose | +-----------+---------------+-----------------------------------------+ | type | 8 | value = 1 (REPAIR_SEGMENTS) | +-----------+---------------+-----------------------------------------+ | nerasure | 8 | Count of missing segments in the block. | +-----------+---------------+-----------------------------------------+ | offset | 32 | Offset of applicable coding block. | +-----------+---------------+-----------------------------------------+ | mask_len | 16 | Length of attached bit mask (in bytes) | +-----------+---------------+-----------------------------------------+ | mask | -- | Bit mask content | +-----------+---------------+-----------------------------------------+ The REPAIR_BLOCKS RepairRequest identifies the beginning of a set of FEC coding blocks (by the initial offset) and then provides a bit mask indicating which coding blocks require retransmission in entirety. The following fields make up a REPAIR_BLOCKS RepairRe- quest: Adamson, Macker Expires April 2000 [Page 25] Internet Draft The Multicast Dissemination Protocol October 1999 +-----------+---------------+----------------------------------------+ | Field | Length (bits) | Purpose | +-----------+---------------+----------------------------------------+ | type | 8 | value = 2 (REPAIR_BLOCKS) | +-----------+---------------+----------------------------------------+ | offset | 32 | Offset of initial coding block | +-----------+---------------+----------------------------------------+ | mask_len | 16 | Length of attached bit mask (in bytes) | +-----------+---------------+----------------------------------------+ | mask | -- | Bit mask content | +-----------+---------------+----------------------------------------+ The REPAIR_INFO RepairRequest implicitly identifies by its type that the receiver requires retransmission of the MDP_INFO associ- ated with an object and thus consists of a single byte: +-------+---------------+----------------------------------+ |Field | Length (bits) | Purpose | +-------+---------------+----------------------------------+ | type | 8 | value = 3 (REPAIR_INFO) | +-------+---------------+----------------------------------+ The REPAIR_OBJECT RepairRequest is also very simple and also con- sists of a single byte to request retransmission of an entire MDP transport object: +-------+---------------+----------------------------------+ |Field | Length (bits) | Purpose | +-------+---------------+----------------------------------+ | type | 8 | value = 4 (REPAIR_OBJECT) | +-------+---------------+----------------------------------+ 3.3.8 MDP_ACK The MDP_ACK message type is used by MDP receivers to provide posi- tive acknowledgement in response to certain commands transmitted by an MDP source. Generally, the source will persistently retransmit commands requiring positive acknowledgement until sufficient acknowledgement is collected. In addition to the MDP common mes- sage header, the following fields make up MDP_ACK messages: Adamson, Macker Expires April 2000 [Page 26] Internet Draft The Multicast Dissemination Protocol October 1999 +--------------------+---------------+----------------------------------+ | Field | Length (bits) | Purpose | +--------------------+---------------+----------------------------------+ | server_id | 32 | MdpNodeId of source for which | | | | NACK is intended | +--------------------+---------------+----------------------------------+ | grtt_response | 64 | Response to source's | | | | MDP_CMD_GRTT_REQ, if any (zero | | | | value if none) | +--------------------+---------------+----------------------------------+ | loss_estimate | 16 | Current packet loss estimate for | | | | the indicated source. | +--------------------+---------------+----------------------------------+ | grtt_req_sequence | 8 | Sequence number identifier of | | | | applicable MDP_CMD_GRTT_REQ | +--------------------+---------------+----------------------------------+ | type | 8 | Type of MDP_ACK message. | +--------------------+---------------+----------------------------------+ | object_id | 32 | Applicable MdpObjectTransportId | | | | object, if any | +--------------------+---------------+----------------------------------+ The "server_id", "grtt_response", "loss_estimate", and "grtt_req_sequence" fields serve the same purpose as the corre- sponding fields in MDP_NACK messages. The "type" field identifies the type of MDP_ACK and is one of the following values: +------------------+------+----------------------------------+ |MDP_ACK Variation | Type | Purpose | +------------------+------+----------------------------------+ | MDP_ACK_OBJECT | 1 | Positive acknowledgement of | | | | receipt of a particular trans- | | | | port object. | +------------------+------+----------------------------------+ | MDP_ACK_GRTT | 2 | Indicates that the MDP_ACK is | | | | simply a response to a GRTT_REQ | | | | command. | +------------------+------+----------------------------------+ The MDP_ACK_OBJECT acknowledgement type is used to indicate that the receiver has successfully received all transport objects up to and including the object sequence number identified in the "object_id" field. Like MDP_NACK messages, all MDP_ACK responses from receivers Adamson, Macker Expires April 2000 [Page 27] Internet Draft The Multicast Dissemination Protocol October 1999 contain an embedded response to GRTT_REQ commands from MDP sources. However, the MDP_ACK_GRTT acknowledgement type is also provided to support explicit collection of a GRTT estimate from the group (or potentially a subset of the group). This is used in MDP's conges- tion control algorithm that is described in detail later. 4.0 Detailed Protocol Operation The following sequence of events roughly describes the general, steady-state operation of the MDP protocol from the perspective of a source transmitting data to a group of receivers. This sequence of events can be used as a guide when reading the subsequent detailed descriptions of the individual portions of the protocol. 1) The source periodically sends out MDP_CMD_GRTT_REQUEST probes and transmits messages of type MDP_INFO, MDP_DATA and optionally some amount of MDP_PARITY. The "object_id" and "offset" fields of these messages monotonically increase in sequence. 2) Receivers "synchronize" to the source upon receipt of MDP_INFO, MDP_DATA, or MDP_PARITY. Receivers will not request repair for objects in sequence prior to the point of synchronization. 3) Receivers monitor the sequence of transmission for any "missing" data and initiate NACK repair cycles using the algorithms described below. 4) The source aggregates the content of received repair requests and transmits appropriate repair, normally using messages of type MDP_PARITY. 5) Transmission of new data is completely interleaved with repair transmissions so the source has no "dead" time. 6) When congestion control is enabled, a dynamically changing subset of the receivers are instructed to quickly, explic- itly ACK the MDP_CMD_GRTT_REQUEST probes. Feedback can also be requested from the entire group over a longer period of time. 7) The source may also initiate optional positive acknowledge- ment from a subset (or possibly the entire) group of receivers. 4.1 Source Transmission Adamson, Macker Expires April 2000 [Page 28] Internet Draft The Multicast Dissemination Protocol October 1999 In the current MDP implementation, protocol activity within a ses- sion is initiated by the transmission of data by a source node. The data is comprised of serialized segments of objects enqueued for transmission by the source application. An object is currently defined as a static data of fixed and pre-determined size stored in a file or in memory at the source node. In the future, support for stream objects of indefinite size will likely be supported in the MDP toolkit but we reserve that discussion for another time. The rate and format of transmission of the data content of an MDP object is determined by a number of source protocol parameters set by the application. These parameters include the "transmit_rate", "segment_size", availability of out-of-band information (MDP_INFO) for an object, "block_size", "max_parity", and "auto_parity". The "transmit_rate" parameter governs the peak rate at which data is transmitted by a source MDP node in units of bits per second. If the MDP application has data enqueued for transmission, the source will transmit packets in aggregate at or below this applica- tion-defined "transmit_rate". The total transmissions by the source, including the data, repairs, and commands, are governed by this parameter. The "segment_size" parameter determines the maximum MDP message payload size the source uses for transmissions. (Note that UDP packet payload sizes will be slightly larger than the "seg- ment_size" setting since there is additional MDP protocol overhead in the message formats previously described.) MDP transport objects are fragmented into MDP_DATA messages of "segment_size" bytes. Note that where the transmitted object does not fragment into an exact number of "segment_size" messages, short ("runt") MDP_DATA messages will be transmitted at the end of an object. The MDP source application has the option of setting and advertis- ing a small amount of out-of-band information ("info") for each object enqueued for transmission. For example, in a file transfer application, MIME-type information and/or name identification for file content might be embedded in the "info" portion of an MDP transport object. The amount of "info" may be up to "segment_size" bytes according to the source's settings. Thus the "info" associ- ated with an object can be transmitted in a single MDP message. As will be discussed later, this allows for more responsive repair than that of the bulk data content. If the "info" is set, the transmission of an object is initiated by sending an MDP_INFO mes- sage. This is followed by the object data content as follows. The "block_size" and "max_parity" parameters affect how the source calculates, maintains, and transmits parity-based repair messages. The present MDP design uses shortened, 8-bit symbol-based Reed Adamson, Macker Expires April 2000 [Page 29] Internet Draft The Multicast Dissemination Protocol October 1999 Solomon encoding methods to construct repair vectors (packets) based on a block of data vectors (packets). The "block_size" parameter corresponds to the number of MDP_DATA messages per Reed- Solomon encoding block while the "max_parity" parameter corresponds to the number of repair (MDP_PARITY) messages the source calculates and maintains per block. So, for standard (N,k) nomenclature to describe the resulting shortened Reed-Solomon code, N = (block_size + max_parity) and k = block_size. Note that, in addition to the ending "runt" MDP_DATA message, it is likely that transport objects will not often fragment to an exact number of encoding blocks. Thus, for the "short" ending block (containing less than "block_size" MDP_DATA segments), the MDP Reed-Solomon encoder assumes zero-padding of the "runt" message and the calculation of the parity vectors is truncated to a further shortened code for that block. Note that the truncation of parity calculation does not impact the erasure repairing capabilities of the resulting code. In a current MDP implementation, the source incrementally calcu- lates and buffers parity information for the sequence of MDP_DATA messages it transmits. At the end of each encoding block, the source MAY optionally transmit a number of MDP_PARITY repair mes- sages according to the value of the "auto_parity" parameter. While this parameter is by default ZERO for pure reactive retransmission repairing, some network topologies, scenarios, and applications may benefit from implicit transmission or hybrid proactive/reactive repairing of lost packets. The ability of any multicast receiver to fill any erasure within an encoding block with any one MDP_PAR- ITY packet allows for performance gains in some environments and the potential for robust data delivery in cases of uni-directional or asymmetric network connectivity (e.g. broadcast satellite com- munication system). Transmitted coding blocks are identified by using the "offset" field contained within MDP_DATA and MDP_PARITY messages. The fol- lowing integer calculation can be used to identify the coding block with which an MDP_DATA or MDP_PARITY message is associated: block_id = "offset" / ("block_size" * "segment_size"); Recall that MDP_PARITY messages have the "parity_id" field to uniquely identify to which portion of the Reed-Solomon parity con- tent the given message corresponds. This methodology allows receivers to efficiently maintain state for decoding of a block when a sufficient quantity of MDP_DATA and MDP_PARITY messages are received (i.e. a total of "block_size" unique MDP_DATA and/or MDP_PARITY messages for a given coding block). Adamson, Macker Expires April 2000 [Page 30] Internet Draft The Multicast Dissemination Protocol October 1999 The MDP source sequentially transmits transport objects with incre- mentally increasing transport "object_id" values. The content position of MDP_DATA and MDP_PARITY messages is contained and implied in the value of the offset fields in those messages. This sequencing information is used by MDP receivers to trigger repair requests at the end of each source encoding block and at the tran- sition of transmission from one object to the next. Additionally, when the source reaches the end of the application enqueued trans- mission objects, it begins periodically transmitting MDP_CMD_FLUSH command messages to notify receivers of the end of an active trans- mission period and to additionally prompt them for repair requests. In the absence of subsequent enqueued object transmission, the MDP_CMD_FLUSH messages are sent once every 2*GRTT seconds until a repair request is received from a receiver or until a maximum probe count according to a preset robustness factor (FLUSH_ROBUST- NESS_COUNT) is reached (currently a default of 20 flush messages). 4.2 Receiver Synchronization Upon reception of MDP_INFO or MDP_DATA messages from a new source, an MDP receiver will "synchronize" with the MDP source by beginning to maintain state on the source with the object segmentation and encoding parameters and current transmission sequencing information embedded in the received messages. For this reason, this informa- tion is embedded in all MDP_INFO, MDP_DATA, and MDP_PARITY messages transmitted by the source. In the current MDP implementation, if new data is received from a source indicating a change in the value of these parameters, the receiver drops its current state on the source and re-synchronize to that source, so it is generally expected that these parameters will not change during the lifetime of an MdpSession. Note that it is possible for multiple sources to co-exist within the context of an MDP session and that each source may maintain its own independent set of transmission parameters. MDP receivers also use their own "transmit_rate" parameter to gov- ern their peak rate of protocol transmissions. However, the normal required quantity of transmissions from an MDP receiver is very low. The MDP implementation limits the conditions under which receivers will synchronize to a source to prevent the source from being restricted in forward transmission progress in environments with very large group sizes and active group join/leave dynamics. For example, at present receivers will not synchronize to source upon receipt of MDP_DATA messages which are part of a repair transmis- sion or on the receipt of MDP_PARITY messages and receivers are currently only allowed to synchronize during data transmission of the first encoding block of a transport object. Also, receivers will not request repairs for objects earlier in sequence than the Adamson, Macker Expires April 2000 [Page 31] Internet Draft The Multicast Dissemination Protocol October 1999 object to which they established synchronization to the source. MDP implementations may desire further refinement of these synchro- nization policy features for different applications and require- ments. The MDP receiver maintains a "synchronization window" where if the current transmission sequence from the source exceed the bounds of this window, the receiver will re-synchronize with the source and not attempt further repair of earlier objects. The constraints on this window can be established according to application policy and/or the amount of buffering space the application is willing to allocate for a given MdpSession and specific source. Note that for efficiency, this window should well-exceed the expected worst-case delay-bandwidth product for the network topology the group is uti- lizing. 4.3 Receiver NACK Process Once a receiver has "synchronized" with a source, it begins track- ing the sequence of transmission using the "object_id" and "offset" fields contained in the data and commands sent by the source. If the receiver detects missing data from the source at the end of an encoding block, end of an object transmission, or upon receipt of an MDP_CMD_FLUSH command, it initiates a process to request repair transmissions from the source. Note that the end-of-block or end- of-object boundaries are detected either explicitly by the MDP_FLAG_BLOCK_END indicator in a received message or implicitly by the receipt of data beyond the last incompletely received block. The repair cycle for requesting retransmission of missing MDP_INFO for an object can be begun immediately since it is "out-of-band" to the MDP parity encoding process. MDP receivers should consider the MDP_INFO content to be the first "virtual" block of the correspond- ing MdpObject. The receiver-initiated repair process will also begin upon a longer-term timeout based on lack of received packets from a previ- ously-active source. This longer-term time out should be set to (2.0 * GRTT * FLUSH_ROBUSTNESS_COUNT) which corresponds to the period of MDP_CMD_FLUSH message transmission which is conducted prior to a source going inactive. Reciever implementations SHOULD set reasonable bounds on minimum and maximum values for this source "inactivity" timeout. Receivers SHOULD also limit the number of "inactivity" timeout refreshes so as not to go into a mode of infi- nite NACKing in the case where the server or network connectivity has completely failed. To initiate the repair request process and to facilitate the sup- pression of redundant NACK responses, the receiver begins a random Adamson, Macker Expires April 2000 [Page 32] Internet Draft The Multicast Dissemination Protocol October 1999 hold-off timeout to delay immediate response to a source node upon detecting loss. Thus, if another NACK for the same (or more) repair information arrives at the receiver (or the repair informa- tion itself) before the timeout ends, the receiver will suppress its transmission of an MDP_NACK message. Note that after transmis- sion or suppression of the NACK occurs, another timeout is used to allow some amount of time for the source to respond to the repair request before again initiating an additional repair request pro- cess. The initial hold-off timeout is randomly picked from a expo- nential distribution from ZERO to GRTT seconds. For large multi- cast group sizes, this generally allows for a significant level of NACK suppression while maintaining reasonably small delays in the repair of data transmissions. The extension of the potential hold- off window to the order of (1 * GRTT) allows for general worst-case receiver-to-receiver transmission delays assuming symmetric unicast routing among nodes in the multicast group. The secondary hold-off timeout after NACK transmission/suppression is fixed at (4 * GRTT) to allow reasonable time for the source to receive a NACK, possibly aggregate multiple NACKs, and to begin providing repair messages back to the receiver. As the receiver has received transmissions of MDP_INFO, MDP_DATA, and MDP_PARITY from a source, it attempts to have maintained the state of completion for all received objects. The receiver may be buffer-limited, so priority is given to the earliest objects within the "synchronization window" previously described. Additionally, the receiver keeps track of the most recently detected "object_id" and "offset" sequencing index information received from the source. For each incomplete transport object up to this current transmis- sion index, the receiver constructs an ObjectNACK which is included in the payload of the MDP_NACK message. It is critical to the efficiency and convergence of the protocol that the NACK content only consist of repair requests for transmissions sequences earlier than the most recent detected source transmission position. This prevents receivers from redundantly requesting repair for data the source be already intending to transmit (e.g. based on repair requests from other receivers that the receiver in question did not receive). The effect of this controlled NACK process is to "rewind" the source to the earliest required repair point without redundant requests for repair being serviced. This sequencing keeps the new data transmission and repair process within a minimal bound of source and receiver buffer space given the the current delay-bandwidth product of the network topology. (The GRTT mea- surement process and subsequent timers based on GRTT work to maxi- mize the efficiency of transmission and NACK suppression while maintaining minimal repair latency.) The content of receiver MDP_NACK messages depend upon the repair Adamson, Macker Expires April 2000 [Page 33] Internet Draft The Multicast Dissemination Protocol October 1999 needs of the requesting receiver. The ObjectNACK consists of a list of RepairRequests for retransmission of repair messages for missing data segments within individual encoding blocks, retrans- mission of coding blocks in entirety, retransmission of the info content, or possibly retransmission of the entire object. Entire object retransmission is requested with the MDP_REPAIR_OBJECT RepairRequest previously described. Otherwise, partial repairs of the object are requested using a combination of the MDP_REPAIR_SEG- MENTS, MDP_REPAIR_BLOCKS, and MDP_REPAIR_INFO RepairRequests. Note that these different types of repair requests can be viewed as a hierarchy to elicit different types of repair transmission behav- ior. These RepairRequests are constructed as follows: The MDP_REPAIR_SEGMENTS RepairRequest identifies the encoding block to which the RepairRequest applies, the total number of erasures (missing data segments) in the block, and provides a bit mask indi- cating which specific segments the receiver requires to repair the encoding block. The MDP protocol leverages the use of parity-based repair by requesting transmission of parity repair messages when- ever possible. For example, if the source uses a coding setting of 20 data segments per coding block (block_size = 20) and calculates 20 parity segments per block (max_parity = 20), the receiver will always request transmission of parity for repair. Only when the receiver is missing a greater number of data segments than avail- able parity will the receiver request explicit retransmission of data segments. And then, the receiver only requests retransmission of the minimal number of data segments (those with the highest off- set values) to repair what parity alone cannot cover. This method- ology allows the source to transmit a minimal number of redundant data packets and leverages the use of parity packet erasure repair- ing since any one parity segment can repair any one missing data segment at any receiver. Thus, even if some requested parity packets are lost during the source's transmission of repair, some receivers (those who observed less than the group maximum loss) may likely be able to completely repair a block from the combina- tion of received repair messages. This is also why all MDP_REPAIR_SEGMENTS RepairRequests contain the explicit bit mask marking specific missing segments in addition to the erasure (miss- ing segment) total count for the applicable coding block. Another feature of MDP is that the source attempts to send previ- ously untransmitted parity segments whenever possible. This also provides efficiency gains since receivers are not required to request and receive explicit segments (data or parity) which they are missing on subsequent iterative repair cycles. This approach provides benefit when parity packets may get lost or dropped and the source does not know which receivers missed which parity pack- ets during subsequent repairs. The source makes use of the erasure Adamson, Macker Expires April 2000 [Page 34] Internet Draft The Multicast Dissemination Protocol October 1999 count provided in the receiver MDP_REPAIR_SEGMENTS RepairRequests to efficiently perform this function which is why the count is pro- vided in addition to the bit mask content. The length of the bit mask content of the RepairRequest is equal to (block_size + max_parity) bits padded out to an integral number of bytes. A value of one in the bit mask marks the requested segment. The "mask_len" field is provided so that other nodes can easily and safely parse the content of the MDP_NACK. The MDP_REPAIR_BLOCKS RepairRequest is used by receivers to request retransmission of coding blocks missing in entirety. In typical applications of the protocol this should occur infrequently such as in cases of intermittent network outages, during the "short" coding blocks at the end of object transmissions (or very small objects), and/or in cases of very severe packet loss. The format of the MDP_REPAIR_BLOCKS RepairRequest is similar to the MDP_REPAIR_SEG- MENTS. An "offset" field is used to indicate the first coding block (as computed using the "block_id" formula presented earlier) and a bit mask is provided with values of one indicating which blocks require retransmission. The source retransmits the entire set of data segments for the encoding blocks requested including any configured quantity of "auto_parity". The MDP_REPAIR_INFO RepairRequest is used by the receiver to request retransmission of the available info the source has attached to the transport object. This retransmission is only requested if the source has advertised the availability of info for the object via the MDP_FLAG_INFO flag in other messages transmitted for the given transport object. The MDP_REPAIR_OBJECT RepairRequest indicates that the receiver requires retransmission of an entire transport object. As with the MDP_REPAIR_BLOCKS request, this will typically be a rare occur- rence, except in the case of very small objects (very few "seg- ment_size" or less in length) and/or intermittent network outages or heavy packet loss. MDP receivers maintain a check on the integrity of the sequencing of transport object id's from a source in order to make these requests. This allows MDP to treat the sequence of MdpObjects as a "pseudo-stream" of transmission for which integrity must be maintained. For some applications in large scale, loosely-controlled data distribution environments, it may be beneficial to have an option to disable this degree of reception integrity checking. The current MDP implementation maintains a window depth for this integrity check before resynchronizing to a source, plus the source maintains a finite history of data it main- tains for retransmission. These parameters are or will be settable in MDP implementations. Adamson, Macker Expires April 2000 [Page 35] Internet Draft The Multicast Dissemination Protocol October 1999 Once an MDP_NACK message has been transmitted (or a decision to suppress transmission has been made), the receiver inhibits itself from initiating another repair request cycle for the given source for a period of (4*GRTT) seconds based on the GRTT estimate adver- tised by the source. This allows time for the repair request to propagate to the source, for the source to aggregate possible MDP_NACKs received from multiple receivers, and for the source responses to the repair request(s) to begin being received. 4.4 Source NACK Aggregation and Repair Upon receipt of an MDP_NACK for a specific object from a receiver, the source parses and records the RepairRequests and begins a hold- off timeout for a period of (2*GRTT) seconds before it responds to the repair requests. This allows sufficient time to receive and aggregate possible additional RepairRequests from other receivers. The reason for the (2*GRTT) hold-off time is to consider the worst- case condition where a receiver very near the source immediately (random NACK backoff timeout of ZERO seconds) sends an MDP_NACK to begin a repair cycle while a receiver very far from the source (up to possibly one GRTT away in worst-case asymmetry) has a random NACK backoff timeout of GRTT seconds. Note that during this repair response hold-off time, the source will still continue to transmit data for new or other objects pending repair. The exception to the hold-off timeout is that retransmission of MDP_INFO messages occurs almost immediately upon receipt of the MDP_NACK message. Note, however, that repeat retransmission of duplicate MDP_INFO is restricted to once per (2*GRTT). The reason that the source can retransmit the MDP_INFO repair quickly is because there is no need to aggregate multiple repair requests to make a determination on what to transmit for maximum efficiency. This added responsiveness to the repair cycle for MDP_INFO messages makes this out-of-band control information potentially useful in the context of reliable multicast session control or for certain types of multicast application data. Once the repair aggregation hold-off timeout has ended the source MUST transmit repair information beginning with the lowest ordinal sequence transport object and coding block. It is critical to the convergence of the protocol that the repair transmissions be con- ducted in this order. Strict adherence to the ordering of repair allows repairs to be conducted within the constraints of source/receiver state buffering. Note that having a good approxi- mation of GRTT lets repairs be conducted in the most efficient and timely manner possible. The NACK process is designed to force the source to "rewind" to the earliest possible repair position in the sequence of transmission so it does not move too far "ahead" of Adamson, Macker Expires April 2000 [Page 36] Internet Draft The Multicast Dissemination Protocol October 1999 receivers suffering loss. It is possible that, due to packet loss patterns, processing delays or another anomalies, MDP_NACK messages may arrive "late" from some receivers after repair transmission of a transport object has already begun. The MDP source MUST immediately incorporate these late-arriving repair requests into the actively transmitting object as appropriate and continue with the repair transmissions. The appropriate incorporation of late-arriving requests is to _only_ mark segments or blocks greater than the source's current position in the sequence (segment offset, object_id) of transmission. Then, as needed, receivers will initiate new repair cycles to recover information lost during repair transmissions. An important feature of the MDP protocol is that the source maxi- mizes the use of the parity segments it has calculated for repair transmissions. For example, if during an initial repair cycle for an object, receivers have requested only a portion of the available parity segments, the source will use parity segments from the remaining unused portion for repair transmissions during subsequent repair cycles for the same encoding block. There is a significant gain to this approach since the receiver parity decoding process can fill a certain number of missing data segments (erasures) with any combination of the same number of parity segments. Thus, when multiple repair cycles are required to complete reliable transmis- sion of an encoding block, receivers are freed from the difficulty of requesting an explicit set of parity segments due to lost parity transmission. With a sufficient number of parity segments calcu- lated by the source and nominal packet loss, the source may never need to send the same segments twice, thus maximizing the use of the parity information and minimizing the reception of redundant data among receivers. 4.5 General GRTT Collection Process (without congestion control) To facilitate more efficient protocol operation over different net- work topologies with varying end-to-end delay characteristics, the MDP protocol dynamically collects information and estimates the greatest round trip time (GRTT) packet propagation delay from the source to the other receiver nodes participating in the reliable multicast session. This information is collected and the GRTT is estimated in the following manner. The source periodically transmits an MDP_CMD_GRTT_REQ containing a timestamp (relative to an internal clock at the source). Receivers record the timestamp ("send_time") of the latest MDP_CMD_GRTT_REQ received from a source and the time at which the request was received (recv_time). These times are used by receivers to Adamson, Macker Expires April 2000 [Page 37] Internet Draft The Multicast Dissemination Protocol October 1999 construct a response. When the receiver responds to a MDP_CMD_GRTT_REQ, it embeds a timestamp in the response message calculated with the following formula: "grtt_response" = "send_time" + (current_time - recv_time) where the "send_time" is the timestamp from the last MDP_CMD_GRTT_REQ received from the source and the (current_time - recv_time) is the amount of time differential since that request was received until the receiver generated this response. In the current MDP implementation this "grtt_response" field is contained within MDP_NACK and MDP_ACK messages so that in the general NACK- based operation of the protocol, only receivers sending MDP_NACK messages for repair requests contribute to the estimation of source-to-group GRTT. If a receiver is experiencing perfect recep- tion and never NACKs, it will not participate in the GRTT-driven repair process anyway. This allows for relatively efficient and scalable collection of round trip estimates from the pertinent mem- bers of the group (those with the worst packet loss). The protocol message formats and code base does support an option to have all or a designated subset of receivers explicitly acknowledge the MDP_CMD_GRTT_REQ messages so that more accurate estimate of total group GRTT can be collected. This collection method option is uti- lized in when the MDP rate-based congestion control algorithm is enabled. The source processes the GRTT response by calculating a current round trip estimate for the receiver from whom the response was received using the following formula: receiver_rtt = current_time - "grtt_response" During the current GRTT probing interval, the source keeps the peak round trip estimate from the responses it has received. The GRTT estimate is presently filtered to be conservative towards maintain- ing an estimate biased towards the greatest receiver RTT measure- ments received. A conservative estimate of GRTT maximizes the efficiency redundant NACK suppression and aggregation. The update to the source's estimate of GRTT is done observing the following rules: 1) If a receiver's response round trip calculation is greater than the current GRTT estimate AND current peak, the response value is immediately fed into the GRTT update fil- ter given below. In any case, the source records the "peak" receiver RTT measurement for the current probe interval. 2) At the end of the response collection period (i.e. the GRTT Adamson, Macker Expires April 2000 [Page 38] Internet Draft The Multicast Dissemination Protocol October 1999 probe interval), if the recorded "peak" response is less than the current GRTT estimate AND this is the third consec- utive collection period with a peak less than the current GRTT estimate the recorded peak is fed into the GRTT update. (Implicitly, Rule #1 was applied otherwise so no new update is required). 3) At the end of the response collection period, the peak tracking value is set to either ZERO if the "peak" is greater than or equal to the current GRTT estimate (i.e. Already incorporated into the filter under Rule #1) or kept the same if its value is less than the current GRTT estimate AND was not yet incorporated into the GRTT update filter according to Rule #2. Thus for decreases in the source's estimate of GRTT, the "peak" is tracked across three consec- utive probe intervals. The current MDP implementation uses the following GRTT update filter to incorporate new peak responses into the the GRTT estimate: if (peak > current_estimate) current_estimate = 0.25 * current_estimate + 0.75 * peak; else current_estimate = 0.75 * current_estimate + 0.25 * peak; This update method is biased towards maintaining an estimate of the worst-case round trip delay. The reason the GRTT estimate is reduced only after 3 consecutive collection periods with smaller response peaks is to be conservative where packet loss may have resulted in lost response messages. And then the reduction is additionally conservatively weighted using the averaging filter from above. The GRTT collection period (and period of MDP_CMD_GRTT_REQ trans- mission) is currently fixed at once per 40 seconds after the source startup phase. During initial source startup the GRTT collection period is varied from a short period of 5 seconds up to the steady- state collection period of 40 seconds. An algorithm may be devel- oped to adjust the GRTT collection period dynamically in response to the current GRTT estimate (or variations in it) and to an esti- mation of packet loss. Thus if the GRTT estimate is stable and unchanging, the overhead of probing with MDP_CMD_GRTT_REQ messages can be reduced while during periods of variation the GRTT estimate might track more accurately with correspondingly shorter GRTT col- lection periods. In summary, although the MDP repair cycle timeouts are based on GRTT, it should be noted that convergent operation of the protocol does not strictly depend on accurate GRTT estimation. The current Adamson, Macker Expires April 2000 [Page 39] Internet Draft The Multicast Dissemination Protocol October 1999 mechanism has proved sufficient in simulations and in the environ- ments in which MDP has been deployed to date. The estimate pro- vided by the algorithm appears to track the peak envelope of actual GRTT (including operating system effect as well as network delays) even in relaitvely high loss connectivity. The steady-state prob- ing/update interval may potentially be varied to accommodate dif- ferent levels of expected network dynamics in different environ- ments. 4.6 Automatic Congestion Control The MDP design presently has an option for automatic rate-based congestion control of source nodes. The theory of operation is loosely based on concepts presented in [12], [13], and [14]. A major goal of the congestion control approach is to fairly share available network capacity with other ongoing MDP and TCP sessions. MDP transmissions are subject to rate control in its fixed-rate mode of operation. The approach taken to congestion control auto- matically adjusts an MDP source transmission rate according to feedback it receives from receiver nodes. MDP uses the model of TCP throughput as described in [15] to estab- lish a goal for its transmission rate. This model estimates the rate at which a TCP source would transmit given estimates of roundtrip delay and delay variation and packet loss. This TCP model can be described by the following equation: PacketSize B = --------------------------------------------------------- RTT*sqrt(2bp/3) + T0*min(1,3*sqrt(3bp/8))*p*pow(1+32p, 2) where B = Resulting predicted rate in units of bytes per second, PacketSize = Nominal transmitted packet size. RTT = Estimate of round trip packet delay in seconds, p = Estimate of packet loss fraction, T0 = Applicable TCP retransmission timeout, b = Number of packets acknowledged by a TCP ACK. The current MDP implementation uses the source "segment_size" plus the overhead of an MDP_DATA message for the "PacketSize" parameter. Measurements of round trip packet delay ("RTT") and delay variation Adamson, Macker Expires April 2000 [Page 40] Internet Draft The Multicast Dissemination Protocol October 1999 (TCP's "T0" is a function of "RTT" and delay variation) and esti- mates of packet loss ("p") are obtained from receivers within the group. A fixed value of one is assumed for "b" in the current implementation. A goal rate is established by determining the low- est rate among the different receivers using the equation above with the metrics obtained. By using this goal rate and a "linear increase"/ "exponential decrease" rate adjustment algorithm (described in detail below), the MDP congestion control algorithm can determine available network capacity and fairly share it with TCP or other transport flows (e.g. other MDP flows) with similar behavior. Simulations and limited empirical tests over networks to date have been used to validate this approach [16]. MDP attempts to maintain "worst path" fairness as described in [17] even under dynamic conditions by rapidly probing an appropriate subset of the receiver set to determine the current worst path (according to the above model's predicted rate) receiver. The rapid probing allows MDP to quickly take advantage of newly avail- able network capacity and rapidly reduce its transmission rate in the face of congestion. Group size scalability makes it pro- hibitive to excite rapid response from the entire receiver set, so the source selects a subset of receivers (a default of 5 in the current implementation) spanning a dynamic estimate of the most significant multicast topology "bottlenecks". The subset of rapidly probed receivers are termed the congestion control "representatives". The composition of the "representative" set dynamically changes during the course of source transmission based on feedback from the group at large. The current algorithm for selecting and maintaining the congestion control representative set is described below. It is important to note that MDP probing and rate adjustment algorithms has features to work through periods of intermittent source transmissions and feedback (or lack of) from the representative set and group at large (e.g. MDP rapidly reduces its rate when its current "bottleneck" representative is unrespon- sive to avoid congestion collapse). At the present time, encouraging results from simulations and lim- ited empirical tests have been obtained using the MDP congestion control approach described in this section. However, multicast congestion control is a complex and still emerging area which will greatly benefit from further study and investigation. 4.6.1 Source Congestion Control Probing This section describes the technique by which the current MDP implementation transmits congestion control "probes" in the form of MDP_CMD_GRTT_REQ messages. The receivers respond to these probes Adamson, Macker Expires April 2000 [Page 41] Internet Draft The Multicast Dissemination Protocol October 1999 with information embedded in MDP_NACK and MDP_ACK messages as described previously. At the present time, this probing is con- ducted in separate MDP_CMD_GRTT_REQ messages, but it is possible that the probing may be aggregated into selected MDP_DATA or MDP_PARITY messages if a corresponding significant reduction in protocol overhead can be realized. 4.6.1.1 Congestion Control Startup As is often the case, assimilating group state at startup is diffi- cult for multicast protocols due to the desire to minimize feedback and the corresponding increase in state collection and reaction. The startup procedure described here suffers some limitations which could be resolved by some "a priori" configuration (e.g. preload the representative list with some known "ringers") or other proto- col initialization phase. Startup techniques for multicast conges- tion (and multicast group communications in general) merit much further study. However, the approach taken in the current MDP implementation (which makes no assumptions about group membership) is presented here. When the MDP congestion control algorithm is enabled, MDP begins by transmitting MDP_CMD_GRTT_REQ messages at intervals of 1.0, 2.0, 4.0, etc seconds as described for fixed-rate operation startup. These initial messages have the MDP_CMD_GRTT_FLAG_WILDCARD flag set indicating that all receivers in the group should explicitly respond to the MDP_CMD_GRTT_REQ with an MDP_ACK. The "hold_time" in the MDP_CMD_GRTT_REQ is set to 1.0, 2.0, 4.0, etc seconds respectively and the receivers should respond to the command in the indicated "hold_time" window with a uniform random distribution. It is understood that these relatively short "hold_time" values at startup _could_ be problematic for cases of large initial group sizes with limited network feedback capacity. The intention of the initial rapid wildcard probing with rapid response is to collect information so MDP can quickly estimate an appropriate transmission rate. The feedback problem created by this current, interim startup algorithm could be potentially solved with side information at startup of an appropriate rate or set of congestion control rep- resentatives. Alternatively, an initial, slow startup phase could be added to collect this side information prior to actual data transmission. It is also possible that some form of ACK "suppres- sion" or router-assisted response aggregation might be realized to make "wildcard" probing more scalable. These issues are under investigation. 4.6.1.2 Steady-state Probing Once the MDP source receives any response from the group, it forms Adamson, Macker Expires April 2000 [Page 42] Internet Draft The Multicast Dissemination Protocol October 1999 and subsequently maintains a list of congestion control representa- tives. At this point, the MDP source begins periodically transmit- ting MDP_CMD_GRTT_REQ messages at a rate of once per its estimate of GRTT. Note that "wildcard" probes are transmitted at the same rate at which MDP_CMD_GRTT_REQ messages are transmitted for fixed- rate operation (currently converging at a steady-state rate of once per 40 seconds) with the corresponding "hold_time" value. (At this time, _all_ receivers explicitly respond to "wildcard" probes. This may limit scalability to extreme group sizes or network condi- tions. More scalable approaches to "group at large" information collection (e.g. response suppression techniques, router assis- tance, etc) are under investigation). The representative list is populated and maintained as described in Section 4.6.3. When the representative list is emptied, the MDP source reduces it transmis- sion rate and reverts to wildcard-only probing until new congestion control representatives are identified. 4.6.2 Receiver Response MDP receivers respond to congestion control probing with round trip delay timestamps as described in Section 4.5 and an estimate of packet loss fraction obtained from monitoring the "sequence" field in messages received from the corresponding MDP source. It is important to note that the packet loss measurement technique plays an important role in the congestion control algorithm's ability to respond to dynamics in network congestion. MDP receivers currently use a form of a filtered, exponential sliding average to estimate the current packet loss fraction. The source advertisement in the MDP_CMD_GRTT_REQ message of its current "transmit_rate", bottleneck "rtt" and "loss" are used in the packet loss estimator. (TBD) This algorithm will be described in detail in a future version of this document. These responses to probing are implicit during usual protocol oper- ation such as when MDP_NACKs are generated to request repairs or MDP_ACK messages are sent in reponse to requests for positive acknowledgment. And when the source sets the MDP_CMD_GRTT_FLAG_WILDCARD or the local receiver is a member of the advertised representative list, a response is explicitly generated within the "hold_time" specified. It is interesting to note that general congestion control operation is completely the responsibil- ity of the source. The receivers only participate in congestion control as instructed by the source. Thus, it would be easy to design a distributed application which dynamically enables and dis- ables congestion control operation. Also it might be possible to program the receiver implementation such that only designated receivers respond to congestion control probing. This may allow increased scalability of the current protocol design through Adamson, Macker Expires April 2000 [Page 43] Internet Draft The Multicast Dissemination Protocol October 1999 intelligent application configuration. The use of "wildcard" probing is being further examined. It is possible that implicit representative "nomination" through normal MDP_NACKs alone may be sufficient for successful protocol operation in many general cases. Further simulation and study will be con- ducted in this area. 4.6.3 Source Response Processing and Representative Selection The TCP analytical model is used to process the responses from receivers containing round trip time (RTT) timestamps and packet loss measurements. First, a current measurement of RTT for the receiver in question is calculated by: rtt_current = currentTime - "grtt_response" If the source has no state for the receiver in question (i.e. it is not currently a "representative"), this value is used to initialize the estimate of RTT and RTT deviation kept for the receiver. If the receiver _is_ a representative, this value is used to update a smoothed estimate of RTT and RTT deviation with the following algo- rithm: err = rtt_current - rtt_smoothed; if (err < 0.0) { rtt_smoothed += (1.0/64.0) * err; rtt_deviation += (1.0/8.0) * (fabs(err) - rtt_deviation); } else { rtt_smoothed += (1.0/8.0) * err; rtt_deviation += (1.0/4.0) * (fabs(err) - rtt_deviation); } Note that this algorithm is similar to the algorithm recommended for similar state the TCP protocol maintains except that MDP is more conservative in reducing the RTT and RTT deviation estimates. Examination of statistics from initial simulations of MDP conges- tion control have indicated this biased result produces more desir- able congestion control behavior. This bias reduces MDP transmit rate fluctuation while maintaining responsiveness to congestion indicated by sudden increases in RTT to bottleneck receivers. The "rtt_smoothed" value is used for the "RTT" parameter in the TCP model and the retransmission timeout value is calculated as: Adamson, Macker Expires April 2000 [Page 44] Internet Draft The Multicast Dissemination Protocol October 1999 "T0" = rtt_smoothed + 4.0 * rtt_deviation Currently, the congestion control representative list maintained by the source is populated with the five receivers with the worst-case rates predicted by the TCP model. The representative with the low- est predicted rate is identified as the current "bottleneck" (worst-path) representative, and that rate establishes a goal rate for the MDP rate adjustment algorithm. This is a simple algorithm for representative election and more complex approaches are being coonsidered. For example, it may be possible to use the loss mea- surement and round-trip estimates collected as individual metrics to correlate receiver "clusters" sharing a common bottleneck and limit the number of representatives per correlation bin. Then, the representative list could be populated with receivers "spread" across multiple candidate bottlenecks in the group's multicast topology. This would allow the MDP source to more quickly identify the worst-path rate as congestion conditions change. Simulations and studies are being conducted in this area. It should also be noted that MDP receivers' suppression of MDP_NACK messages will naturally tend to reduce multiple responses from receiver "clus- ters" due to their likely correlated loss and close delay proxim- ity. (Ironically, MDP's FEC repair technique which greatly improves protocol efficiency tends to reduce suppression based solely on correlated loss patterns which might be a reasonably effective identifier of receiver "clusters"); Representative receivers are removed from the list when they fail to respond to "N" consecutive probes. "N" is a robustness factor to account for probe and/or response loss. The current implementa- tion sets "N" to a value of 5. When receivers are removed from this list, this makes room for other candidates and of course pre- vious candidates can be quickly restored to the list if responses (implicit or explicit) are later received. 4.6.4 Source Rate Adjustment The MDP congestion control algorithm can make representative mem- bership adjustments on the same interval that probing MDP_CMD_GRTT_REQ messages are generated. At each of these inter- vals, the MDP source evaluates its current representative list and selects the receiver with the lowest rate as the worst- path "bot- tleneck". This bottleneck rate is used as the goal rate for an adjustment in the source current transmission rate. The source also checks that the last response recorded for the bottleneck rep- resentative is "current" (i.e. the "grtt_req_sequence" of the MDP_CMD_GRTT_REQ to which the response has a delta of no more than one from the "grtt_seq" of the last MDP_CMD_GRTT_REQ transmitted by the source.) Adamson, Macker Expires April 2000 [Page 45] Internet Draft The Multicast Dissemination Protocol October 1999 If the goal bottleneck rate is greater than the current source transmission rate and the bottleneck representative response is "current", the source linearly increases its transmit rate with the following formula: new_rate = old_rate + (1.0 *"segment_size") where "new_rate" and "old_rate" are in units of bytes per second and the "segment_size" is the MDP source's current "segment_size" setting. Thus the rate is increased by one "segment_size" bytes per second every GRTT during the representative probing. Note that the "new_rate" is also limited not to exceed the goal rate pre- dicted by the TCP analytical model. If the goal bottleneck rate is less than the MDP source's current transmission rate or the bottleneck representative's response is _not_ "current", the source reduces its rate with the following formula: new_rate = old_rate * 0.75 In this fashion, the rate is exponentially reduced over the course of multple GRTT intervals. Note that if the bottleneck representa- tive's response _is_ current, the rate reduction is limited to not fall below the goal rate established with the TCP analytical model. No rate adjustment is performed when the source is not actively transmitting data. In the current implementation, the MDP source (if kept open by the application) continues to probe during this time and the representative list contines to be updated. However, it may be appropriate for MDP to gradually reduce its probe rate during these periods of data transmission inactivity. Also it may be desirable that the MDP use a reduced rate, weighted by the period of inactivity, when resuming transmission. The main point here is that the validity of rate prediction from probing without active transmission is questionable, partiularly when probing over uncongested connectivity. 4.7 Optional Positive Acknowledgement Process MDP provides an option for the source application to request posi- tive acknowledgment (ACK) of individual transport objects from a subset of receivers in the group. The list of receivers providing acknowledgement is determined by the source application with "a priori" knowledge of participating nodes and/or is determined dur- ing protocol operation by receivers who indicate their ACKing sta- tus request with a flag in the MDP_REPORT messages each node gener- ates. (Note that this second methodology is only applicable when Adamson, Macker Expires April 2000 [Page 46] Internet Draft The Multicast Dissemination Protocol October 1999 MDP status reporting is enabled). Positive acknowledgment can be requested for all transport objects sent by the source or may be applied at certain "watermark" points in the course of transmission of a series (stream) of transport objects. The ACK process is initiated by the source who generates MDP_CMD_ACK_REQ messages in periodic "rounds" containing the MdpOb- jectTransportId identifier of the object to be acknowledged and a list of MdpNodeId identifiers of receiver nodes from which acknowl- edgement is being requested. The ACK process is self-limiting and avoids ACK implosion in that: 1) Only a single MDP_CMD_ACK_REQ message is generated once per (2*GRTT), and 2) The size of the list of MdpNodeIds from which ACK is requested is limited to a maximum of the source "seg- ment_size" setting per round of the positive acknowledgement process. The indicated receivers randomly spread their MDP_ACK responses uniformly in time over a window of (1*GRTT). As the source receives responses from receivers, it eliminates them from the mes- sage payload list and adds in pending receiver MdpNodeIds keeping within the "segment_size" limitation of the list size. Each receiver is only queried for a maximum number of repeats (20, by default). Any receivers not responding within this maximum robust- ness factor are removed from the payload list to make potential room for other receivers pending acknowledgement. The transmission of the MDP_CMD_ACK_REQ is repeated until no further responses are required or until the robustness threshold is exceeded for all pending receivers. The positive acknowledgment process is inter- rupted in response to negative acknowledgement repair requests (NACKs) received from receivers during the acknowledgment period. The process is resumed once the repairs have been transmitted. Note that receivers will not ACK until they have received complete transmission of all transport objects up to and including the transport object indicated in the MDP_CMD_ACK_REQ message from the source. Receivers will respond to the request with a NACK message if repairs are required. The optional positive acknowledgement process may be further refined in future revisions of the MDP pro- tocol and has undergone limited operational use to date. 4.8 Buffer Size Considerations (TBD) MDP is designed to allow sources and receivers operate within the constraints of limited buffering resources. A complete discussion of issues related to buffering resources will be Adamson, Macker Expires April 2000 [Page 47] Internet Draft The Multicast Dissemination Protocol October 1999 provided in the future. 4.9 Silent Receiver Operation MDP supports an option for "silent" receiver, or emission con- trolled (EMCON) receiver operation. This mode of operation is use- ful when it is not possible (or desired) for the receiver nodes to transmit, by any means, messages back to the source node. The "auto_parity" feature of the source transmission sequence can be leveraged to provide robust but efficient delivery of data with the robustness tuned to the expected packet loss characteristics of the network media. Additionally, receivers can combine the information from multiple repeat transmissions of transport object data into a complete object. The primary issues with implementation of this mode of operation is with how state and memory is managed at the receiving receiver nodes. In particular, the receivers must have a policy of when to "give up" on reception of an object and to free resources for reception of subsequent transmissions. A simple policy was imple- mented in earlier versions of the MDP protocol. This is currently being refined in the current MDP implementation to allow support for different, user-defined concepts of silent receiver operation. (TBD) A complete discussion of different concepts of silent receiver operation will be provided in the future. 4.10 Statistics Reporting (TBD) A complete disscussion of the session performance statistics reports generated by participating nodes will be provided in the future. 5.0 Security Considerations At the time of writing, broad multicast security issues are the subject of research within the IRTF Secure Multicast Group (SMuG). Solutions for these requirements will be standardized within the IETF when ready. However, the current protocol does not preclude the use of application security mechanisms or the use of underlying network security features. For example, the current protocol implementation has an option to use sockets secured with an under- lying IPSec implementation on operating systems supporting that feature. 6.0 Future Work and Design Issues While the present design has been through limited MBone and other Adamson, Macker Expires April 2000 [Page 48] Internet Draft The Multicast Dissemination Protocol October 1999 operational network use and has been shown to work effectively, there remain design issues which the authors envision will continue to evolve. These include multicast session startup and the dynamic congestion control algorithm. We feel these are general problems and not unique to MDP. While effective reactive congestion control in a multicast environment remains a complex technical design issue, the basic rate control feature in the present design can be activated in number of environments within the limitations described in this document. We envision potential advantage in applying this protocol framework in combination with a reservation protocol (e.g., RSVP [18]) and future integrated or differential services capabilities. The source rate control setting can be reflective of the bandwidth reserved and protocol timers can be better tuned to operate within average or upper bound delay expectations. Also, early simulation results show that the MDP congestion control approach is effective in allowing multiple multicast flows to dynamically share available capacity. Then unicast and multicast traffic isolation methods may be applied in some scenarios in com- bination with end-to-end congestion control. Additionally, early study of the application of intermediate system random early detec- tion (RED/ECN) explicit congestion notification [19, 20] or other network congestion indicators have been shown to be additionally beneficial to the overall performance of the end-to-end multicast congestion control approach used by MDP. 7.0 Suggested Usage The present MDP instantiation is seen as useful tool for the reli- able data transfer over generic IP multicast services. It is not the intention of the authors to suggest it is suitable for sup- porting all envisioned multicast reliability requirements. MDP provides a simple and flexible framework for multicast applications with a degree of concern for network traffic implosion and protocol overhead efficiency. As previously described, MDP has been suc- cessfully demonstrated within the MBone for bulk data dissemination applications, including weather satellite compressed imagery updates servicing a large group of receivers and a generic web con- tent reliable "push" application. In addition, this framework approach has some design features mak- ing it attractive for bulk transfer in asymmetric and wireless internetwork applications. MDP is capable of successfully operat- ing independent of network structure and in environments with high packet loss, delay, and misordering. Hybrid proactive/reactive FEC-based repairing improve protocol performance in some multicast Adamson, Macker Expires April 2000 [Page 49] Internet Draft The Multicast Dissemination Protocol October 1999 scenarios. A source-only repair approach often makes additional engineering sense in asymmetric networks. MDP's optional unicast feedback mode may be suitable for use in asymmetric networks or in networks where only unidirectional multicast routing/delivery ser- vice exists. Asymmetric architectures supporting multicast delivery are likely to make up an important portion of the future Internet structure (e.g., DBS/cable/PSTN hybrids) and efficient, reliable bulk data transfer will be an important capability for servicing large groups of subscribed receivers. 8.0 References [1] S. Deering. "Host Extensions for IP Multicasting". Internet RFC 1112, August 1989. [2] A. Mankin, A. Romanow, S. Bradner, V. Paxson, "IETF Criteria for Evaluating Reliable Multicast Transport and Application Proto- cols", RFC 2357, IETF, June 1998. [3] J. Macker, W. Dang, "The Reliable Dissemination Protocol", Inter- net Draft, October 1996, work in progress. [4] Metzer, John, "An Improved Broadcast Retransmission Protocol", IEEE Transactions on Communications, Vol. Com-32, No.6, June 1984. [5] J. Macker, "Integrated Erasure-based Coding for Reliable Multi- cast Transmission", IRTF RMRG Meeting presentation, March 1997, . [6] J. Macker, "Reliable Multicast Transport and Integrated Erasure- based Forward Error Correction", Proc. IEEE MILCOM 97, October 1997. [7] D. Grossink, J. Macker, "Reliable Multicast and Integrated Parity Retransmission with Channel Estimation", IEEE GLOBECOM 98, 1998. [8] B.N. Levine, J.J. Garcia-Luna-Aceves, "A Comparison of Known Classes of Reliable Multicast Protocols", Proc. International Conference on Network Protocols (ICNP-96), Columbus, Ohio, Octo- ber 29- November 1, 1996. [9] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, IETF January 1996. Adamson, Macker Expires April 2000 [Page 50] Internet Draft The Multicast Dissemination Protocol October 1999 [10] S. Floyd, V. Jacobson, S. McCanne, C. Liu, and L. Zhang. "A Reli- able Multicast Framework for Light-weight Sessions and Applica- tion Level Framing". In Proc. ACM SIGCOMM, pp. 342-256, August 1995. [11] Lin S, D.J. Costello, "Error Control Coding", Prentice Hall, 1983. [12] M. Handley and S. Floyd, "Strawman Specification for TCP Friendly (Reliable) Multicast Congestion Control (TFMCC), ISI/LBNL Techni- cal Report for the IRTF RMRG working group, November 1999, . [13] D. DeLucia and K. Obraczka, "Multicast Feedback Suppression Using Representatives", USC/ISI Technical Report, June 1996. [14] D. DeLucia and K. Obraczka, "A Congestion Control Mechanism for Reliable Multicast", IRTF RMRG Meeting presentation, September 1997, . [15] J. Padhye, V. Firoiu, D. Towsley and J. Kurose, "Modeling TCP Throughput: A Simple Model and its Empirical Validation", Proc. ACM Sigcomm, 1998, Vancouver, Canada. [16] B. Adamson, "MDP Congestion Control Update", IRTF RMRG Meeting presentation, June 1999, . [17] B. Whetton and J. Conlan, "A Rate Based Congestion Control Scheme for Reliable Multicast", Technical Report, GlobalCast Communica- tion, October 1998. [18] R. Braden, Ed., L. Zhang, S. Berson, S. Herzog, and S. Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, IETF, September 1997. [19] S. Floyd and V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance", IEEE/ACM Transactions on Networking, V.1 N.4, August 1993, pp. 397-413. [20] S. Floyd and K. Fall, "Router Mechanisms to Support End-to-End Congestion Control", LBL Technical Report, February 1997. Authors' Addresses R. Brian Adamson Newlink Global Engineering Corporation Adamson, Macker Expires April 2000 [Page 51] Internet Draft The Multicast Dissemination Protocol October 1999 6506 Loisdale Road, Suite 209 Springfield, VA 22150 (202) 404-1194 adamson@itd.nrl.navy.mil Joseph Macker Information Technology Division Naval Research Laboratory Washington, DC 20375 (202) 767-2001 macker@itd.nrl.navy.mil Adamson, Macker Expires April 2000 [Page 52]