Internet Engineering Task Force H. Schulzrinne INTERNET-DRAFT AT&T Bell Laboratories October 27, 1992 Expires: 4/1/93 A Transport Protocol for Audio and Video Conferences and other Multiparticipant Real-Time Applications Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Distribution of this document is unlimited. Abstract This draft discusses aspects of transporting real-time services such as voice and video over the Internet. It compares and evaluates design alternatives for a proposed real-time transport protocol. Appendices touch on issues of port assignment and multicast address allocation. Acknowledgments This draft is based on discussion within the AVT working group chaired by Stephen Casner. Eve Schooler and Stephen Casner provided valuable comments. INTERNET-DRAFT RTP October 27, 1992 This work was supported in part by the Office of Naval Research under contract N00014-90-J-1293, the Defense Advanced Research Projects Agency under contract NAG2-578 and a National Science Foundation equipment grant, CERDCR 8500332. Contents 1 Introduction 3 2 Goals 5 3 Services 8 3.1 Framing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Version Identification . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Conference Identification . . . . . . . . . . . . . . . . . . . . . 10 3.3.1Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.2Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 Media Encoding Identification . . . . . . . . . . . . . . . . . . . 11 3.4.1Audio Encodings . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4.2Video Encodings . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Playout Synchronization . . . . . . . . . . . . . . . . . . . . . . 14 3.5.1Synchronization Method . . . . . . . . . . . . . . . . . . . . . 18 3.5.2End-of-talkspurt indication . . . . . . . . . . . . . . . . . . 21 3.5.3Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Segmentation and Reassembly . . . . . . . . . . . . . . . . . . . . 21 3.7 Source Identification . . . . . . . . . . . . . . . . . . . . . . . 22 3.7.1Gateways, Reflectors and End Systems . . . . . . . . . . . . . . 22 3.7.2Address Format Issues . . . . . . . . . . . . . . . . . . . . . 24 3.8 Energy Indication . . . . . . . . . . . . . . . . . . . . . . . . . 25 H. Schulzrinne Expires 4/1/93 [Page 2] INTERNET-DRAFT RTP October 27, 1992 3.9 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.10Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.10.1Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.10.2Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.11Quality of Service Control . . . . . . . . . . . . . . . . . . . . 27 4 Conference Control Protocol 28 5 Packet Format 28 5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Control Packets . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A Port Assignment 31 B Multicast Address Allocation 34 C Glossary 36 D Address of Author 40 1 Introduction The real-time transport protocol (RTP) discussed in this draft aims to provide services commonly required by interactive multimedia conferences, in particular playout synchronization, demultiplexing, media identification and active-party identification. However, RTP is not restricted to multimedia conferences; it is anticipated that other real-time services such as remote data acquisition and control may find its services of use. In this context, a conference describes associations that are characterized by the participation of two or more agents, interacting in real time with one or more media of potentially different types. The agents are anticipated to be human, but may also be measurement devices, remote media servers, simulators and the like. Both two-party and multiple-party associations are to be supported, where one or more agents can take active roles, i.e., generate data. Thus, applications not commonly considered a conference fall under our wider definition, for example, one-way media such as the network equivalent of closed-circuit television or radio, traditional two-party telephone conversations or real-time distributed simulations. H. Schulzrinne Expires 4/1/93 [Page 3] INTERNET-DRAFT RTP October 27, 1992 Even though intended for real-time interactive applications, the use of RTP for the storage and transmission of recorded real-time data should be possible, with the understanding that the interpretation of some fields such as timestamps may be affected by this off-line mode of operation. RTP uses the services of an end-to-end transport protocol such as UDP, TCP, OSI TPx, ST-II [1, 2] or the like(1). The services used are: end-to-end delivery, framing, demultiplexing and multicast. The underlying network is not assumed to be reliable and can be expected to lose, corrupt, arbitrarily delay and reorder packets. However, the use of RTP within quality-of-service (e.g., rate) controlled networks is anticipated to be of particular interest. Network layer support for multicasting is desirable, but not required. RTP is supported by a real-time control protocol (RTCP) in a relationship similar to that between IP and ICMP. However, RTP can function with reduced functionality without a control protocol. The control protocol provides minimum functionality for maintaining conference state for a single medium. It is not guaranteed to be reliable and assumed to be multicast to all participants of a conference. Conferences encompassing several media are managed by a (reliable) conference control protocol, whose definition is outside the scope of this note. Some aspects of its functionality, however, are described in Section 4. Within this working group, some common encoding rules and algorithms for media should be specified, keeping in mind that this aspect is largely independent of the remainder of the protocol. Without this specification, interoperability cannot be achieved. It is suggested, however, to keep the two aspects as separate RFCs as changes in media encoding should be independent of the transport aspects. The encoding specification should include things such as byte order for multi-byte samples, sample order for multi-channel audio, the format of state information for differential encodings, the segmentation of encoded video frames into packets, and the like. As part of this working group (or the conference architecture BOF/working group), some number assignment issues will have to be addressed, in particular for encoding formats, port and address usage. The issue of port assignment will be discussed in more detail in Appendix A. It should be emphasized, however, that UDP port assignment does not imply that all underlying transport mechanisms share this or a similar port mechanism. This draft aims to summarize some of the discussions held within the AVT working group chaired by Stephen Casner, but the opinions are the author's own. Where possible, references to previous work are included, but the author realizes that the attribution of ideas is far from complete. ------------------------------ 1. ST-II is not properly a transport protocol, as it is visible to intermediate nodes, but it provides services such as process demultiplexing commonly associated with transport protocols. H. Schulzrinne Expires 4/1/93 [Page 4] INTERNET-DRAFT RTP October 27, 1992 The draft builds on operational experience with Van Jacobson's and Steve McCanne's vat audio conferencing tool as well as implementation experience with the author's Nevot network voice terminal. This note will frequently refer to NVP [3], the network voice protocol, the only such protocol currently specified through an RFC within the Internet. CCITT has standardized as recommendations G.764 and G.765 a packet voice protocol stack for use in digital circuit multiplication equipment. The name RTP was chosen to reflect the fact that audio-visual conferences may not be the only applications employing its services, while the real-time nature of the protocol is important, setting it apart from other multimedia-transport mechanisms, such as the MIME multimedia mail effort [4]. The remainder of this draft is organized as follows. Section 2 summarizes the design goals of this real-time transport protocol. Then, Section 3 describes the services to be provided in more detail. Section 4 briefly outlines some of the services added by the conference control protocol; a more detailed description is outside the scope of this document. Given the required services and design goals, Section 5 outlines possible packet formats for RTP and RTCP. Two appendices discuss the issues of port assignment and multicast address allocation, respectively. A glossary defines terms and acronyms, providing references for further detail. 2 Goals Design decisions should be measured against the following goals, not necessarily listed in order of importance: media flexibility: While the primary applications that motivate the protocol design are conference voice and video, it should be anticipated that other applications may also find the services provided by the protocol useful. Some examples include distribution audio/video (for example, the ``Radio Free Ethernet''application by Sun) and some forms of (loss-tolerant) remote data acquisition. Note that it may be possible that different media interpret the same packet header field in different ways (e.g., a synchronization bit may be used to indicate the beginning of a talkspurt for audio and the beginning of a frame for video). Also, new formats of established media, for example, high-quality multi-channel audio, should be anticipated where possible. extensible: Researchers and implementors within the Internet community are currently only beginning to explore real-time multimedia services such as audio-visual conferences. Thus, the RTP should be able to incorporate additional services as operational experience with the protocol accumulates and as applications not originally anticipated find its services useful. The same mechanisms H. Schulzrinne Expires 4/1/93 [Page 5] INTERNET-DRAFT RTP October 27, 1992 should also allow experimental applications to exchange application- specific information without jeopardizing interoperability with other applications. Extensibility is also desirable as it will hopefully speed along the standardization effort, making the consequences of leaving out some group's favorite fixed header field less drastic. It should be understood that extensibility and flexibility may conflict with the goals of bandwidth and processing efficiency. independent of lower-layer protocols: RTP should make as few assumptions about the underlying transport protocol as possible. It should, for example, work reasonably well with UDP, TCP, ST-II, OSI TP, VMTP and experimental protocols, for example, protocols that support resource reservation and quality-of-service guarantees. Naturally, not all transport protocols are equally suited for real-time services; in particular, TCP may introduce unacceptable delays over anything but low-error-rate LANs. Also, protocols that deliver streams rather than packets needs additional framing services as discussed in Section 3.1. It remains to be discussed whether RTP may use services provided by the lower-layer protocols for its own purposes (time stamps and sequence numbers, for example). The goal of independence from lower-layer considerations also affects the issue of address representation. In particular, anything too closely tied to the current IP 4-byte addresses may face early obsolescence. However, the charter of the working group is short term, so that longer term changes in the host addressing can legitimately be ignored. gateway-compatible: Operational experience has shown that RTP-level gateways are necessary and desirable for a number of reasons. First, it may be desirable to aggregate several media streams into a single stream and then retransmit it with possibly different encoding, packet size or transport protocol. A reflector that achieves multicasting by user-level copying may be needed where multicast tunnels are unavailable or the end-systems are not multicast-capable. bandwidth efficient: It is anticipated that the protocol will be used in networks with a wide range of bandwidths and with a variety of media encodings. Despite increasing bandwidths within the national backbone networks, bandwidth efficiency will continue to be important for transporting conferences across 56 kb links, office-to-home high-speed modem connections and international links. To minimize end-to-end delay and the effect of lost packets, packetization intervals have to be limited, which, in combination with efficient media encodings, leads to short packet sizes. Generally, packets containing 16 to 32 ms of speech are considered optimal [5, 6, 7]. For example, even with a 65 ms packetization interval, a 4800 b/s encoding produces 39 byte packets. Current Internet voice experiments use packets containing between 20 and 22.5 ms of audio, which translates into 160 to 180 bytes H. Schulzrinne Expires 4/1/93 [Page 6] INTERNET-DRAFT RTP October 27, 1992 of audio information coded at 64 kb/s. Video packets are typically much longer, so that header overhead is less of a concern. For UDP multicast (without counting the overhead of source routing as currently used in tunnels or a separate IP encapsulation as planned), IPv4 incurs 20 bytes and UDP an additional 8 bytes of header overhead, not counting any datalink layer headers of at least 4 bytes. With RTP header lengths between 4 and 8 bytes, the total overhead amounts to between 36 and 40 (or more) bytes per audio or video packet. For 160-byte audio packets, the overhead of 8-byte RTP headers together with UDP, IP and PPP headers is 25%. For low bitrate coding, packet headers can easily double the necessary bit rate. Thus, it appears that any fixed headers beyond eight bytes would have to make a significant contribution to the protocol's capabilities to outweigh it standing in the way of running RTP applications over low-speed links. The current fixed header lengths for NVP and vat are 4 and 8 bytes, respectively. It is interesting to note that G.764 has a total header overhead, including the LAPD data link layer, of only 8 bytes, as the voice transport is considered a network-layer protocol. The overhead is split evenly between layer 2 and 3. Bandwidth efficiency can be achieved by transporting non-essential or slowly changing protocol state in optional fields or in a separate low-bandwidth control protocol. Also, header compression [8] may be used. international: Even now, audio and visual conferencing tools are used far beyond the North American continent. It would seem appropriate to give considerations to internationalization concerns, for example to allow for the European A-law audio encoding and non-US-ASCII character sets in textual data such as site identification. processing efficient: At packet arrival rates of on the order of 40 to 50 per second for a single voice or video source, per-packet processing overhead may become a concern, particularly if the protocol is to be implemented on other than high-end platforms. Multiplication and division operations should be avoided where possible and fields should be aligned to their natural size, i.e., an n-byte integer is aligned on an n-byte multiple, where possible. implementable: Given the anticipated lifetime and experimental nature of the protocol, it must be implementable with current hardware and operating systems. That does not preclude that hardware and OS geared towards real-time services may improve the performance or capabilities of the protocol, e.g., allow better intermedia synchronization. H. Schulzrinne Expires 4/1/93 [Page 7] INTERNET-DRAFT RTP October 27, 1992 3 Services The services that may be provided by RTP are summarized below. Note that not all services have to be offered. Services anticipated to be optional are marked with an asterisk. o framing (*) o demultiplexing by conference/association (*) o demultiplexing by media source o demultiplexing by media encoding o synchronization between source(s) and destination(s) o error detection (*) o encryption (*) o quality-of-service monitoring (*) In the following sections, we will discuss how these services are reflected in the proposed packet header. Information to be conveyed within the conference can be roughly divided into information that changes with every data packet and other information that stays constant for longer time periods. State information that does not change with every packet can be carried in several different ways: as a fixed part of the RTP header: This method is easiest to decode and ensures state synchronization between sender and receiver(s), but can be bandwidth inefficient or restrict the amount of state information to be conveyed. as a header option: The information is only carried when needed. It requires more processing by the sending and receiving application. If contained in every packet, it is also less bandwidth-efficient than the first method. within RTCP packets: This approach is roughly equivalent to header options in terms of processing and bandwidth efficiency. Some means of identifying when a particular option takes effect within the data stream may have to be provided. within conference control: The state information is conveyed when the conference is established or when the information changes. As for RTCP packets, a synchronization mechanism between data and control may be H. Schulzrinne Expires 4/1/93 [Page 8] INTERNET-DRAFT RTP October 27, 1992 required for certain information. through a conference directory: This is a variant of the conference control mechanism, with a (distributed) directory at a well-known location maintaining state information about on-going or scheduled conferences. Changing state information during a conference is probably more difficult than with conference control as participants need to be told to look at the directory for changed information. Thus, a directory is probably best suited to hold information that will persist through the life of the conference, for example, its multicast group, title and organizer. The first two methods are examples of in-band signaling, the others of out-of-band signaling. 3.1 Framing To satisfy the goal of transport independence, we cannot assume that the lower layer provides framing. (Consider TCP as an example, even though it would probably not be used for real-time applications except possibly on a local network, but may be used in distributing recorded audio or video segments.) Thus, if and only if the underlying protocol does not provide framing, the RTP packet is prefixed by a 16-bit byte count. The byte count could also be used by mutual agreement if it is deemed desirable to carry several RTP packets in a single TPDU for increased efficiency. 3.2 Version Identification Humility suggests that we anticipate that we may not get the first iteration of the protocol right. In order to avoid ``flag days'' where everybody shifts to a new protocol, a version identifier could ensure continued interoperability. This is particularly important since UDP, for example, does not carry a ``next protocol'' identifier. The difficulty in interworking between the current vat and NVP protocols further affirms the necessity of a version identifier. However, the version identifier can be anticipated to be the most static of all proposed header fields. Since the length of the header and the location and meaning of the option length field may be affected by a version change, encoding the version within an optional field is not feasible. Putting the version number into the control protocol packets would make RTCP mandatory and would make rapid scanning of conferences significantly more difficult. vat currently offers a 2-bit version field, while this capability is missing from NVP. Given the low bit usage and their utility in other contexts (IP, H. Schulzrinne Expires 4/1/93 [Page 9] INTERNET-DRAFT RTP October 27, 1992 ST-II), it may be prudent to include a version identifier. 3.3 Conference Identification A conference identifier (conference ID) could serve two mutually exclusive functions: providing another level of demultiplexing or a means of logically aggregating flows with different network addresses and port numbers. vat specifies a 16-bit conference identifier. 3.3.1 Demultiplexing Demultiplexing by RTP allows one association characterized by destination address and port number to carry several distinct conferences. However, this appears to be necessary only if the number of conferences exceeds the demultiplexing capability available through (multicast) addresses and port numbers. Efficiency arguments suggest that combining several conferences or media within a single multicast group is not desirable. Combining several conferences or media within a single multicast address negates the bandwidth efficiency afforded by multicasting. Also, applications that are not interested in a particular conference or capable of dealing with particular medium are still forced to handle the packets delivered for that conference or medium. Consider as an example two separate applications, one for audio, one for video. If both share the same multicast address and port, being differentiated only by the conference identifier, the operating system has to copy each incoming audio and video packet into two application buffers and perform a context switch to both applications, only to have one immediately discard the incoming packet. Given that application-layer demultiplexing has strong negative efficiency implications and given that multicast addresses are not an extremely scarce commodity, there seems to be no reason to burden every application with maintaining and checking conference identifiers for the purpose of demultiplexing. It is also not recommended to use this field to distinguish between different encodings, as it would be difficult for the application to decide whether a new conference identifier means that a new conference has arrived or simply all participants should be moved to the new conference with a different encoding. Since the encoding may change for some but not all participants, we could find ourselves breaking a single logical conference into several pieces, with a fairly elaborate control mechanism to decide which conferences logically belong together. H. Schulzrinne Expires 4/1/93 [Page 10] INTERNET-DRAFT RTP October 27, 1992 3.3.2 Aggregation Particularly within a network with a wide range of capacities, differing multicast groups for each media component of a conference allows to tailor the media distribution to the network bandwidths and end-system capabilities. It appears useful, however, to have a means of identifying groups that logically belong together, for example for purposes of time synchronization. A conference identifier used in this manner would have to be globally unique. It appears that such logical connections would better be identified as part of the control protocol by identifying all multicast addresses belonging to the same logical conference, thereby avoiding the assignment of globally unique identifiers. 3.4 Media Encoding Identification This field plays a similar role to the protocol field in data link or network protocols, indicating the next higher layer (here, the media decoder) that the data is meant for. For RTP, this field would indicate the audio or video or other media encoding. In general, the number of distinct encodings should be kept as small as possible to increase the chance that applications can interoperate. A new encoding should only be recognized if it significantly enhances the range of media quality or the types of networks conferences can be conducted over. The unnecessary proliferation of encodings can be reduced by making reference implementations of standard encoders and decoders widely available. It should be noted that encodings may not be enumerable as easily as, say, transport protocols. A particular family of related encoding methods may be described by a set of parameters, as discussed below in the sections on audio and video encoding. Encodings may change during the duration of a conference. This may be due to changed network conditions, changed user preference or because the conference is joined by a new participant that cannot decode the current encoding. If the information necessary for the decoder is conveyed out-of-band, some means of indicating when the change is effective needs to be incorporated. Also, the indication that the encoding is about to change must reach all receivers reliably before the first packet employing the new encoding. Each receiver needs to track pending changes of encodings and check for every incoming packet whether an encoding change is to take effect with this packet. Conveying media encodings rapidly is also important to allow scanning of conferences or broadcast media. A directory service could provide encoding information for on-going conferences. This may not be sufficient, however, unless all participants within a conference use the same encoding. Also, H. Schulzrinne Expires 4/1/93 [Page 11] INTERNET-DRAFT RTP October 27, 1992 the usual synchronization problems between transmitted data and directory information apply. There are at least two approaches to indicating media encoding, either in-band or out-of-band: conference-specific: Here, the media identifier is an index into a table designating the approved or anticipated encodings (together with any particular version numbers or other parameters) for a particular conference or user community. The table can be distributed through RTCP, a conference control protocol or some other out-of-band means. Since the number of encodings used during a single conference is likely to be small, the field width in the header can likewise be small. Also, there is no need to agree on an Internet-wide list of encodings. It should be noted that conveying the table of encodings through RTCP forces the application to maintain a separate mapping table for each sender as there can be no guarantee that all senders will use the same table. global: Here, the media identifier is an index into a global table of encodings. A global list reduces the need for out-of-band information. Transmitting the parameters associated with an encoding may be difficult, however, if it has to be done within the header space constraints of per-packet signaling. To make detecting coder mismatches easier, encodings for all media should be drawn from the same numbering space. To facilitate experimentation with new encodings, a part of any global encoding numbering space should be set aside for experimental encodings, with numbers agreed upon within the community experimenting with the encoding, with no Internet-wide guarantee of uniqueness. 3.4.1 Audio Encodings Audio data is commonly characterized by three independent descriptors: encoding (the translation of one or more audio samples into a channel symbol), the number of channels (mono, stereo) and the sampling rate. Theoretically, sampling rate and encoding are (largely) independent. We could, for example, apply =-law encoding to any sampling rate even though it is traditionally used with a rate 8,000 Hz. In practical terms, it may be desirable to limit the combinations of encoding and sampling rate to the values the encoding was designed for. (2). ------------------------------ 2. Given the wide availability of =-law encoding and its low overhead, using it with a sampling rate of 16,000 or 32,000 Hz might be quite H. Schulzrinne Expires 4/1/93 [Page 12] INTERNET-DRAFT RTP October 27, 1992 Channel counts between 1 and 4 should be sufficient and can be encoded into 2 bits by encoding the channel count minus one. The audio encodings listed in Table 1 appear particularly interesting, even though the list is by no means exhaustive and does not include some experimental protocols currently in use, for example a non-standard form of LPC. The bit rate is shown per channel. ks/s, b/sample and kb/s denote kilosamples per second, bits per sample and kilobits per second, respectively. If sampling rates are to be specified separately, the values of 8, 16, 32, 44.1, and 48 kHz suggest themselves, even though other values (11.025 and 22.05 kHz) are supported on some workstations (the Silicon Graphics audio hardware and the Apple Macintosh, for example). Clearly, little is to be gained by allowing arbitrary sampling rates, as conversion particularly between rates not related by simple fractions is quite cumbersome and processing-intensive. _Org.______Name_____k_samples/s__bits/sample__kb/s__description_______ CCITT G.711 8.0 8 64 =-law PCM CCITT G.711 8.0 8 64 A-law PCM CCITT G.721 8.0 4 32 ADPCM Intel DVI 8.0 4 32 APDCM CCITT G.723 8.0 3 24 ADPCM CCITT G.726 ADPCM CCITT G.727 ADPCM NIST/GSA FS 1015 8.0 2.4 LPC-10E NIST/GSA FS 1016 8.0 4.8 CELP NADC IS-54 8.0 7.95 VSELP CCITT G.7xy 8.0 16 LD-CELP GSM 8.0 13 RPE-LPC CCITT G.722 8.0 64 7 kHz, SB-ADPCM 256 MPEG audio 32.0 16 512 DAT 44.1 16 705.6 CD, DAT playback 48.0 16 786 DAT record Table 1: Standardized and common audio encodings ------------------------------ appropriate for high-quality audio conferences, even though there are other encodings, such as G.722, specifically designed for such applications. Note that the signal-to-noise ratio of =-law encoding is about 38 dB, equivalent to an AM receiver. The ``telephone quality'' associated with G.711 is due primarily to the limitation in frequency response to the 200 to 3500 Hz range. H. Schulzrinne Expires 4/1/93 [Page 13] INTERNET-DRAFT RTP October 27, 1992 3.4.2 Video Encodings Common video encodings are listed in Table 2. Encodings with tunable rate can be configured for different rates, but produce a fixed-rate stream. The average bit rate produced by variable-rate codecs depends on the source material. _Org.________name___rate_______________remarks___________ CCITT JPEG tunable CCITT MPEG variable, tunable CCITT H.261 tunable Bolter variable, tunable PictureTel ?? BBN DVC variable, tunable block differences Table 2: Common video encodings 3.5 Playout Synchronization A major purpose of RTP is synchronization between the source and sink(s) of a single medium. Note that this is to be distinguished from synchronization between different media such as audio and video (lip sync). Sometimes the two forms are referred to as intra-media and inter-media synchronization. RTP concerns itself only with intra-media or playout synchronization, although the mechanisms such as timestamps may be necesary for inter-media synchronization. In connection with playout synchronization, we can group packets into playout units, a number of which in turn form a synchronization unit. More specifically, we define: synchronization unit: A synchronization unit consists of one or more playout units (see below) that, as a group, share a common fixed delay between generation and playout of each part of the group. The delay may change at the beginning of such a synchronization unit. The most common synchronization units are talkspurts for voice and frames for video transmission. playout unit: A playout unit is a group of packets sharing a common timestamp. (Naturally, packets whose timestamps are identical due to timestamp wrap-around are not considered part of the same playout unit.) For voice, the playout unit would typically be a single voice segment, while for video a video frame could be broken down into subframes, each consisting of packets sharing the same timestamp and ordered by some form of sequence number. H. Schulzrinne Expires 4/1/93 [Page 14] INTERNET-DRAFT RTP October 27, 1992 All proposed synchronization methods require a timestamp. The timestamp has to have a sufficient range that wrap-arounds are infrequent. It is desirable that the range exceeds the maximum expected inactive (e.g., silence) period. Otherwise, special handling may be necessary in the case of the sequence number/time stamp combination as the beginning of the next active period could have a time stamp one greater than the last one, thus masking the beginning of the talkspurt. The 10-bit timestamp used by NVP is generally agreed to be too small as it wraps around after only 20.5 s (for 20 ms audio packets), while a 32-bit timestamp should serve all anticipated needs, even if the timestamp is expressed in units of samples or other sub-packet entities. Three proposals as to the interpretation of the timestamp have been advanced: packet/frame: Each packetization or (video/audio) frame interval increments the timestamp. This approach very efficient in terms of processing and bit-use, but cannot be used without out-of-band information if the time interval of media ``covered'' by a packet varies from packet to packet. This occurs for example with variable-rate encoders or if the packetization interval is changed during a conference. This interpretation of a timestamp is assumed by NVP, which defines a frame as a block of PCM samples or a single LPC frame. Note that there is no inherent necessity that all participants within a conference use the same packetization interval. Local implementation considerations such as available clocks may suggest other intervals. As another example, consider a conference with feedback. For the lecture audio, a long packetization interval may be desirable to better amortize packet headers. For side chats, delays are more important, thus suggesting a shorter packetization interval.(3) sample: This method simply counts samples, allowing a direct translation between time stamp and playout buffer insertion point. It is just as easily computable as the per-packet timestamp. However, for some media and encodings(4), it may not be quite clear what a sample is. Also, some care must be taken at the receiver if incoming streams use different sampling rates. This method is currently used by vat. subset of NTP timestamp: 16 bits encode seconds relative to 0 o'clock, ------------------------------ 3. Nevot for example, allows each participant to have a different packetization interval, independent of the packetization interval used by Nevot for its outgoing audio. Only the packetization interval for outgoing audio for all conferences must be the same. 4. Examples include frame-based encodings such as LPC and CELP. Here, given that these encodings are based on 8,000 Hz input samples, the preferred interpretation would probably be in terms of audio samples, not frames, as samples would be used for reconstruction and mixing. H. Schulzrinne Expires 4/1/93 [Page 15] INTERNET-DRAFT RTP October 27, 1992 January 1, 1900 (modulo 65536) and 16 bits encode fractions of a second, with a resolution of approximately 15.2 =s, which is smaller than any anticipated audio sampling or video frame interval. This timestamp is the same as the middle 32 bits of the 64-bit NTP timestamp [9]. It wraps around every 18.2 hours. If it should be desirable to reconstruct absolute transmission time at the receiver for logging or recording purposes, it should be easy to determine the most significant 16 bits of the timestamp. Otherwise, wrap-arounds are not a significant problem as long as they occur 'naturally', i.e., at a 16 or 32 bit boundary, so that explicit checking on arithmetic operations is not required. Also, since the translation mechanism would probably treat the timestamp as a single integer without accounting for its division into whole and fractional part, the exact bit allocation between seconds and fractions thereof is less important. However, the 16/16 approach simplifies extraction from a full NTP timestamp. The NTP-like timestamp has the disadvantage that its resolution does not map into any of the common sample intervals. Thus, there is a potential uncertainty of one sample at the receiver as to where to place the beginning of the received packet, resulting in the equivalent of a one-sample slip. CCITT recommendation G.821 postulates a mean slip rate of less than 1 slip in 5 hours, with degraded but acceptable service for less than 1 slip in 2 minutes. Tests with appropriate rounding conducted by the author showed that this most likely does not cause problems. In any event, a double-precision floating point multiplication is needed to translate between this timestamp and the integer sample count available on transmission and required for playout.(5) It has been suggested to use timestamps relative to the beginning of first transmission from a user. This makes correlation between media from different participants difficult and seems to have no technical or implementation advantages, except for avoiding wrap-around during most conferences. As pointed out above, that seems to be of little benefit. Clearly, the reliability of a wallclock-synchronized timestamps depends on how closely the system clocks are synchronized, but that does not argue for giving up potential real-time synchronization in all cases. It also needs to be decided whether the time stamp should reflect real time or sample time. A real time timestamp is defined to track wallclock time plus or minus a constant offset. Sample time increases by the nominal sampling interval for each sample. The two clocks in general do not agree since the clock source used for sampling will in all likelihood be slightly off the nominal rate. For example, typical crystals without temperature control are only accurate to 50 -- 100 ------------------------------ 5. The multiplication with an appropriate factor can be approximated to the desired precision by an integer multiplication and division, but multiplication by a floating point value is generally much faster on modern processors. H. Schulzrinne Expires 4/1/93 [Page 16] INTERNET-DRAFT RTP October 27, 1992 ppm (parts per million), yielding a potential drift of 0.36 seconds per hour between the sampling clock and wallclock time. Using real time rather than sample time allows for easier synchronization between different media and to compensate for slow or fast sample clocks. Note that it is neither desirable nor necessary to obtain the wall clock time when each packet was sampled. Rather, the sender determines the wallclock time at the beginning of each synchronization unit (e.g., a talkspurt for voice and a frame for video) and adds the nominal sample clock duration for all packets within the talkspurt to arrive at the timestamp value carried in packets. The real time at the beginning of a talkspurt is determined by estimating the true sample rate for the duration of the conference. The sample rate estimate has to be accurate enough to allow placing the beginning of a talkspurt, say, to within at most 50 to 100 ms, otherwise the lack of synchronization may be noticeable, delay computations are confused and successive talkspurts may be concatenated. Estimating the true sampling instant to within a few milliseconds is surprisingly difficult for current operating systems. The sample rate r can to be estimated as r =_s+_q_t:-t0 Here, t is the current time, t0 the time elapsed since the first sample was acquired, s is the number of samples read, q is the number of samples ready to be read (queued) at time t. Then, the timestamp to be inserted into the synchronization packet is computed as t0+ tr. Unfortunately, only s is known precisely. The accuracy of the estimate for t0 and t depend on how accurately the beginning of sampling and the last reading from the audio device can be measured. There is a non-zero probability that the process will get preempted between the time the audio data is read and the instant the system clock is sampled. It remains unclear whether indications of current buffer occupancy, if available, can be trusted. Experiments with the SunOS audio driver showed significant variations of the estimated sample rate, with discontinuities of the computed timestamps of up to 25 ms. Kernel support is probably required for meaningful real time measurements. Sample time increments with the sampling interval for every sample or (sub)frame received from the audio or video hardware. It is easy to determine, as long as care is taken to avoid cumulative round-off errors incurred by simply repeatedly adding the approximate packetization interval. However, synchronization between media and end-to-end delay measurements are then no longer feasible. (Example: Consider an audio and video stream. If the audio sample clock is slightly faster than the real clock and the video sampling clock, a video and audio frame belonging together would be marked by different timestamps, thus played out at different instants.) H. Schulzrinne Expires 4/1/93 [Page 17] INTERNET-DRAFT RTP October 27, 1992 If we are forced to use sample time, the advantage of using an NTP timestamp disappears, as the receiver can easily reconstruct a NTP sample-based timestamp from the sample count if needed, but would not have to if no cross-media synchronization is required. RTCP could relate the time increment per sample in full precision. It should be noted that it may not be possible to associate an meaningful notion of time with every packet. For example, if a video frame is broken into several fragments, there is no natural timestamp associated with anything but the first fragment, particularly if there is not even a sequential mapping from screen scan location into packets. Thus, any timestamp used would be purely artificial. A synchronization bit could be used in this particular case to mark beginning of synchronization units. For packets within synchronization units, there are two possible approaches: first, we can introduce an auxiliary sequence number that is only used to order packets within a frame. Secondly, we could abuse the timestamp field by incrementing it by a single unit for each packet within the frame, thus allowing a variable number of frames per packet. The latter approach is barely workable and rather kludgy. 3.5.1 Synchronization Method Timestamp/sequence number: This method is currently used by NVP. The sequence number is incremented with every transmitted packet. For audio, the beginning of a talkspurt is indicated when successive packets differ in timestamp more than they differ in sequence number. As long as packets are not reordered, determination of the beginning of a talkspurt is generally easy, except for the unlikely case where a new talkspurt has a time stamp that, due to timestamp wrap-around, is one greater than the last packet of the previous talkspurt. However, if packets are reordered, delay adaptation at the beginning of a talkspurt becomes unreliable. Consider the scenario laid out in Table 3. For convenience, the example assumes that clocks at the transmitter and receiver are perfectly synchronized; also, timestamps are expressed in wallclock time, increasing by 20 time units for each packet. The current playout delay, that is, the jitter estimate, is set at 50 time units and is assumed to stay constant throughout the example. In the table, packet 210 is recognized as the beginning of a new talkspurt if there has been no reordering. If packets 210 and 211 arrive in reverse transmission order, the receiver can only conclude that packet 211 introduces a new talkspurt. Because the wrong packet is treated as the beginning of a talkspurt, the playout delay is really one packetization interval too short for the remainder of the talkspurt. In the example, packet 212 arrives too late and misses its playout time, even though it would have made playout without reordering. This scenario assumes that packets are mixed in at the time of arrival so that their playout time cannot be changed. H. Schulzrinne Expires 4/1/93 [Page 18] INTERNET-DRAFT RTP October 27, 1992 It is possible to relax that assumption and reschedule packets after discovering that the wrong packet was used as the talkspurt beginning; this, however, would seem to complicate the implementation greatly, as determining how long the mixing is to be delayed cannot be readily decided. Unfortunately, reordering at the beginning of a talkspurt is particularly likely since common silence detection algorithms send a group of packets to prevent front clipping. no reordering with reordering seq. timestamp arrival playout arrival playout 200 1020 1520 1570 1520 1570 201 1040 1530 1590 1530 1590 210 1220 1720 1790 1725 1770 211 1240 1725 1810 1720 1790 212 1260 1825 1830 1825 1810 Table 3: Example where out-of-order arrival leads to packet loss timestamp/synchronization bit: This method is currently used by vat. Here, the beginning of a talkspurt is indicated by setting the synchronization bit. A sequence number is not required. This synchronization method is unaffected by out-of-order packet delivery. If the first packet of a talkspurt is lost, two talkspurts are simply merged, without dire consequences except for a missed chance to have the playout delay reflect the delay jitter estimate. The synchronization bit has to be ignored if a packet with a larger timestamp has already arrived. The insertion rule can thus be expressed as ae ln= l1+p(tn-t1+)dmforanx>1forn =1 (1) where ln denotes the location within the playout buffer for packet n within a talkspurt, tn the timestamp of packet n within a talkspurt, p the current playout location (the read pointer) and dmax the current estimated playout delay, that is, the estimated maximum delay variation. All quantities are measured in appropriate units (time, samples, or bytes). Addition is performed modulo the buffer size. The role of the synchronization bit for packet video remains to be defined. It does not have to bear any relationship to the content, e.g., frame structure of a packet video source, as it merely indicates where delay can be varied without affecting perceived quality. The disadvantage of this scheme is that it is impossible for the receiver to get an accurate count of the number of packets that it should have received. While gaps within a talkspurt give some indication of packet loss, we cannot tell what part of the tail of a H. Schulzrinne Expires 4/1/93 [Page 19] INTERNET-DRAFT RTP October 27, 1992 talkspurt has been transmitted. (Example: consider the talkspurts with time stamps 100, 101, 102, 110, 111, where packets with timestamp 100 and 110 have the synchronization bit set. At the receiver, we have no way of knowing whether we were supposed to have received two talkspurts with a total of five packets, or two or more talkspurts with up to 12 packets.) We can overcome this difficulty by enhancing RTCP as discussed in Section 3.11. synchronization bit/sequence number within talkspurt: G.764 implements this method. The sequence number zero is reserved for the first packet of a talk spurt, while sequence numbers 1 through 15 are used for the remaining packets within the talkspurt, wrapping around from 15 to 1, if necessary. This is equivalent to the synchronization bit described earlier. A sequence number gap also triggers a new talkspurt. The scheme is designed for networks that cannot reorder packets. With reordering, packets may easily be played out in the wrong order. Consider, for example, packets with sequence numbers 0, 1, and 2. If the packets arrive in the order 1, 2, 0, the receiver interprets this as a two talkspurts and plays the packets in the order received. From the example, we can generalize that sequence numbers that number packets within a talkspurt are not suitable for networks that can reorder packets if used without timestamps. G.764 also features a delay accumulator field, into which each node adds the queueing and processing delay accumulated at that node. A one-byte field is used to encode delays between 0 and 200 ms with a resolution of 1 ms. The resolution of 1 ms suffices since the delay estimate affects only the placement of the beginning of a talkspurt. Note that the synchronization mechanism does not depend on this delay value. The delay value does, however, allow the application to gauge how congested the underlying network is. With a delay estimate, equation (1) changes so that l1 =p +dmax- d1 Thus, the end-to-end delay is the maximum variable delay plus the fixed delay, rather than the sum of estimated maximum variable delay, the fixed delay and the variable delay experienced by the first packet in the talkspurt. Thus, the end-to-end delay is lower without affecting the late loss probability. The delay accumulator could be used for any of the synchronization schemes described here. Despite this benefit, its use within the Internet appears impossible, as we cannot expect routers to update a field in an application layer protocol like RTP. H. Schulzrinne Expires 4/1/93 [Page 20] INTERNET-DRAFT RTP October 27, 1992 3.5.2 End-of-talkspurt indication An end-of-talkspurt indication is useful to distinguish silence from lost packets. The receiver would want to replace silence by an appropriate background noise level to avoid the ``noise-pumping'' associated with silence detection. On the other hand, missing packets should be reconstructed from previous packets. If the silence detector makes use of hangover, the transmitter can easily set the end-of-talkspurt indicator on the last bit of the last hangover packet. If the talkspurts follow end-to-end, the end-of-talkspurt indicator has no effect except in the case where the first packet of a talkspurt is lost. In that case, the indicator would erroneously trigger noise fill instead of loss recovery. The end-of-talkspurt indicator is implemented in G.764 as a ``more'' bit which is set to one for all but the last packet within a talkspurt. 3.5.3 Recommendation Given the ease of cross-media synchronization, the media independence (except for the sub-frame aspect mentioned), the use of 32-bit 16/16 timestamps representing the middle part of the NTP timestamp is suggested. Generally, a real-time based timestamp appears to be preferable to a sample-based one, but it may not be realizable for some current operating systems. Inter-media synchronization has to await mechanisms that can accurately determine when a particular sample was actually received by the A/D converter. Given the lower overhead and the ease of playout reconstruction, a synchronization bit appears preferable to the sequence number/time stamp combination. Since sequence numbers are useful for cases where packets do not carry meaningful timing information and also ease loss detection, they should be provided for space permitting. 3.6 Segmentation and Reassembly For high-bandwidth video, a single frame may not fit into the maximum transport unit (MTU). Thus, some form of frame sequence number is needed. If possible, the same sequence number should be used for synchronization and fragmentation. Four possibilities suggest themselves: overload timestamp: No sequence number is used. Within a frame, the timestamp has no meaning. Since it is used for synchronization only when the synchronization bit is set, the other timestamps can just increase by one for each packet. However, as soon as the first frame gets lost or reordered, determining positions and timing becomes difficult or impossible. H. Schulzrinne Expires 4/1/93 [Page 21] INTERNET-DRAFT RTP October 27, 1992 continuous: The sequence number is incremented without regard to frame boundaries. If a frame consists of a variable number of packets, it may not be clear what position the packet occupies within the frame if packets are lost or reordered. Continuous sequence numbers make it possible to determine if all packets for a particular frame have arrived, but only after the first packet of the next frame, distinguished by a new timestamp, has arrived. within frame: Naturally, this approach has properties complementary to the first. continuous with first-packet option: Packets use a continuous sequence number plus an option in every packet indicating the initial sequence number within the playout unit(6). Carrying both a continuous and packet-within-frame count achieves the same effect. continuous with last-packet option: Packets carry a continuous sequence number plus an option in every packet indicating the last sequence number within the playout unit. This has the advantage that the receiver can readily detect when the last packet packet for a playout unit has been received. The transmitter may not know, however, at the beginning of a playout unit how many packets it will comprise. Also, the position within the playout unit is more difficult to determine if the initial packet is lost. It could be argued that encoding-specific location information should be contained within the media part, as it will likely vary in format and use from one media to the next. 3.7 Source Identification 3.7.1 Gateways, Reflectors and End Systems It is necessary to be able to identify the origin of the real-time data in terms meaningful to the application. First, this is required to demultiplex sites (or sources) within the same conference. Secondly, it allows an indication of the currently active source. Currently, NVP makes no explicit provisions for this, assuming that the network source address can be used. This may fail if intermediate agents intervene between the media source and final destination. Consider the example in Fig. 1. An RTP-level gateway is defined as an entity that transforms either the RTP header or the RTP media data or both. Such a gateway could for example merge two successive packets for increased ------------------------------ 6. suggested by Steve Casner H. Schulzrinne Expires 4/1/93 [Page 22] INTERNET-DRAFT RTP October 27, 1992 transport efficiency or, probably the most common case, translate media encodings for each stream, say from PCM to LPC (called transcoding). A synchronizing gateway is defined here as a gateway that recreates a synchronous media stream, possibly after mixing several sources. An application that mixes all incoming streams for a particular conference, recreates a synchronous audio stream and then forwards it to a set of receivers is an example of a synchronizing gateway. A synchronizing gateway could be built from two end system applications, with the first application feeding the media output to the media input of the second application and vice versa. In figure 1, the gateways are used to translate audio encodings, from PCM and ADPCM to LPC. The gateway could be either synchronizing or not. Note that a resynchronizing gateway is only necessary if audio packets depend on their predecessors and thus cannot be transcoded independently. It may be advantageous if the packetization interval can be increased. Also, for connections that are barely able to handle one active source at a time, mixing at the gateway avoids excessive queueing delays when several sources are active at the same time. A synchronizing gateway has the disadvantage that it always increases the end-to-end delay. We define reflectors as transport-level entities that translate between transport protocols, but leave the RTP protocol unit untouched. In the figure, the reflector connects a multicast group to a group of hosts that are not multicast capable by performing transport-level replication. We define an end system as an entity that receives and generates media content, but does not forward it. We define three types of sources: the media source is the actual origins of the media, e.g., the talker in an audiocast; a synchronization source is the combination of several media sources with its own timing; network source is the network-level origin as seen by the end system receiving the media. The end system has to synchronize its playout with the synchronization source, indicate the active party according to the media source and return media to the network source. If an end system receives media through a resynchronizing gateway, the end system will see the gateway as the network and synchronization source, but the media sources should not be affected. The reflector does not affect the media or synchronization sources, but the reflector becomes the network source. (Note that having the reflector change the IP source address is not possible since the end systems need to be able to return their media to the reflector.) vat audio packets include a variable-length list of at most 64 4-byte identifiers containing all media sources of the packet. However, there is no convenient way to distinguish the synchronization source from the network source. The end system needs to be able to distinguish synchronization sources because jitter computation and playout delay differ for each synchronization source. H. Schulzrinne Expires 4/1/93 [Page 23] INTERNET-DRAFT RTP October 27, 1992 /-------\ +------+ _ _ ADPCM _ _ _ group _<------>_ GW _--\ LPC _ _ _ _ \ /------ end system \-------/ +------+ \_\/ reflector _ >------- end system /-------\ +------+ /_/\ _ _ PCM _ _ / \------ end system _ group _<------>_ GW _--/ LPC _ _ _ _ \-------/ +------+ <---> multicast Figure 1: Gateway topology Rather than having the gateway (which may be unaware of the existence of a reflectors down stream) insert a synchronization source identifier or having the reflector know about the internal structure of RTP packets, the current ad-hoc encapsulation solution used by Nevot may be sufficient: the reflector simply prefixes the the true network address (and port?) of the last source (either the gateway or media source, i.e., the synchronization source) to the RTP packet. Thus, each end system and gateway has to be aware whether it is being served by a reflector. Also, multiple concatenated reflectors are difficult to handle. 3.7.2 Address Format Issues The limitation to four bytes of addressing information may not be desirable for a number of reasons. Currently, it is used to hold an IP address. This works as long as four bytes are sufficient to hold an identifier that is unique throughout the conference and as long as there is only one media source per IP address. The latter assumption tends to be true for many current workstations, but it is easy to imagine scenarios where it might not be, e.g., a system could hold a number of audio cards, could have several audio channels (Silicon Graphics systems, for example) or could serve as a multi-line telephone interface.(7) The combination of IP address and source port can identify multiple sources per site if each media source uses a different network source port. It does not seem appropriate to force applications to allocate ports just to distinguish sources. In the PBX example a single output port would appear ------------------------------ 7. If we are willing to forego the identification with a site, we could have a multiple-audio channel site pick unused IP addresses from the local network and associate it with the second and following audio ports. H. Schulzrinne Expires 4/1/93 [Page 24] INTERNET-DRAFT RTP October 27, 1992 to be the appropriate method for sending all incoming calls across the network. Given the discussion of longer address formats at least in the longer term, it seems appropriate to consider allowing for variable-length identifiers. Ideally, the identifier would identify the agent, not a computer or network interface.(8) A currently viable implementation is the concatenation of the IP address and some locally unique number. The meaning of the local discriminator is opaque to the outside world; it appears to be generally easier to have a local unique id service than a distributed version thereof. Possibilities for the local discriminator include the numeric process identifier (plus some distinguishing information within the application), the network source port number or a numeric user identifier. For efficiency in the common case of one source per workstation, the convention (used in vat) of using the network source address, possibly combined with the user id or source port, as media and synchronization source should be maintained. 3.8 Energy Indication G.764 contains a 4-bit noise energy field, which encodes the white noise energy to be played by the receiver in the silences between talkspurts. Playing silence periods as white noise reduces the noise-pumping where the background noise audible during the talkspurt is audibly absent at the receiver during silence periods. Substituting white noise for silence periods at the receiver is not recommended for multi-party conferences, as the summed background noise from all silent parties would be distractive. Determining the proper noise level appears to be difficult. It is suggested that the receiver simply takes the energy of the last packet received before the beginning of a silence period as an indication of the background noise. With this mechanism, an explicit indication in the packet header is not required. 3.9 Error Control It remains to be decided whether the header, the whole packet or neither should be protected by checksums. NVP protects its header only, while G.764 ------------------------------ 8. In the United States, a one way encryption function applied to the social security number would serve to identify human agents without compromising the SSN itself, given that the likelihood of identical SSNs is sufficiently small. The use of a telephone number may be less controversial and is applicable world-wide, but may require some local coordination if numbers are shared. H. Schulzrinne Expires 4/1/93 [Page 25] INTERNET-DRAFT RTP October 27, 1992 has a single 16-bit check sequence covering both datalink and packet voice header. However, if UDP is used as the transport protocol, a checksum over the whole packet is already computed by the receiver. (Checksumming for UDP can typically be disabled by the sending or receiving host.) ST-II does not compute checksums for either header or data. Many data link protocols already discard packets with bit errors, so that packets are rarely rejected due to higher-layer checksums. Bit errors within the data part are probably easier to tolerate than a lost packet, particularly since some media encoding formats may provide built-in error correction. The impact of bit errors within the header can vary; for example, errors within the timestamp may cause the audio packet to be played out at the wrong time, probably much more noticeable than discarding the packet. Other noticeable effects are caused by a wrong conference ID or false encoding (if present). If a separate checksum is desired for the cases where the underlying protocols do not already provide one, it should be optional. Once optional, it would be easy to define several checksum options, covering just the header, the header plus a certain part of the body or the whole packet. A checksum can also be used to detect whether the receiver has the correct decryption key, avoiding noise or (worse) denial-of-service attacks. For that application, the checksum should be computed across the whole packet, before encrypting the content. Alternatively, a well-known signature could be added to the packet and included in the encryption, as long as known plaintext does not weaken the encryption security. Recommendation: optional for header; if not used, 4-byte signature in data. 3.10 Security 3.10.1 Encryption Only encryption can provide privacy as long as intruders can monitor the channel. It is desirable to specify an encryption algorithm and provide implementations without export restrictions. DES is widely available outside the United States and could easily be added even to binary-only applications by dynamic linking. We have the choice of either encrypting both the header and data or only the data. Encrypting the header denies the intruder knowledge about some conference details (for example, who the participants are, although this is only true as long as the UDP source address does not already reveal that information). It also allows some heuristic detection of key mismatches, as the version identifier, timestamp and other header information are somewhat predictable. However, header encryption makes packet traces and debugging by external programs difficult. H. Schulzrinne Expires 4/1/93 [Page 26] INTERNET-DRAFT RTP October 27, 1992 Public key cryptography does not work for true multicast systems since the public encoding key for every recipient differs, but it may be appropriate when used in two-party conversations or application-level multicast. In that case, mechanisms similar to privacy enhanced mail will probably be appropriate. Key distribution for non-public key encryption is beyond the scope of this recommendation. For one-way applications, it may desirable to prohibit listeners from interrupting the broadcast. (After all, since live lectures on campus get disrupted fairly often, there is reason to fear that a sufficiently controversial lecture carried on the Internet would suffer a similar fate.) Again, asymmetric encryption can be used. Here, the decryption key is made available to all receivers, while the encryption key is known only to the legitimate sender. Current public-key algorithms are probably too computationally intensive for all but low-bit-rate voice. In most cases, filtering based on sources will be sufficient. 3.10.2 Authentication The usual message digest methods are applicable if only the integrity of the message is to be protected against spoofing. 3.11 Quality of Service Control Because real-time services cannot afford retransmissions, they are immediately affected by packet loss and delays. For debugging and monitoring purposes, it is useful to know exactly where and why losses occur. Losses occur either within the network or because of excessive delay within the application. To determine the fraction of losses and the amount of network loss, knowledge about the number of frames transmitted is required. A packet sequence number with sufficient range provides the most reliable and easiest to implement method of gauging packet loss. If a sequence number is not available, it is difficult to impossible for the receiver to get an accurate count of the packets transmitted. Thus, the following RTCP service is suggested for that case. An RTCP message of type PC (packet count) contains two 32-bit integers, one containing the timestamp when the measurement was taken, the second the number of transmitted samples, bytes, packets, or the amount of audio/video measured in seconds, expressed as a 16/16 timestamp. To make it easier for the receiver to use that information, the sample should be taken at a synchronization point, indicated by the synchronization bit in the data packet (see Section 3.5.1). Since this field is intended to measure network packet loss, a packet or byte count would be the simplest to maintain, as the meaning of sample depends on the packet content, for example the number of channels, the encoding, whether it's audio or video and so on. H. Schulzrinne Expires 4/1/93 [Page 27] INTERNET-DRAFT RTP October 27, 1992 The receiver simply stores the number of received samples at each synchronization point and then, after receiving the PC packet, can determine the fraction of packets lost so far. Packet reordering may introduce a slight inaccuracy if the packet sent before the synchronization point arrives afterwards. Given that there typically is a gap between that last packet and the synchronization point, this occurrence should be sufficiently unlikely as to leave the loss measurement accurate enough for QOS monitoring. This method avoids cumulative errors inherent in estimates based purely on timestamps. 4 Conference Control Protocol Currently, only conference control functions used for loose conferences (open admission, no explicit conference set-up) have been considered in depth. Support for the following functionality needs to be specified: o authentication o floor control, token passing o invitations, calls o discovery of conferences and resources (directory service) o media, encoding and quality-of-service negotiation o voting o conference scheduling The functional specification of a conference control protocol is beyond the scope of this draft. 5 Packet Format Given the above technical justifications, the following packet formats are proposed: 5.1 Data The data packet header format is shown in Figure 2. The optional 16-bit framing field and the optional 32-bit IP address designating the network H. Schulzrinne Expires 4/1/93 [Page 28] INTERNET-DRAFT RTP October 27, 1992 source are not shown. All integer fields are in network byte order (most significant byte first). The content of the fields is defined as follows: protocol version: two-bit version identifier. The initial version number is one. The value of zero is reserved for the current vat protocol. sync (S): synchronization bit, described in Section 3.5.1. media: media encoding. The five bits form an index into a table of encodings defined out-of-band. If no mapping has been defined, a standard mapping to be specified by the IANA is used. The value of zero is reserved and indicates that the encoding is carried as an option of type MEDIA. The value of one is reserved and indicates that the encoding is specified in RTCP packets or the conference control protocol. If a packet with a media field value of one arrives and no encoding is known from the conference control protocol, the receiver should defer playing these packets until a control packet has been received. If the packet does not contain a MEDIA option, the last defined encoding is used. option length: number of 32-bit long words contained within the options immediately following the header. sequence number: 16-bit sequence number counting packets. timestamp: timestamp, reflecting real time. The timestamp consists of the middle 32 bits of an NTP timestamp. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _Ver_S_ media _ option length _ sequence number _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ timestamp (seconds) _ timestamp (fraction) _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: RTP data packet format The packet header is followed by options, if any, and the media data. Optional fields are summarized in Table 4. Unless otherwise noted, each option may appear only once per packet. Each packet may contain any number of options. Each option consists of the one-byte option type designation, followed by a one-byte length field denoting the total number of 32-bit words comprising the option, followed by any option-specific data. Options are aligned to the natural length of the field, i.e., 16-bit words are aligned on even addresses, 32-bit words are aligned at addresses divisible H. Schulzrinne Expires 4/1/93 [Page 29] INTERNET-DRAFT RTP October 27, 1992 by four, etc. Options unknown to the application are to be ignored. The MEDIA option, if preent, must precede all other options whose interpretation depends on the current encoding. Currently, no such options are defined. type___description______________________________________ MSRC Globally unique media source identifier. A packet may contain multiple options of this type, indicating all contributors. A source is identified by a globally unique six-byte string. The concatenation of a two-byte numeric user id unique within the system followed by a four-byte Internet address is used(9). If missing, the network source is considered the media source. SSRC Globally unique synchronization source identi- fier. The format is the same as for the MSRC option. If missing, the network source is considered the synchronization source. MEDIA media encoding identification, as discussed in Section 3.4. The first byte designates the encoding, with values of 128 through 255 reserved for experimental encodings. Values of 0 through 127 are assigned by the IANA. Encoding-specific parameters follow. The parameter string is padded with zeros until the option has a length divisible by four. For audio encodings, a single byte contains a two-bit channel count in the most significant bits and a six-bit index an IANA-defined table of sampling frequencies in the least significant bits. An index value of zero designates the natural sampling frequency defined for each encoding. ENERG Energy indication. The length and interpre- tation of this field is media-dependent and specified for each encoding. The ENERG field must follow the MEDIA field, if present. BOP (beginning of playout unit) 16-bit sequence number designating the first packet within the current playout unit. Table 4: Optional fields 5.2 Control Packets The scope of RTCP is meant to be limited to a single medium, conveying minimal out-of-band state information during a conference. Thus, any means of providing reliability are beyond its scope. A version field is not needed since new control message types can be defined readily. Control H. Schulzrinne Expires 4/1/93 [Page 30] INTERNET-DRAFT RTP October 27, 1992 packets are sent periodically to the same multicast group as data packets, using the same time-to-live value. The period should be varied randomly to avoid synchronization of all sources. The period determines how long a new receiver has to wait in the worst case time until it can identify the source. The control packets defined here extend the functionality found in vat session packets. Control packets consists of one or more items using the same format and alignment as options within the data packet. Non-overlapping type numbers for data packet options and control message items are to be assigned, so that control information could be carried in data packets if so desired. The packet format is shown in Fig. 3, while the item types are defined in Fig. 5. Padding is used to align fields to multiples of four bytes. The value used for padding is undefined. A Port Assignment Since it is anticipated that UDP and similar port-oriented protocols will play a major role in carrying RTP traffic, the issue of port assignment needs to be addressed. The way ports are assigned mainly affects how applications can extract the packets destined for them. For each medium, there also needs to be a mechanism for distinguishing data from control packets. For unicast UDP, only the port number is available for demultiplexing. Thus, each media will need a separate port number pair unless a separate demultiplexing agent is used. However, for one-to-one connections, dynamically negotiating a port number is easy. If several UDP streams are used to provide multicast, the port number issue becomes more thorny. For connection-oriented protocols like ST-II or TCP, only packets for a particular connection reach the application. For UDP multicast, an application can select to receive only packets with a particular port number and multicast address by binding to the appropriate multicast address. Thus, for UDP multicast, there is no need to distinguish media by port numbers, as each medium is assumed to have its designated and unique multicast group. Any dynamic port allocation mechanism would fail for large, dynamic multicast groups, but might be appropriate for small conferences and two-party conversations. Data and control packets for a single medium can either share a single port or use two different port numbers. (Currently, two adjacent port numbers are used.) A single port for data and control simplifies the receiver code and conserves port numbers. It requires some other means of identifying control packets, for example as a special media code, and does not allow the sharing of a single control port by several applications. H. Schulzrinne Expires 4/1/93 [Page 31] INTERNET-DRAFT RTP October 27, 1992 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ type=ENERG _ length=1 _ energy level _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ type=MSRC _ length=2 _ user id _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ IP address of media source _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ type=SSRC _ length=2 _ user id _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ IP address of synchronization source _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ type=MEDIA _ length=2 _ encoding _ch#_sampling f._ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ encoding-specific parameters _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ _ type=BOP _ length=1 _ first seq.# in playout unit _ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: RTCP packet format H. Schulzrinne Expires 4/1/93 [Page 32] INTERNET-DRAFT RTP October 27, 1992 type___description______________________________________ ID The media or synchronization source identi- fier, using the same 6-byte format as the MSRC and SSRC options. This identifier applies to all following fields until the next ID field. This results in more compact coding when application gateways are used and allows aggregation of several sources into one control message. ALIAS A variable-length string padded with zeros so that the total length of the item, including the type and length bytes, is a multiple of four bytes. The content of the field describes the media source identified by the most recent ID item, for example, by giving the name and affiliation of the talker or the call letters of the radio station being rebroadcast. The content is not specified or authenticated. The text is encoded as 7-bit US ASCII values from 32 to 127 (decimal). The escape mechanism for character sets other than US-ASCII text remains to be must be defined (ISO 2022?). DESC Media content description, with the same format as ALIAS. The field describes the current media content. Example applications include the session title for a conference distribution, or the current program title for radio or television redistribution through packet networks. BYE The site specified by the most recent ID field requests to be dropped from the conference. No further data. Padded to 32 bit word length. PC 16 bits of padding are followed by a 16/16 32-bit timestamp (same format as the synchronization timestamp) and a 32 bit packet count. The item specifies the number of packets transmitted by the sender of this RTCP message up to the time specified. TIME 16 bits of padding wallclock time and media clock, both expressed as 16/16 timestamps. MEDIA media description (see Table 4) Table 5: The RTCP message types H. Schulzrinne Expires 4/1/93 [Page 33] INTERNET-DRAFT RTP October 27, 1992 Using a single RTCP stream for several media may be advantageous to avoid duplicating, for example, the same identification information for voice, video and whiteboard streams. This works only if there is one multicast group that all members of a conference subscribe to. Given the relatively low frequency of control messages, the coordination effort between applications and the necessity to designate control messages for a particular medium are probably reasons enough to have each application send control messages to the same multicast group as the data. In conclusion, for multicast UDP, two assigned port numbers, one for data and one for control, seem to offer the most flexibility. B Multicast Address Allocation A fixed allocation of network multicast addresses to conferences is clearly not feasible, since the lifetime of conferences is unknown, the potential number of conferences is rather large and the available number space limited to about 228, of which 216 have been assigned to conferences. Dynamic allocation of addresses without intervention of some centralized clearing house mechanism appears to be difficult. One approach would be akin to carrier sense multiple access: A conference originator would listen on a randomly selected multicast address using the session port (it is left as an exercise to the reader to imagine what happens if a data port is used). Within a small multiple of the session announcement interval (with vat, this interval averages six seconds), we would have some indication of whether the address is in use. This technique may fail for a number of reasons. First, collisions if the same multicast address is checked nearly simultaneously are possible, if unlikely as long as the number space is only sparsely utilized. More seriously, it is quite possible that multicast islands using the same multicast group are unaware of each other as they are isolated due to time-to-live restrictions or temporary network interruptions. It is clearly undesirable to be forced to renegotiate a new multicast address in the middle of a conference because time-to-live values or network connectivity have changed. It appears to the author that since multicasting takes place at the IP-level, we would have to check all potential ports to avoid drawing multicast traffic with the same group but different destination port towards us. Some IP-level mechanism would have to be added to the kernel to avoid having to scan all ports. A probe packet sent with maximum time-to-live to the desired address would avoid missing time-to-live-isolated islands and detect temporarily idle multicast groups, but would impose a rather severe load on the network, without solving temporary network splittings. Probe packets and responses could also get lost. Using probe packets also requires an agreement that all potential users of the range of multicast addresses would indeed respond H. Schulzrinne Expires 4/1/93 [Page 34] INTERNET-DRAFT RTP October 27, 1992 to a probe packet. Using the conference identifier at the RTP level to detect collisions may have severe performance consequences for both the network and the receiving host if the conference sharing the same multicast group happens to send high-bandwidth data. One solution would be to provide a hierarchical allocation of addresses. Here, the originator of a conference asks the nearest address provider for an available address. The provider in turn asks the next level up (for example, the regional network) or a peer if it had temporarily run out of addresses. The conference originator would be responsible for returning the address after use. The return of addresses after use raises the issue of what happens if either the requesting agent or the issuer of the address crashes. A timeout mechanism is probably most robust. Addresses could be issued for a certain number of hours. If the original requester renews the request before the expiration of the timeout period, it is guaranteed to have the request granted. With that policy, requester or issuer crashes can be handled gracefully under most circumstances. It remains to be decided what a conference originator is supposed to do if an address renewal request fails because the address provider has crashed or connectivity has been lost. It is imaginable that each site would pay an access fee for a block of addresses, similar to the access-speed dependent fee charged for network connectivity within the Internet. This would provide local incentives for each administrative domain (AD) to recoup unused addresses. Trading of smaller address blocks between friendly ADs could accommodate peak demands or clearing-house failures, similar to the mutual support agreements between electrical utilities. For increased reliability, each AD could offer multiple clearing-houses, just as it typically maintains several name servers. As an extension, it may be desirable to distinguish multicast addresses with different reach. A local address would be given out with the restriction of a maximum time-to-live value and could thus be reused at an AD sufficiently removed, akin to the combination of cell reuse and power limitation in cellular telephony. Given that many conferences will be local or regional (e.g., broadcasting classes to nearby campuses of the same university or a regional group of universities, or an electronic town meeting), this should allow significant reuse of addresses. Reuse of addresses requires careful engineering of thresholds and would probably only be useful for very small time-to-live values that restrict reach to a single local area network. The proposed allocation mechanism has no single point of failure, scales well and conserves the addressing resources by providing appropriate incentives, combined with local control. It requires sufficient address space to supply the hierarchy.(10) The address allocation may or may not be ------------------------------ 10. The ideas presented here are compatible with the more general proposals H. Schulzrinne Expires 4/1/93 [Page 35] INTERNET-DRAFT RTP October 27, 1992 handled by the same authority that provides conference naming and discovery services. C Glossary The glossary below briefly defines the acronyms used within the text. Further definitions can be found in the Internet draft draft-ietf-userglos-glossary-00.txt available for anonymous ftp from nnsc.nsf.net and other sites. Some of the general Internet definitions below are copied from that glossary. 16/16 timestamp: A 32-bit integer timestamp consisting of a 16-bit field containing the number of seconds followed by a 16-bit field containing the binary fraction of a second. This timestamp can measure about 18.2 hours with a resolution of approximately 15 =s. ADPCM: Adaptive differential pulse code modulation. Rather than transmitting ! PCM samples directly, the difference between the estimate of the next sample and the actual sample is transmitted. This difference is usually small and can thus be encoded in fewer bits than the sample itself. The ! CCITT recommendations G.721, G.723, G.726 and G.727 describe ADPCM encodings. CCITT: Comite Consultatif International de Telegraphique et Telephonique (CCITT). This organization is part of the United National International Telecommunications Union (ITU) and is responsible for making technical recommendations about telephone and data communications systems. X.25 is an example of a CCITT recommendation. Every four years CCITT holds plenary sessions where they adopt new recommendations. Recommendations are known by the color of the cover of the book they are contained in. CELP: code-excited linear prediction; audio encoding method for low-bit rate codecs. CD: compact disc. codec: short for coder/decoder; device or software that ! encodes and decodes audio or video information. ------------------------------ contained in ``Remote Conferencing Architecture'' by Yee-Hsiang Chang and Jon Whaley. H. Schulzrinne Expires 4/1/93 [Page 36] INTERNET-DRAFT RTP October 27, 1992 companding: reducing the dynamic range of audio or video by a non-linear transformation of the sample values. The best known methods for audio are =-law, used in North America, and A-law, used in Europe and Asia. !G.711 [10] DAT: digital audio tape. encoding: transformation of the media content for transmission, usually to save bandwidth, but also to decrease the effect of transmission errors. Well-known encodings are G.711 (=-law PCM), and ADPCM for audio, JPEG and MPEG for video. ! encryption encryption: transformation of the media content to ensure that only the intended recipients can make use of the information. ! encoding end system: host where conference participants are located. RTP packets received by an end system are played out, but not forwarded to other hosts (in a manner visible to RTP). frame: unit of information. Commonly used for video to refer to a single picture. For audio, it refers to a data that forms a encoding unit. For example, an LPC frame consists of the coefficients necessary to generate a specific number of audio samples. G.711: ! CCITT recommendation for ! PCM audio encoding at 64 kb/s using =-law or A-law companding. G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like data link and network layer. In the draft stage, this standard was referred to as G.PVNP. The standard is primarily geared towards digital circuit multiplication equipment used by telephone companies to carry more voice calls on transoceanic links. G.PVNP: designation of CCITT recommendation ! G.764 while in draft status. GSM: Group Speciale Mobile. In general, designation for European mobile telephony standard. In particular, often used to describe the 8 kb/s audio coding used. H.261: ! CCITT recommendation for the compression of motion video at rates of Px 64 kb/s. Originally intended for narrowband !ISDN. hangover: Audio data transmitted after the silence detector indicates that no audio data is present. Hangover ensures that the ends of words, important for comprehension, are transmitted even though they are often of low energy. HDLC: high-level data link control; standard data link layer protocol (closely related to LAPD and SDLC). H. Schulzrinne Expires 4/1/93 [Page 37] INTERNET-DRAFT RTP October 27, 1992 ICMP: Internet Control Message Protocol; ICMP is an extension to the Internet Protocol. It allows for the generation of error messages, test packets and informational messages related to ! IP. in-band: signaling information is carried together (in the same channel or packet) with the actual data. ! out-of-band. IP: internet protocol; the Internet Protocol, defined in RFC 791, is the network layer for the TCP/IP Protocol Suite. It is a connectionless, best-effort packet switching protocol [11]. IP address: four-byte binary host interface identifier used by !IP for addressing. An IP address consists of a network portion and a host portion. RTP treats IP addresses as globally unique, opaque identifiers. IPv4: current version (4) of ! IP. ISDN: integrated services digital network; refers to an end-to-end circuit switched digital network intended to replace the current telephone network. ISDN offers circuit-switched bandwidth in multiples of 64 kb/s (B or bearer channel), plus a 16 kb/s packet-switched data (D) channel. JPEG: joint photographic experts group. Designation of a variable-rate compression algorithm using discrete cosine transforms for still-frame color images. LPC: linear predictive coder. Audio encoding method that models speech as a parameters of a linear filter; used for very low bit rate codecs. loosely controlled conference: Participants can join and leave the conference without connection establishment or notifying a conference moderator. The identity of conference participants may or may not be known to other participants. See also: tightly controlled conference. MPEG: motion picture experts group. Designates a variable-rate compression algorithm for full motion video at low bit rates; uses both intraframe and interframe coding. media source: entity (user and host) that produced the media content. It is the entity that is shown as the active participant by the application. MTU: maximum transmission unit; the largest frame length which may be sent on a physical medium. Nevot: network voice terminal; application written by the author. network source: entity denoted by address and port number from which the ! end system receives the RTP packet and to which the end system send any H. Schulzrinne Expires 4/1/93 [Page 38] INTERNET-DRAFT RTP October 27, 1992 RTP packets for that conference in return. NVP: network voice protocol, original packet format used in early packet voice experiments; defined in RFC 741 [3]. OSI: Open System Interconnection; a suite of protocols, designed by ISO committees, to be the international standard computer network architecture. out of band: signaling and control information is carried in a separate channel or separate packets from the actual data. For example, ICMP carries control information out-of-band, that is, as separate packets, for IP, but both ICMP and IP usually use the same communication channel (in band). PCM: pulse-code modulation; speech coding where speech is represented by a given number of fixed-width samples per second. Often used for the coding employed in the telephone network: 64,000 eight-bit samples per second. playout: Delivery of the medium content to the final consumer within the receiving host. For audio, this implies digital-to-analog conversion, for video display on a screen. PVP: packet video protocol; extension of ! NVP to video data [12] SB: subband; as in subband codec. Audio or video encoding that splits the frequency content of a signal into several bands and encodes each band separately, with the encoding fidelity matched to human perception for that particular frequency band. RTCP: real-time control protocol; adjunct to ! RTP. RTP: real-time transport protocol; discussed in this draft. ST-II: stream protocol; connection-oriented unreliable, non-sequenced packet-oriented network and transport protocol with process demulti- plexing and provisions for establishing flow parameters for resource control; defined in RFC 1090 [2]. TCP: transmission control protocol; An Internet Standard transport layer protocol defined in RFC 793. It is connection-oriented and stream-oriented, as opposed to UDP [13]. TPDU: transport protocol data unit. tightly controlled conference: Participants can join the conference only after an invitation from a conference moderator. The identify of all conference participants is known to the moderator. !loosely controlled conference. H. Schulzrinne Expires 4/1/93 [Page 39] INTERNET-DRAFT RTP October 27, 1992 transcoder: device or application that translates between several encodings, for example between ! LPC and ! PCM. UDP: user datagram protocol; unreliable, non-sequenced connectionless transport protocol defined in RFC 768 [14]. vat: Visual audio application (voice terminal) written by Steve McCanne and Van Jacobson. vt: Voice terminal software written at the Information Sciences Institute. VMTP: Versatile message transaction protocol; defined in RFC 1045 [15]. D Address of Author Henning Schulzrinne AT&T Bell Laboratories MH 2A244 600 Mountain Avenue Murray Hill, NJ 07974 telephone: 908 582-2262 electronic mail: hgs@research.att.com References [1] S. Casner, C. Lynn, Jr., P. Park, K. Schroder, and C. Topolcic, ``Experimental internet stream protocol, version 2 (ST-II),'' Network Working Group Request for Comments RFC 1190, Information Sciences Institute, Oct. 1990. [2] C. Topolcic, ``ST II,'' in First International Workshop on Network and Operating System Support for Digital Audio and Video, no. TR-90-062 in ICSI Technical Reports, (Berkeley, CA), 1990. [3] D. Cohen, ``Specification for the network voice protocol (nvp),'' Network Working Group Request for Comment RFC 741, ISI, Jan. 1976. [4] N. Borenstein and N. Freed, ``MIME (multipurpose internet mail extensions) mechanisms for specifying and describing the format of internet message bodies,'' Network Working Group Request for Comments RFC 1341, Bellcore, June 1992. [5] J. G. Gruber and L. Strawczynski, ``Subjective effects of variable delay and speech clipping in dynamically managed voice systems,'' IEEE Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985. H. Schulzrinne Expires 4/1/93 [Page 40] INTERNET-DRAFT RTP October 27, 1992 [6] N. S. Jayant, ``Effects of packet losses in waveform coded speech and improvements due to an odd-even sample-interpolation procedure,'' IEEE Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981. [7] D. Minoli, ``Optimal packet length for packet voice communication,'' IEEE Transactions on Communications, vol. COM-27, pp. 607--611, Mar. 1979. [8] V. Jacobson, ``Compressing TCP/IP headers for low-speed serial links,'' Network Working Group Request for Comments RFC 1144, Lawrence Berkeley Laboratory, Feb. 1990. [9] D. L. Mills, ``Network time protocol (version 2) --- specification and implementation,'' Network Working Group Request for Comments RFC 1119, University of Delaware, Sept. 1989. [10] N. S. Nayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs, NJ: Prentice Hall, 1984. [11] J. Postel, ``Internet protocol,'' Network Working Group Request for Comments RFC 791, Information Sciences Institute, Sept. 1981. [12] R. Cole, ``PVP - a packet video protocol,'' W-Note 28, Information Sciences Institute, University of Southern California, Los Angeles, CA, Aug. 1981. [13] J. B. Postel, ``DoD standard transmission control protocol,'' Network Working Group Request for Comments RFC 761, Information Sciences Institute, Jan. 1980. [14] J. B. Postel, ``User datagram protocol,'' Network Working Group Request for Comments RFC 768, ISI, Aug. 1980. [15] D. R. Cheriton, ``VMTP: Versatile Message Transaction Protocol specification,'' in Network Information Center RFC 1045, (Menlo Park, CA), pp. 1--123, SRI International, Feb. 1988. H. Schulzrinne Expires 4/1/93 [Page 41]