Internet Engineering Task Force Audio-Video Transport WG INTERNET-DRAFT A. Periyannan, D. Singer, M. Speer draft-periyannan-generic-rtp-00 Apple Computer / Sun Microsystems March 13, 1998 Expires: September 13, 1998 Delivering Media Generically over RTP Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Abstract This document specifies a method for delivering generic media streams over the Realtime Transport Protocol (RTP). This proposal is intended for media or codec types that are not already handled by other RTP payload specifications. Three packetization schemes are defined for carrying the media data. The Session Description Protocol (SDP) is used to convey to receivers the packetization scheme used, the media data encoding format and parameters for the media encoding format. 1 Introduction This document defines a method for delivering generic media streams over the Realtime Transport Protocol (RTP) [1]. RTP is a protocol designed to carry realtime media data along with synchronization information over a datagram protocol (usually UDP over IP). The protocol itself does not address the encapsulation of specific media types, but instead leaves it to various profile and payload format A. Periyannan, D. Singer, M. Speer [Page 1]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 specifications. An accompanying RTP profile document [2] contains various payload specifications to carry audio and video over RTP for conferencing applications and specifies the static payload types for each audio/video compression scheme. The RTP payload specifications available today are limited to a few audio compression schemes such as PCM, GSM and DVI [2] and a few video compression schemes such as JPEG, MPEG and H.261 [3,4,5]. In the current model every new compression scheme requires a new RTP payload specification. As we move forward this model is impractical since there are many new audio and video codecs that will become available. There are also many compression schemes that are already available within media file formats and playback architectures such as QuickTime, WAV, RealAudio and ASF that do not have RTP payload specifications. There are media types such as text and MIDI that are not addressed by any RTP payload specification. There needs to be a way of carrying all of this media over RTP without having to individually come up with payload specifications for each of them. Two proposals were made for delivering QuickTime [7] and ASF [8] media over RTP that solve the above mentioned problems. These proposals solved the primary problem but raised a major concern - they were incompatible with each other. They introduced two ways of delivering the same media content over RTP. In contrast, this proposal illustrates a method for carrying generic media and codec types over RTP that is independent of the file format and media playback architecture. The goals of this proposal can be divided into two categories, - define a set of packetization schemes that cover the needs of various media types. - define a mechanism within SDP to convey the packetization scheme, the media encoding format and other parameters for the media encoding format. Proposals to achieve the above goals are covered in sections 2 and 3 of this document. Open issues are listed in section 4. 2 RTP Packetization Schemes This proposal defines three packetization schemes. They are designed to meet the needs of different types of media samples in media streams. The scheme used in a given RTP session is agreed upon by the senders and receivers through non-RTP means. (Section 3 defines a mechanism to convey this information through SDP.) A. Periyannan, D. Singer, M. Speer [Page 2]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 For the purposes of this proposal, a media sample is defined as a unit of compressed or uncompressed media data with an associated duration and timestamp. Audio samples are typically constant size, constant duration units of audio data. Video samples are usually variable size units of video data refered to as video frames. MIDI samples are typically variable size, variable duration units of MIDI instructions. Text samples are variable size, variable duration units containing text strings usually in Unicode format. The choice of which scheme to use on a given RTP session is based on the type of stream. More specifically it depends on characteristics such as, - the duration of each media sample - the size of each media sample compared to the Maximum Transmission Unit (MTU) size of the underlying network - the need for receivers to detect key samples with a mechanism that is independent of the encoding format. (Key samples are defined as intracoded samples in a media stream that also contains intercoded samples.) - the need to specify sample durations The definition of each packetization scheme in this section is preceded by a recommendation on its usage based on the above considerations. The three schemes proposed here may not satisfy the needs of all the current and future types of media streams. More packetization schemes may be added to this list to satisfy changing requirements. 2.1 Generic Scheme A This packetization scheme is recommended for use with media streams with the following characteristics, - the duration and size of each media sample is constant - the size of each media sample is less than the MTU size of the underlying network (after taking into account the RTP header) Media streams that typically fall into this category are frame-based as well as sample-based compressed or uncompressed audio streams. The RTP packet used in Scheme A is formatted as follows: A. Periyannan, D. Singer, M. Speer [Page 3]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . RTP Header . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The format and general usage of the RTP header fields are described in [1]. The following fields of the RTP header will be used as specified below: - The payload type should specify one of the dynamic payload types that should be agreed upon through some non-RTP means. - The RTP timestamp is based on a timescale that should be agreed upon through some non-RTP means. The timestamp encodes the sampling instant of the first media sample contained in the RTP data packet. Multiple samples may be contained in one RTP packet. The initial value of the timestamp is random (unpredictable) to make known- plaintext attacks on encryption more difficult, see RTP [1]. - The marker bit (M-bit) is unused. Transmitters must set this bit to zero. Receivers must ignore this bit. The sample data immediately follows the RTP header and contains one or more complete media samples. 2.2 Generic Scheme B This packetization scheme is recommended for use with media streams with the following characteristics, - the size of each media sample is variable and typically greater than the MTU size of the underlying network (after taking into account the RTP header) - the duration of each media sample is the difference between its timestamp and the timestamp of the next sample or the duration is either implicitly or explicitly contained within the sample data. - the receivers do not have a need to be able to detect key samples using a mechanism that is independent of the encoding format. Media streams that typically fall into this category are compressed video streams with large frame sizes. A. Periyannan, D. Singer, M. Speer [Page 4]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 The RTP packet used in Scheme B is formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . RTP Header . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The format and general usage of the RTP header fields are described in [1]. The following fields of the RTP header will be used as specified below: - The payload type should specify one of the dynamic payload types that should be agreed upon through some non-RTP means. - The RTP timestamp is based on a timescale that should be agreed upon through some non-RTP means. The timestamp encodes the sampling instant of the media sample contained in the RTP data packet. An RTP packet contains a single complete sample or a single sample is fragmented over multiple RTP packets. If a media sample occupies more than one packet, the timestamp must be the same on all of those packets. Packets containing different samples must have different timestamps so that samples may be distinguished by the timestamp. The initial value of the timestamp is random (unpredictable) to make known-plaintext attacks on encryption more difficult, see RTP [1]. - The marker bit (M-bit) is set to one in the last packet of a sample and otherwise, must be zero. If one sample is fully contained within an RTP packet the M-bit must be set to one. Thus, it is possible to easily detect that a complete sample has been received and can be decoded and presented. The sample data immediately follows the RTP header and contains one complete media sample or a fragment of a media sample. 2.3 Generic Scheme C This packetization scheme is recommended for use with media streams with the following characteristics, - the size of each media sample is variable - some packets are larger than the MTU size, but most are much smaller. - sometimes the samples require a mechanism outside their encoding A. Periyannan, D. Singer, M. Speer [Page 5]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 format to specify the duration. - the receivers have a need to be able to detect key samples using a mechanism that is independent of the encoding format. Media streams that typically fall into this category are compressed video streams with some large frames and many small frames, proprietary video streams that have a need for receivers to be able to detect key samples and MIDI streams that require duration of a sample to be specified. The RTP packet used in Scheme C is formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . RTP Header . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Scheme C Header... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Scheme C Header... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . ...... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The format and general usage of the RTP header fields are described in [1]. The following fields of the RTP header will be used as specified below: - The payload type should specify one of the dynamic payload types that should be agreed upon through some non-RTP means. - The RTP timestamp is based on a timescale that should be agreed upon through some non-RTP means. The timestamp encodes the sampling instant of the first media sample contained in the RTP data packet. Multiple samples may be contained in one RTP packet or a single sample may be fragmented over multiple RTP packets. If a media sample occupies more than one packet, the timestamp must be the same on all of those packets. Packets containing different samples must have different timestamps so that samples may be distinguished by the timestamp. The initial value of the timestamp is random A. Periyannan, D. Singer, M. Speer [Page 6]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 (unpredictable) to make known-plaintext attacks on encryption more difficult, see RTP [1]. - The marker bit (M-bit) is set to one in the last packet of a sample and otherwise, must be zero. If one or more samples are fully contained within an RTP packet the M-bit must be set to one. Thus, it is possible to easily detect that a complete sample has been received and can be decoded and presented. The RTP payload contains one of the following, - One media sample fragment, i.e. when the sample size is larger than the MTU size and hence the sample has to be fragmented over multiple packets. - One or more complete media samples, i.e. when the sample size is smaller than the MTU size and hence one or more media samples can be placed in a single RTP packet. In both cases each media sample or fragment is preceded by a Scheme C header. The Scheme C header is defined as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S|L|R|D| RES | Length/Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Relative Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Duration | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the Scheme C header have the following meanings: S bit: 1 bit The S-bit is set to one if the sample is a key sample, i.e. intracoded sample. Otherwise it is set to zero. The S-bit in all headers preceding fragments of the same sample must be set to the same value. L bit: 1 bit The L-bit is set to one if the Length/Offset field contains a length. A. Periyannan, D. Singer, M. Speer [Page 7]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 Otherwise it is set to zero and the Length/Offset field contains an offset. The L-bit must be set to one in all headers preceding complete samples and must be set to zero in all headers preceding sample fragments. R bit: 1 bit The R-bit is set to one if the header contains a relative timestamp. Otherwise it is set to zero. The R-bit in all headers preceding fragments of the same sample must be set to the same value. D bit: 1 bit The D-bit is set to one if the header contains a sample duration. Otherwise it is set to zero. The D-bit in all headers preceding fragments of the same sample must be set to the same value. RES: 8 bits Reserved for future use. Transmitters must set these bits to zero. Receivers must ignore these bits. Length/Offset: 24 bits If a single sample is fragmented over multiple packets, the L-bit is set to zero and the Length/Offset field contains the byte offset of the first byte of this fragment from the beginning of the sample. If one or more complete samples are contained in this packet, the L-bit is set to one in each Scheme C header, and the Sample Length/Offset field contains the length of this sample (including the Scheme C header.) The sum of the lengths of all samples in a packet must be equal to the RTP payload length. Receivers make use of this relationship to ascertain whether there are more samples to extract from a packet. Relative Timestamp: signed 32 bits This field is present only if the R-bit is set to one. It contains the relative timestamp for this sample with respect to the timestamp in the RTP header. The timescale used is the same as that used for the timestamp in the RTP header. This field is specified as a signed 32-bit number to allow for negative offsets from the RTP header timestamp. When this field is absent a default relative timestamp of zero is used. Duration: 32 bits This field is present only if the D-bit is set to one. It contains the duration of the sample. The timescale used is the same as that used for the timestamp in the RTP header. The Duration in all headers preceding fragments of the same sample must be set to the same value. When this field is absent the default duration is implicitly or explicitly obtained from the sample data. If this is not possible the default is the difference between this sample's timestamp and the A. Periyannan, D. Singer, M. Speer [Page 8]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 next sample's timestamp. The length of the Scheme C header varies between 4 bytes and 12 bytes depending on whether the R and D bits are set. When neither of the bits are set the header length is 4 bytes. When only one of them is set the header length is 8 bytes. When both bits are set the header length is 12 bytes. 3 SDP Usage SDP is used as a mechanism to convey to RTP receivers the information required to decode and present a set of RTP sessions. SDP can be used as an announcement mechanism as described in [9] or can be used as a description format with the Real Time Streaming Protocol [6]. In both cases, SDP is used to specify the set of RTP sessions being transmitted, the media type in each session, the payload format and encoding format of each session and other parameters associated with each session. This proposal defines a set of extensions to the mechanisms already defined by SDP. These extensions are used to convey the following information: - RTP packetization scheme - Media sample encoding format - Parameters for the media encoding format The SDP rtpmap and fmtp attributes are used to convey the above information. 3.1 rtpmap Attribute The SDP specification [9] currently defines the following format for the rtpmap attribute, a=rtpmap: /[/] The is either a registered IANA name or an unregistered name preceded by "X-". Currently, this name specifies the encoding format of the media samples and also implicitly specifies the packetization scheme used. This proposal defines a new usage for the that separates the packetization scheme and the media encoding format. When the packetization scheme is implicit in the encoding format the A. Periyannan, D. Singer, M. Speer [Page 9]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 optional fields are dropped and the usage becomes identical to the old usage of . The is enclosed in quotes when it contains the optional fields. All other fields of the rtpmap attribute line are used as defined in [9]. The is defined as follows: "[/][,]" The specifies the encodings used in the media samples, i.e. it specifies the compression scheme used for audio or video streams or the media type used for other types of streams. The optional specifies the headers used within the RTP payload and the mechanisms used to fragment samples over RTP packets. The as well as the are either registered IANA names or unregistered names preceded by "X-". The optional that precedes the encoding format is used to qualify the when the format falls within the scope of an encompassing format. The is either a registered IANA name or an unregistered name preceded by "X-". When a is present then the falls under the scope of the and hence is not registered with the IANA. The packetization schemes defined in section 2 are named as follows, Generic Scheme A genpak-a Generic Scheme B genpak-b Generic Scheme C genpak-c Some examples of the new usage of the rtpmap attribute are presented: An RTP session with Intel Indeo video over generic packetization scheme B with a timescale (clockrate) of 600: a=rtpmap:99 "indeo,genpak-b"/600 An RTP session with QuickTime MIDI over generic packetization scheme C with a timescale of 30: a=rtpmap:99 "x-qt/midi,genpak-c"/30 An RTP session with Microsoft ADPCM audio over generic packetization scheme A with a timescale of 8 KHz: A. Periyannan, D. Singer, M. Speer [Page 10]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 a=rtpmap:99 "x-asf/00000002-0000-0010-8000-00AA00389B71,genpak- a"/8000 An RTP session with QuickTime AppleVideo over generic packetization scheme B with a timescale of 90000: a=rtpmap:99 "x-qt/viderpza,genpak-b"/90000 An RTP session with QuickTime Sprites over a proprietary packetization scheme with a timescale of 600: a=rtpmap:99 "x-qt/twen,x-qtsprite"/600 This method of using the rtpmap attribute allows for widely available proprietary video codecs such as Intel's Indeo to be sent over RTP regardless of the file format used to store the content or the multimedia architecture used to present it. In addition, the method is flexible enough to allow a way to specify proprietary codecs that only exist within a proprietary file format or multimedia playback architecture. The method also allows new packetization schemes to be added independent of new encoding formats. 3.2 fmtp Attribute The SDP specification [9] currently defines the following format for the fmtp attribute, a=fmtp: is one of the payload types specified for the media, i.e. one of the payload types for which there is an rtpmap attribute line. The usage of is currently undefined in the specification. This proposal defines that the usage of is scoped by the specified in the corresponding rtpmap attribute for the payload type specified in the fmtp line. Thus, if we are sending Intel's Indeo (indeo) over RTP the format specific parameters are those defined for Indeo and if we're sending QuickTime MIDI (x-qt/midi) over RTP the format specific parameters are those defined for QuickTime MIDI. The definitions for format specific parameters for a given encoding format are beyond the scope of this document. 4 Open Issues The following open issues need to be resolved: A. Periyannan, D. Singer, M. Speer [Page 11]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 - M-bit usage in scheme A The M-bit is currently unused in scheme A. It can be used to indicate the first packet after a gap in the RTP timeline. - A new rtpmap2 An attempt was made to keep the rtpmap line as close as possible to the current specification. This may not be the best choice. A new rtpmap2 will eliminate confusion and parsing problems. (Also, the "," delimiter is not widely used in SDP. Should a space be used instead?) - Multiple fmtp attribute lines per format We may need multiple fmtp lines per format (payload type) for better readability. In the current SDP specification it is unclear whether this is legally allowed. - Binary data in fmtp attribute lines When sending proprietary encoding formats over RTP, the format specific parameters may need to be transparently conveyed to the receiver in binary form. There is currently no mechanism defined in SDP to convey binary data. - Large SDP files The format specific parameters may require more space than the 1 Kbyte limit in SDP. This limit needs to relaxed. Acknowledgments The authors would like to thank all the members of the QuickTime Streaming team - Anne Jones, Jay Geagan, Andy Grignon, Sylvain Rouze and Kevin Gong for their valuable input in writing this proposal. A. Periyannan, D. Singer, M. Speer [Page 12]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 References [1] H. Schulzrinne, et. al., "RTP : A Transport Protocol for Real- Time Applications", IETF RFC 1889, January 1996. [2] H. Schulzrinne, et. al., "RTP Profile for Audio and Video Conference with Minimal Control", IETF RFC 1890, January 1996. [3] L. Berc, et. al., "RTP Payload Format for JPEG-compressed Video", IETF RFC 2035, October 1996. [4] D. Hoffman, et. al., "RTP Payload Format for MPEG1/MPEG2 Video", IETF RFC 2038, October 1996. [5] T. Turletti, C. Huitema, "RTP Payload Format for H.261 Video Streams", IETF RFC 2032, October 1996. [6] H. Schulzrinne, et. al., "Real Time Streaming Protocol", IETF Draft, draft-ietf-mmusic-rtsp-09.txt, February 2 1998, Expires: August 2 1998. [7] A. Jones, et. al., "RTP Payload Format for QuickTime Media Streams", IETF Draft, draft-ietf-avt-qt-rtp-00.txt, July 22 1997, Expires: January 22 1998. [8] A. Klemets, "RTP Payload Format for ASF Streams", IETF Draft, draft-klemets-asf-rtp-00.txt, October 8 1997, Expires: April 8 1998. [9] M. Handley, "SDP: Session Description Protocol", IETF Draft, draft-ietf-mmusic-sdp-05.txt, November 21 1997, Expires: November 21 1998. Authors' Contact Information Alagu Periyannan Email: alagu@apple.com Tel: +1 408 862 5387 David Singer Email: singer@apple.com Tel: +1 408 974 3162 Apple Computer, Inc. One Infinite Loop, MS:302-3MT Cupertino CA 95014 USA Michael Speer A. Periyannan, D. Singer, M. Speer [Page 13]^L Internet Draft draft-periyannan-generic-rtp-00 March 13 1998 Email: michael.speer@sun.com Tel: +1 650 786 6368 Sun Microsystems, Inc. 901 San Antonio Road, MS UMPK15-214 Palo Alto CA 94303 USA A. Periyannan, D. Singer, M. Speer [Page 14]^L