INTERNET-DRAFT Katsushi Kobayashi draft-ietf-avt-dv-video-00.txt Communication Research Laboratory Akimichi Ogawa Keio University Stephen Casner Cisco Systems Carsten Bormann Universitaet Bremen TZI June 25, 1999 Expires December 1999 RTP Payload Format for DV Format Video Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document specifies the packetization scheme for encapsulating the digital video data streams defined by the HD Digital VCR Conference, commonly known as "DV", into a payload format for the Real-Time Transport Protocol (RTP). The RTP payload format specified in this document supports three quality levels of digital video identified as SD-VCR, HD-VCR and SDL-VCR. 2. Introduction The HD Digital VCR Conference has published a digital video specification set entitled "Specification of Consumer-Use Digital VCRs using 6.3mm magnetic tape" [1,2]. The specification set Kobayashi, et al Expires December 1999 [Page 1] Internet Draft June 25, 1999 consists of two subset specifications, the first of which is "Specification of Consumer-Use Digital VCRs". That subset comprises the whole specification for consumer-use digital video including mechanical specifications of a cassette, helical magnetic recording format, error correction in the magnetic tape, DCT video encoding format, and audio encoding format. The digital video format defined by that specification is commonly known as "DV" format. The second subset is "Specification of Digital Interface for Consumer Electronic Audio/Video Equipment" (abbreviated hereafter as the Digital Interface). That subset defines the communication protocol for carrying DV video and audio over the IEEE 1394 high performance serial bus [3]. The IEEE 1394 bus may be used to interconnect digital video cameras, digital VCRs, computers and other devices. This document specifies the RTP payload format for encapsulating the DV format data streams obtained via the Digital Interface into the Real-time Transport Protocol (RTP), version 2 [4]. The HD Digital VCR Conference specification set supports several video formats: SD-VCR (including 525/60, 625/50), HD-VCR (1125/60, 1250/50), SDL-VCR (525/60, 625/50), PALplus, DVB (Digital Video Broadcast) and ATV (Advanced Television). However, the Digital Interface specifies the IEEE1394 communication protocol for only a subset of these video formats. The RTP payload format defined here covers only those video formats that are included in the Digital Interface. Furthermore, some formats defined by the HD Digital VCR Conference, e.g. DVB and ATV, are based on MPEG2. The payload format for encapsulating MPEG2 into RTP has already been defined in RFC 2250. That payload format is more suitable for transmission of MPEG2 over the Internet than would be a packetization of MPEG2 first into the IEEE 1394 protocol and then into RTP. Therefore, packetization of DV formats based on MPEG2 is outside the scope of this document. Consequently, the payload format specified in this document will support the original six video formats of the HD Digital VCR Conference: SD-VCR (525/60, 625/50), HD-VCR (1125/60, 1250/50) and SDL-VCR (525/60, 625/50). The HD Digital VCR Conference is also standardizing an audio and video device control protocol, that is, a command set for video equipment operation and status queries to video devices. This document does not address these control functions. Throughout this specification, we make extensive use of the VCR Conference terminology. The reader should consult the Digital Kobayashi, et al Expires December 1999 [Page 2] Internet Draft June 25, 1999 Interface references for definitions of these terms. 3. DV format encoding The DV format is designed for magnetic tape applications and is optimized in helical magnetic recording on tape media. All video data including audio and other system data are managed within the picture frame unit of video. The video encoding consists of a three-level hierarchical structure. A picture frame is divided into rectangle- or clipped-rectangle- shaped DCT super blocks. DCT super blocks are divided into 27 rectangle- or square-shaped DCT macro blocks. The DCT macro block consists of 6 square 8x8 DCT blocks, four of which represent Y picture component and the remaining two 2 represent Cr and Cb. Audio is encoded with sampled data. Its frequency is 32 kHz, 44.1 kHz or 48 kHz, its quantization is 16-bit linear or 12-bit non- linear, and the number of channels may range from 2 to 8. Only certain combinations of these parameters are allowed depending upon the video format, as specified in [1]. A frame of data in the DV format stream is divided into several "DIF sequences". A DIF sequence is composed of an integral number of fixed length (80-byte) DIF blocks. Each DIF block contains a 3-byte ID header that specifies the type of the DIF block and its position in the DIF sequence. Five types of DIF blocks are defined: DIF sequence header, Subcode, Video Auxiliary information (VAUX), Audio data and Video data. 3.1 Transmission of DV format over IEEE 1394 The specification of the Digital Interface defines a transport protocol for transmission of video stream data in the isochronous stream mode of IEEE 1394 called "real time data transmission protocol". The protocol defines the general Common Isochronous Packet (CIP) header that does not depend on the encoding format of the payload. Several real time transmission encodings have been defined on CIP, including MPEG2 and MIDI in addition to DV format [1,2]. A DIF block is the basic unit for all transmission on the IEEE 1394. Each IEEE 1394 isochronous stream packet is composed of an integral number of DIF blocks, assembled without regard to DIF sequence boundaries, up to the limit of the MTU for IEEE 1394. 4. Usage of RTP Kobayashi, et al Expires December 1999 [Page 3] Internet Draft June 25, 1999 Each RTP packet starts with the RTP header as defined in RFC 1889 [4]. No additional payload-format-specific header is required for this payload format. 4.1 RTP header usage The meaning of RTP header fields that are specific to the DV format is described in the following: Payload type (PT): The payload type is dynamically assigned by means outside the scope of this document. Details of the encoding format, such as audio sampling rate and video scan rate, are given in the AAUX and VAUX data embedded in the data stream. However, the same information SHOULD be provided as part of the dynamic payload type assignment. If multiple encoding formats are to be used within one RTP session, then multiple dynamic payload types MUST be assigned, one for each encoding format. The sender MUST change to the corresponding payload type whenever the encoding format is changed. The sender MUST NOT expect to notify the receiver of an encoding format change with the information included in AAUX or VAUX because the packet carrying this information might be dropped and would not be available to the receiver until the next AAUX or VAUX packet is received. Timestamp: 32-bit 90 kHz timestamp representing the time at which the first data in the frame was sampled. All RTP packets within the same video frame MUST have the same timestamp. The timestamp SHOULD increment by a multiple of the nominal interval for one frame time, as given in the following table: Mode Framerate (Hz) Increase of one frame in 90khz timestamp 525-60 29.97 3003 625-50 25 3600 1125-60 30 3000 1250-50 25 3600 The progress of video frame times MAY be monitored using the SYT timestamp carried in the CIP header, as described in Appendix A. Marker bit (M): The marker bit of the RTP fixed header is set to one on the last packet of a video frame, and otherwise, must be zero. The M bit allows the receiver to know that it has received the last packet of a frame so it can display the image without waiting for the first packet of the next frame to arrive to detect the frame change. However, detection of a frame change MUST NOT rely on the marker bit Kobayashi, et al Expires December 1999 [Page 4] Internet Draft June 25, 1999 since the last packet of the frame might be lost. Detection of a frame change MUST be done by differences in RTP timestamp. 4.2 DV data encapsulation into RTP payload All of the information in the IEEE 1394 CIP header is either implicit in the RTP payload format or supplanted by information in the RTP header, so the CIP header is not required. For this payload format, the CIP header MUST be removed from IEEE 1394 packet, leaving just a sequence of DIF blocks. Integral DIF blocks are placed into the RTP payload beginning immediately after the RTP header. DIF blocks carried by different IEEE 1394 packets may be packed into one RTP packet, except that all DIF blocks in one RTP packet must be from the same video frame. DIF blocks from the next video frame MUST NOT be packed into the same RTP packet even if there is more payload space remaining. This requirement stems from the fact the transition from one video frame to the next is indicated by a change in the RTP timestamp. It also reduces the processing complexity at the receiver. Since the RTP payload contains an integral number of DIF blocks, the length of the RTP payload will be a multiple of 80 bytes. Audio and video data may be transmitted as one bundled RTP stream or in separate RTP streams. The choice MUST be indicated as part of the assignment of the dynamic payload type and MUST remain unchanged for the duration of the RTP session to avoid complicated procedures of sequence number synchronization. In the case of one bundled stream, DIF blocks for both audio and video are packed into RTP packets in the same order as they were generated. When audio and video are sent in separate RTP streams, or when only one medium is sent, then only the DIF blocks corresponding to the selected medium are included. If VAUX DIF blocks are included, they MUST only be sent in the video stream. When sending a separate audio stream in the 16-bit encoding, it is RECOMMENDED that the audio stream data be extracted from the DIF blocks and repackaged in the L16 payload format defined in RFC 1890 [5] in order to maximize interoperability with non-DV-capable receivers. When sending separate video and audio streams with both in DV format, the same timestamp SHOULD be used for both audio and video data within the same frame in order to simplify lip synchronization at the receiver. Lip synchronization may also be achieved using reference Kobayashi, et al Expires December 1999 [Page 5] Internet Draft June 25, 1999 timestamps passed in RTCP as described in [4]. The sender MAY send null AAUX information and omit VAUX DIF blocks if the VAUX/AAUX information remains constant during the session. However, the VAUX/AAUX information in the DV stream includes source encoding parameters, such as video display aspect ratio, audio quantization and number of audio channels, which are required to decode the stream. Therefore, if VAUX/AAUX information is not transmitted in the stream, the equivalent parameters essential to playout MUST be provided by some out of band means beyond the scope of this document. The receiver MUST be able to process a data stream with null AAUX information and null or omitted VAUX DIF blocks if the equivalent parameters are provided out of band. Therefore, if the RTP receiver is feeding the DV stream to a device that requires AAUX information and VAUX DIF blocks, the receiver MUST be able to generate AAUX within audio DIF blocks and VAUX DIF blocks for the device using the parameters provided out of band. The sender MAY reduce the video frame rate by discarding the video data and VAUX DIF blocks for some of the video frames. The RTP timestamp must still be incremented to account for the discarded frames. The sender MAY alternatively reduce bandwidth by discarding video data DIF blocks for portions of the image which are unchanged from the previous image. To enable this bandwidth reduction, receivers SHOULD implement an error concealment strategy to accommodate lost or missing DIF blocks by repeating the corresponding DIF block from the previous image. 5. SDP Signaling for RTP/DV When using SDP(Session Description Protocol) for negotiation of the RTP payload information, the format described in this document SHOULD be used. SDP description will be slightly different for a bundled stream and an unbundled stream. 5.1 SDP description for unbundled stream When using an unbundled stream, a RTP stream for video and audio will be sent separately to a different port or a different multicast group. When this is done, SDP carries several m=?? lines which is for media type of the stream (see RFC2327[7]). For example, when audio is sent by port 31394 and RTP payload type identifier 111, the m=?? line will be like; m=video 31394 RTP/AVP 111 Kobayashi, et al Expires December 1999 [Page 6] Internet Draft June 25, 1999 The a=rtpmap attribute will be like; a=rtpmap:111 DV/90000 "DV" is the encoding name for the DV video payload format defined in this document. 90000 shows the clock rate. The clock used for the payload format defined in this document uses 90khz clock. In SDP, format specific parameters are defined as a=fmtp, as below. a=fmtp: In the DV video payload format, the a=fmtp line will be used to show the encoding type within the DV video and will be used as below. a=fmtp:DV v-encode: The block with the parameters, is used to describe which type of DV format is used. The parameters for will be one of the following; o SD-VCR/525-60 o SD-VCR/625-50 o HD-VCR/1125-60 o HD-VCR/1250-50 o SDL-VCR/525-60 o SDL-VCR/625-50 An example of SDP description using these attributes is: v=0 o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4 s=SDP Seminar i=A Seminar on the session description protocol u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps e=mjh@isi.edu (Mark Handley) c=IN IP4 224.2.17.12/127 t=2873397496 2873404696 m=audio 49170 RTP/AVP 112 m=video 50000 RTP/AVP 113 a=rtpmap:112 L16/32000/2 a=rtpmap:113 DV/90000 a=fmtp:DV 113 v-encode:SD-VCR/525-60 This describes a session where audio and video streams are sent separetely. The session is sent to a multicast group 224.2.17.12. The audio is sent using L16 format, and the video is sent using 525/60 format which corresponds to NTSC format in SD-VCR. Kobayashi, et al Expires December 1999 [Page 7] Internet Draft June 25, 1999 5.2 SDP description for bundled stream When sending a bundled stream, video and audio DIF blocks will be sent through a single RTP stream. The encoding information MAY be included in the in-band information of RTP stream as AAUX/VAUX DIF blocks. However, in order for receiver to know what type of encoding will be used in the session, the announce of encoding MUST be done in a out-of-band way. The parameters related to the video encoding format are already specified in 5.1. The video encoding parameters are also used in an unbundled case. This section specifies audio format description in SDP in a bundled stream. To describe audio parameter, format specific parameter descriptor is used as unbundled case. The audio encoding format informations are specified for each channel with following format. a=fmtp:DV a-encode:\ {...} Where specifies which channel of format is described in this line. The following channel identifier are available: o 1-4 o a-h The sub block represents the type of quantization. Three types of quantization are available: o 16L o 12NL o 20L The sub block describes the sampling rate with following parameters: o 48000 o 44100 o 32000 corresponds to LF bit in AAUX source DIF block. LF bit represents whether the audio block data is locked or unlocked to the video frame in the DV stream. o locked o unlocked Kobayashi, et al Expires December 1999 [Page 8] Internet Draft June 25, 1999 corresponds to SM bit in the AAUX source DIF block. SM bit represents whether the audio data generated from plural audio source (multi-stereo) or from only one source (lumped). o multi o lumped corresponds to CHN bit. CHN bit shows the how many numbers of sub channels are included in the audio channel. The following numbers are available: o 1 o 2 corresponds to PA bit. PA bit indicates whether the audio channel is related to another channel or not. The following value is available. o pair o independent sub block corresponds to the AUDIOMODE field of AAUX source DIF block. AUDIOMODE describes the type of audio mode on the channel. In the DV, 4 bits field is reserved and each mode represents 0-15 number. o 0-15 sub block corresponds to the AUDIO LANGUAGE field of AAUX closed caption DIF block. 3 bits field is reserved and each mode represents with 0-7 number. o 0-7 The optional sub field corresponds to the EF bit of AAUX source DIF block. EF bit indicates whether the audio data is emphasized or not. If the audio data is emphasized on sender side, the following parameter is specified. If not, this field and the