INTERNET-DRAFT Katsushi Kobayashi draft-ietf-avt-dv-video-03.txt Communication Research Laboratory Akimichi Ogawa Keio University Stephen Casner Cisco Systems Carsten Bormann Universitaet Bremen TZI June 26, 2000 Expires December 2000 RTP Payload Format for DV Format Video Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document specifies the packetization scheme for encapsulating the compressed digital video data streams commonly known as "DV" into a payload format for the Real-Time Transport Protocol (RTP). There are two kinds of DV, one for consumer use and the other for professional. The original "DV" specification designed for consumer- use digital VCRs is approved as the IEC 61834 standard set. The specifications for professional DV are published as SMPTE 306M(D-7) and 314M(D-9). Both are based on consumer DV. The RTP payload format specified in this document supports IEC 61834 consumer DV and professional SMPTE 306M and 314M(DV-Based) formats. Kobayashi, et al. Expires December 2000 [Page 1] Internet Draft June 26, 2000 2. Introduction This document specifies payload formats for encapsulating both consumer- and professional-use DV format data streams into the Real- time Transport Protocol (RTP), version 2 [6]. DV compression audio and video formats were designed for helical-scan magnetic tape media. The DV standards for consumer-market devices, the IEC 61883 and 61834 series, cover many aspects of consumer-use digital video, including mechanical specifications of a cassette, magnetic recording format, error correction on the magnetic tape, DCT video encoding format, and audio encoding format[1]. The digital interface part of IEC 61883 defines an interface on an IEEE 1394 network[2,3]. This specification set supports several video formats: SD-VCR (Standard Definition), HD- VCR (High Definition), SDL-VCR (Standard Definition - Long), PALPlus, DVB (Digital Video Broadcast) and ATV (Advanced Television). North American formats are indicated with a number of lines and "/60", while European formats use "/50". DV standards extended for professional use were published by SMPTE as 306M and 314M, for different sampling system, higher color resolution, and faster bit rates[4,5]. IEC 61834 also includes magnetic tape recording for digital TV broadcasting systems (such as DVB and ATV) that use MPEG2 encoding. The payload format for encapsulating MPEG2 into RTP has already been defined in RFC 2250[7] and others. Consequently, the payload specified in this document will support six video formats of the IEC standard: SD-VCR (525/60, 625/50), HD-VCR (1125/60, 1250/50) and SDL-VCR (525/60, 625/50), and six of the SMPTE standards: 306M (525/60, 625/50), 314M 25Mbps (525/60, 625/50) and 314M 50Mbps (525/60, 625/50). In the future it can be extended into other high-definition formats. Throughout this specification, we make extensive use of the terminology of IEC and SMPTE standards. The reader should consult the original references for definitions of these terms. 2.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [8] 3. DV format encoding The DV format only uses the DCT compression technique within each frame, contrasted with the interframe compression of the MPEG video standards [9,10]. All video data including audio and other system Kobayashi, et al. Expires December 2000 [Page 2] Internet Draft June 26, 2000 data are managed within the picture frame unit of video. The DV encoding is composed of a three-level hierarchical structure. A picture frame is divided into rectangle- or clipped-rectangle- shaped DCT super blocks. DCT super blocks are divided into 27 rectangle- or square-shaped DCT macro blocks. Audio data is encoded with PCM format. The sampling frequency is 32 kHz, 44.1 kHz or 48 kHz and the quantization is 12-bit non-linear, 16-bit linear or 20-bit linear. The number of channels may be up to 8. Only certain combinations of these parameters are allowed depending upon the video format; the restrictions are specified in each document. A frame of data in the DV format stream is divided into several "DIF sequences". A DIF sequence is composed of an integral number of 80-byte DIF blocks. A DIF block is the primitive unit for all treatment of DV streams. Each DIF block contains a 3-byte ID header that specifies the type of the DIF block and its position in the DIF sequence. Five types of DIF blocks are defined: DIF sequence header, Subcode, Video Auxiliary information (VAUX), Audio and Video. Audio DIF blocks are composed of 5 bytes of Audio Auxiliary data (AAUX) and 72 bytes of audio data. Each RTP packet starts with the RTP header as defined in RFC 1889 [6]. No additional payload-format-specific header is required for this payload format. 4.1 RTP header usage The RTP header fields that have a meaning specific to the DV format are described as follows: Payload type (PT): The payload type is dynamically assigned by means outside the scope of this document. If multiple DV encoding formats are to be used within one RTP session, then multiple dynamic payload types MUST be assigned, one for each DV encoding format. The sender MUST change to the corresponding payload type whenever the encoding format is changed. Timestamp: 32-bit 90 kHz timestamp representing the time at which the first data in the frame was sampled. All RTP packets within the same video frame MUST have the same timestamp. The timestamp SHOULD increment by a multiple of the nominal interval for one frame time, as given in the following table: Mode Frame rate (Hz) Increase of one frame in 90kHz timestamp 525-60 29.97 3003 625-50 25 3600 Kobayashi, et al. Expires December 2000 [Page 3] Internet Draft June 26, 2000 1125-60 30 3000 1250-50 25 3600 When the DV stream is obtained from a IEEE 1394 interface, the progress of video frame times MAY be monitored using the SYT timestamp carried in the CIP header, as described in Appendix A. Marker bit (M): The marker bit of the RTP fixed header is set to one on the last packet of a video frame, and otherwise, must be zero. The M bit allows the receiver to know that it has received the last packet of a frame so it can display the image without waiting for the first packet of the next frame to arrive to detect the frame change. However, detection of a frame change MUST NOT rely on the marker bit since the last packet of the frame might be lost. Detection of a frame change MUST be done by differences in RTP timestamp. 4.2 DV data encapsulation into RTP payload Integral DIF blocks are placed into the RTP payload beginning immediately after the RTP header. Any number of DIF blocks may be packed into one RTP packet, except that all DIF blocks in one RTP packet must be from the same video frame. DIF blocks from the next video frame MUST NOT be packed into the same RTP packet even if more payload space remains. This requirement stems from the fact the transition from one video frame to the next is indicated by a change in the RTP timestamp. It also reduces the processing complexity on the receiver. Since the RTP payload contains an integral number of DIF blocks, the length of the RTP payload will be a multiple of 80 bytes. Audio and video data may be transmitted as one bundled RTP stream or in separate RTP streams (unbundled). The choice MUST be indicated as part of the assignment of the dynamic payload type and MUST remain unchanged for the duration of the RTP session to avoid complicated procedures of sequence number synchronization. The RTP sender MAY send DIF-sequence header and subcode DIF block into streams. When sending DIF-sequence header and subcode DIF block, both the blocks MUST be included in the video stream. DV streams include "source" and "source control" packs that carry information indispensable for proper decoding, such as aspect ratio, position of picture, quantization of audio sampling, the number of audio channels, audio channel assignment, and language of audio. However, describing all of these attributes with SDP would require large SDP descriptions to enumerate all combinations. Therefore, in the later section of this document, the SDP entry for each of these parameters is not defined. Instead, the RTP sender MUST transmit at least VAUX DIF block and/or AAUX information including "source" and Kobayashi, et al. Expires December 2000 [Page 4] Internet Draft June 26, 2000 "source control" pack filled with the indispensable information for decoding. In the case of one bundled stream, DIF blocks for both audio and video are packed into RTP packets in the same order as they were encoded. In the case of an unbundled stream, only the header, subcode, video and VAUX DIF blocks are sent within the video stream. Audio is sent in a different stream if desired, using a different RTP payload type. It is also possible to send audio duplicated in a separate stream, in addition to bundling it in with the video stream. When using unbundled mode, it is RECOMMENDED that the audio stream data be extracted from the DIF blocks and repackaged into the corresponding RTP payload format for the audio encoding (DAT12, L16, L20) [11,12] in order to maximize interoperability with non-DV- capable receivers while maintaining the original source quality. In the case of unbundled transmission where both audio and video are sent in the DV format, the same timestamp SHOULD be used for both audio and video data within the same frame to simplify the lip synchronization effort on the receiver. Lip synchronization may also be achieved using reference timestamps passed in RTCP as described in RFC 1889 [6]. The sender MAY reduce the video frame rate by discarding the video data and VAUX DIF blocks for some of the video frames. The RTP timestamp must still be incremented to account for the discarded frames. The sender MAY alternatively reduce bandwidth by discarding video data DIF blocks for portions of the image which are unchanged from the previous image. To enable this bandwidth reduction, receivers SHOULD implement an error concealment strategy to accommodate lost or missing DIF blocks, e.g. repeating the corresponding DIF block from the previous image. 5. SDP Signaling for RTP/DV When using SDP (Session Description Protocol) for negotiation of the RTP payload information, the format described in this document SHOULD be used. SDP description will be slightly different for a bundled stream and an unbundled stream. When DV stream is sent to port 31394 and RTP payload type identifier 111, the m=?? line will be like: m=video 31394 RTP/AVP 111 The a=rtpmap attribute will be like: a=rtpmap:111 DV/90000 Kobayashi, et al. Expires December 2000 [Page 5] Internet Draft June 26, 2000 "DV" is the encoding name for the DV video payload format defined in this document. 90000 shows the clock rate. The clock used for the payload format defined in this document uses 90kHz clock. In SDP, format specific parameters are defined as a=fmtp, as below: a=fmtp: In the DV video payload format, the a=fmtp line will be used to show the encoding type within the DV video and will be used as below: a=fmtp: encode: The parameter is specified which type of DV format is used. The DV format name will be one of the following: o SD-VCR/525-60 o SD-VCR/625-50 o HD-VCR/1125-60 o HD-VCR/1250-50 o SDL-VCR/525-60 o SDL-VCR/625-50 o 306M/525-60 o 306M/625-50 o 314M-25/525-60 o 314M-25/625-50 o 314M-50/525-60 o 314M-50/625-50 In order to show whether the audio data is bundled into DV stream or not, a format specific parameter is defined as bellow: a=fmtp: audio: