INTERNET-DRAFT                                        Katsushi Kobayashi
draft-ietf-avt-dv-video-00.txt         Communication Research Laboratory
                                                          Akimichi Ogawa
                                                         Keio University
                                                          Stephen Casner
                                                           Cisco Systems
                                                         Carsten Bormann
                                                 Universitaet Bremen TZI
                                                           June 25, 1999
                                                   Expires December 1999

                 RTP Payload Format for DV Format Video

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet- Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

1. Abstract

   This document specifies the packetization scheme for encapsulating
   the digital video data streams defined by the HD Digital VCR
   Conference, commonly known as "DV", into a payload format for the
   Real-Time Transport Protocol (RTP).  The RTP payload format specified
   in this document supports three quality levels of digital video
   identified as SD-VCR, HD-VCR and SDL-VCR.

2. Introduction

   The HD Digital VCR Conference has published a digital video
   specification set entitled "Specification of Consumer-Use Digital
   VCRs using 6.3mm magnetic tape" [1,2].  The specification set

Kobayashi, et al          Expires December 1999                 [Page 1]

Internet Draft                                             June 25, 1999

   consists of two subset specifications, the first of which is
   "Specification of Consumer-Use Digital VCRs".  That subset comprises
   the whole specification for consumer-use digital video including
   mechanical specifications of a cassette, helical magnetic recording
   format, error correction in the magnetic tape, DCT video encoding
   format, and audio encoding format.  The digital video format defined
   by that specification is commonly known as "DV" format.

   The second subset is "Specification of Digital Interface for Consumer
   Electronic Audio/Video Equipment" (abbreviated hereafter as the
   Digital Interface).  That subset defines the communication protocol
   for carrying DV video and audio over the IEEE 1394 high performance
   serial bus [3].  The IEEE 1394 bus may be used to interconnect
   digital video cameras, digital VCRs, computers and other devices.

   This document specifies the RTP payload format for encapsulating the
   DV format data streams obtained via the Digital Interface into the
   Real-time Transport Protocol (RTP), version 2 [4].

   The HD Digital VCR Conference specification set supports several
   video formats: SD-VCR (including 525/60, 625/50), HD-VCR (1125/60,
   1250/50), SDL-VCR (525/60, 625/50), PALplus, DVB (Digital Video
   Broadcast) and ATV (Advanced Television).  However, the Digital
   Interface specifies the IEEE1394 communication protocol for only a
   subset of these video formats.  The RTP payload format defined here
   covers only those video formats that are included in the Digital
   Interface.

   Furthermore, some formats defined by the HD Digital VCR Conference,
   e.g. DVB and ATV, are based on MPEG2.  The payload format for
   encapsulating MPEG2 into RTP has already been defined in RFC 2250.
   That payload format is more suitable for transmission of MPEG2 over
   the Internet than would be a packetization of MPEG2 first into the
   IEEE 1394 protocol and then into RTP.  Therefore, packetization of DV
   formats based on MPEG2 is outside the scope of this document.

   Consequently, the payload format specified in this document will
   support the original six video formats of the HD Digital VCR
   Conference: SD-VCR (525/60, 625/50), HD-VCR (1125/60, 1250/50) and
   SDL-VCR (525/60, 625/50).

   The HD Digital VCR Conference is also standardizing an audio and
   video device control protocol, that is, a command set for video
   equipment operation and status queries to video devices.  This
   document does not address these control functions.

   Throughout this specification, we make extensive use of the VCR
   Conference terminology.  The reader should consult the Digital

Kobayashi, et al          Expires December 1999                 [Page 2]

Internet Draft                                             June 25, 1999

   Interface references for definitions of these terms.

3. DV format encoding

   The DV format is designed for magnetic tape applications and is
   optimized in helical magnetic recording on tape media.  All video
   data including audio and other system data are managed within the
   picture frame unit of video.

   The video encoding consists of a three-level hierarchical structure.
   A picture frame is divided into rectangle- or clipped-rectangle-
   shaped DCT super blocks.  DCT super blocks are divided into 27
   rectangle- or square-shaped DCT macro blocks.  The DCT macro block
   consists of 6 square 8x8 DCT blocks, four of which represent Y
   picture component and the remaining two 2 represent Cr and Cb.

   Audio is encoded with sampled data.  Its frequency is 32 kHz, 44.1
   kHz or 48 kHz, its quantization is 16-bit linear or 12-bit non-
   linear, and the number of channels may range from 2 to 8.  Only
   certain combinations of these parameters are allowed depending upon
   the video format, as specified in [1].

   A frame of data in the DV format stream is divided into several "DIF
   sequences".  A DIF sequence is composed of an integral number of
   fixed length (80-byte) DIF blocks.  Each DIF block contains a 3-byte
   ID header that specifies the type of the DIF block and its position
   in the DIF sequence.  Five types of DIF blocks are defined:  DIF
   sequence header, Subcode, Video Auxiliary information (VAUX), Audio
   data and Video data.

3.1 Transmission of DV format over IEEE 1394

   The specification of the Digital Interface defines a transport
   protocol for transmission of video stream data in the isochronous
   stream mode of IEEE 1394 called "real time data transmission
   protocol".  The protocol defines the general Common Isochronous
   Packet (CIP) header that does not depend on the encoding format of
   the payload.  Several real time transmission encodings have been
   defined on CIP, including MPEG2 and MIDI in addition to DV format
   [1,2].

   A DIF block is the basic unit for all transmission on the IEEE 1394.
   Each IEEE 1394 isochronous stream packet is composed of an integral
   number of DIF blocks, assembled without regard to DIF sequence
   boundaries, up to the limit of the MTU for IEEE 1394.

4. Usage of RTP

Kobayashi, et al          Expires December 1999                 [Page 3]

Internet Draft                                             June 25, 1999

   Each RTP packet starts with the RTP header as defined in RFC 1889
   [4].  No additional payload-format-specific header is required for
   this payload format.

4.1 RTP header usage

   The meaning of RTP header fields that are specific to the DV format
   is described in the following:

   Payload type (PT): The payload type is dynamically assigned by means
   outside the scope of this document.  Details of the encoding format,
   such as audio sampling rate and video scan rate, are given in the
   AAUX and VAUX data embedded in the data stream.  However, the same
   information SHOULD be provided as part of the dynamic payload type
   assignment.  If multiple encoding formats are to be used within one
   RTP session, then multiple dynamic payload types MUST be assigned,
   one for each encoding format.  The sender MUST change to the
   corresponding payload type whenever the encoding format is changed.
   The sender MUST NOT expect to notify the receiver of an encoding
   format change with the information included in AAUX or VAUX because
   the packet carrying this information might be dropped and would not
   be available to the receiver until the next AAUX or VAUX packet is
   received.

   Timestamp: 32-bit 90 kHz timestamp representing the time at which the
   first data in the frame was sampled.  All RTP packets within the same
   video frame MUST have the same timestamp.  The timestamp SHOULD
   increment by a multiple of the nominal interval for one frame time,
   as given in the following table:

      Mode        Framerate (Hz)      Increase of one frame
                                   in 90khz timestamp

     525-60         29.97                   3003
     625-50         25                      3600
     1125-60        30                      3000
     1250-50        25                      3600

   The progress of video frame times MAY be monitored using the SYT
   timestamp carried in the CIP header, as described in Appendix A.

   Marker bit (M): The marker bit of the RTP fixed header is set to one
   on the last packet of a video frame, and otherwise, must be zero.
   The M bit allows the receiver to know that it has received the last
   packet of a frame so it can display the image without waiting for the
   first packet of the next frame to arrive to detect the frame change.
   However, detection of a frame change MUST NOT rely on the marker bit

Kobayashi, et al          Expires December 1999                 [Page 4]

Internet Draft                                             June 25, 1999

   since the last packet of the frame might be lost.  Detection of a
   frame change MUST be done by differences in RTP timestamp.

4.2 DV data encapsulation into RTP payload

   All of the information in the IEEE 1394 CIP header is either implicit
   in the RTP payload format or supplanted by information in the RTP
   header, so the CIP header is not required.  For this payload format,
   the CIP header MUST be removed from IEEE 1394 packet, leaving just a
   sequence of DIF blocks.  Integral DIF blocks are placed into the RTP
   payload beginning immediately after the RTP header.  DIF blocks
   carried by different IEEE 1394 packets may be packed into one RTP
   packet, except that all DIF blocks in one RTP packet must be from the
   same video frame.  DIF blocks from the next video frame MUST NOT be
   packed into the same RTP packet even if there is more payload space
   remaining.  This requirement stems from the fact the transition from
   one video frame to the next is indicated by a change in the RTP
   timestamp.  It also reduces the processing complexity at the
   receiver.

   Since the RTP payload contains an integral number of DIF blocks, the
   length of the RTP payload will be a multiple of 80 bytes.

   Audio and video data may be transmitted as one bundled RTP stream or
   in separate RTP streams.  The choice MUST be indicated as part of the
   assignment of the dynamic payload type and MUST remain unchanged for
   the duration of the RTP session to avoid complicated procedures of
   sequence number synchronization.

   In the case of one bundled stream, DIF blocks for both audio and
   video are packed into RTP packets in the same order as they were
   generated.

   When audio and video are sent in separate RTP streams, or when only
   one medium is sent, then only the DIF blocks corresponding to the
   selected medium are included.  If VAUX DIF blocks are included, they
   MUST only be sent in the video stream.

   When sending a separate audio stream in the 16-bit encoding, it is
   RECOMMENDED that the audio stream data be extracted from the DIF
   blocks and repackaged in the L16 payload format defined in RFC 1890
   [5] in order to maximize interoperability with non-DV-capable
   receivers.

   When sending separate video and audio streams with both in DV format,
   the same timestamp SHOULD be used for both audio and video data
   within the same frame in order to simplify lip synchronization at the
   receiver.  Lip synchronization may also be achieved using reference

Kobayashi, et al          Expires December 1999                 [Page 5]

Internet Draft                                             June 25, 1999

   timestamps passed in RTCP as described in [4].

   The sender MAY send null AAUX information and omit VAUX DIF blocks if
   the VAUX/AAUX information remains constant during the session.
   However, the VAUX/AAUX information in the DV stream includes source
   encoding parameters, such as video display aspect ratio, audio
   quantization and number of audio channels, which are required to
   decode the stream.  Therefore, if VAUX/AAUX information is not
   transmitted in the stream, the equivalent parameters essential to
   playout MUST be provided by some out of band means beyond the scope
   of this document.

   The receiver MUST be able to process a data stream with null AAUX
   information and null or omitted VAUX DIF blocks if the equivalent
   parameters are provided out of band.  Therefore, if the RTP receiver
   is feeding the DV stream to a device that requires AAUX information
   and VAUX DIF blocks, the receiver MUST be able to generate AAUX
   within audio DIF blocks and VAUX DIF blocks for the device using the
   parameters provided out of band.

   The sender MAY reduce the video frame rate by discarding the video
   data and VAUX DIF blocks for some of the video frames.  The RTP
   timestamp must still be incremented to account for the discarded
   frames.  The sender MAY alternatively reduce bandwidth by discarding
   video data DIF blocks for portions of the image which are unchanged
   from the previous image.  To enable this bandwidth reduction,
   receivers SHOULD implement an error concealment strategy to
   accommodate lost or missing DIF blocks by repeating the corresponding
   DIF block from the previous image.

   5. SDP Signaling for RTP/DV

   When using SDP(Session Description Protocol) for negotiation of the
   RTP payload information, the format described in this document SHOULD
   be used. SDP description will be slightly different for a bundled
   stream and an unbundled stream.

   5.1 SDP description for unbundled stream

   When using an unbundled stream, a RTP stream for video and audio will
   be sent separately to a different port or a different multicast
   group. When this is done, SDP carries several m=?? lines which is for
   media type of the stream (see RFC2327[7]).  For example, when audio
   is sent by port 31394 and RTP payload type identifier 111, the m=??
   line will be like;

        m=video 31394 RTP/AVP 111

Kobayashi, et al          Expires December 1999                 [Page 6]

Internet Draft                                             June 25, 1999

   The a=rtpmap attribute will be like;

        a=rtpmap:111 DV/90000

   "DV" is the encoding name for the DV video payload format defined in
   this document. 90000 shows the clock rate. The clock used for the
   payload format defined in this document uses 90khz clock.

   In SDP, format specific parameters are defined as a=fmtp, as below.

          a=fmtp:<format> <format specific parameters>

   In the DV video payload format, the a=fmtp line will be used to show
   the encoding type within the DV video and will be used as below.

          a=fmtp:DV <payload type> v-encode:<DV-video encoding>

   The block with the parameters, <DV-video encoding> is used to
   describe which type of DV format is used. The parameters for <DV-
   video encoding> will be one of the following;

         o  SD-VCR/525-60
         o  SD-VCR/625-50
         o  HD-VCR/1125-60
         o  HD-VCR/1250-50
         o  SDL-VCR/525-60
         o  SDL-VCR/625-50

   An example of SDP description using these attributes is:

      v=0
      o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4
      s=SDP Seminar
      i=A Seminar on the session description protocol
      u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps
      e=mjh@isi.edu (Mark Handley)
      c=IN IP4 224.2.17.12/127
      t=2873397496 2873404696
      m=audio 49170 RTP/AVP 112
      m=video 50000 RTP/AVP 113
      a=rtpmap:112 L16/32000/2
      a=rtpmap:113 DV/90000
      a=fmtp:DV 113 v-encode:SD-VCR/525-60

   This describes a session where audio and video streams are sent
   separetely. The session is sent to a multicast group 224.2.17.12. The
   audio is sent using L16 format, and the video is sent using 525/60
   format which corresponds to NTSC format in SD-VCR.

Kobayashi, et al          Expires December 1999                 [Page 7]

Internet Draft                                             June 25, 1999

   5.2 SDP description for bundled stream

   When sending a bundled stream, video and audio DIF blocks will be
   sent through a single RTP stream. The encoding information MAY be
   included in the in-band information of RTP stream as AAUX/VAUX DIF
   blocks. However, in order for receiver to know what type of encoding
   will be used in the session, the announce of encoding MUST be done in
   a out-of-band way. The parameters related to the video encoding
   format are already specified in 5.1. The video encoding parameters
   are also used in an unbundled case. This section specifies audio
   format description in SDP in a bundled stream.

   To describe audio parameter, format specific parameter descriptor is
   used as unbundled case. The audio encoding format informations are
   specified for each channel with following format.

      a=fmtp:DV <payload type> a-encode:\
          <channel id>{<channel id>...} <quantization/\
           sampling rate/frame lock/stereo mode/sub channel number/\
           channel pair/audiomode/language{/emphasis/time constant}>

   Where <channel id> specifies which channel of format is described in
   this line. The following channel identifier are available:

       o 1-4
       o a-h

   The sub block <quantization> represents the type of quantization.
   Three types of quantization are available:

       o 16L
       o 12NL
       o 20L

   The sub block <sampling rate> describes the sampling rate with
   following parameters:

       o 48000
       o 44100
       o 32000

   <frame lock> corresponds to LF bit in AAUX source DIF block. LF bit
   represents whether the audio block data is locked or unlocked to the
   video frame in the DV stream.

      o locked
      o unlocked

Kobayashi, et al          Expires December 1999                 [Page 8]

Internet Draft                                             June 25, 1999

   <stereo mode> corresponds to SM bit in the AAUX source DIF block.  SM
   bit represents whether the audio data generated from plural audio
   source (multi-stereo) or from only one source (lumped).

      o multi
      o lumped

   <sub channel number> corresponds to CHN bit. CHN bit shows the how
   many numbers of sub channels are included in the audio channel.  The
   following numbers are available:

       o 1
       o 2

   <channel pair> corresponds to PA bit. PA bit indicates whether the
   audio channel is related to another channel or not.  The following
   value is available.

       o pair
       o independent

   <audiomode> sub block corresponds to the AUDIOMODE field of AAUX
   source DIF block. AUDIOMODE describes the type of audio mode on the
   channel.  In the DV, 4 bits field is reserved and each mode
   represents 0-15 number.

       o 0-15

   <language> sub block corresponds to the AUDIO LANGUAGE field of AAUX
   closed caption DIF block. 3 bits field is reserved and each mode
   represents with 0-7 number.

       o 0-7

   The optional sub field <emphasis> corresponds to the EF bit of AAUX
   source DIF block. EF bit indicates whether the audio data is
   emphasized or not.  If the audio data is emphasized on sender side,
   the following parameter is specified. If not, this field and the
   <time constant> sub field will not appear.

       o emphasis

   The optional sub field <time constant> corresponds to TC bit. TC bit
   represents the time constant values in the case of emphasized audio.
   Only the following value in a 1/micro second unit is available:

        o 50-15

Kobayashi, et al          Expires December 1999                 [Page 9]

Internet Draft                                             June 25, 1999

   An example of SDP description using these attributes is:

      v=0
      o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4
      s=SDP Seminar
      i=A Seminar on the session description protocol
      u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps
      e=mjh@isi.edu (Mark Handley)
      c=IN IP4 224.2.17.12/127
      t=2873397496 2873404696
      m=audio-video 49170 RTP/AVP 112
      a=rtpmap:112 DV/90000
      a=fmtp:DV 112 v-encode:SD-VCR/525-60
      a=fmtp:DV 112 a-encode:abcd 12NL/32000/locked/lumped/2/independent/9/0

   Above SDP record describes a session where audio and video streams
   are sent bundled. The session is sent to a multicast group
   224.2.17.12.  The video is sent using 525/60 format.  The audio data
   format is using 12 bits nonlinear, 32kHz sampled, frame locked,
   lumped, two sub channels in each channels, channel independent, 9
   audio mode, unknown languages, and no emphasis.

   If audio data is using 16 bit linear and 48kHz sampled, and L channel
   is assigned to CH1 and R is CH2, the audio related block can be
   rewritten as:

      a=fmtp:DV 112 a-encode:1 16L/48000/locked/multi/1/pair/0/1
      a=fmtp:DV 112 a-encode:2 16L/48000/locked/multi/1/pair/1/1

6. Security Considerations

   RTP packets using the payload format defined in this specification
   are subject to the security considerations discussed in the RTP
   specification [4], and any appropriate RTP profile.  This implies
   that confidentiality of the media streams is achieved by encryption.
   Because the data compression used with this payload format is applied
   end-to-end, encryption may be performed after compression so there is
   no conflict between the two operations.

   A potential denial-of-service threat exists for data encodings using
   compression techniques that have non-uniform receiver-end
   computational load.  The attacker can inject pathological datagrams
   into the stream which are complex to decode and cause the receiver to
   be overloaded.  However, this encoding does not exhibit any
   significant non-uniformity.

   As with any IP-based protocol, in some circumstances a receiver may

Kobayashi, et al          Expires December 1999                [Page 10]

Internet Draft                                             June 25, 1999

   be overloaded simply by the receipt of too many packets, either
   desired or undesired.  Network-layer authentication may be used to
   discard packets from undesired sources, but the processing cost of
   the authentication itself may be too high.  In a multicast
   environment, pruning of specific sources may be implemented in future
   versions of IGMP [6] and in multicast routing protocols to allow a
   receiver to select which sources are allowed to reach it.

7. Full Copyright Statement

   Copyright (C) The Internet Society (1999). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.

   However, this document itself may not be modified in any way, such as
   by removing the copyright notice or references to the Internet Soci-
   ety or other Internet organizations, except as needed for the purpose
   of developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be fol-
   lowed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MER-
   CHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

8. Authors' Addresses

   Katsushi Kobayashi
   Communication Research Laboratory
   4-2-1 Nukii-kita machi, Koganei
   Tokyo 184-8795
   JAPAN
   EMail:  ikob@koganei.wide.ad.jp

   Akimichi Ogawa

Kobayashi, et al          Expires December 1999                [Page 11]

Internet Draft                                             June 25, 1999

   Keio University
   5322 Endo, Fujisawa
   Kanagawa 252
   JAPAN
   EMail:  akimichi@sfc.wide.ad.jp

   Stephen L. Casner
   Cisco Systems, Inc.
   170 West Tasman Drive
   San Jose, CA 95134-1706
   United States
   EMail: casner@cisco.com

   Carsten Bormann
   Universitaet Bremen FB3 TZI
   Postfach 330440
   D-28334 Bremen, GERMANY
   Phone: +49.421.218-7024
   Fax: +49.421.218-7000
   EMail: cabo@tzi.org

9. Bibliography

   [1] IEC 61834, Helical-scan digital video cassette recording system
       using 6,35 mm magnetic tape for consumer use (525-60, 625-50,
       1125-60 and 1250-50 systems)

   [2] IEC 61883, Consumer audio/video equipment - Digital interface

   [3] IEEE Std 1394-1995, Standard for a High Performance Serial Bus

   [4] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson.  RTP: A
       transport protocol for real-time applications. IETF Audio/Video
       Transport Working Group, January 1996. RFC1889.

   [5] Schulzrinne, H., "RTP Profile for Audio and Video Conferences
       with Minimal Control", RFC 1890, January 1996.

   [6] Deering, S., "Host Extensions for IP Multicasting", STD 5,
       RFC 1112, August 1989.

   [7] M.Handley, V.Jacobson, "SDP: Session Description Protocol",
       RFC 2327, April 1998

Appendix A.

   In the Digital Interface specification, two types of 8-byte CIP
   headers are defined, one type including the SYT field, and the other

Kobayashi, et al          Expires December 1999                [Page 12]

Internet Draft                                             June 25, 1999

   without the SYT field.  The SYT field is a 16-bit timestamp copied
   from lower 16 bits of CYCLE_TIME register defined in IEEE 1394.  The
   CYCLE_TIME register is incremented by a 24.576 MHz clock, but the
   lower 12 bits count to a maximum of 3071 before wrapping around to
   zero and adding a carry to the high 4 bits.  Therefore, the SYT
   timestamp is not linear.

   If the encoding format requires synchronization between devices, it
   should adopt the CIP header with SYT.  The DV format selects the CIP
   header type including the SYT field, but only requires that the SYT
   field contain a valid timestamp for one CIP header in every video frame
   period.  In the remaining CIP headers, the SYT field may contain the
   special "no information" value (all ones).

Kobayashi, et al          Expires December 1999                [Page 13]