Internet Engineering Task Force Audio-Video Transport WG INTERNET-DRAFT C. Zhu Intel Corp. November 25, 1996 Expires: May 25, 1997 RTP Payload Format for H.263 Video Stream Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Abstract This document specifies the RTP payload format for encapsulating H.263 bitstreams in the Real-Time Transport Protocol (RTP). Three modes are defined for RTP H.263 payload header. An RTP packet can use one of the three modes for H.263 video streams depending on the desired performance characteristics and H.263 encoding options employed. The shortest header mode (Mode A) supports fragmentation at Group of Block (GOB) boundaries. The long header modes (Mode B and C) support fragmentation at Macroblock (MB) boundaries. Zhu [Page 1] Internet Draft RTP Payload for H.263 November 25, 1996 1. Introduction This document describes a scheme to packetize an H.263 video stream for transport using RTP [1]. H.263 video stream is defined by ITU-T Recommendation H.263 (referred to as H.263 in this document) [4] for video coding at very low bit rate. RTP is defined by the Internet Engineering Task Force (IETF) to provide end-to-end network transport functions suitable for applications transmitting real-time data over multicast or unicast network services. The complete specification of RTP for a particular application will require a profile specification document [3], a payload format specification, and an RTP protocol document [1]. This document is intended to serve as the payload format specification for H.263 video streams. 2. Definitions For the purpose of this document, the following definitions apply: CIF: Common Intermediate Format. For H.263, a CIF picture has 352 x 288 pixels for luminance, and 176 x 144 pixels for chrominance. QCIF: Quarter CIF source format with 176 x 144 pixels for luminance and 88 x 72 pixels for chrominance. sub-QCIF: picture source format with 128 x 96 pixels for luminance and 64 x 48 pixels for chrominance. 4CIF: picture source format with 704 x 576 pixels for luminance and 352 x 288 pixels for chrominance. 16CIF: picture source format with 1408 x 1152 pixels for luminance and 704 x 576 pixels for chrominance. GOB: for H.263, a Group of Blocks (GOB) comprises of k*16 lines, depending on the picture format (k=1 for QCIF, CIF and sub-QCIF, k=2 for 4CIF and k=3 for 16CIF). MB: a macroblock (MB) relates to four blocks of luminance and the spatially corresponding two blocks of chrominance. Each block consists of 8x8 pixels. 3. Design Issues for Packetizing H.263 Bitstream H.263 is based on the ITU-T Recommendation H.261 [2] (referred to as H.261 in this document). Although it employs similar techniques to reduce both temporal and spatial redundancy, there are several major differences between the two algorithms that impact the design of packetization schemes significantly. This section summarizes those differences. Zhu [Page 2] Internet Draft RTP Payload for H.263 November 25, 1996 3.1 Optional Features of H.263 In addition to the basic source coding algorithms, H.263 supports four negotiable features to improve performance: Advanced Prediction, PB frames, Syntax-based Arithmetic Coding, and Unrestricted Motion Vectors. They can be used in any combination. Advanced Prediction(AP): four motion vectors instead of one can be used for some macroblocks in the frame. This feature makes recovery from packet loss harder, because more redundant information has to be preserved at the beginning of the packet when fragmenting at macroblock boundaries. PB frames: two frames ( P frame and B frame) are coded into one bitstream with macroblocks from the two frames interleaved. From a packetization point of view, a MB from the P frame and a MB from the B frame must be treated together because each MB for the B frame are coded based on the corresponding MB for the P frame. A means must be provided to ensure proper rendering of two frames in the right order. Also if part of this combined bitstream is lost, it will effect the two frames, and possibly more. Syntax-based Arithmetic Coding (SAC): Huffman codes are not the only choice for variable length coding. When the SAC option is on, the resultant run value pair after quantization of Discrete Cosine Transform (DCT) coefficients will be coded differently from Huffman codes, but the macroblock hierarchy will be preserved. Since context variables are only synchronized before fixed length codes in the bitstream, any fragmentation at variable length codes will result in difficulty in decoding in the presence of packet loss without carrying the values of all the context variables in each packet header. Unrestricted motion vector feature also has impact on packetization because of larger range of motion vector than normal. To enable proper decoding of packets received without dependency on previous packets, the use of these optional features is signaled in the H.263 payload header, as described in section 5. 3.2 GOB Numbering In H.263, each picture is divided into groups of blocks (GOB). GOBs are numbered by vertical scan of the GOBs, starting with the upper GOB and ending with the lower GOB. In contrast, a GOB in H.261 is composed of three rows of 16x16 MB for QCIF, and three half rows of MB for CIF format. Like H.261, a GOB is divided into macroblocks in H.263. The definition of a macroblock in H.263 is the same as in H.261. Each GOB in H.263 can have a fixed GOB header, but unlike H.261 the use of the header is optional. If the GOB header is present, it may or may not start on a byte boundary. Byte alignment can be achieved by proper bit stuffing by the encoder, but it is not required. Zhu [Page 3] Internet Draft RTP Payload for H.263 November 25, 1996 In summary, a GOB in H.263 is defined and coded with finer granularity with the same source format, thus resulting in more flexibility for packetization than with H.261. 3.3 Motion Vectors Encoding Differential coding is used to code motion vectors as variable length codes. Unlike in H.261, where each motion vector is predicted from the previous MB in the GOB, H.263 employs a more flexible prediction scheme, where three candidate predictors are used instead of one. It is done differently depending on the presence of GOB header. If the GOB header is included for a GOB, motion vectors are coded with reference to MBs in the current GOB only. But if the GOB header is not present for the current GOB, three motion vectors must be available to decode one macroblock, where two of them are from the previous GOB. To decode the whole inter-coded GOB, all the motion vectors must be available from the previous GOB. This can be a major problem for a packetization scheme like the one defined for H.261 when packetizing at MB boundaries. Let's assume a packet starts with one MB but the GOB header is not coded. If the previous packet is lost, then all the motion vectors to predict the motion vector for the MBs in this GOB are not available. In order to decode the received MBs correctly, all the motion vectors for the previous GOB would have to be saved at the beginning of the packet. This would be very expensive and unacceptable in terms of bandwidth overhead. The encoding strategy of each H.263 codec implementation is beyond the scope of this document, even though it has very significant impact on visual quality in the presence of packet loss. However, we strongly recommend use of the GOB header for every GOB at the beginning of a packet to address these problems. 3.4 Macroblock Address As specified by H.261, macroblock address (MBA) is encoded with a variable length code to indicate the position of a macroblock within a group of blocks in the H.261 bitstream. H.263 does not code the MBA explicitly, but the macroblock address within a GOB is necessary to recover the decoder state in the presence of packet loss when fragmenting at MB boundaries. Therefore, this information must be included in the H.263 payload header for two of the modes (Mode B and Mode C as described in section 5) that allow packetization at MB boundaries. Zhu [Page 4] Internet Draft RTP Payload for H.263 November 25, 1996 4. Usage of RTP When transmitting H.263 video streams over the Internet, we will directly packetize the output of the encoder. All the bits resulting from the bitstream including the fixed length codes and variable length codes (Huffman codes, or SAC if SAC is used) will be included in the packet. Also we do not intend to multiplex audio and video signals in the same packets, as UDP and RTP provide a much more efficient way to achieve multiplexing. RTP does not guarantee a reliable and orderly data delivery service, so a packet might get lost in the network. To achieve a best-effort recovery from packet loss, the decoder needs assistance to proceed with decoding of other packets that are received. Thus it is desirable to be able to process each packet independent of other packets. Some frame level information is included in each packet, such as source format and flags for optional features to assist the decoder in operating correctly and efficiently in presence of packet loss. The flags for H.263 optional features also provide information about coding options used in the H.263 video streams that can be used by session management tools. The H.263 video stream will be carried as payload data within the RTP packets. A new H.263 payload header is defined in section 5, H.263 payload header. This section defines the usage of RTP fixed header and H.263 video packet structure. 4.1 RTP Header usage Each RTP packet starts with a fixed RTP header. The following fields of the RTP fixed header are used for H.263 video stream: Marker bit (M bit): The Marker bit of the RTP header is set to 1 when the current packet carries the end of current frame. 0 otherwise. Payload Type (PT): The Payload Type shall specify H.263 video payload format. Timestamp: The RTP Timestamp encodes the sampling instance of the first video frame contained in the RTP data packet. The RTP timestamp may be the same on successive packets if a video frame occupies more than one packet. For H.263 video stream, the RTP timestamp is based on a 90 kHz clock, the same as that of RTP payload for H.261 stream. 4.2 Video Packet Structure H.263 compressed bitstream is carried as payload within each RTP packet. For each RTP packet, the RTP header is followed by the H.263 payload header, which is followed by the standard H.263 compressed bitstream. The size of the H.263 payload header is variable depending on modes used as detailed in the next section. The layout of the RTP H.263 video packet is shown as: Zhu [Page 5] Internet Draft RTP Payload for H.263 November 25, 1996 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.263 Payload Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.263 stream | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5. H.263 Payload Header For H.263 video streams, each RTP packet carries only one H.263 video packet. The H.263 payload header is always present for each H.263 video packet. Three formats (Mode A, Mode B and Mode C) are defined for RTP H.263 payload header. In Mode A, a H.263 payload header of four bytes is present before actual compressed H.263 video bitstream in the packet. It allows fragmentation at GOB boundaries. In Mode B, a eight byte H.263 payload header is used and each packet starts at MB boundaries with the PB frame option off. Finally, Mode C with a 12 bytes header is provided to support fragmentation at MB boundaries for frames that are coded with the PB frame option on. The mode is indicated by the F field and P field in the first two bits of the header. The three modes can be intermixed for one compressed frame. All the client application are required to be able to receive packets in all three modes, but decoding of Mode C packets is optional because PB-frame is an optional feature of H.263 decoder. In this section, the H.263 payload format is shown as rows of 32-bit word. Each word is transmitted in network byte order with the most significant byte shown at the left in the following diagrams. 5.1 Mode A In this mode, H.263 bitstream will be packetized at GOB boundaries. In other words, each packet will start at the beginning of a GOB, and it can carry one or more MBs or GOBs. Only four bytes are used for the header. Mode A can be used with or without PB frame option. For those GOBs that are smaller than network packet size, this mode is recommended. The H.263 payload header definition for Mode A is shown as follows with F=0. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V |F|P|I|SBIT |EBIT | SRC |U|A|S|R |DBQ| TRB | TR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Zhu [Page 6] Internet Draft RTP Payload for H.263 November 25, 1996 V: 2 bit Version number, set to 0 for this version of RTP H.263 payload header. F: 1 bit The flag bit indicates the format of the header. F=0, Mode A, F=1, Mode B or Mode C. P: 1 bit Optional PB-frame mode as defined by the H.263 [4]. "0" implies normal I or P frame, "1" PB-frame. When F=1, P also indicates modes. Mode B if P=0, Mode C if P=1. I: 1 bit. Set to 1 if current picture is intra-coded. Otherwise 0. Notice this is opposite to the picture coding type in PTYPE as defined within the H.263 bitstream [4]. SBIT: 3 bits Start bit position specifies number of bits that should be ignored in the first data byte. EBIT: 3 bits End bit position indicates number of bits that should be ignored in the last data byte. SRC : 3 bits Source format specifies the resolution of the frames contained as defined by the H.263 [4]. U: 1 bit Unrestricted Motion Vector mode as defined by H.263 [4]. "0" off, "1" on. A: 1 bit Optional Advanced Prediction mode as defined by H.263 [4]. "0" off, "1" on. S: 1 bit Optional Syntax-based Arithmetic Code mode as defined by the H.263 [4]. 0" off, "1" on. R: 2 bits Reserved, must be set to zero. DBQ: 2 bits Differential quantization parameter to calculate quantizer for the B frame based on quantizer for the P frame, when PB frame option is on. The value should be the same as DBQUANT defined by the H.263 [4]. Set to zero if PB frame option is off. Zhu [Page 7] Internet Draft RTP Payload for H.263 November 25, 1996 TRB: 3 bits Temporal Reference for the B frame as defined by the H.263 [4]. Set to zero if PB frame option is off. TR: 8 bits Temporal Reference for the P frame as defined by the H.263 [4]. Set to zero if the PB frame option is off. 5.2 Mode B In this mode, the H.263 stream can be fragmented at MB boundaries. Thus necessary information is needed at the start of a packet to recover the decoder internal state in the presence of packet loss. It is intended for those GOBs whose sizes are larger than the maximum packet size allowed in the underlying protocol. This mode can only be used with PB frame option off. Mode C as defined in the next section can be used to fragment at MB boundaries with PB frame option on. The H.263 payload header definition for Mode B is shown as follows with F=1 and P=0: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V |F|P|I|SBIT |EBIT | SRC | QUANT | GOBN | MBA | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U|A|S|R| HMV1 | VMV1 | HMV2 | VMV2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following fields are defined the same as in Mode A: F, P, SBIT, EBIT, SRC, I, A, S, V and U. Other fields are defined as follows: QUANT: 5 bits Quantization value for the first MB coded at the starting of the packet. Set to 0 if the packet begins with a GOB header. This is the equivalent of GQUANT defined by the H.263 [4]. GOBN: 5 bits GOB number in effect at the start of the packet. GOB number is specified differently for different resolutions. See H.263 [4] for details. MBA: 8 bits The absolute address of the first MB within its GOB, counting from 0. HMV1, VMV1: 7 bits each. Horizontal and vertical motion vector predictors for the first MB coded in this packet from the MB on the left. The same as MV1 defined by H.263 [4]. Zhu [Page 8] Internet Draft RTP Payload for H.263 November 25, 1996 HMV2, VMV2, 7 bits each. Horizontal and vertical motion vector predictors from the block or MB on the left of block number 3 in the current MB when advanced prediction option is on. They are the same as MV1 defined for block number 3 in H.263 [4]. This is needed because block number 3 in the first MB needs the motion vector predictor from the block to its left, as block number 1. These two fields are not used when advanced prediction is off and must be set to 0. See the H.263 [4] for block organization in a frame. R: 1 bit Reserved, must set to zero. 5.3 Mode C In this mode, H.263 stream can be fragmented at MB boundaries of P frames when the PB frame option is on. It is intended for those GOBs whose sizes are larger than the maximum packet size allowed in the underlying protocol. H.263 payload header definition for Mode C is shown as follows with F=1 and P=1: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V |F|P|I|SBIT |EBIT | SRC | QUANT | GOBN | MBA | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U|A|S|R| HMV1 | VMV1 | HMV2 | VMV2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RR |DBQ| TRB | TR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following fields are defined the same as in Mode A: F, P, SBIT, EBIT, SRC, I, A, S, U, V, R, TR, DBQ, TRB, and the rest of the fields are defined the same as in Mode B, except field RR of 19 bits is reserved and its value must be zero. 5.4 Selection of Modes for the H.263 Payload Header Packets with different modes can be intermixed. The modes shall be selected carefully based on performance characteristics, H.263 coding modes and underlying network protocols. We strongly recommend use of Mode A whenever possible. The major advantage of Mode A over Mode B and C is its simplicity. The header is one half and one third of the size of Mode B and C respectively. Transmission overhead is reduced and the savings may be very significant when working with very low bit rates, especially when low latency is desired. Zhu [Page 9] Internet Draft RTP Payload for H.263 November 25, 1996 Another advantage of Mode A is that it simplifies error recovery in the presence of packet loss. The internal state of the decoder can be recovered at GOB boundaries instead of having to synchronize with MBs as in Mode B and C. The GOB headers and the picture start code are easy to identify, and their presence will normally cause the H.263 decoder to re-synchronize its internal states. Requiring the decoder to synchronize its internal states at MBs introduces extra overhead and complexity for the decoder. Mode A shall be used for packets starting with a GOB of size smaller than the network packet size. The major disadvantage of Mode A is lack of flexibility in packetization when a GOB can not fit in a network packet. Mode B has the advantage of flexibility with fragmentation at MB boundaries with PB frame option off. This mode is necessary when a GOB is larger than the network packet size. It has the disadvantage of higher overhead with a long header of 8 bytes. For small packets, this may not be desirable. Mode C is the same as B, except it allows fragmentation with PB option on at the price of 4 additional bytes. Finally, we would like to emphasize that recovery from packet loss will depend on the decoder's ability to use the information provided in the H.263 payload header within the RTP packets. 6. References [1] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, RTP : A Transport Protocol for Real-Time Applications, RFC 1889. [2] Video Codec for Audiovisual Services at px64 kbits/s, ITU-T Recommendation H.261, 1993 [3] RTP Profile for Audio and Video Conference with Minimal Control, RFC 1890. [4] Video Coding for Low Bitrate Communication, ITU-T Recommendation H.263, 1995 [5] T. Turletti, C. Huitema, RTP Payload Format for H.261 Video Stream. RFC 2032. 7. Author's Address Chunrong "Chad" Zhu Mail Stop: JF2-78 Intel Corporation 2111 N.E. 25th Avenue Hillsboro, OR 97124 USA Zhu [Page 10] Internet Draft RTP Payload for H.263 November 25, 1996 Email: czhu@ibeam.intel.com Tel: (503) 264-8849 Fax: (503) 264-6067 Zhu [Page 11]