Internet Draft                                               S. Wenger 
 Document: draft-ietf-avt-rtp-h264-02.txt               M.M. Hannuksela 
 Expires: December 2003                                  T. Stockhammer 
                                                          M. Westerlund 
                                                              D. Singer 
                                                              June 2003 
                                                  Expires December 2003 
                                                
  
  
  
                   RTP payload Format for H.264 Video 
  
  
  
 Status of this Memo 
     
 This document is an Internet-Draft and is in full conformance with 
 all provisions of Section 10 of RFC2026.  Internet-Drafts are working 
 documents of the Internet Engineering Task Force (IETF), its areas, 
 and its working groups.  Note that other groups may also distribute 
 working documents as Internet-Drafts. 
  
 Internet-Drafts are draft documents valid for a maximum of six months 
 and may be updated, replaced, or obsoleted by other documents at any 
 time.  It is inappropriate to use Internet-Drafts as reference 
 material or to cite them other than as "work in progress." 
  
 The list of current Internet-Drafts can be accessed at 
 http://www.ietf.org/1id-abstracts.txt 
  
 The list of Internet-Draft Shadow Directories can be accessed at 
 http://www.ietf.org/shadow.html 
     
     
 Copyright Notice 
     
    Copyright (C) The Internet Society (2003).  All Rights Reserved. 
     
 Abstract 
     
    This memo describes an RTP Payload format for the ITU-T 
    Recommendation H.264 video codec.  This codec was designed as a 
    joint project of the Video Coding Experts Group (VCEG) of ITU-T and 
 Wenger et. al.      Expires August 2003            [Page 1] 

 Internet Draft                                          26 June, 2003 
    the Moving Picture Experts Group (MPEG) of ISO/IEC.  Recommendation 
    H.264 was approved by ITU-T on May 2003, and the approved draft 
    specification is available for public review.  ISO/IEC 
    International Standard 14496-10 will be technically identical to 
    ITU-T Recommendation H.264. 
     
 Wenger et. al.     Expires December 2003                [Page 2] 

 Internet Draft                                          26 June, 2003 
     
 Table of Contents 
     
 1. Introduction..............................................5 
  1.1. The H.264 codec.........................................5 
  1.2. Parameter Set Concept...................................6 
  1.3. Network Abstraction Layer Unit Types......................7 
 2. Conventions...............................................9 
 3. Scope ....................................................9 
 4. Definitions...............................................9 
 5. RTP Payload Format........................................ 11 
  5.1. RTP Header Usage....................................... 11 
  5.2. Common structure of the RTP payload format................ 13 
  5.3. Decoding Order Number (DON)............................. 14 
  5.4. Single NAL Unit Packet ................................. 16 
  5.5. Aggregation Packets.................................... 17 
  5.6. Fragmentation Units (FUs)............................... 25 
 6. Packetization Rules....................................... 28 
  6.1. Unrestricted Mode (Multiple Picture Model)................ 29 
  6.2. Restricted Mode (Single Picture Model).................... 30 
 7. De-Packetization Process (Informative)....................... 30 
 8. Payload Format Parameters.................................. 33 
  8.1. MIME Registration...................................... 33 
  8.2. SDP Parameters ........................................ 37 
 9. Security Considerations.................................... 38 
 10. Informative Appendix: Application Examples .................. 39 
  10.1. Video Telephony, No Slice Data Partitioning, No Packet 
  Aggregation ............................................... 39 
  10.2. Video Telephony, Interleaved Packetization Using Packet 
  Aggregation ............................................... 40 
  10.3. Video Telephony, with Data Partitioning.................. 41 
  10.4. Low-Bit-Rate Streaming................................. 41 
  10.5. Robust Packet Scheduling in Video Streaming.............. 42 
 11. Informative Appendix: Rationale for Decoding Order Number..... 43 
  11.1. Introduction......................................... 43 
  11.2. Example of Multi-Picture Slice Interleaving.............. 43 
  11.3. Example of Robust Packet Scheduling..................... 45 
  11.4. Robust Transmission Scheduling of Redundant Coded Slices... 49 
  11.5. Remarks on Other Design Possibilities.................... 50 
 12. Open Issues............................................. 51 
 13. Full Copyright Statement.................................. 51 
 14. Intellectual Property Notice............................... 52 
 15. References.............................................. 52 
  15.1. Normative References.................................. 52 
  15.2. Informative References................................. 53 
 Annex A: Changes relative to draft-ietf-avt-rtp-h264-01.txt....... 55 
     
 Wenger et. al.     Expires December 2003                [Page 3] 

 Internet Draft                                          26 June, 2003 
     
     
 Wenger et. al.     Expires December 2003                [Page 4] 

 Internet Draft                                          26 June, 2003 
 1.    Introduction 
  
 1.1.      The H.264 codec 
  
    This memo specifies an RTP payload specification for the video 
    coding standard known as ITU-T Recommendation H.264 and ISO/IEC 
    International Standard 14496 Part 10 (also known as MPEG-4 Advanced 
    Video Coding).  Recommendation H.264 was approved by ITU-T on May 
    2003, and the approved draft specification is available for public 
    review [1].  In this memo the H.264 acronym is used for the codec 
    and the standard, but the memo is equally applicable to the ISO/IEC 
    counterpart of the coding standard. 
  
    The H.264 video codec has a very broad application range that 
    covers all forms of digital compressed video from low bit rate 
    Internet Streaming applications to HDTV broadcast and Digital 
    Cinema applications with near loss-less coding.  Most, if not all, 
    relevant companies in all of these fields (including Video-
    Conferencing, Streaming, TV broadcast, and Digital Cinema) have 
    participated in the standardization, which gives hope that this 
    wide application range is more than an illusion and may 
    materialize, probably in a relatively short time frame.  The 
    overall performance of H.264 is as such that bit rate savings of 
    50% or more, compared to the current state of technology, are 
    reported.  Digital Satellite TV quality, for example, was reported 
    to be achievable at 1.5 Mbit/s, compared to the current operation 
    point of MPEG 2 video at around 3.5 Mbit/s [6]. 
     
    The codec specification [1] itself distinguishes conceptually 
    between a video coding layer (VCL), and a network abstraction layer 
    (NAL).  The VCL contains the signal processing functionality of the 
    codec, things such as transform, quantization, motion 
    search/compensation, and the loop filter.  It follows the general 
    concept of most of today's video codecs, a macroblock-based coder 
    that utilizes inter picture prediction with motion compensation, 
    and transform coding of the residual signal.  The VCL encoder 
    outputs slices: a bit string that contains the macroblock data of 
    an integer number of macroblocks, and the information of the slice 
    header (containing the spatial address of the first macroblock in 
    the slice, the initial quantization parameter, and similar).  
    Macroblocks in slices are ordered in scan order unless a different 
    macroblock allocation is specified, using the so-called Flexible 
 Wenger et. al.     Expires December 2003                [Page 5] 

 Internet Draft                                          26 June, 2003 
    Macroblock Ordering syntax.  In-picture prediction is used only 
    within a slice. More information is provided in [6].  
     
    The NAL encoder encapsulates the slice output of the VCL encoder 
    into Network Abstraction Layer Units (NAL units), which are 
    suitable for the transmission over packet networks or the use in 
    packet oriented multiplex environments.  Annex B of H.264 defines 
    an encapsulation process to transmit such NAL units over byte-
    stream oriented networks.  In the scope of this memo Annex B is not 
    relevant. 
     
    Internally, the NAL uses NAL Units.  A NAL unit consists of a one-
    byte header and the payload byte string.  The header co-serves as 
    the RTP payload header and indicates the type of the NAL unit, the 
    (potential) presence of bit errors or syntax violations in the NAL 
    unit payload, and information regarding the relative importance of 
    the NAL unit for the decoding process.  This RTP payload 
    specification is designed to be unaware of the bit string in the 
    NAL unit payload. 
     
    One of the main properties of H.264 is the complete decoupling of 
    the transmission time, the decoding time, and the sampling or 
    presentation time of slices and pictures.  The decoding process 
    specified in H.264 is unaware of time, and the H.264 syntax does 
    not carry information such as the number of skipped frames (as 
    common in the form of the Temporal Reference in earlier video 
    compression standards).  Also, there are NAL units that are 
    affecting many pictures and are, hence, inherently time-less.  For 
    this reason, the handling of the RTP timestamp requires some 
    special considerations for those NAL units for which the sampling 
    or presentation time is not defined, or, at transmission time, 
    unknown. 
     
     
 1.2.      Parameter Set Concept 
     
    One very fundamental design concept of H.264 is to generate self-
    contained packets, to make mechanisms such as the header 
    duplication of RFC2429 [8] or MPEG-4's HEC [9] unnecessary.  The 
    way how this was achieved is to decouple information that is 
    relevant to more than one slice from the media stream.  This higher 
 Wenger et. al.     Expires December 2003                [Page 6] 

 Internet Draft                                          26 June, 2003 
    layer meta information should be sent reliably, asynchronously and 
    in advance from the RTP packet stream that contains the slice 
    packets.  (Provisions for sending this information in-band are also 
    available for such applications that do not have an out-of-band  
    transport channel appropriate for the purpose).  The combination of 
    the higher level parameters is called a parameter set.  The H.264 
    specification includes two types of parameter sets: sequence 
    parameter set and picture parameter set.  An active sequence 
    parameter set remains unchanged throughout a coded video sequence, 
    and an active picture parameter set remains unchanged within a 
    coded picture.  The sequence and picture parameter set structures 
    contain information such as 
     
      o picture size, 
      o optional coding modes employed, and 
      o macroblock to slice group map. 
       
    In order to be able to change picture parameters (such as the 
    picture size), without having the need to transmit parameter set 
    updates synchronously to the slice packet stream, the encoder and 
    decoder can maintain a list of more than one sequence and picture 
    parameter set.  Each slice header contains a codeword that 
    indicates the sequence and picture parameter set to be used.   
     
    This mechanism allows to decouple the transmission of parameter 
    sets from the packet stream, and transmit them by external means, 
    e.g. as a side effect of the capability exchange, or through a 
    (reliable or unreliable) control protocol. It may even be possible 
    that they get never transmitted but are fixed by an application 
    design specification. 
     
     
 1.3.      Network Abstraction Layer Unit Types 
  
    Tutorial information on the NAL design can be found in [10], 
    [11] and [12].   
     
    All NAL units consist of a single NAL unit type octet, which also 
    co-serves as the payload header.  The payload of a NAL unit follows 
    immediately.   
     
 Wenger et. al.     Expires December 2003                [Page 7] 

 Internet Draft                                          26 June, 2003 
    The syntax and semantics of the NAL unit type octet are specified 
    in [1], but the essential properties of the NAL unit type octet are 
    summarized below.  The NAL unit type octet has the following 
    format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |F|NRI|  Type   | 
    +---------------+ 
     
    F: 1 bit 
       forbidden_zero_bit, when zero, indicates a bit error free NAL 
       unit.  The H.264 specification declares a value of 1 as a syntax 
       violation.  Hence, when set, the decoder is advised that bit 
       errors or any other syntax violation may be present in the 
       payload or in the NAL unit type octet.  A prudent reaction of 
       decoders that are incapable of handling erroneous bit streams is 
       to discard such NAL units. 
        
    NRI: 2 bits 
       nal_ref_idc.  A value of 00 indicates that the content of the 
       NAL unit is not used to reconstruct reference pictures for inter 
       picture prediction.  Such NAL units can be discarded without 
       risking the integrity of the reference pictures.  Values above 
       00 indicate that the decoding of the NAL unit is required to 
       maintain the integrity of the reference pictures.  Furthermore, 
       values above 00 indicate the relative transport priority, as 
       determined by the encoder.  Intelligent network elements can use 
       this information to protect more important NAL units better than 
       less important NAL units.  11 is the highest transport priority, 
       followed by 10, then by 01 and, finally, 00 is the lowest. 
     
    Type: 5 bits 
       nal_unit_type.  The NAL Unit payload type as defined in table 7-
       1 of [1], and later within this memo.  Note that the NAL unit 
       types defined in this memo are marked as unspecified in [1]. 
     
    For a reference of all currently defined NAL unit types and their 
    semantics please refer to section 7.4.1 in [1].   
     
 Wenger et. al.     Expires December 2003                [Page 8] 

 Internet Draft                                          26 June, 2003 
     
 2.    Conventions 
  
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 
    this document are to be interpreted as described in RFC 2119 [2]. 
     
    This specification uses the notion of setting and clearing a bit 
    when handling bit fields. Setting a bit is the same as assigning 
    that bit the value of 1 (On). Clearing a bit is the same as 
    assigning that bit the value of 0 (Off). 
     
  
 3.    Scope 
  
    This payload specification can only be used to carry the "naked" 
    H.264 NAL unit stream over RTP.  Likely, the first applications of 
    this specification will be in the conversational multimedia field, 
    video telephone or video conference.  The draft is not intended for 
    the use in conjunction with the byte stream format of Annex B of 
    H.264. 
     
     
 4.    Definitions and Abbreviations 
     
 4.1.      Definitions 
  
    This document uses the definitions of [1]. The following terms 
    defined in [1] are summed up below for convenience: 
     
       access unit: A set of NAL units always containing a primary 
       coded picture.  In addition to the primary coded picture, an 
       access unit may also contain one or more redundant coded 
       pictures or other NAL units not containing slices or slice data 
       partitions of a coded picture.  The decoding of an access unit 
       always results in a decoded picture. 
     
       coded video sequence: A sequence of access units that consists, 
       in decoding order, of an IDR access unit followed zero or more 
       non-IDR access units including all subsequent access units up to 
       but not including any subsequent IDR access unit. 
        
 Wenger et. al.     Expires December 2003                [Page 9] 

 Internet Draft                                          26 June, 2003 
       instantaneous decoding refresh (IDR) access unit: An access unit 
       in which the primary coded picture is an IDR picture. 
        
       instantaneous decoding refresh (IDR) picture: A coded picture 
       containing only slices with I or SI slice types that causes a 
       "reset" in the decoding process.  After the decoding of an IDR 
       picture all following coded pictures in decoding order can be 
       decoded without inter prediction from any picture decoded prior 
       to the IDR picture. 
     
       primary coded picture: The coded representation of a picture to 
       be used by the decoding process for a bitstream conforming to 
       H.264.  The primary coded picture contains all macroblocks of 
       the picture. 
     
       redundant coded picture: A coded representation of a picture or 
       a part of a picture.  The content of a redundant coded picture 
       shall not be used by the decoding process for a bitstream 
       conforming to H.264.  The content of a redundant coded picture 
       may be used by the decoding process for a bitstream that 
       contains errors or losses. 
        
       VCL NAL unit: A collective term used to refer to coded slice and 
       coded data partition NAL units. 
     
    In addition, the following definitions apply: 
     
       decoding order number (DON): A field in the payload structure or 
       a derived variable indicating NAL unit decoding order.  Values 
       of DON are in the range of 0 to 65535, inclusive.  After 
       reaching the maximum value, the value of DON wraps around to 0. 
        
       NAL unit decoding order: A NAL unit order that conforms to the 
       constraints on NAL unit order given in section 7.4.1.2 in [1].   
     
       transmission order: The order of packets in ascending RTP 
       sequence number order (in modulo arithmetic).  Within an 
       aggregation packet, the NAL unit transmission order is the same 
       as the order of appearance of NAL units in the packet. 
     
     
 Wenger et. al.     Expires December 2003               [Page 10] 

 Internet Draft                                          26 June, 2003 
 4.2.      Abbreviations 
     
    DON:        Decoding Order Number 
    DONB:       Decoding Order Number Base 
    DOND:       Decoding Order Number Difference 
    FU:         Fragmentation Unit 
    IDR:        Instantaneous Decoding Refresh 
    IEC:        International Engineering Consortium 
    ISO:        International Organization for Standardization 
    ITU-T:                      International Telecommunication Union, 
                Telecommunication Standardization Sector 
    MTAP:       Multi-Time Aggregation Packet 
    MTAP16:     MTAP with 16-bit timestamp offset 
    MTAP24:     MTAP with 24-bit timestamp offset 
    NAL:        Network Adaptation Layer 
    NALU:       NAL Unit 
    STAP:       Single-Time Aggregation Packet 
    STAP-A:     STAP type A 
    STAP-B:     STAP type B 
    TS:         Timestamp 
    VCL:        Video Coding Layer 
     
     
 5.    RTP Payload Format 
     
 5.1.      RTP Header Usage 
  
    The format of the RTP header is specified in RFC 1889 [3] and 
    reprinted in Figure 1 for convenience.  This payload format uses 
    the fields of the header in a manner consistent with that 
    specification. 
     
    When encapsulating one NAL unit per RTP packet, the RECOMMENDED RTP 
    payload format is specified in section 5.2.  The RTP payload (and 
    the settings for some RTP header bits) for aggregation packets and 
    fragmentation units are specified in sections 5.5 and 5.6, 
    respectively.   
 Wenger et. al.     Expires December 2003               [Page 11] 

 Internet Draft                                          26 June, 2003 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |V=2|P|X|  CC   |M|     PT      |       sequence number         | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                           timestamp                           | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |           synchronization source (SSRC) identifier            | 
    +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 
    |            contributing source (CSRC) identifiers             | 
    |                             ....                              | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 1: RTP header according RFC 1889. 
     
    The RTP header information is set as follows:  
     
    Version (V): 2 bits 
       Set to 2 according to RFC 1889. 
     
    Padding (P): 1 bit 
       Used according to RFC 1889. 
     
    Extension (X): 1 bit 
       Used according to RFC 1889 and profile definitions. 
     
    CSRC count (CC): 4 bits 
       Used according to RFC 1889. 
     
    Marker bit (M): 1 bit 
       Set for the very last packet of the access unit indicated by the 
       RTP timestamp, in line with the normal use of the M bit and to 
       allow an efficient playout buffer handling.  Decoders MAY use 
       this bit as an early indication of the last packet of an access 
       unit, but MUST NOT rely on this property. 
       Informative note: Only one M bit is associated with an 
       aggregation packet carrying multiple NAL units, and thus if a 
       gateway has re-packetized an aggregation packet into several 
       packets, it cannot reliably set the M bit of those packets.  
     
 Wenger et. al.     Expires December 2003               [Page 12] 

 Internet Draft                                          26 June, 2003 
    Payload type (PT): 7 bits 
       The assignment of an RTP payload type for this new packet format 
       is outside the scope of this document, and will not be specified 
       here.  The assignment of a payload type needs to be performed 
       either through the profile used or in a dynamic way. 
     
    Sequence number (SN): 16 bit 
       Increased by one for each sent packet.  Set to a random value 
       during startup as per RFC1889 
     
    Timestamp: 32 bits 
       The RTP timestamp is set to the sampling timestamp of the 
       content.  A 90 kHz clock rate MUST be used.  If the content is a 
       part of a coded frame that was sampled as two fields having 
       distinct sampling times and that is supposed to be displayed as 
       fields having distinct display times, the RTP timestamp is set 
       to the sampling timestamp of the latest sampled field.  In 
       addition, the picture timing SEI message (subclauses D.1.2 and 
       D.2.2 of [1]) SHOULD be used to convey the timestamps for 
       display, and the last clock timestamp in decoding order conveyed 
       in a picture timing SEI message MUST correspond to the RTP 
       timestamp of the primary coded picture of the same access unit.  
       If the NAL unit has no own timing properties (e.g. parameter set 
       and SEI NAL units), the RTP timestamp is set to the RTP 
       timestamp of the primary coded picture of the access unit to 
       which the NAL unit is included according to section 7.4.1.2 of 
       [1].  The setting of the RTP Timestamp for MTAPs is defined in 
       section 5.5.2. 
     
    Synchronization source (SSRC) identifier: 32 bits 
       Used according to RFC 1889. 
     
    Contributing source (CSRC) identifiers: 0 to 15 items, 32 bits each 
       Used according to RFC 1889. 
     
     
 5.2.      Common structure of the RTP payload format 
     
    The payload format is defined as a number of different payload 
    structures depending on need.  However which structure that a 
    received RTP packet contains is evident from the first byte of the 
 Wenger et. al.     Expires December 2003               [Page 13] 

 Internet Draft                                          26 June, 2003 
    payload. This byte will always be structured as a NAL unit header.  
    The NAL unit type field indicates which structure that is present.  
    The possible structures are: 
     
    Single NAL Unit Packet: Contains only a single NAL unit in the 
    payload.  The NAL header type field will be equal to the original 
    NAL type, i.e., in the range of 1 to 23, inclusive.  Specified in 
    section 5.4. 
     
    Aggregation packet: Packet type used to aggregate multiple NAL 
    units into a single RTP payload.  This packet exists in four 
    versions, the Single-Time Aggregation Packet type A (STAP-A), the 
    Single-Time Aggregation Packet type B (STAP-B), Multi-Time 
    Aggregation Packet (MTAP) with 16 bit offset (MTAP16), and Multi-
    Time Aggregation Packet (MTAP) with 24 bit offset (MTAP24).  The 
    NAL type numbers assigned for STAP-A, STAP-B, MTAP16, and MTAP24 
    are 24, 25, 26, and 27 respectively.  Specified in section 5.5. 
     
    Fragmentation packet: Used to fragment a single NAL unit over 
    multiple RTP packets. Identified with the NAL type number 27. 
    Specified in section 5.6. 
     
    Table 1 - Summary of NAL types and their payload structures. 
     
    Type   Packet    Type name                       Section 
    -------------------------------------------------------- 
    1-23   NAL unit  A single NAL unit packet          5.4 
    24     STAP-A    Single-time aggregation packet    5.5.1 
    25     STAP-B    Single-time aggregation packet    5.5.1 
    26     MTAP16    Multi-time aggregation packet     5.5.2 
    27     MTAP24    Multi-time aggregation packet     5.5.2 
    28     FU        Fragmentation unit                5.6 
     
     
 5.3.      Decoding Order Number (DON) 
     
    This memo allows transmission of NAL units out of their decoding 
    order.  Decoding order number (DON) is a field in the payload 
    structure or a derived variable that indicates the NAL unit 
    decoding order.  Rationale and example use cases for transmission 
 Wenger et. al.     Expires December 2003               [Page 14] 

 Internet Draft                                          26 June, 2003 
    out of decoding order and for the use of DON are given in section 
    11. 
     
    The coupling of transmission and decoding order is controlled by 
    the optional interleaving-depth MIME parameter as follows.  When 
    the value of the optional interleaving-depth MIME parameter is 
    equal to 0 and transmission of NAL units out of their decoding 
    order is disallowed by external means, the transmission order of 
    NAL units MUST conform to the NAL unit decoding order.  When the 
    value of the optional interleaving-depth MIME parameter is greater 
    than 0 or transmission of NAL units out of their decoding order is 
    allowed by external means, 
    o the order of NAL units in an STAP-B, an MTAP16, and an MTAP24 is 
      NOT REQUIRED to be the NAL unit decoding order, and 
    o the order of NAL units composed by decapsulating STAP-Bs, MTAPs, 
      and FUs in two consecutive packets, when neither of these packets 
      is a single NAL unit packet or an STAP-A, is NOT REQUIRED to be 
      the NAL unit decoding order. 
     
    The RTP payload structures for a single NAL unit packet and an 
    STAP-A do not include DON.  STAP-B and FU structures include DON, 
    and the structure of MTAPs enables derivation of DON as specified 
    in section 5.5.2. 
     
    Informative note: If a transmitter wants to encapsulate one NAL 
    unit per packet and transmit packets out of their decoding order, 
    STAP-B packet type can be used. 
     
    The transmission order of NAL units in single NAL unit packets and 
    STAP-As MUST be the same as their NAL unit decoding order.  The NAL 
    units within an STAP-A MUST appear in the NAL unit decoding order.  
    The transmission order of a single NAL unit packet or an STAP-A 
    MUST succeed the transmission order of any NAL unit that precedes 
    the single NAL unit packet or the STAP-A, respectively, in NAL unit 
    decoding order.  The transmission order of a single NAL unit packet 
    or an STAP-A MUST precede the transmission order of any NAL unit 
    that succeeds the single NAL unit packet or the STAP-A, 
    respectively, in NAL unit decoding order. 
     
    Informative note: Due to the restrictions in the transmission order 
    of single NAL unit packets and STAP-As, the decoding order of 
 Wenger et. al.     Expires December 2003               [Page 15] 

 Internet Draft                                          26 June, 2003 
    single NAL unit packets and STAP-As is the same as their 
    transmission order, and all NAL units preceding a single NAL unit 
    packet or an STAP-A in transmission order are decoded before the 
    single NAL unit packet or the STAP-A, respectively. 
     
    The DON of the first NAL unit in transmission order MAY be set to 
    any value.  Values of DON are in the range of 0 to 65535, 
    inclusive.  After reaching the maximum value, the value of DON 
    wraps around to 0.   
     
    The decoding order of two NAL units is determined as follows.  Let 
    the value of DON of one NAL unit be D1 and the value of DON of 
    another NAL unit be D2.  If D1 < D2 and D2 - D1 < 32768, or if D1 > 
    D2 and D1 - D2 >= 32768, then the NAL unit having a value of DON 
    equal to D1 precedes the NAL unit having a value of DON equal to D2 
    in NAL unit decoding order.  If D1 < D2 and D2 - D1 >= 32768, or if 
    D1 > D2 and D1 - D2 < 32768, then the NAL unit having a value of 
    DON equal to D2 precedes the NAL unit having a value of DON equal 
    to D1 in NAL unit decoding order.   
     
    Values of DON related fields (DON, DONB, and DOND, see section 5.5) 
    MUST be such that the decoding order determined by the values of 
    DON as specified above conforms to the NAL unit decoding order.  If 
    the order of two consecutive NAL units in the NAL unit stream is 
    switched and the new order still conforms to the NAL unit decoding 
    order, the NAL units MAY have the same value of DON.  For example, 
    when arbitrary slice order is allowed by the video coding profile 
    in use, all the coded slice NAL units of a coded picture are 
    allowed to have the same value of DON.  Consequently, NAL units 
    having the same value of DON can be decoded in any order, and two 
    NAL units having a different value of DON should be passed to the 
    decoder in the order specified above. 
     
    An example decapsulation process to recover the NAL unit decoding 
    order is given in section 7.   
     
     
 5.4.      Single NAL Unit Packet 
     
    The single NAL unit packet defined here MUST contain one and only 
    one NAL unit of the types defined in [1].  This means that neither 
 Wenger et. al.     Expires December 2003               [Page 16] 

 Internet Draft                                          26 June, 2003 
    an aggregation packet nor a fragmentation unit can be used within a 
    single NAL unit packet.  A NAL unit stream composed by 
    decapsulating single NAL unit packets in RTP sequence number order 
    MUST conform to the NAL unit decoding order.  The structure of the 
    single NAL unit packet is shown in Figure 2. 
     
      0                   1                   2                   3 
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                          RTP Header                           | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                                                               | 
     |                       Single NAL unit                         | 
     |                                                               | 
     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                               :...OPTIONAL RTP padding        | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 2. RTP payload format for single NAL unit packet. 
     
  
 5.5.      Aggregation Packets 
     
    Aggregation packets are the NALU unit aggregation scheme of this 
    payload specification.  The scheme is introduced to reflect the 
    dramatically different MTU sizes of two key target networks -- 
    wireline IP networks (with an MTU size that is often limited by the 
    Ethernet MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. 
    ITU-T H.324/M) based wireless communication systems with preferred 
    transmission unit sizes of 254 bytes or less.  In order to prevent 
    media transcoding between the two worlds, and to avoid undesirable 
    packetization overhead, a NAL unit aggregation scheme is 
    introduced. 
     
    Two types of aggregation packets are defined by this specification: 
     
    o Single-time aggregation packet (STAP) aggregates NAL units with 
      identical NALU-time.  Two types of STAPs are defined, one without 
      DON (STAP-A) and another one including DON (STAP-B). 
    o Multi-time aggregation packet (MTAP) aggregates NAL units with 
      potentially differing NALU-time.  Two different MTAPs are defined 
      that differ in the length of the NAL unit timestamp offset. 
 Wenger et. al.     Expires December 2003               [Page 17] 

 Internet Draft                                          26 June, 2003 
     
    The term NALU-time is defined as the value that the RTP timestamp 
    would have if that NAL unit would be transported in its own RTP 
    packet.  
     
    Each NAL unit to be carried in an aggregation packet is 
    encapsulated in an aggregation unit.  Please see below for the 
    three different aggregation units and their characteristics. 
     
    The structure of the RTP payload format for aggregation packets is 
    presented in Figure 3. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |F|NRI|  type   |                                               | 
    +-+-+-+-+-+-+-+-+                                               | 
    |                                                               | 
    |             one or more aggregation units                     | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               :...OPTIONAL RTP padding        | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 3. RTP payload format for aggregation packets. 
     
    MTAPs and STAPs share the following packetization rules:  The RTP 
    timestamp MUST be set to the earliest of the NALU times of all the 
    NAL units to be aggregated.  The type field of the NAL unit type 
    octet MUST be set to the appropriate value as indicated in Table 2. 
    The F bit MUST be cleared if all F bits of the aggregated NAL units 
    are zero, otherwise it MUST be set. 
     
     
 Wenger et. al.     Expires December 2003               [Page 18] 

 Internet Draft                                          26 June, 2003 
    Table 2: Type field for STAPs and MTAPs 
     
    Type   Packet    Timestamp offset   DON related fields 
                     field length       (DON, DONB, DOND) 
                     (in bits)          present 
    -------------------------------------------------------- 
    24     STAP-A       0                 no 
    25     STAP-B       0                 yes 
    26     MTAP16      16                 yes 
    27     MTAP24      24                 yes 
     
    The marker bit in the RTP header MUST be set to the value the 
    marker bit of the last NAL unit of the aggregated packet would have 
    if it were transported in its own RTP packet. 
     
    The payload of an aggregation packet consists of one or more 
    aggregation units.  See section 5.5.1 and 5.5.2 for the three 
    different types of aggregation units.  An aggregation packet can 
    carry as many aggregation units as necessary, however the total 
    amount of data in an aggregation packet obviously MUST fit into an 
    IP packet, and the size SHOULD be chosen such that the resulting IP 
    packet is smaller than the MTU size.  An aggregation packet MUST 
    NOT contain fragmentation units specified in section 5.6. 
  
  
 5.5.1.        Single-Time Aggregation Packet 
  
    Single-time aggregation packet (STAP) SHOULD be used whenever 
    aggregating NAL units that share the same NALU-time.  The payload 
    of an STAP-A does not include DON and consists of at least one 
    single-time aggregation unit as presented in Figure 4.  The payload 
    of an STAP-B consists of a 16-bit unsigned decoding order number 
    (DON) followed by at least one single-time aggregation unit as 
    presented in Figure 5. 
     
 Wenger et. al.     Expires December 2003               [Page 19] 

 Internet Draft                                          26 June, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :                                               | 
    +-+-+-+-+-+-+-+-+                                               | 
    |                                                               | 
    |                single-time aggregation units                  | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 4. Payload format for STAP-A. 
     
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :  decoding order number (DON)  |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |                single-time aggregation units                  | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 5. Payload format for STAP-B. 
     
    A single-time aggregation unit consists of 16-bit unsigned size 
    information that indicates the size of the following NAL unit in 
    bytes (excluding these two octets, but including the NAL unit type 
    octet of the NAL unit), followed by the NAL unit itself including 
    its NAL unit type byte.  A single-time aggregation unit is byte-
    aligned within the RTP payload but it may not be aligned on a 32-
    bit word boundary.  Figure 6 presents the structure of the single-
    time aggregation unit. 
     
 Wenger et. al.     Expires December 2003               [Page 20] 

 Internet Draft                                          26 June, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :        NAL unit size          |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |                           NAL unit                            | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 6. Structure for single-time aggregation unit. 
     
    The NAL units within an STAP-A MUST appear in the NAL unit decoding 
    order.  If the value of the optional interleaving-depth MIME 
    parameter is equal to 0 or transmission of NAL units out of their 
    decoding order is disallowed by other means, the NAL units within 
    an STAP-B MUST appear in the NAL unit decoding order.  Otherwise, 
    the NAL units within an STAP-B are NOT REQUIRED to appear in the 
    NAL unit decoding order. 
     
    Figure 7 presents an example of an RTP packet that contains an 
    STAP-B.  The STAP contains two single-time aggregation units, 
    labeled as 1 and 2 in the figure. 
     
 Wenger et. al.     Expires December 2003               [Page 21] 

 Internet Draft                                          26 June, 2003 
     0                   1                   2                   3 
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                          RTP Header                           | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |STAP-B NAL HDR | DON                           | NALU 1 Size   | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     | NALU 1 Size   | NALU 1 HDR    | NALU 1 Data                   | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               + 
     :                                                               | 
     +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |               | NALU 2 Size                   | NALU 2 HDR    | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                       NALU 2 Data                             | 
     |                                                               | 
     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                               :...OPTIONAL RTP padding        | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 7. An example of an RTP packet including an STAP-B and two 
    single-time aggregation units. 
     
     
 5.5.2.        Multi-Time Aggregation Packets (MTAPs) 
     
    The NAL unit payload of MTAPs consists of a 16-bit unsigned 
    decoding order number base (DONB) and one or more multi-time 
    aggregation units as presented in Figure 8.  DONB MUST contain the 
    smallest value of DON among the NAL units of the MTAP.  The choice 
    between the different MTAP types (MTAP16 and MTAP24) is application 
    dependent -- the larger the timestamp offset is the higher is the 
    flexibility of the MTAP, but the higher is also the overhead. 
     
 Wenger et. al.     Expires December 2003               [Page 22] 

 Internet Draft                                          26 June, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :  decoding order number base   |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |                 multi-time aggregation units                  | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 8. NAL unit payload format for MTAPs. 
     
    Two different multi-time aggregation units are defined in this 
    specification.  Both of them consist of 16 bits unsigned size 
    information of the following NAL unit, an 8-bit unsigned decoding 
    order number delta (DOND), and n bits of timestamp offset (TS 
    offset) for this NAL unit, whereby n can be 16 or 24.  The 
    structure of the multi-time aggregation units for MTAP16 and MTAP24 
    are presented in Figure 9 and Figure 10 respectively.  Note that 
    the starting or ending position of an aggregation unit within a 
    packet is NOT REQUIRED to be on a 32-bit word boundary.  DON of the 
    following NAL unit is equal to DONB + DOND.  The derived DON is 
    REQUIRED to be in the range of 0 to 65535, inclusive.  This memo 
    does not specify how the NAL units within an MTAP are ordered, but, 
    in most cases, NAL unit decoding order SHOULD be used.  The 
    timestamp offset field MUST be set to a value equal to the value of 
    the following formula: (the NALU-time of the NAL unit - the RTP 
    timestamp of the packet).  
     
 Wenger et. al.     Expires December 2003               [Page 23] 

 Internet Draft                                          26 June, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    :        NAL unit size          |      DOND     |  TS offset    | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |  TS offset    |                                               | 
    +-+-+-+-+-+-+-+-+              NAL unit                         | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 9. Multi-time aggregation unit for MTAP16 
     
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    :        NALU unit size         |      DOND     |  TS offset    | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |         TS offset             |                               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               | 
    |                              NAL unit                         | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 10. Multi-time aggregation unit for MTAP24 
     
    For the "earliest" multi-time aggregation unit in an MTAP the 
    timing offset MUST be zero.  Hence, the RTP timestamp of the MTAP 
    itself is identical to the earliest NALU-time. 
     
    Figure 11 presents an example of an RTP packet that contains a 
    multi-time aggregation packet of type MTAP16 that contains two 
    multi-time aggregation units, labeled as 1 and 2 in the figure. 
     
 Wenger et. al.     Expires December 2003               [Page 24] 

 Internet Draft                                          26 June, 2003 
     0                   1                   2                   3 
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                          RTP Header                           | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |MTAP16 NAL HDR |  decoding order number base   | NALU 1 Size   | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |  NALU 1 Size  |  NALU 1 DOND  |       NALU 1 TS offset        | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |  NALU 1 HDR   |  NALU 1 DATA                                  | 
     +-+-+-+-+-+-+-+-+                                               + 
     :                                                               | 
     +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |               | NALU 2 SIZE                   |  NALU 2 DOND  | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |       NALU 2 TS offset        |  NALU 2 HDR   |  NALU 2 DATA  | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
     |                                                               | 
     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     |                               :...OPTIONAL RTP padding        | 
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 11. An example of an RTP packet including a multi-time 
    aggregation packet of type MTAP16 and two multi-time aggregation 
    units. 
     
     
 5.6.      Fragmentation Units (FUs) 
     
    This payload type allows fragmenting a NAL unit into several RTP 
    packets.  Doing so on the application layer instead of relying on 
    lower layer fragmentation (e.g. by IP) has the following 
    advantages: 
     
    o The payload format is capable of transporting NAL units bigger 
      than 64 kbytes over an IPv4 network that may be present in pre-
      recorded video, particularly in High Definition formats (there is 
      a limit of the number of slices per picture, which results in a 
      limit of NAL units per picture, which may result in big NAL 
      units) 
 Wenger et. al.     Expires December 2003               [Page 25] 

 Internet Draft                                          26 June, 2003 
    o The fragmentation mechanisms allows fragmenting a single picture 
      and applying generic forward error correction, see e.g. [15], to 
      fragmentation units with identical timestamp.   
     
    Fragmentation is defined only for a single NAL unit, and not for 
    any aggregation packets.  A fragment of a NAL unit consists of an 
    integer number of consecutive octets of that NAL unit.  Each octet 
    of the NAL unit MUST be part of exactly one fragment of that NAL 
    unit. Fragments of the same NAL unit MUST be sent in consecutive 
    order with ascending RTP sequence numbers (with no other RTP 
    packets within the same RTP packet stream being sent between the 
    first and last fragment).  Similarly, a NAL unit MUST be 
    reassembled in RTP sequence number order. 
     
    When a NAL unit is fragmented and conveyed within fragmentation 
    units (FUs), it is referred to as fragmented NAL unit.  STAPs and 
    MTAPs MUST NOT be fragmented. 
  
    The RTP timestamp of an RTP packet carrying an FU is set to the 
    NALU time of the fragmented NAL unit.   
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    | FU indicator  |   FU header   |        DON (optional)         | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 
    |                                                               | 
    |                         FU payload                            | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               :...OPTIONAL RTP padding        | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure 12. RTP payload format for fragmentation unit. 
     
 Wenger et. al.     Expires December 2003               [Page 26] 

 Internet Draft                                          26 June, 2003 
    The RTP payload of a fragmentation unit consists of fragmentation 
    unit indicator of one octet, a fragmentation unit header of one 
    octet, a 16-bit unsigned decoding order number (DON) that is 
    conditionally present, and a fragmentation unit payload.   
     
    A value equal to 28 in the NAL unit type field of the FU indicator 
    octet identifies an FU.  The FU indicator octet has the following 
    format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |F|NRI|    28   | 
    +---------------+ 
  
    The use of the F bit is described in section 1.3.  The value of the 
    NRI field MUST be set according to the value of the NRI field in 
    the fragmented NAL unit. 
     
    The FU header has the following format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |S|E|R|  Type   | 
    +---------------+ 
     
    S: 1 bit 
       The Start bit, when one, indicates the start of a fragmented NAL 
       unit. Otherwise, when the following FU payload is not the start 
       of a fragmented NAL unit payload, the Start bit is set to zero. 
        
    E: 1 bit 
       The End bit, when one, indicates the end of a fragmented NAL 
       unit, i.e. the last byte of the payload is also the last byte of 
       the fragmented NAL unit. Otherwise, when the following FU 
       payload is not the last fragment of a fragmented NAL unit, the 
       End bit is set to zero. 
        
    R: 1 bit 
 Wenger et. al.     Expires December 2003               [Page 27] 

 Internet Draft                                          26 June, 2003 
       The Reserved bit MUST be equal to 0 and MUST be ignored by the 
       receiver. 
        
    Type: 5 bits 
       The NAL Unit payload type as defined in table 7-1 of [1]. 
  
    DON is present when the S bit in the FU header is equal to 1.  The 
    value of DON is selected as described in section 5.3. 
     
    Informative note: The DON field in FUs allows gateways to fragment 
    NAL units to FUs without organizing the incoming NAL units to the 
    NAL unit decoding order.  
     
    A fragmented NAL unit MUST NOT be transmitted in one FU, i.e., 
    Start bit and End bit MUST NOT both be set to one in the same FU 
    header.  A fragmentation unit MUST NOT contain an aggregation 
    packet.  Fragmentation units of a NAL unit MUST be sent in 
    consecutive packets of the same RTP stream. 
     
    The FU payload consists of fragments of the payload of the 
    fragmented NAL unit such that if the fragmentation unit payloads of 
    consecutive FUs are sequentially concatenated, the payload of the 
    fragmented NAL unit is reconstructed.  Note that the NAL unit type 
    octet of the fragmented NAL unit is not included as such in the 
    fragmentation unit payload, but rather the information of the NAL 
    unit type octet of the fragmented NAL unit is conveyed in F and NRI 
    fields of the FU indicator octet of the fragmentation unit and in 
    the type field of the FU header.  A FU payload MAY have any number 
    of octets and MAY be empty. 
     
    If a fragmentation unit is lost, the receiver SHOULD discard all 
    following fragmentation units in transmission order corresponding 
    to the same fragmented NAL unit. 
  
  
 6.    Packetization Rules 
  
    Two cases of packetization rules have to be distinguished by the 
    possibility to put packets belonging to more than a single picture 
    into a single aggregated packet (using MTAPs). 
     
     
 Wenger et. al.     Expires December 2003               [Page 28] 

 Internet Draft                                          26 June, 2003 
 6.1.      Unrestricted Mode (Multiple Picture Model) 
  
    This mode MAY be supported by some receivers.  Usually, the 
    capability of a receiver to support this mode is implied by the 
    application, or indicated by external control protocol means.  The 
    use of this mode MUST be signaled with the optional mtap-allowed 
    MIME or SDP parameter, if MIME or SDP signaling is in use.  The 
    following packetization rules MUST be enforced by the sender: 
     
    o All packet types specified in Error! Reference source not found. 
      MAY be used. 
     
    o The transmission order of packets and NAL units is constrained as 
      specified in section 5.3.  Some receivers MAY NOT support a 
      transmission order that does not conform to the NAL unit decoding 
      order. 
     
    o Coded slice NAL units or coded slice data partition NAL units 
      belonging to the same coded picture (and hence share the same RTP 
      timestamp value) MAY be sent in any order permitted by the 
      applicable profile defined in [1], although, for delay-critical 
      systems, they SHOULD be sent in their original coding order to 
      minimize the delay.  Note that the coding order is not 
      necessarily the scan order, but the order the NAL packets become 
      available to the RTP stack.  
     
    o Sequence and picture parameter set NAL units MUST NOT be sent in 
      an RTP session whose parameter sets were already changed by 
      control protocol messages during the lifetime of the RTP session. 
     
    o Network elements such as gateways MAY convert single NAL unit 
      packets into one aggregation packet, convert an aggregation 
      packet into several single NAL unit packets, or mix both 
      concepts.  However, when doing so they SHOULD take into account 
      at least the following parameters: path MTU size, unequal 
      protection mechanisms (e.g. through packet-based FEC according to 
      RFC2398, carried by RFC2198, especially for sequence and picture 
      parameter set NAL units and coded slice data partition A NAL 
      units), bearable latency of the system, and buffering 
      capabilities of the receiver. 
     
 Wenger et. al.     Expires December 2003               [Page 29] 

 Internet Draft                                          26 June, 2003 
    o Network elements such as gateways MUST NOT duplicate any NAL unit 
      except for sequence or picture parameter set NAL units, because 
      neither this memo nor the H.264 specification provides means to 
      identify duplicated NAL units.  Sequence and picture parameter 
      set NAL units MAY be duplicated to make their correct reception 
      more probable, but any such duplication MUST NOT affect the 
      contents of any active sequence or picture parameter set. 
     
     
 6.2.      Restricted Mode (Single Picture Model) 
     
    This mode MUST be supported by all receivers.  It is primarily 
    intended for low-delay applications.  Its main difference from the 
    Unrestricted Mode is to forbid the packetization of data belonging 
    to more than one picture in a single RTP packet.  Hence, MTAPs MUST 
    NOT be used.  The following packetization rules MUST be enforced by 
    the sender: 
     
   o All rules of the Unrestricted Mode above, but MTAPs MUST NOT be 
      used.  This implies that aggregation packets MUST NOT include 
      coded slices or coded slice data partitions belonging to 
      different access units. 
     
     
 7.    De-Packetization Process (Informative) 
  
    The de-packetization process is implementation dependent.  Hence, 
    the following description should be seen as an example of a 
    suitable implementation.  Other schemes may be used as well.  
    Optimizations relative to the described algorithms are likely 
    possible. 
     
    The general concept behind these de-packetization rules is to 
    reorder NAL units from transmission order to the NAL unit decoding 
    order.   
     
    The receiver includes a receiver buffer, which is used to reorder 
    packets from transmission or to the NAL unit decoding order.  The 
    receiver may use the following guidelines when determining the size 
    of the receiver buffer.  The optional interleaving-depth MIME 
    parameter indicates the size of the receiver buffer as the number 
    of VCL NAL units.  The size of the receiver buffer in bytes may be 
 Wenger et. al.     Expires December 2003               [Page 30] 

 Internet Draft                                          26 June, 2003 
    estimated by multiplying the optional initial-buffering-time MIME 
    parameter with the bandwidth-value of the corresponding media level 
    SDP description parameter, if available (and taking into account 
    the necessary conversions between the units of initial-buffering-
    time and bandwidth-value).  The receiver should take buffering for 
    transmission delay jitter into account too and either reserve a 
    separate buffer for transmission delay jitter buffering or combine 
    the buffer for transmission delay jitter with the receiver buffer. 
     
    The receiver stores incoming NAL units in reception order into the 
    receiver buffer as follows.  NAL units contained in a single NAL 
    unit packet are stored into the receiver buffer as such.  NAL units 
    of aggregation packets are stored into the receiver buffer 
    individually.  For each NAL unit of a single NAL unit packet and an 
    aggregation packet, the RTP sequence number of the packet that 
    contained the NAL unit is stored and associated with the stored NAL 
    unit.  All fragments of a fragmented NAL unit are collected and 
    stored into the receiver buffer.  For a fragmented NAL unit, the 
    RTP sequence number of the RTP packet containing the first fragment 
    in transmission order is stored and associated with the fragmented 
    NAL unit.  Moreover, for each NAL unit, the packet type (single NAL 
    unit packet, STAP-A, STAP-B, MTAP16, MTAP24, or fragmentation unit) 
    that contained the NAL unit is stored and associated with each 
    stored NAL unit.  Furthermore, for NAL units carried in STAP-Bs, 
    MTAPs, and FUs, decoding order number (DON) is calculated and 
    stored. 
     
    Hereinafter, let N be the value of the optional interleaving-depth 
    MIME type parameter (see section 8.1).  When the RTP session is 
    initialized, the receiver buffers at least N VCL NAL units or at 
    least for the duration equal to the value of the optional initial-
    buffering-time MIME parameter, if available, whichever results into 
    shorter initial buffering duration, before passing any packet to 
    the decoder. 
     
    If the receiver buffer contains at least N VCL NAL units, NAL units 
    are removed from the receiver buffer and passed to the decoder in 
    the order specified below until the buffer contains N-1 VCL NAL 
    units.  Variable ts is set to the value of system timer that was 
    initialized to 0 when the first packet of the NAL unit stream was 
    received.  If the receiver buffer contains a NAL unit whose 
 Wenger et. al.     Expires December 2003               [Page 31] 

 Internet Draft                                          26 June, 2003 
    reception time tr fulfills the condition that ts - tr > initial-
    buffering-time, NAL units are passed to the decoder (and removed 
    from the receiver buffer) in the order specified below until the 
    receiver buffer contains no NAL unit whose reception time tr 
    fulfills the specified condition.  Note that transmission delay 
    jitter should be taken into account in the calculations with 
    timestamps. 
     
    The order that NAL units are passed to the decoder is specified as 
    follows: 
     
    o Let PDON be a variable that is initialized to 0 at the beginning 
      of the an RTP session. 
     
    o If the oldest RTP sequence number in the buffer corresponds to a 
      single NAL unit packet or an STAP-A, the NAL unit in the single 
      NAL unit packet or the STAP-A, respectively, is the next NAL unit 
      in the NAL unit decoding order.  The value of PDON is set to 0. 
     
    o If the oldest RTP sequence number in the buffer corresponds to An 
      STAP-B, an MTAP, or an FU, the NAL unit decoding order is 
      recovered among the NAL units conveyed in STAP-Bs, MTAPs, and FUs 
      in RTP sequence number order until the next single NAL unit 
      packet or the next STAP-A (exclusive).  This set of NAL units is 
      hereinafter referred to as the candidate NAL units.  If no NAL 
      units conveyed in single NAL unit packets or STAP-As reside in 
      the receiver buffer, all NAL units belong to candidate NAL units. 
     
    o For each NAL unit among the candidate NAL units, a DON distance 
      is calculated as follows.  If the value of DON of the NAL unit is 
      larger than the value of PDON, the DON distance is equal to DON - 
      PDON.  Otherwise, the DON distance is equal to 65535 - PDON + DON 
      + 1.  NAL units are delivered to the decoder in ascending order 
      of DON distance.  When a desired amount of NAL units have been 
      passed to the decoder, the value of PDON is set to the value of 
      DON for the last NAL unit passed to the decoder. 
     
    o If several NAL units share the same DON distance, they can be 
      passed to the decoder in any order.  
     
 Wenger et. al.     Expires December 2003               [Page 32] 

 Internet Draft                                          26 June, 2003 
    o If the video decoder in use does not support Arbitrary Slice 
      Order, the decoding order of coded slices and coded slice data 
      partitions A is ordered in ascending order of the 
      first_mb.in.slice syntax element in the slice header. Moreover, 
      coded slice data partition B and C immediately follow the 
      corresponding coded slice data partition A in decoding order. 
     
    The following additional de-packetization rules may be used to 
    implement an operational H.264 de-packetizer: 
     
    o Intelligent RTP receivers (e.g. in gateways) may identify lost 
      coded slice data partitions A (DPAs). If a lost DPA is found, a 
      gateway may decide not to send the corresponding coded slice data 
      partitions B and C, as their information is meaningless for H.264 
      decoders.  In this way a network element can reduce network load 
      by discarding useless packets, without parsing a complex bit 
      stream. 
     
    o Intelligent receivers may discard all packets in which the value 
      of the NRI field of the NAL unit type octet is equal to 0.  
      However, they process those packets if possible, because the user 
      experience may suffer if the packets are discarded. 
     
     
 8.    Payload Format Parameters 
  
    This section specifies the parameters that MAY be used to select 
    optional features of the payload format.  The parameters are 
    specified here as part of the MIME subtype registration for the 
    ITU-T H.264 | ISO/IEC 14496-10 codec.  A mapping of the parameters 
    into the Session Description Protocol (SDP) [4] is also provided 
    for those applications that use SDP.  Equivalent parameters could 
    be defined elsewhere for use with control protocols that do not use 
    MIME or SDP. 
     
     
 8.1.      MIME Registration 
     
    The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is 
    allocated from the IETF tree.   
     
    Any unspecified parameter MUST be ignored by the receiver. 
 Wenger et. al.     Expires December 2003               [Page 33] 

 Internet Draft                                          26 June, 2003 
     
    Media Type name:     video 
     
    Media subtype name:  H264 
     
    Required parameters: none 
     
    Optional parameters: 
       profile-level-id: A profile-level element used in specifying the 
                         value of this parameter is generated by 
                         forming a string of hexadecimal 
                         representations of the following three bytes 
                         in the sequence parameter set NAL unit 
                         specified in [1]: 1) profile_idc, 2) a byte 
                         herein referred to as profile-iop, composed of 
                         the values of constrained_set0_flag, 
                         constrained_set1_flag, constrained_set2_flag, 
                         and reserved_zero_5bits in bit-significance 
                         order starting from the most significant bit, 
                         and 3) level_idc.  Note that 
                         reserved_zero_5bits is required to be equal to 
                         0 in [1], but other values for it may be 
                         specified in the future by ITU-T or ISO/IEC.  
                         The value of profile-level-id is a sequence of 
                         profile-level elements.  If this parameter is 
                         used for indicating properties of a NAL unit 
                         stream, it indicates the profiles that are in 
                         use in the stream and the highest level that 
                         is in use for each signaled profile.  The 
                         profile-iop byte for each signaled profile 
                         indicates whether the NAL unit stream also 
                         obeys all constraints of the indicated 
                         profiles as follows.  If bit 7 (the most 
                         significant bit), bit 6, or bit 5 of profile-
                         iop is equal to 1, all constraints of the 
                         Baseline profile, the Main profile, or the 
                         Extended profile, respectively, are obeyed in 
                         the NAL unit stream.  If the profile-level-id 
                         parameter is used for capability exchange or 
                         session setup procedure, it indicates the 
                         profiles that the codec supports and the 
 Wenger et. al.     Expires December 2003               [Page 34] 

 Internet Draft                                          26 June, 2003 
                         highest level that is supported for each 
                         signaled profile.  The profile-iop byte for 
                         each signaled profile indicates whether the 
                         codec has such additional limitations that 
                         only the common subset of the algorithmic 
                         features and limitations of the profiles 
                         signaled with the profile-iop byte and the 
                         profile indicated by profile_idc is supported 
                         by the codec.  For example, if a codec 
                         supports the Baseline Profile at level 3 and 
                         below and the Main Profile at level 2.1 and 
                         below without any additional limitations, the 
                         profile-level-id becomes 42A01E4D4015.  If a 
                         codec supports only the common subset of the 
                         coding tools of the Baseline profile and the 
                         Main profile at level 2.1 and below, the 
                         profile-level-id becomes 42E015.  If no 
                         profile-level-id is present, the Baseline 
                         Profile without additional constraints at 
                         Level 1 MUST be implied. 
        
       parameter-sets:   This parameter MAY be used to convey such 
                         sequence and picture parameter set NAL units, 
                         herein referred to as the initial parameter 
                         set NAL units, that MUST precede any other NAL 
                         units in decoding order.  The parameter MUST 
                         NOT be used to indicate codec capability in 
                         any capability exchange procedure.  The value 
                         of the parameter is the hexadecimal 
                         representation of the initial parameter set 
                         NAL units as specified in sections 7.3.2.1 and 
                         7.3.2.2 of [1].  The parameter sets are 
                         conveyed in decoding order and no framing of 
                         the parameter set NAL units takes place.  A 
                         comma is used to separate any pair of 
                         parameter sets in the list.  Note that the 
                         number of bytes in a parameter set NAL unit is 
                         typically less than 10 bytes, but a picture 
                         parameter set NAL unit can contain several 
                         hundreds of bytes. 
        
 Wenger et. al.     Expires December 2003               [Page 35] 

 Internet Draft                                          26 June, 2003 
       interleaving-depth: This parameter MAY be used to signal the 
                         properties of a NAL unit stream or the 
                         capabilities of a transmitter or receiver 
                         implementation.  The parameter specifies the 
                         maximum amount of VCL NAL units that precede 
                         any VCL NAL unit in the NAL unit stream in 
                         transmission order and follow the VCL NAL unit 
                         in decoding order.  If the parameter is not 
                         present, then a value of 0 MUST be used for 
                         interleaving-depth.  The value of 
                         interleaving-depth MUST be an integer in the 
                         range from 0 to 32767, inclusive. 
  
       initial-buffering-time: This parameter MAY be used to signal the 
                         properties of a NAL unit stream or the 
                         capabilities of a transmitter or receiver 
                         implementation.  The parameter is a decimal 
                         representation of the maximum time in clock 
                         ticks of a 90-kHz clock between the 
                         transmission time of any NAL unit and the 
                         decoding time of the NAL unit assuming 
                         reliable and instantaneous transmission and 
                         the same timeline for transmission and 
                         decoding.  If the parameter is not present, 
                         then a value of 0 MUST be used for initial-
                         buffering-time.  The value of initial-
                         buffering-time MUST be an integer in the range 
                         from 0 to 4 294 967 295, inclusive.  Receivers 
                         SHOULD take transmission delay jitter 
                         buffering, including buffering for the delay 
                         jitter caused by mixers, translators, 
                         gateways, proxies, traffic-shapers and other 
                         network elements, into account in addition to 
                         the signaled initial-buffering-time. 
  
       mtap-allowed:    Permissible values are 0 and 1.  If 0 or not 
                         present, MTAPs MUST NOT be present in the NAL 
                         unit stream. If 1, MTAPs MAY be present in the 
                         NAL unit stream. 
  
    Encoding considerations: 
 Wenger et. al.     Expires December 2003               [Page 36] 

 Internet Draft                                          26 June, 2003 
                         This type is only defined for transfer via RTP 
                         (RFC 1889). 
     
    Security considerations: 
                         See section 9 of RFC XXXX. [Ed.Note: to be 
                         replaced with the RFC number of this 
                         specification] 
     
    Public specification: 
                         Please refer to RFC XXXX [Ed.Note: to be 
                         replaced with the RFC number of this 
                         specification] and its section 15. 
     
    Additional information: 
                         None 
     
    File extensions:     none 
    Macintosh file type code: none 
    Object identifier or OID: none 
     
    Person & email address to contact for further information: 
                         stewe@cs.tu-berlin.de 
     
    Intended usage:      COMMON. 
     
    Author/Change controller: 
                         stewe@cs.tu-berlin.de 
                         IETF Audio/Video transport working group 
     
     
 8.2.      SDP Parameters 
     
    The MIME media type video/H264 string is mapped to fields in the 
    Session Description Protocol (SDP) [4] as follows: 
     
    o The media name in the "m=" line of SDP MUST be video. 
     
    o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the 
      MIME subtype). 
     
    o The clock rate in the "a=rtpmap" line MUST be 90000. 
 Wenger et. al.     Expires December 2003               [Page 37] 

 Internet Draft                                          26 June, 2003 
     
    o The "a=fmtp" line of SDP MUST contain the optional parameters 
      "profile-level-id", "parameter-sets", "interleaving-depth", 
      "initial-buffering-time", and "mtap-allowed", if any, to indicate 
      the coder capability and configuration, respectively.  These 
      parameters are expressed as a MIME media type string, in the form 
      of as a semicolon separated list of parameter=value pairs. 
     
    An example of media representation in SDP is as follows (Baseline 
    Profile, Level 3.0, more than one slice group, arbitrary slice 
    ordering, and redundant slices are in use): 
     
    m=video 49170/2 RTP/AVP 98 
    a=rtpmap:98 H264/90000 
    a=fmtp:98 profile-level-id=42A01E 
     
     
 9.    Security Considerations 
  
    RTP packets using the payload format defined in this specification 
    are subject to the security considerations discussed in the RTP 
    specification [1], and any appropriate RTP profile (for example 
    [2]).  
    This implies that confidentiality of the media streams is achieved 
    by encryption.  Because the data compression used with this payload 
    format is applied end-to-end, encryption may be performed after   
    compression so there is no conflict between the two operations.  
     
    A potential denial-of-service threat exists for data encodings 
    using compression techniques that have non-uniform receiver-end 
    computational load.  The attacker can inject pathological datagrams 
    into the stream which are complex to decode and cause the receiver 
    to be overloaded.  H.264 is particularly vulnerable to such attacks 
    because it is extremely simple to generate datagrams containing NAL 
    units that affect the decoding process of many future NAL units. 
     
    As with any IP-based protocol, in some circumstances a receiver may 
    be overloaded simply by the receipt of too many packets, either 
    desired or undesired.  Network-layer authentication may be used to 
    discard packets from undesired sources, but the processing cost of 
    the authentication itself may be too high.  In a multicast 
    environment, pruning of specific sources may be implemented in 
 Wenger et. al.     Expires December 2003               [Page 38] 

 Internet Draft                                          26 June, 2003 
    future versions of IGMP [5] and in multicast routing protocols to 
    allow a receiver to select which sources are allowed to reach it.  
     
     
 10.     Informative Appendix: Application Examples 
  
    This payload specification is very flexible in its use, to cover 
    the extremely wide application space that is anticipated for the 
    H.264.  However, such a great flexibility also makes it difficult 
    for an implementer to decide on a reasonable packetization scheme.  
    Some information how to apply this specification to real-world 
    scenarios is likely to appear in the form of academic publications 
    and a test model software and description in the near future.  
    However, some preliminary usage scenarios are described here as 
    well.   
     
     
 10.1.       Video Telephony according to ITU-T Recommendation H.241  
      Annex A 
     
    H.323-based video telephony systems that use H.264 as an optional 
    video compression scheme are required to support H.241 Annex A as a 
    packetization scheme.  The packetization mechanism defined in this 
    Annex is technically identical with a small subset of this 
    specification. 
    When operating according to H.241 Annex A, parameter sets NAL units 
    are sent in-band.  Only Single NAL unit packets are used.  A 
    typical packet stream generated by such a system consists of all 
    sequence and picture parameter sets used for the future video 
    sequence, possibly sent in more than one copy to raise the 
    likeliness of their arrival at the receiver, followed by the 
    packets carrying the NAL units of the IDR picture, and followed by 
    packets carrying the subsequent pictures.  Many such systems are 
    not sending IDR pictures regularly, but only when required by user 
    interaction or by control protocol means, e.g. when switching 
    between video channels in an Multipoint Control Unit. 
     
 10.2.       Video Telephony, No Slice Data Partitioning, No NAL unit  
      Aggregation 
  
    The RTP part of this scheme is implemented and tested (though not 
    the control-protocol part, see below). 
     
 Wenger et. al.     Expires December 2003               [Page 39] 

 Internet Draft                                          26 June, 2003 
    In most real-world video telephony applications, the picture 
    parameters such as picture size or optional modes never change 
    during the lifetime of a connection.  Hence, all necessary 
    parameter sets (usually only one) are sent as a side effect of the 
    capability exchange/announcement process e.g. according to the SDP 
    syntax specified in section 8.2 of this document.  Since all 
    necessary parameter set information is established before the RTP 
    session starts, there is no need for sending any parameter set NAL 
    units.  Slice data partitioning is not used either.  Hence, the RTP 
    packet stream consists basically of NAL units that carry single 
    coded slices. 
     
    The size of those single-slice NAL units is chosen by the encoder 
    such that they offer the best performance.  Often, this is done by 
    adapting the coded slice size to the MTU size of the IP network.  
    For small picture sizes this may result in a one-picture-per-one-
    packet strategy.  The loss of packets and the resulting drift-
    related artifacts are cleaned up by Intra refresh algorithms. 
     
     
 10.3.       Video Telephony, Interleaved Packetization Using NAL unit 
      Aggregation 
  
    This scheme allows better error concealment and is widely used in 
    H.263 based designed using RFC2429 packetization.  It is also 
    implemented and good results were reported [10].  
     
    The VCL encoder codes the source picture such that all macroblocks 
    (MBs) of one MB line are assigned to one slice.  All slices with 
    even MB row addresses are combined into one STAP, and all slices 
    with odd MB row addresses into another STAP.  Those STAPs are 
    transmitted as RTP packets.  The establishment of the parameter 
    sets is performed as discussed above. 
     
    Note that the use of STAPs is essential here, because the high 
    number of individual slices (18 for a CIF picture) would lead to 
    unacceptably high IP/UDP/RTP header overhead (unless the source 
    coding tool FMO is used, which is not assumed in this scenario).  
    Furthermore, some wireless video transmission systems, such as 
    H.324M and the IP-based video telephony specified in 3GPP, are 
    likely to use relatively small transport packet size.  For example, 
    a typical MTU size of H.223 AL3 SDU is around 100 bytes [13].  
 Wenger et. al.     Expires December 2003               [Page 40] 

 Internet Draft                                          26 June, 2003 
    Coding individual slices according to this packetization scheme 
    provides a further advantage in communication between wired and 
    wireless networks, as individual slices are likely to be smaller 
    than the preferred maximum packet size of wireless systems.  
    Consequently, a gateway can convert the STAPs used in a wired 
    network to several RTP packets with only one NAL unit that are 
    preferred in a wireless network and vice versa.  
     
     
 10.4.       Video Telephony, with Data Partitioning 
  
    This scheme is implemented and was shown to offer good performance 
    especially at higher packet loss rates [10]. 
     
    Data Partitioning is known to be useful only when some form of 
    unequal error protection is available.  Normally, in single-session 
    RTP environments, even error characteristics are assumed -- 
    statistically, the packet loss probability of all packets of the 
    session is the same.  However, there are means to reduce the packet 
    loss probability of individual packets in an RTP session.  RFC 2198 
    [14], for example, allows carrying a redundant copy of a essential 
    packet in the next RTP packet.  Packet-based Forward Error 
    Correction [15] carried in RFC2198 is also an appropriate means to 
    protect high priority information. 
     
    In all cases, the incurred overhead is substantial, but in the same 
    order of magnitude as the number of bits that have otherwise be 
    spent for intra information.  However, this mechanism is not adding 
    any delay to the system.   
     
    Again, the complete parameter set establishment is performed 
    through control protocol means. 
     
     
 10.5.       Low-Bit-Rate Streaming 
  
    This scheme has been implemented with H.263 and gave good results 
    [16].  There is no technical reason why similarly good results 
    could not be achievable with H.264.  
     
    In today's Internet streaming, some of the offered bit-rates are 
    relatively low in order to allow terminals with dial-up modems to 
 Wenger et. al.     Expires December 2003               [Page 41] 

 Internet Draft                                          26 June, 2003 
    access the content.  In wired IP networks, relatively large 
    packets, say 500 - 1500 bytes, are preferred to smaller and more 
    frequently occurring packets in order to reduce network congestion. 
    Moreover, use of large packets decreases the amount of RTP/UDP/IP 
    header overhead.  For low-bit-rate video, the use of large packets 
    means that sometimes up to few pictures should be encapsulated in 
    one packet.  
     
    However, loss of a packet including many coded pictures would have 
    drastic consequences in visual quality, as there is practically no 
    other way to conceal a loss of an entire picture than to repeat the 
    previous one.  One way to construct relatively large packets and 
    maintain possibilities for successful loss concealment is to 
    construct MTAPs that contain slices from several pictures in an 
    interleaved manner.  An MTAP should not contain spatially adjacent 
    slices from the same picture or spatially overlapping slices from 
    any picture.  If a packet is lost, it is likely that a lost slice 
    is surrounded by spatially adjacent slices of the same picture and 
    spatially corresponding slices of the temporally previous and 
    succeeding pictures.  Consequently, concealment of the lost slice 
    is likely to succeed relatively well. 
     
     
 10.6.       Robust Packet Scheduling in Video Streaming 
     
    This scheme has been implemented with MPEG-4 Part 2 and simulated 
    in a wireless streaming environment [17].  There is no technical 
    reason why similar or better results could not be achievable with 
    H.264. 
     
    Streaming clients typically have a receiver buffer that is capable 
    of storing a relatively large amount of data.  Initially, when a 
    streaming session is established, a client does not start playing 
    the stream back immediately, but rather it typically buffers the 
    incoming data for a few seconds.  This buffering helps to maintain 
    continuous playback, because, in case of occasional increased 
    transmission delays or network throughput drops, the client can 
    decode and play buffered data.  Otherwise, without initial 
    buffering, the client has to freeze the display, stop decoding, and 
    wait for incoming data.  The buffering is also necessary for either 
    automatic or selective retransmission in any protocol level.  If 
 Wenger et. al.     Expires December 2003               [Page 42] 

 Internet Draft                                          26 June, 2003 
    any part of a picture is lost, a retransmission mechanism may be 
    used to resend the lost data.  If the retransmitted data is 
    received before its scheduled decoding or playback time, the loss 
    is perfectly recovered.  Coded pictures can be ranked according to 
    their importance in the subjective quality of the decoded sequence. 
    For example, non-reference pictures, such as conventional B 
    pictures, are subjectively least important, because their absence 
    does not affect decoding of any other pictures.  In addition to 
    non-reference pictures, the ITU-T H.264 | ISO/IEC 14496-10 standard 
    includes a temporal scalability method called sub-sequences [18]. 
    Subjective ranking can also be made on coded slice data partition 
    or slice group basis.  Coded slices and coded slice data partitions 
    that are subjectively the most important can be sent earlier than 
    their decoding order indicates, whereas coded slices and coded 
    slice data partitions that are subjectively the least important can 
    be sent later than their natural coding order indicates. 
    Consequently, any retransmitted parts of the most important slices 
    and coded slice data partitions are more likely to be received 
    before their scheduled decoding or playback time compared to the 
    least important slices and slice data partitions. 
     
     
 11.     Informative Appendix: Rationale for Decoding Order Number 
     
 11.1.       Introduction 
  
    The Decoding Order Number (DON) concept was introduced mainly to 
    enable efficient multi-picture slice interleaving (see section 
    10.5) and robust packet scheduling (see section 10.6).  In both of 
    these applications NAL units are transmitted out of decoding order. 
    DON indicates the decoding order of NAL units and should be used in 
    the receiver the recover the decoding order.  Example use cases for 
    efficient multi-picture slice interleaving and for robust packet 
    scheduling are given in sections 11.2 and 11.3 respectively.  
    Section 11.4 describes the benefits of the DON concept in error 
    resiliency achieved by redundant coded pictures.  Section 11.5 
    summarizes considered alternatives to DON and justifies why DON was 
    chosen to this RTP payload specification. 
     
     
 11.2.       Example of Multi-Picture Slice Interleaving 
  
 Wenger et. al.     Expires December 2003               [Page 43] 

 Internet Draft                                          26 June, 2003 
    An example of multi-picture slice interleaving follows.  A subset 
    of a coded video sequence is depicted below in output order.  R 
    denotes a reference picture, N denotes a non-reference picture, and 
    the number indicates a relative output time. 
     
    ... R1 N2 R3 N4 R5 ... 
     
    The decoding order of these picture is from left to right as 
    follows: 
    ... R1 R3 N2 R5 N4 ... 
     
    The NAL units of pictures R1, R3, N2, R5, and N4 are marked with a 
    DON equal to 1, 2, 3, 4, and 5, respectively. 
     
    Each reference picture consists of three slice groups that are 
    scattered as follows (a number denotes the slice group number for 
    each macroblock in a QCIF frame): 
     
    0 1 2 0 1 2 0 1 2 0 1 
    2 0 1 2 0 1 2 0 1 2 0 
    1 2 0 1 2 0 1 2 0 1 2 
    0 1 2 0 1 2 0 1 2 0 1 
    2 0 1 2 0 1 2 0 1 2 0 
    1 2 0 1 2 0 1 2 0 1 2 
    0 1 2 0 1 2 0 1 2 0 1 
    2 0 1 2 0 1 2 0 1 2 0 
    1 2 0 1 2 0 1 2 0 1 2 
     
    For the sake of simplicity, we assume that all the macroblocks of a 
    slice group are included in one slice.  Three MTAPs are constructed 
    from three consecutive reference pictures so that each MTAP 
    contains three aggregation units, each of which contains all the 
    macroblocks from one slice group.  The first MTAP contains slice 
    group 0 of picture R1, slice group 1 of picture R2, and slice group 
    2 of picture R3.  The second MTAP contains slice group 1 of picture 
    R1, slice group 2 of picture R2, and slice group 0 of picture R3.  
    The third MTAP contains slice group 2 of picture R1, slice group 0 
    of picture R2, and slice group 1 of picture R3.  Each non-reference 
    picture is encapsulated into an STAP-B. 
     
    Consequently, the transmission order of NAL units is the following: 
 Wenger et. al.     Expires December 2003               [Page 44] 

 Internet Draft                                          26 June, 2003 
      R1, slice group 0, DON 1 
      R3, slice group 1, DON 2 
      R5, slice group 2, DON 4 
      R1, slice group 1, DON 1 
      R3, slice group 2, DON 2 
      R5, slice group 0, DON 4 
      R1, slice group 2, DON 1 
      R3, slice group 1, DON 2 
      R5, slice group 0, DON 4 
      N2,                DON 3 
      N4,                DON 5 
     
    The receiver is able to organize the NAL units back in decoding 
    order based on the value of DON associated with each NAL unit. 
     
    If one the MTAPs is lost, the spatially adjacent and temporally co-
    located macroblocks are received and can be used to conceal the 
    loss efficiently.  If one of the STAPs is lost, the effect of the 
    loss does not propagate temporally. 
     
     
 11.3.       Example of Robust Packet Scheduling 
     
    An example of robust packet scheduling follows.  The communication 
    system used in the example consists of the following components in 
    the order that the video is processed from source to sink: 
    o camera and capturing 
    o pre-encoding buffer 
    o encoder 
    o encoded picture buffer 
    o transmitter 
    o transmission channel 
    o receiver 
    o receiver buffer 
    o decoder 
    o decoded picture buffer 
    o display 
     
    The video communication system used in the example operates as 
    follows.  Note that processing of the video stream happens 
    gradually and at the same time in all components of the system.  
 Wenger et. al.     Expires December 2003               [Page 45] 

 Internet Draft                                          26 June, 2003 
    The source video sequence is shot and captured to a pre-encoding 
    buffer.  The pre-encoding buffer can be used to order pictures from 
    sampling order to encoding order or to analyze multiple 
    uncompressed frames for bitrate rate control purposes, for  
    example. In some cases the pre-encoding buffer may not exist, but 
    rather the sampled pictures are encoded right away.  The encoder 
    encodes pictures from the pre-encoding buffer and stores the 
    output, i.e., coded pictures, to the encoded picture buffer.  The 
    transmitter encapsulates the coded pictures from the encoded 
    picture buffer to transmission packets and sends them to a receiver 
    through a transmission channel.  The receiver stores the received 
    packets to the receiver buffer.  The receiver buffering process 
    typically includes buffering for transmission delay jitter.  The 
    receiver buffer can also be used to recover correct decoding order 
    of coded data.  The decoder reads coded data from the receiver 
    buffer and produces decoded pictures as output into the decoded 
    picture buffer.  The decoded picture buffer is used to recover the 
    output (or display) order of pictures.  Finally, pictures are 
    displayed. 
     
    In the following example figures, I denotes an IDR picture, R 
    denotes a reference picture, N denotes a non-reference picture, and 
    the number after I, R, or N indicates a relative sampling time 
    proportional to the previous IDR picture in decoding order.  Values 
    below the sequence of pictures indicate scaled system clock 
    timestamps.  The system clock is initialized arbitrarily in this 
    example, and time runs from left to right.  Each I, R, and N 
    picture is mapped into the same timeline compared to the previous 
    processing step, if any, assuming that encoding, transmission, and 
    decoding take no time.  Thus, events happening at the same time are 
    located in the same column throughout all example figures. 
     
    A subset of a sequence of coded pictures is depicted below in 
    sampling order.   
     
    ...  N58 N59 I00 N01 N02 R03 N04 N05 R06 ... N58 N59 I00 N01 ... 
    ... --|---|---|---|---|---|---|---|---|- ... -|---|---|---|- ... 
    ...  58  59  60  61  62  63  64  65  66  ... 128 129 130 131 ... 
     
    The sampled pictures are buffered in the pre-encoding buffer to 
    arrange them in encoding order.  In this example, we assume that 
 Wenger et. al.     Expires December 2003               [Page 46] 

 Internet Draft                                          26 June, 2003 
    the non-reference pictures are predicted from both the previous and 
    the next reference picture in output order.  Thus, the pre-encoding 
    buffer has to contain at least two pictures and the buffering 
    causes a delay of two picture intervals.  The output of the pre-
    encoding buffering process, and the encoding (and decoding) order 
    of the pictures is as follows: 
     
             ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...  
             ... -|---|---|---|---|---|---|---|---|-  ...  
             ... 60  61  62  63  64  65  66  67  68  ... 
     
    The encoder or the transmitter can set the value of DON for each 
    picture to a value of DON for the previous picture in decoding 
    order plus one. 
     
    For the sake of simplicity, let us assume that: 
    o the frame rate of the sequence is constant, 
    o each picture consists on only one slice, 
    o each slice is encapsulated in a single NAL unit packet, 
    o pictures are transmitted in decoding order, and 
    o pictures are transmitted at constant intervals (that is equal to 
    1 / frame rate). 
     
    Thus, pictures are received in decoding order: 
     
             ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...  
             ... -|---|---|---|---|---|---|---|---|- ...  
             ... 60  61  62  63  64  65  66  67  68  ... 
     
    The optional interleaving-depth MIME type parameter is set to 0, 
    because no buffering is needed to recover the correct decoding 
    order from transmission (or reception order). 
     
    The decoder has to buffer for one picture interval initially in its 
    decoded picture buffer to organize pictures from decoding order to 
    output order as depicted below: 
     
                 ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ...  
                 ... -|---|---|---|---|---|---|---|---|- ...  
                 ... 61  62  63  64  65  66  67  68  69  ... 
     
 Wenger et. al.     Expires December 2003               [Page 47] 

 Internet Draft                                          26 June, 2003 
    The amount of required initial buffering in the decoded picture 
    buffer can be signaled in the buffering period SEI message or with 
    the num_reorder_frames syntax element of H.264 video usability 
    information.  num_reorder_frames indicates the maximum number of 
    frames, complementary field pairs, or non-paired fields that 
    precede any frame, complementary field pair, or non-paired field in 
    the sequence in decoding order and follow it in output order.  For 
    the sake of simplicity, we assume that num_reorder_frames is used 
    to indicate the initial buffer in the decoded picture buffer.  In 
    this example, num_reorder_frames is equal to 1. 
     
    It can be observed that if the IDR picture I00 is lost during 
    transmission and a retransmission request is issued when the value 
    of the system clock is 62, there is one picture interval of time 
    (until the system clock reaches timestamp 63) to receive the 
    retransmitted IDR picture I00. 
     
    Let us then assume that IDR pictures are transmitted two frame 
    intervals earlier than their decoding position, i.e., the pictures 
    are transmitted as follows: 
     
                    ...  I00 N58 N59 R03 N01 N02 R06 N04 N05 ...  
                    ... --|---|---|---|---|---|---|---|---|- ...  
                    ...  62  63  64  65  66  67  68  69  70  ...  
     
    The optional interleaving-depth MIME type parameter is set equal to 
    1 according to its definition.  (The value of interleaving-depth in 
    this example can be derived as follows:  Picture I00 is the only 
    picture preceding picture N58 or N59 in transmission order and 
    following it in decoding order.  Except for pictures I00, N58, and 
    N59, the transmission order is the same as the decoding order of 
    pictures.  Since a coded picture is encapsulated into exactly one 
    NAL unit, the value of interleaving-depth is equal to the maximum 
    number of pictures preceding any picture in transmission order and 
    following the picture in decoding order.) 
     
    The receiver buffering process contains one picture at a time 
    according to the value of the interleaving-depth parameter and 
    orders pictures from the reception order to the correct decoding 
    order based on the value of DON associated with each picture.  The 
    output of the receiver buffering process is the following: 
 Wenger et. al.     Expires December 2003               [Page 48] 

 Internet Draft                                          26 June, 2003 
     
                         ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...  
                         ... -|---|---|---|---|---|---|---|---|- ...  
                         ... 63  64  65  66  67  68  69  70  71  ... 
     
    Again, an initial buffering delay of one picture interval is needed 
    to organize pictures from decoding order to output order as 
    depicted below: 
     
                             ... N58 N59 I00 N01 N02 R03 N04 N05 ... 
                             ... -|---|---|---|---|---|---|---|- ... 
                             ... 64  65  66  67  68  69  70  71  ... 
     
    It can be observed that the maximum delay that IDR pictures can 
    undergo during transmission, including possible application, 
    transport, or link layer retransmission, is equal to three picture 
    intervals.  Thus, the loss resiliency of IDR pictures is improved 
    in systems supporting retransmission compared to the case in which 
    pictures were transmitted in their decoding order. 
     
     
 11.4.       Robust Transmission Scheduling of Redundant Coded Slices 
     
    A redundant coded picture is a coded representation of a picture or 
    a part of a picture that is not used in the decoding process if the 
    corresponding primary coded picture is correctly decoded.  There 
    should be no noticeable difference between any area of the decoded 
    primary picture and a corresponding area that would result from 
    application of the H.264 decoding process for any redundant picture 
    in the same access unit.  A redundant coded slice is a coded slice 
    that is a part of a redundant coded picture. 
     
    Redundant coded pictures can be used to provide unequal error 
    protection in error-prone video transmission.  If a primary coded 
    representation of a picture is decoded incorrectly, a corresponding 
    redundant coded picture can be decoded.  Examples of applications 
    and coding techniques utilizing the redundant codec picture feature 
    include the video redundancy coding [19] and protection of "key 
    pictures" in multicast streaming [20]. 
     
 Wenger et. al.     Expires December 2003               [Page 49] 

 Internet Draft                                          26 June, 2003 
    One property of many error-prone video communications systems is 
    that transmission errors are often bursty and therefore they may 
    affect more than one consecutive transmission packets in 
    transmission order.  In low bitrate video communication it is 
    relatively common that an entire coded picture can be encapsulated 
    into one transmission packet.  Consequently, a primary coded 
    picture and the corresponding redundant coded pictures may be 
    transmitted in consecutive packets in transmission order.  In order 
    to make the transmission scheme more tolerant of bursty 
    transmission errors, it is beneficial to transmit a primary coded 
    picture apart from the corresponding redundant coded pictures.  The 
    DON concept enables this. 
     
     
 11.5.       Remarks on Other Design Possibilities 
     
    The slice header syntax structure of the H.264 coding standard 
    contains the frame_num syntax element that can indicates the 
    decoding order of coded frames.  However, the usage of the 
    frame_num syntax element is not feasible or desirable to recover 
    the decoding order due to the following reasons: 
    o The receiver is required to parse at least one slice header per 
      coded picture (before passing the coded data to the decoder). 
    o Coded slices from multiple coded video sequences cannot be 
      interleaved, because the frame number syntax element is reset to 
      0 in each IDR picture. 
    o The coded fields of a complementary field pair share the same 
      value of the frame_num syntax element.  Thus, the decoding order 
      of the coded fields of a complementary field pair cannot be 
      recovered based on the frame_num syntax element or any other 
      syntax element of the H.264 coding syntax. 
     
    The RTP payload format for transport of MPEG-4 elementary streams 
    [21] enables interleaving of access units and transmission of 
    multiple access units in the same RTP packet.  An access unit is 
    specified in the H.264 coding standard to consist of all NAL units 
    that are associated with a primary coded picture according to 
    subclause 7.4.1.2 of [1].  Consequently, slices of different 
    pictures cannot be interleaved and the multi-picture slice 
    interleaving technique (see section 10.5) for improved error 
    resilience cannot be used. 
 Wenger et. al.     Expires December 2003               [Page 50] 

 Internet Draft                                          26 June, 2003 
     
     
 12.     Open Issues 
     
    Since the length of the value of the optional parameter-sets MIME 
    parameter may be remarkable, it is desirable to use base64 encoding 
    instead of the current base16 (hexadecimal) encoding.  We will 
    carry out the change to base64 encoding in the next release of this 
    memo. 
     
    Security section needs review 
     
 13.     Full Copyright Statement 
     
    Copyright (C) The Internet Society (2002). All Rights Reserved. 
     
    This document and translations of it may be copied and furnished to 
    others, and derivative works that comment on or otherwise explain 
    it 
    or assist in its implementation may be prepared, copied, published 
    and distributed, in whole or in part, without restriction of any 
    kind, provided that the above copyright notice and this paragraph 
    are included on all such copies and derivative works. 
     
    However, this document itself may not be modified in any way, such 
    as by removing the copyright notice or references to the Internet 
    Society or other Internet organizations, except as needed for the 
    purpose of developing Internet standards in which case the 
    procedures for copyrights defined in the Internet Standards process 
    must be followed, or as required to translate it into languages 
    other than English. 
     
    The limited permissions granted above are perpetual and will not be 
    revoked by the Internet Society or its successors or assigns. 
     
    This document and the information contained herein is provided on 
    an 
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
 Wenger et. al.     Expires December 2003               [Page 51] 

 Internet Draft                                          26 June, 2003 
     
     
 14.     Intellectual Property Notice 
     
    The IETF takes no position regarding the validity or scope of any 
    intellectual property or other rights that might be claimed to 
    pertain to the implementation or use of the technology described in 
    this document or the extent to which any license under such rights 
    might or might not be available; neither does it represent that it 
    has made any effort to identify any such rights.  Information on 
    the IETF's procedures with respect to rights in standards-track and 
    standards-related documentation can be found in BCP-11.  Copies of 
    claims of rights made available for publication and any assurances 
    of licenses to be made available, or the result of an attempt made 
    to obtain a general license or permission for the use of such 
    proprietary rights by implementors or users of this specification 
    can be obtained from the IETF Secretariat. 
     
    The IETF invites any interested party to bring to its attention any 
    copyrights, patents or patent applications, or other proprietary 
    rights which may cover technology that may be required to practice 
    this standard.  Please address the information to the IETF 
    Executive Director. 
     
    The IETF has been notified of intellectual property rights claimed 
    in regard to some or all of the specification contained in this 
    document.  For more information consult the online list of claimed 
    rights at http://www.ietf.org/ipr. 
     
     
 15.     References 
     
 15.1.       Normative References 
  
    [1]  "Draft ITU-T Recommendation and Final Draft International 
          Standard of Joint Video Specification (ITU-T Rec. H.264 | 
          ISO/IEC 14496-10 AVC)", available from  ftp://ftp.imtc-
          files.org/jvt-experts/2003_03_Pattaya/JVT-G50r1.zip, May 
          2003. 
    [2]  S. Bradner, "Key words for use in RFCs to Indicate 
          Requirement Levels", BCP 14, RFC 2119, March 1997. 
 Wenger et. al.     Expires December 2003               [Page 52] 

 Internet Draft                                          26 June, 2003 
    [3]  H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, 
          "RTP: A Transport Protocol for Real-Time Applications", RFC 
          1889, January 1996. 
    [4]  M. Handley and V. Jacobson, "SDP: Session Description 
          Protocol", RFC 2327, April 1998. 
    [5]  ITU-T Recommendation T.35 (need to put in Rec name and date) 
  
  
 15.2.       Informative References 
  
    [6]  A. Luthra, G. Sullivan, and T. Wiegand (eds.), Special Issue 
          on H.264/AVC. IEEE Transactions on Circuits and Systems on 
          Video Technology, July 2003. 
    [7]  P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-
          N57r2, available from ftp://standard.pictel.com/video-
          site/0109_San/VCEG-N57r2.doc, September 2001. 
    [8]  C. Borman et. Al., "RTP Payload Format for the 1998 Version 
          of ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998. 
    [9]  ISO/IEC IS 14496-2. 
    [10] S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and 
          Systems for Video technology, July 2003. 
    [11] S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", 
          Proceedings Packet Video Workshop 02, April 2002 
    [12] T. Stockhammer, M.M. Hannuksela, and S. Wenger, "H.26L/JVT 
          Coding Network Abstraction Layer and IP-based Transport" in 
          Proc. ICIP 2002, Rochester, NY, September 2002. 
    [13] ITU-T Recommendation H.223 (1999). 
    [14] C. Perkins et. al., "RTP Payload for Redundant Audio Data", 
          RFC 2198, September 1997. 
    [15] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for 
          Generic Forward Error Correction", RFC 2733, December 1999. 
    [16] V Varsa, M. Karczewicz, "Slice interleaving in compressed 
          video packetization", Packet Video Workshop 2000. 
    [17] S.H. Kang and A. Zakhor, "Packet scheduling algorithm for 
          wireless video streaming," International Packet Video 
          Workshop 2002, available http://www.pv2002.org. 
    [18] M.M. Hannuksela, "Enhanced concept of GOP", JVT-B042, 
          available ftp://standard.pictel.com/video-site/0201_Gen/JVT-
          B042.doc, January 2002. 
    [19] S. Wenger, "Video Redundancy Coding in H.263+", 1997 
          International Workshop on Audio-Visual Services over Packet 
          Networks, September 1997. 
 Wenger et. al.     Expires December 2003               [Page 53] 

 Internet Draft                                          26 June, 2003 
    [20] Y.-K. Wang, M.M. Hannuksela, and M. Gabbouj, "Error Resilient 
          Video Coding Using Unequally Protected Key Pictures", in 
          Proc. International Workshop VLBV03, September 2003. 
    [21] J. van der Meer, D. Mackie, V. Swaminathan, D. Singer, and P. 
          Gentric, "RTP Payload Format for Transport of MPEG-4 
          Elementary Streams", draft-ietf-avt-mpeg4-simple-07.txt, 
          February 2003. 
     
     
    Author's Addresses 
     
    Stephan Wenger                    Phone: +49-172-300-0813 
    TU Berlin / Teles AG              Email: stewe@cs.tu-berlin.de 
    Franklinstr. 28-29 
    D-10587 Berlin 
    Germany 
     
    Miska M. Hannuksela               Phone: +358-7180-73151 
    Nokia Corporation                 Email: miska.hannuksela@nokia.com 
    P.O. Box 100 
    33721 Tampere 
    Finland   
     
    Thomas Stockhammer                Phone: +49-89-28923474 
    Institute for Communications Eng. Email: stockhammer@ei.tum.de 
    Munich University of Technology 
    D-80290 Munich 
    Germany 
     
    Magnus Westerlund                 Phone: +46-8-4048287 
    Multimedia Technologies           Email:  
    Ericsson Research EAB/TVA/A       magnus.westerlund@ericsson.com 
    Ericsson AB 
    Torshamsgatan 23 
    S-164 80 Stockholm 
    Sweden 
     
    David Singer                      Phone +1 408 974-3162 
    QuickTime Engineering             Email: singer@apple.com 
    Apple 
    1 Infinite Loop MS 302-3MT 
    Cupertino 
    CA 95014 
 Wenger et. al.     Expires December 2003               [Page 54] 

 Internet Draft                                          26 June, 2003 
    USA 
     
     
 Annex A: Changes relative to draft-ietf-avt-rtp-h264-01.txt 
  
    [This section will be removed in a future version of this draft.] 
     
    This memo contains the following technical changes relative to the 
    previous I-D: 
    o The memo is aligned with the latest version of the H.264 
      specification.  In particular, the concept of access unit defined 
      in the H.264 specification has been incorporated into the memo. 
    o The use of RTP timestamps for interlaced video has been 
      clarified. 
    o Two forms of single-time aggregation packets are available: one 
      with decoding order number (DON) and another without DON. 
    o The assignment rules for the values of DON have been simplified.  
      NAL units having the same value of DON may be decoded at any 
      order.  NAL units having a different value of DON must be passed 
      to the decoder in ascending order of DON value. 
    o Receiver buffer size considerations have been added. 
    o The depacketization process has been updated to suit lossy 
      transmission better.  The optional initial-buffering-time MIME 
      parameter forms a part of the solution. 
    o Example use cases of DON and rationale behind the DON concept 
      have been added. 
    o The application examples have been clarified and a new example 
      has been added 
    o The security section has been reworked 
    o numerous editorial fixes 
  
 Wenger et. al.     Expires December 2003               [Page 55]