Internet Draft                                               S. Wenger 
 Document: draft-ietf-avt-rtp-h264-00.txt                 M. Hannuksela 
 Expires: March 2003                                     T. Stockhammer 
                                                         September 2002 
                                                     Expires March 2003 
                                                
  
  
  
                    RTP payload Format for JVT Video 
  
  
  
 Status of this Memo 
     
 This document is an Internet-Draft and is in full conformance with 
 all provisions of Section 10 of RFC2026.  Internet-Drafts are working 
 documents of the Internet Engineering Task Force (IETF), its areas, 
 and its working groups.  Note that other groups may also distribute 
 working documents as Internet-Drafts. 
  
 Internet-Drafts are draft documents valid for a maximum of six months 
 and may be updated, replaced, or obsoleted by other documents at any 
 time.  It is inappropriate to use Internet-Drafts as reference 
 material or to cite them other than as "work in progress." 
  
 The list of current Internet-Drafts can be accessed at 
 http://www.ietf.org/1id-abstracts.txt 
  
 The list of Internet-Draft Shadow Directories can be accessed at 
 http://www.ietf.org/shadow.html 
     
     
     
 Abstract 
     
    This memo describes an RTP Payload format for the ITU-T 
    Recommendation H.264 codec.  This codec was designed as a joint 
    project of the ITU-T SG 16 VCEG, and the ISO/IEC JTC1/SC29/WG11 
    MPEG groups.  The most up-to-date draft of the video codec was 
    specified in late August 2002, is due for revision in late October 
    2002, and is available for public review [2].  Final versions carry 
    the denomination H.264 and ISO/IEC 14496-10 and are technically 
    identical. 
      
 Wenger et. al.       Expires March 2003             [Page 1] 

 Internet Draft                                       21 September 2002 
     
 1. The JVT codec 
  
    This memo specifies an RTP payload specification for a new video 
    codec that is currently under development by the Joint Video Group 
    (JVT), which is formed of video coding experts of MPEG and the ITU-
    T.  After the likely approval by the two parent bodies, the codec 
    specification will have the status of the ITU-T Recommendation 
    H.264 and become part of the MPEG-4 specification (ISO/IEC 14496 
    Part 10).  The current project timeline of the JVT project is such 
    that a technically frozen specification (pending bug fixes) was 
    finalized in July 2002 in the form of an ISO/IEC Final Committee 
    Draft (FCD).  In October, some editorial changes will be made, and 
    a few technical changes can also be expected.  However, it is 
    believed that only very few, if any, technical details will be 
    changed that directly affect this draft. 
    Before JVT was formed in late 2001, this project used the ITU-T 
    project name H.26L and the JVT project inherited all the technical 
    concepts of the H.26L project. 
  
    The JVT video codec has a very broad application range that covers 
    the whole range from low bit rate Internet Streaming applications 
    to HDTV broadcast and Digital Cinema applications with near loss-
    less coding.  Most, if not all, relevant companies in all of these 
    fields (including TV broadcast) have participated in the 
    standardization, which gives hope that this wide application range 
    is more than an illusion and may materialize, probably in a 
    relatively short time frame.  The overall performance of the JVT 
    codec is as such that bit rate savings of 50% or more, compared to 
    the current state of technology, are reported.  Digital Satellite 
    TV quality, for example, was reported to be achievable at 1.5 
    Mbit/s, compared to the current operation point of MPEG 2 video at 
    around 3.5 Mbit/s [1]. 
     
    The codec specification [2] itself distinguishes conceptually 
    between a video coding layer (VCL), and a network abstraction layer 
    (NAL).  The VCL contains the signal processing functionality of the 
    codec, things such as transform, quantization, motion 
    search/compensation, and the loop filter.  It follows the general 
    concept of most of today's video codecs, a macroblock-based coder 
    that utilizes inter picture prediction with motion compensation, 
    and transform coding of the residual signal.  The output of the VCL 
    are slices: a bit string that contains the macroblock data of an 
    integer number of macroblocks, and the information of the slice 
    header (containing the spatial address of the first macroblock in 
    the slice, the initial quantization parameter, and similar).  
    Macroblocks in slices are ordered in scan order unless a different 
    macroblock allocation is specified, using the so-called Flexible 
    Macroblock Ordering syntax.  In-picture prediction is used only 
    within a slice.   
     
    The NAL encapsulates the slice output of the VCL into Network 
    Abstraction Layer Units (NALUs), which are suitable for the 
    transmission over packet networks or the use in packet oriented 
    multiplex environments.  JVT's Annex B defines an encapsulation 
 Wenger et. al.     Expires December 2002                [Page 2] 

 Internet Draft                                       21 September 2002 
    process to transmit such NALUs over byte-stream oriented networks.  
    In the scope of this memo Annex B is not relevant. 
     
    Neither VCL nor NAL are claimed to be media or network independent 
    - the VCL needs to know transmission characteristics in order to 
    appropriately select the error resilience strength, slice size, 
    etc., whereas the NAL needs information like the importance of a 
    bit string provided by the VCL to select the appropriate 
    application layer protection. 
     
    Internally, the NAL uses NAL Units or NALUs.  A NALU consists of a 
    one-byte header and the payload byte string.  The header co-serves 
    as the RTP payload header and indicates the type of the NALU, the 
    (potential) presence of bit errors in the NALU payload, and 
    information regarding the relative importance of the NALU for the 
    decoding process.  This RTP payload specification is designed to be 
    unaware of the bit string in the NALU payload. 
     
    One of the main properties of the JVT codec is the possibility of 
    the use of Reference Picture Selection.  For each macroblock the 
    reference picture to be used can be selected independently.  The 
    reference pictures may be used in a first-in, first-out fashion, 
    but it is also possible to handle the reference picture buffers 
    explicitly.  A consequence of this new feature (it was available 
    before only in H.263++ [3]) is the complete decoupling of the 
    transmission time, the decoding time, and the sampling or 
    presentation time of slices and pictures.  For this reason, the 
    handling of the RTP timestamp requires some special considerations 
    for those NALUs for which the sampling or presentation time is not 
    defined, or, at transmission time, unknown. 
     
     
 2. Changes relative to draft-wenger-avt-rtp-jvt-01.txt 
  
    [This section will be removed in a future version of this draft.] 
     
 2.1. Status of the JVT standardization, and recent changes to JVT 
  
    Since the last draft, JVT has met twice and each time a new JVT 
    working draft was produced.  The latest JVT working draft is 
    currently in the second stage of the ISO/IEC approval process, the 
    ballot on the so-called Final Committee Draft.  Procedural 
    provisions are taken by interested ISO/IEC members to ensure that 
    changes relative to this draft are still possible, even after the 
    ballot. 
     
    The meetings brought a lot of changes in the VCL, which do not have 
    a direct influence to this memo.  However, there were also numerous 
    changes introduced to the NAL.  Most of these changes can be 
    considered bug fixes or cleanups that re-established the clean NAL 
    design.  In particular, the unreasonably high number of slice types 
    were again reduced to the pre-Fairfax design (as presented in the 
    Minneapolis IETF), and the picture header concept with its 
    redundant carriage mechanism was removed.   
 Wenger et. al.     Expires December 2002                [Page 3] 

 Internet Draft                                       21 September 2002 
    Newly introduced was a mechanism that allows to signal the relative 
    importance of a NALU for the decoding process.  A two-bit field 
    indicates the importance of a NALU.  A value of 00 indicates that 
    the decoding of the NALU is not necessary to maintain the integrity 
    of the reference pictures.  Values above 0 imply that the NALU is 
    necessary for maintaining the integrity of the reference pictures.  
    However, the impact of the loss, as determined by the encoder, is 
    the higher the bigger the value of the field is.  Intelligent 
    network elements can use this information to discard NALUs in a 
    controlled manner in order to produce the best possible picture at 
    a given bit rate. 
     
     
     
 2.2. Changes relative to draft-wenger-avt-rtp-jvt-01.txt 
     
    This memo contains two significant changes relative to the previous 
    I-D.  The first change is the alignment with the current JVT WD, in 
    particular with respect to the NALU types and the priority field in 
    the NALU header.  The second change was discussed in Yokohama and 
    is concerned with the length of the timestamp offset field in the 
    MTAP.  In Yokohama it was felt that more flexibility is needed.  
    Hence, now a total of 4 MTAPs are introduced, called MTAP8, MTAP16, 
    MTAP24, and MTAP32, which differ from each other only by the length 
    of the timestamp offset. 
     
     
 3. Scope 
  
    This payload specification can only be used to carry the "naked" 
    JVT NALU stream over RTP.  Likely, the first applications of a 
    Standard Track RFC resulting from this draft will be in the 
    conversational multimedia field, video telephone or video 
    conference.  The draft is not intended for the use in conjunction 
    with the Byte Stream format of Annex B of the JVT working draft, 
    the MPEG 4 system layer [4] or other multiplexing schemes. 
     
     
 4. NAL basics 
  
    Tutorial information on the NAL design can be found in [5], 
    [6] and [14].  For the precise definition of the NAL it is referred 
    to [2].  This section tries to provide a very short overview of the 
    concepts used. 
     
     
 4.1. Parameter Set Concept 
     
    One very fundamental design concept of the JVT codec is to generate 
    self-contained packets, to make mechanisms such as the header 
    duplication of RFC2429 [7] or MPEG-4's HEC [8] unnecessary.  The 
    way how this was achieved is to decouple information that is 
    relevant for more than one slice from the media stream.  This 
    higher layer meta information should be sent reliably, 
    asynchronously and in advance from the RTP packet stream that 
 Wenger et. al.     Expires December 2002                [Page 4] 

 Internet Draft                                       21 September 2002 
    contains the slice packets.  The combination of the higher level 
    parameters is called a Parameter Set.  The Parameter Set contains 
    information such as 
     
      o picture size, 
      o display window, 
      o optional coding modes employed, 
      o macroblock allocation map, 
      o and others. 
       
    In order to be able to change picture parameters (such as the 
    picture size), without having the need to transmit Parameter Set 
    updates synchronously to the slice packet stream, the encoder and 
    decoder can maintain a list of more than one Parameter Set.  Each 
    slice header contains a codeword that indicates the Parameter Set 
    to be used.   
     
    This mechanism allows to decouple the transmission of the Parameter 
    Sets from the packet stream, and transmit them by external means, 
    e.g. as a side effect of the capability exchange, or through a 
    (reliable or unreliable) control protocol. It may even be possible 
    that they get never transmitted but are fixed by an application 
    design specification. 
     
    Although, conceptually, the Parameter Set updates are not designed 
    to be sent in the synchronous packet stream, this memo contains 
    means to convey them in the RTP packet stream.   
     
     
 4.2. Network Abstraction Layer Packet (NALU) Types 
  
    All NALUs consist of a single NALU Type octet, which also serves as 
    the payload header.  The payload of a NALU follows immediately.   
     
    The NALU type octet has the following format: 
     
     
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |F|NSI|  Type   | 
    +---------------+ 
     
    F: 1 bit 
       The Forbidden bit, when zero, indicates a bit error free NAL 
       unit.  The JVT specification declares a value of 1 as a syntax 
       violation.  Hence, when set, the decoder is advised that bit 
       errors may be present in the payload or in the NALU type octet.  
       A prudent reaction of decoders that are incapable of handling 
       bit errors is to discard such packets. 
        
    NSI: 2 bits 
       NAL Storage IDC.  A value of 00 indicates that the content of 
       the NALU is not used to reconstruct stored pictures (that can be 
 Wenger et. al.     Expires December 2002                [Page 5] 

 Internet Draft                                       21 September 2002 
       used for future reference).  Such NALUs can be discarded without 
       risking the integrity of the reference pictures.  Values above 
       00 indicate that the decoding of the NALU is required to 
       maintain the integrity of the reference pictures.  Furthermore, 
       values above 00 indicate the relative transport priority, as 
       determined by the encoder.  Intelligent network elements can use 
       this information t protect more important NALUs better than less 
       important NALUs.  11 is the highest transport priority, followed 
       by 10, then by 01 and, finally, 00 is the lowest. 
     
    Type: 5 bits 
       The NAL Unit payload type as defined in table 7.1 of [2]. 
     
    For a reference of all currently defined NALU types and their 
    semantics please refer to section 7.1 in [2].   
     
 4.3. Aggregation Packets 
    Aggregation packets are the packet aggregation scheme of this 
    payload specification.  The scheme is introduced to reflect the 
    dramatically different MTU sizes of two target networks -- wireline 
    IP networks (with an MTU size that is often limited by the Ethernet 
    MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. H.324/M) 
    based wireless networks with preferred transmission unit sizes of 
    254 bytes or less.  In order to prevent media transcoding between 
    the two worlds, and to avoid undesirable packetization overhead, a 
    packet aggregation scheme is introduced. 
     
    Two types of Aggregation packets are defined by this specification: 
     
    o Single-Time Aggregation Packet (STAP) aggregate NALUs with 
      identical NALU-time. 
    o Multi-Time Aggregation Packets (MTAP) aggregate NALUs with 
      potentially differing NALU-time.  Four different MTAPs are 
      defined that differ in the length of the NALU timestamp offset. 
     
    The term NALU-time is defined as the value the RTP timestamp would 
    have if that NALU would be transported in its own RTP packet.  
     
    MTAPs and STAP share the following packetization rules: 
     
    The NSI MUST be set to the maximum of the NSIs of all the NALUs to 
    be aggregated. 
    The Type field of the NALU type octet MUST be set to the 
    appropriate value as indicated in table xxx.  The F bit MUST be 
    cleared if all F bits of the aggregated NALUs are zero, otherwise 
    it MUST be set. 
     
    Table xxx: Type field for STAP and MTAPs 
     
    Type   Packet    Timestamp offset field length (in bits) 
    ---------------------------------------------- 
    0x18   STAP      0 
    0x19   MTAP8     8 
    0x20   MTAP16    16 
    0x21   MTAP24    24 
 Wenger et. al.     Expires December 2002                [Page 6] 

 Internet Draft                                       21 September 2002 
    0x22   MTAP32    32 
     
    The Marker bit in the RTP header MUST be set to the value the 
    marker bit of the last NALU of the aggregated packet would have if 
    it were transported in its own RTP packet. 
     
    The NALU Payload of an aggregation packet consists of one or more 
    aggregation units.  See section 4.3.1 and 4.3.2 for the two 
    different types of aggregation units.  An aggregation packet can 
    carry as many aggregation units as necessary, however the total 
    amount of data in an aggregation packet obviously MUST fit into an 
    IP packet, and the size SHOULD be chosen such that the resulting IP 
    packet is smaller than the MTU size. 
  
 4.3.1. Single-Time Aggregation Packet 
  
    Single-Time Aggregation Packet (STAP) SHOULD be used whenever 
    aggregating NALUs that share the same NALU-time.  
    The NALU payload of an STAP consists of Single-Picture Aggregation 
    units. 
      
    A Single-Picture Aggregation Unit consists of 16-bit unsigned size 
    information that indicates the size of the following NALU in bytes 
    (excluding these two octets, but including the NALU type octet of 
    the NALU), followed by the NALU itself including its NALU type  
    byte.   
     
 4.3.2. Multi-Time Aggregation Packets (MTAPs) 
     
    An MTAP has a similar architecture as an STAP.  It consists of the 
    NALU header byte and one or more Multi-Picture Aggregation Units.  
    The choice between the different MTAP fields is application 
    dependent -- the larger the timestamp offset is the higher is the 
    flexibility of the MTAP, but the higher is also the overhead. 
     
     
    This Memo does not specify how the NALUs within an MTAP are  
    ordered.  In most cases, the natural "decoding order" SHOULD be 
    used, in particular in conjunction with bi-predicted pictures that 
    use a forward reference picture.  However, all other NALU ordering 
    schemes that are legal in JVT video MAY be used as well. 
     
    Four different  Multi-Time Aggregation Unit are defined in this 
    specification.  They all consist of 16 bits unsigned size 
    information of the following NALU (same as the size information of 
    in the STAP).  These 16 bits are followed by n bits of timing 
    information for this NALU, whereby n can be 8, 16, 24, or 32.  The 
    timing information field MUST be set so that the RTP timestamp of 
    an RTP packet of each NALU in the MTAP (the NALU-time) can be 
    generated by subtracting the timing information from the RTP 
    timestamp of the MTAP.  
     
    For the "latest" multi-picture Aggregation Unit in an MTAP the 
    timing offset MUST be zero.  Hence, the RTP timestamp of the MTAP 
    itself is identical to the latest NALU-time. 
 Wenger et. al.     Expires December 2002                [Page 7] 

 Internet Draft                                       21 September 2002 
     
     
     
 5. RTP Packetization Process 
  
    The RTP packetization process of the JVT codec is straightforward 
    and follows the general principles outlined in RFC1889.  When using 
    one NALU per RTP packet, the RTP payload consists of the bit buffer 
    containing the NALU.  The RTP payload (and the settings for some 
    RTP header bits) for aggregation packets were already defined in 
    section 4.3 above.  There is no specific RTP payload header -- the 
    NALU type byte double-functions in this task.  The RTP header 
    information is set as follows:  
     
    Timestamp: 32 bits 
       The RTP timestamp is set to the presentation/sampling timestamp 
       of the content.  If the NALU has no own timing properties (e.g. 
       PSIs, SEI), or if the presentation/sampling time is unknown, the 
       RTP timestamp is set to the RTP timestamp of the last 
       transmitted RTP packet in the session.  The setting of the RTP 
       Timestamp for MTAPs is defined in section 4.3.2 above. 
     
    Marker bit (M): 1 bit 
       Set for the very last packet of the picture indicated by the RTP 
       timestamp, in line with the normal use of the M bit and to allow 
       an efficient playout buffer handling.  Decoders MAY use this bit 
       as an early indication of the last packet of a coded picture, 
       but MUST not rely on this property because the last packet of 
       the picture may get lost, and because the use of MTAPs does not 
       always preserve the M bit.   
     
    Sequence No (Seq): 16 bit 
       Increased by one for each sent packet.  Set to a random value 
       during startup as per RFC1889 
     
    Version (V): 2 bits 
       set to 2 
     
    Padding (P): 1 bit 
       set to 0 
     
    Extension (X): 1 bit 
       set to 0 
     
    Payload Type (PT): 8 bits 
       established dynamically during connection establishment 
     
    All other RTP header fields are set as per RFC1889. 
     
     
 6. Packetization Rules 
  
    Two cases of packetization rules have to be distinguished by the 
    possibility to put packets belonging to more than a single picture 
    into a single aggregated packet (using STAPs or MTAPs). 
 Wenger et. al.     Expires December 2002                [Page 8] 

 Internet Draft                                       21 September 2002 
     
     
 6.1. Unrestricted Mode (Multiple Picture Model) 
  
    This mode MAY be supported by some receivers.  Usually, the 
    capability of a receiver to support this mode is indicated by one 
    of the profiles of the JVT codec (this is not yet defined in [2]). 
    The following packetization rules MUST be enforced by the sender: 
     
    o Single slice packets belonging to the same picture (and hence 
      share the same RTP timestamp value) MAY be sent in any order, 
      although, for delay critical systems, they SHOULD be sent in 
      their original coding order to minimize the delay.  Note that the 
      coding order is not necessarily the scan order, but the order the 
      NAL packets become available to the RTP stack.  
  
    o Both MTAPs and STAPs MAY be used. 
     
    o SEI packets MAY be sent anytime. 
     
    o PSIs MUST NOT be sent in an RTP session whose Parameter Sets were 
      already changed by control protocol messages during the lifetime 
      of the RTP session.  If PSIs are allowed by this condition, they 
      MAY be sent at any time. 
     
    o All NALU types MAY be mixed freely, provided that above 
      rules are obeyed.  In particular, it is allowed to mix slices in 
      data-partitioned and single-slice mode. 
     
    o Network elements MAY convert multiple RTP packets carrying 
   Individual  NALUs into one aggregated RTP packet, convert an 
      aggregated RTP packet into several RTP packets carrying 
 individual 
       NALUs, or mix both concepts.  However, when doing so they SHOULD 
      take into account at least the following parameters: path MTU 
      size, unequal protection mechanisms (e.g. through packet 
      duplication, packet-based FEC carried by RFC2198, especially for 
      header and Type A Data Partitioning packets), bearable latency of 
      the system, and buffering capabilities of the receiver. 
     
    o NALUs of all types MAY be conveyed as aggregation units of an 
    STAP 
      or MTAP rather than individual RTP packets.  Special care SHOULD 
      be taken (particularly in gateways) to avoid more than a single 
      copy of identical NALUs in a single STAP/MTAP in order to avoid 
      unnecessary data transfers without any improvements of QoS. 
     
     
 6.2. Restricted Mode (Single Picture Model) 
     
    This mode MUST be supported by all receivers.  It is primarily 
    intended for low delay applications.  Its main difference from the 
    Unrestricted Mode is to forbid the packetization of data belonging 
    to more than one picture in a single RTP packet.  Hence, MTAPs MUST 
 Wenger et. al.     Expires December 2002                [Page 9] 

 Internet Draft                                       21 September 2002 
    NOT be used.  The following packetization rules MUST be enforced by 
    the sender: 
     
    o All rules of the Unrestricted Mode above, with the following  
      additions 
     
    o only STAPs MAY be used, MTAPs MUST NOT be used.  This implies 
    that 
      aggregated packets MUST NOT include slices or data partitions 
      belonging to different pictures. 
     
 7. De-Packetization Process 
  
    The de-packetization process is implementation dependent.  Hence, 
    the following description should be seen as an example of a 
    suitable implementation.  Other schemes MAY be used as well.  
    Optimizations relative to the described algorithms are likely 
    possible. 
     
    The general concept behind these de-packetization rules is to 
    collect all packets belonging to a picture, bringing them into a 
    reasonable order, discard anything that is unusable, and pass the 
    rest to the decoder.  Aggregation packets are handled by unloading 
    their payload into individual RTP packets carrying NALUs.  Those 
    NALUs are processed as if they were received in separate RTP 
    packets, in the order they were arranged in the Aggregation Packet. 
     
    The following de-packetization rules MAY be used to implement an 
    operational JVT de-packetizer: 
     
    o NALUs are presented to the JVT decoder in the order of the  
      RTP sequence number. 
     
    o NALUs carried in an Aggregation Packet are presented in their 
      order in the Aggregation packet.  All NALUs of the Aggregation 
      packet are processed before the next RTP packet is processed.  
     
    o Intelligent RTP receivers (e.g. in Gateways) MAY identify lost  
      DPAs. If a lost DPA is found, the Gateway MAY decide not to send 
      the DPB and DPC partitions, as their information is meaningless 
      for the JVT Decoder.  In this way a network element can reduce 
      network load by discarding useless packets, without parsing a 
      complex bit stream 
     
    o Intelligent receivers MAY discard all packets that have the 
      Disposable Flag set.  However, they SHOULD process those packets 
      if possible, because the user experience may suffer if the 
    packets 
      are discarded.  
     
     
 8. MIME Considerations 
  
    This section is to be completed later.   
     
 Wenger et. al.     Expires December 2002               [Page 10] 

 Internet Draft                                       21 September 2002 
     
 9. Security Considerations 
  
    So far, no security considerations beyond those of RFC1889 have 
    been identified. 
     
    Currently, the JVT CD does not allow carrying any type of active 
    payload.  However, the inclusion of a "user data" mechanism is 
    under consideration, which could potentially be used for mechanisms 
    such as remote software updates of the video decoder and similar 
    tasks.  
     
     
 10. Informative Appendix: Application Examples 
  
    This payload specification is very flexible in its use, to cover 
    the extremely wide application space that is anticipated for the 
    JVT codec.  However, such a great flexibility also makes it 
    difficult for an implementer to decide on a reasonable 
    packetization scheme.  Some information how to apply this 
    specification to real-world scenarios is likely to appear in the 
    form of academic publications and a Test Model in the near future.  
    However, some preliminary usage scenarios should be described here 
    as well.   
     
     
 10.1. Video Telephony, no Data Partitioning, no packet aggregation 
  
    The RTP part of this scheme is implemented and tested (though not 
    the control-protocol part, see below). 
     
    In most real-world video telephony applications, the picture 
    parameters such as picture size or optional modes never change 
    during the lifetime of a connection.  Hence, all necessary 
    Parameter Sets (usually only one) are sent as a side effect of the 
    capability exchange/announcement process.  An example for such a 
    capability exchange with an SDP-like syntax can be found in [9], 
    but other schemes such as ASN.1 are possible as well.  Since all 
    necessary Parameter Set information is established before the RTP 
    session starts, there is no need for sending any PSIs.  Data 
    Partitioning is not used either.  Hence, the RTP packet stream 
    consists basically of NALUs that carry single slices of video 
    information. 
     
    The size of those single-slice NALUs is chosen by the encoder such 
    that they offer the best performance.  Often, this is done by 
    adapting the coded slice size to the MTU size of the IP network.  
    For small picture sizes this may result in a one-picture-per-one-
    packet strategy.  The loss of packets and the resulting drift-
    related artifacts are cleaned up by Intra refresh algorithms. 
     
     
 10.2. Video Telephony, Interleaved Packetization using Packet 
 Aggregation 
  
 Wenger et. al.     Expires December 2002               [Page 11] 

 Internet Draft                                       21 September 2002 
    This scheme allows better error concealment and is widely used in 
    H.263 based designed using RFC2429 packetization.  It is also 
    implemented and good results were reported [5].  
     
    The source picture is coded by the VCL such that all MBs of one MB 
    line are assigned to one slice.  All slices with even MB row 
    addresses are combined into one STAP, and all slices with odd MB 
    row addresses into another STAP.  Those STAPs are transmitted as 
    RTP packets.  The establishment of the Parameter Sets is performed 
    as discussed above. 
     
    Note that the use of STAPs is essential here, because the high 
    number of individual slices (18 for a CIF picture) would lead to 
    unacceptably high IP/UDP/RTP header overhead (unless the source 
    coding tool FMO is used, which is not assumed in this scenario).  
    Furthermore, some wireless video transmission systems, such as 
    H.324M and the IP-based video telephony specified in 3GPP, are 
    likely to use relatively small transport packet size.  For example, 
    a typical MTU size of H.223 AL3 SDU is around 100 bytes [10].  
    Coding individual slices according to this packetization scheme 
    provides a further advantage in communication between wired and 
    wireless networks, as individual slices are likely to be smaller 
    than the preferred maximum packet size of wireless systems.  
    Consequently, a gateway can convert the STAPs used in a wired 
    network to several RTP packets with only one NALU that are 
    preferred in a wireless network and vice versa.  
     
     
 10.3. Video Telephony, with Data Partitioning 
  
    This scheme is implemented and was shown to offer good performance 
    especially at higher packet loss rates [5]. 
    Data Partitioning is known to be useful only when some form of 
    unequal error protection is available.  Normally, in single-session 
    RTP environments, even error characteristics are assumed -- 
    statistically, the packet loss probability of all packets of the 
    session is the same.  However, there are means to reduce the packet 
    loss probability of individual packets in an RTP session.  One 
    simple way is known as Packet Duplication: simply send the to-be-
    protected packet twice, with the same sequence number.  If both 
    packets survive, the receiver will assume a packet duplication by 
    UDP and discard one of the two packets.  Other means of unequal 
    protection within the same RTP session include the use of RFC 2198 
    [11] (for this application it is essentially a packet duplication 
    process as well, with some saved bytes for the second RTP header), 
    or packet-based Forward Error Correction [12] carried in RFC2198. 
     
    The implemented software uses the simple packet duplication process 
    to increase the probability of all DPA NALUs.  The incurred 
    overhead is substantial, but in the same order of magnitude as the 
    number of bits that have otherwise be spent for intra information.  
    However, this mechanism is not adding any delay to the system.   
     
    Again, the complete Parameter Set establishment is performed 
    through control protocol means. 
 Wenger et. al.     Expires December 2002               [Page 12] 

 Internet Draft                                       21 September 2002 
     
     
 10.4. MPEG-2 Transport to RTP Gateway 
  
    This example is not implemented completely, but the basic 
    mechanisms are part of the interim file format the JVT group uses 
    and, hence, well tested.   
     
    When using JVT video in satellite/cable broadcast environments, 
    there is no control protocol available that can be used for the 
    transmission of Parameter Sets.  Furthermore, a receiver has to be 
    able to "tune" into an ongoing packet stream at any time, without 
    much delay and artifacts.  For this reason, PSIs that contain all 
    Parameter Set information are included in the packet stream at any 
    Instantaneous Decoder Refresh Point (which are similar to Key 
    Frames in earlier coding standards).  IDERP packets are used to 
    signal these "key frames" so that a decoder can most easily 
    determine where to start in its decoding process. 
     
    Since the byte stream format used in satellite/cable broadcast 
    environments does not include timing information in the video 
    stream, the gateway needs to use external timing information (e.g. 
    from the MPEG-2 system layer) to generate the RTP timestamp.  
    Please note that this timestamp is also a 90 kHz clock -- hence, in 
    most cases, the conversion should be relatively simple. 
     
    The simplest possible MPEG-2 transport to RTP gateway could take 
    the NALUs as they come from the MPEG-2 transport stream (after de-
    framing), and send them, each NALU in one RTP packet, with 
    increasing RTP sequence numbers.  However, less than perfect packet 
    loss rates would lead to a very poor performance of such a system.  
    However, a Gateway could use the protection mechanisms discussed 
    above to unequally protect the most important packets, e.g. all 
    PSIs (very strong protection) IDERPs (weak protection), and 
    transmit everything else best effort.  The Gateway can do this 
    without parsing the bit stream, by simply using the NALU type byte. 
    A more sophisticated Gateway may be able to combine some small 
    NALUs to a big STAP or MTAP in order to save the bytes used for the 
    IP/UDP/RTP headers. 
     
    A similar mechanism is, of course, also possible in H.320 to RTP 
    gateways.  Here, however, the system environment does not include 
    any timing information, and exact presentation timing is carried in 
    the form of SEIs.  Hence, in the H.320 to IP data path, the gateway 
    has the additional duty to filter out SEIs containing timing 
    information and setting the RTP timestamp of the following video 
    packets accordingly.  In the reverse direction, SEIs need to be 
    generated using the RTP timestamp as a guideline. 
     
     
 10.5. Low-Bit-Rate Streaming 
  
    This scheme has been implemented with H.263 and gave good results 
    [13].  There is no technical reason why similarly good results 
    could not be achievable using the JVT codec.  
 Wenger et. al.     Expires December 2002               [Page 13] 

 Internet Draft                                       21 September 2002 
     
    In today's Internet streaming, some of the offered bit-rates are 
    relatively low in order to allow terminals with dial-up modems to 
    access the content.  In wired IP networks, relatively large 
    packets, say 500 - 1500 bytes, are preferred to smaller and more 
    frequently occurring packets in order to reduce network  
    congestion.  Moreover, use of large packets decreases the amount of 
    RTP/UDP/IP header overhead.  For low-bit-rate video, the use of 
    large packets means that sometimes up to few pictures should be 
    encapsulated in one packet.  
     
    However, loss of such a packet would have drastic consequences in 
    visual quality, as there is practically no other way to conceal a 
    loss of an entire picture than to repeat the previous one.  One way 
    to construct relatively large packets and maintain possibilities 
    for successful loss concealment is to construct MTAPs that contain 
    slices from several pictures in an interleaved manner.  An MTAP 
    should not contain spatially adjacent slices from the same picture 
    or spatially overlapping slices from any picture.  If a packet is 
    lost, it is likely that a lost slice is surrounded by spatially 
    adjacent slices of the same picture and spatially corresponding 
    slices of the temporally previous and succeeding pictures. 
    Consequently, concealment of the lost slice is likely to succeed 
    relatively well. 
     
     
 11. Open Issues 
    There are several open issues on which the authors would like to 
    receive opinions.  They are listed below. 
     
    We have now five xTAPs, with 0, 8, 16, 24, and 32 bit timestamps 
    offset per aggregation unit.  This is per response to the last AVT 
    meeting.  However, neither the 8 bit nor the 32 bit offset make a 
    lot of sense.  JVT does not allow frame rates that make 8 bit 
    offsets useful, and a 32 bit offset at 90 kHz is only necessary for 
    frame intervals longer than 186 seconds.  Hence, we believe we 
    should remove the 8 bit and 32 bit timestamp offsets to save the 
    two codepoints. 
     
    Since JVT will likely be approved as the advanced video codec of 
    MPEG-4, it may be desirable to align this payload specification 
    with other payload specifications for MPEG 4.  The authors of this 
    I-D and some authors of the MPEG-4 packetization I-Ds are 
    discussing the issue, and there is a chance that in the future 
    changes to this I-D will be proposed to AVT to reflect the outcome 
    of these discussions. 
     
 12. Full Copyright Statement 
     
    Copyright (C) The Internet Society (2002). All Rights Reserved. 
     
    This document and translations of it may be copied and furnished to 
    others, and derivative works that comment on or otherwise explain 
    it 
    or assist in its implementation may be prepared, copied, published 
 Wenger et. al.     Expires December 2002               [Page 14] 

 Internet Draft                                       21 September 2002 
    and distributed, in whole or in part, without restriction of any 
    kind, provided that the above copyright notice and this paragraph 
    are included on all such copies and derivative works. 
     
    However, this document itself may not be modified in any way, such 
    as by removing the copyright notice or references to the Internet 
    Society or other Internet organizations, except as needed for the 
    purpose of developing Internet standards in which case the 
    procedures for copyrights defined in the Internet Standards process 
    must be followed, or as required to translate it into languages 
    other than English. 
     
    The limited permissions granted above are perpetual and will not be 
    revoked by the Internet Society or its successors or assigns. 
     
    This document and the information contained herein is provided on 
    an 
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
     
     
 13. Bibliography 
                      
    [1]  P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-
         N57r2, available from ftp://standard.pictel.com/video-
         site/0109_San/VCEG-N57r2.doc, September 2001 
    [2]  JVT Joint Final Committee Draft, available from  
    [3]  ITU-T Recommendation H.263-2000 
    [4]  ISO/IEC IS 14496-1 
    [5]  S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and 
         Systems for Video technology, to appear (April 2002) 
    [6]  S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", 
         Proceedings Packet Video Workshop 02, April 2002, to appear. 
    [7]  C. Borman et. Al., "RTP Payload Format for the 1998 Version of 
         ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998 
    [8]  ISO/IEC IS 14496-2 
    [9] S. Wenger, T. Stockhammer, "H.26L over IP and H.324 Framework", 
         VCEG-N52, available from ftp://standard.pictel.com/video-
         site/0109_San/VCEG-N52.doc, September 2001 
    [10] ITU-T Recommendation H.223 (1999) 
    [11] C. Perkins et. al., "RTP Payload for Redundant Audio Data", 
         RFC 2198, September 1997 
    [12] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for 
         Generic Forward Error Correction", RFC 2733, December 1999  
    [13] V Varsa, M. Karczewicz, "Slice interleaving in compressed 
         video packetization", Packet Video Workshop 2000 
 Wenger et. al.     Expires December 2002               [Page 15] 

 Internet Draft                                       21 September 2002 
    [14] T. Stockhammer, M. M. Hannuksela, and S. Wenger, "H.26L/JVT 
         Coding Network Abstraction Layer and IP-based Transport" in 
         Proc. ICIP 2002, Rochester, NY, September 2002. 
     
     
     
     
    Author's Addresses 
     
    Stephan Wenger                     Phone: +49-172-300-0813 
    TU Berlin / Teles AG               Email: stewe@cs.tu-berlin.de 
    Franklinstr. 28-29 
    D-10587 Berlin 
    Germany 
     
    Thomas Stockhammer                 Phone: +49-89-28923474 
    Institute for Communications Eng.  Email: stockhammer@ei.tum.de 
    Munich University of Technology 
    D-80290 Munich 
    Germany 
     
    Miska M. Hannuksela                Phone: +358 40 5212845 
    Nokia Corporation                  Email: 
         miska.hannuksela@nokia.com 
    P.O. Box 68 
    33721 Tampere 
    Finland   
 Wenger et. al.     Expires December 2002               [Page 16]