Audio/Video Transport WG                                  Ari Lakaniemi 
Internet Draft                                              Ye-Kui Wang 
Intended status: Standards track                                  Nokia  
Expires: March 2009                                  September 28, 2008 
                                    
 
                 RTP payload format for G.718 speech/audio  
                    draft-lakaniemi-avt-rtp-evbr-03.txt 


Status of this Memo 

   By submitting this Internet-Draft, each author represents that any 
   applicable patent or other IPR claims of which he or she is aware 
   have been or will be disclosed, and any of which he or she becomes 
   aware will be disclosed, in accordance with Section 6 of BCP 79. 

   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 

   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time.  It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 

   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt 

   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html 

   This Internet-Draft will expire on March 28, 2009. 

Copyright Notice 

   Copyright (C) The IETF Trust (2008). 

Abstract 

   This document specifies the Real-Time Transport Protocol (RTP) 
   payload format for the Embedded Variable Bit-Rate (EV-VBR) 
   speech/audio codec, specified in ITU-T G.718. A media type 
   registration for this RTP payload format is also included. 


Lakaniemi, Wang         Expires March 28, 2009                 [Page 1] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

Conventions used in this document 

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC 2119 [RFC2119]. 

Table of Contents 

    
   1. Introduction...................................................3 
   2. Background.....................................................3 
      2.1. The EV-VBR codec..........................................3 
      2.2. Benefits of layered design................................5 
      2.3. Transmitting layered data.................................5 
      2.4. Scaling scenarios & rate control..........................6 
   3. EV-VBR RTP payload format......................................7 
      3.1. Payload Structure.........................................7 
         3.1.1. Payload Header.......................................7 
         3.1.2. EV-VBR transport blocks..............................8 
      3.2. Handling the Encoded data................................11 
      3.3. EV-VBR scaling...........................................13 
      3.4. CRC verification.........................................14 
      3.5. EV-VBR session...........................................14 
      3.6. Cross-stream/cross-layer timing synchronization..........14 
      3.7. RTP Header usage.........................................15 
   4. Payload Format Parameters.....................................15 
      4.1. Media Type Registration..................................15 
      4.2. Mapping to SDP Parameters................................17 
      4.3. Offer/answer considerations..............................18 
      4.4. Declarative usage of SDP.................................18 
      4.5. SDP examples.............................................19 
   5. Security Considerations.......................................20 
   6. Congestion control............................................21 
   7. IANA Considerations...........................................22 
   APPENDIX A: Payload examples.....................................23 
      A.1. Simple payload examples..................................23 
         A.1.1. All the layers in the same payload..................23 
         A.1.2. Layers in separate RTP streams......................24 
      A.2. Advanced examples........................................25 
         A.2.1. Different update rate for subset of layers..........25 
         A.2.2. Redundant frames with limited set of layers.........26 
   8. References....................................................28 
      8.1. Normative References.....................................28 
      8.2. Informative References...................................29 
   Author's Addresses...............................................29 
   Intellectual Property Statement..................................30 
   Disclaimer of Validity...........................................30 
 
 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 2] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   Copyright Statement..............................................30 
   Acknowledgment...................................................30 
   9. Open Issues...................................................31 
   10. Changes Log..................................................32 
    
1. Introduction 

   The International Telecommunication Union (ITU-T) Recommendation 
   G.718 [G.718] specifies the Embedded Variable Bit Rate (EV-VBR) 
   speech/audio codec. This document specifies the Real-time Transport 
   Protocol (RTP) [RFC3550] payload format for this codec. 

2. Background 

2.1. The EV-VBR codec 

   EV-VBR is an embedded variable rate speech codec having a layered 
   design. The bitstream of the EV-VBR core codec consists of a core 
   layer, denoted as L1, and four enhancement layers, denoted as L2-L5. 
   The bit-rates of the EV-VBR core codec range from 8 kbit/s (core 
   layer only) to 32 kbit/s (with all layers up to L5). Furthermore, the 
   EV-VBR codec supports also discontinuous transmission (DTX) and 
   comfort noise generation (CNG) by sending Silence Descriptor (SID) 
   frames during periods of non-active input signal, resulting in a 
   reduced bit-rate. The sampling frequency of the core codec is 16 kHz 
   and the codec operates on 20 ms frames. The EV-VBR codec is also 
   capable of narrowband operation with audio input and/or output at 8 
   kHz sampling frequency.  

   While transmitting/receiving the core layer L1 is enough for 
   successful decoding of the audio content, each of the enhancement 
   layers Ln (n being 2 to 5, inclusive) provides an improvement to 
   reconstructed audio quality. Thus, the core layer ensures the basic 
   communication while the enhancement layers can be used to improve the 
   perceptual quality. Furthermore, enhancement layers are dependent on 
   all the lower layers in a sense that successful decoding of layer Ln 
   requires also all the layers Lm with m<n to be available.  

   The sizes, sampling rates and possible outputs of the EV-VBR core 
   codec layers L1-L5 are summarized in Table 1 below, where "Bytes" 
   column indicates the number of bytes per encoded data unit for a 
   layer, NB and WB denotes narrowband and wideband, respectively. The 
   "Bytes" column in other tables has the same meaning. Note that for 
   layers L1 and L2, the corresponding output may either be NB or WB, 
   depending on the rendering device and the application requirement, 
   regardless of the sampling rate of the encoded data.  

 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 3] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

                          Table 1: EV-VBR layers 

        Layer   Bytes   Cumulative bit-rate   Sampling rate   Output 
      -----------------------------------------------------------------_ 
         L1       20        8 kbit/s           8 or 16 kHz    NB or WB 
         L2       10       12 kbit/s           8 or 16 kHz    NB or WB 
         L3       10       16 kbit/s           16 kHz         WB 
         L4       20       24 kbit/s           16 kHz         WB 
         L5       20       32 kbit/s           16 kHz         WB 
    

   The EV-VBR codec includes also an operating mode that is compatible 
   with the Adaptive Multi-Rate Wideband (AMR-WB) codec [AMR-WB], for 
   which the RTP payload format is specified in [RFC4867]. In this AMR-
   WB interoperable mode, layers L1, L2 are replaced by L1' consisting 
   of AMR-WB encoded data. Furthermore, together with L1' modified L3' 
   is used instead of L3. The usage of layers L4 and L5 is not affected 
   by transmitting AMR-WB data in the lower layers. If layer L3' is 
   present in the encoded bit-stream, the base layer L1' must use the 
   AMR-WB mode 2 with the bit-rate of 12.65 kbits/s. Otherwise (the 
   encoded bit-stream contains only the L1' layer), any of the 9 AMR-WB 
   coding modes 0, 1, 2, 3, 4, 5, 6, 7, and 8 correspond to the bit-
   rates of 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, and 
   23.85 kbit/s, respectively, may be in use. Table 2 summarizes the 
   AMR-WB interoperable mode when more than one layer may be present. 

          Table 2: EV-VBR layers in the AMR-WB interoperable mode 

        Layer   Bytes   Cumulative bit-rate   Sampling rate   Output 
      -----------------------------------------------------------------_ 
         L1'      32       12.8 kbit/s           16 kHz          WB 
         L3'       9       16.4 kbit/s           16 kHz          WB 
         L4       20       24.4 kbit/s           16 kHz          WB 
         L5       20       32.4 kbit/s           16 kHz          WB 
    
   Note that the bit-rate for the raw bit-stream of AMR-WB mode 2 is 
   12.65 kbits/s. However, after counting the padding bits to make each 
   encoded data unit byte-aligned, as in the octet-aligned mode 
   specified in [RFC4867], the resulting bit-rate is then 12.8 kbits/s.  

   In the AMR-WB interoperable mode, when the base layer L1' is 
   transported in its own RTP packet stream, the packetisation specified 
   in [RFC4867] MUST be used, to enable legacy RFC4867 receivers to 
   receive the base layer L1'. 

   ITU-T SG16 is currently working on a set of extension layers in order 
   to provide a so-called super-wideband (SWB) audio and stereophonic 
 
 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 4] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   encoding extensions on top of the EV-VBR core codec. Further details 
   and the usage of these layers are TBD. 

   The main application of the EV-VBR codec is telephony. Other expected 
   applications include audio/video conferencing and streaming. 

2.2. Benefits of layered design 

   The layered design enables simple scalability of the transmitted 
   stream simply by conveying a suitable number of layers. The number of 
   layers used in a session may be selected for example based on the 
   capacity of the transmission channel, current transmission 
   conditions, characteristics of the source signal or available 
   processing capacity.  

   Another obvious benefit of the layered codec design is the 
   possibility to exploit the scalability to support congestion control 
   by transmitting/dropping some of the (higher) enhancement layers in 
   order to alleviate congestion in the network. See more detailed 
   discussion on the congestion control in section 6.  

   Furthermore, the layered design also implicitly provides possibility 
   for unequal error detection/protection by employing different levels 
   of protection on core layer and enhancement layers.  

2.3. Transmitting layered data 

   In principle there are two basic approaches to carry the data from a 
   layered encoder: 

   1. All the layers are carried within a single RTP session. 

   2. The encoded data is divided over multiple RTP sessions, each 
      session carrying a subset of layers. This is also referred to as 
      Multi-Session Transmission (MST).  

   The first choice is the most efficient in terms of exploitation of 
   transmission bandwidth. Furthermore, using only one packet to carry 
   all encoded data layers of a frame requires less resources also from 
   the end-systems (and intermediate systems) since the number of 
   packets is kept at minimum and only single RTP packet stream needs to 
   be handled. However, this option requires any intermediate network 
   element performing the scaling operation to be fully media-aware 
   since removing encoded layers requires modification of the payload. 
   Furthermore, the intermediate network element needs to be within the 
   security context to enable the meaningful manipulation of the 
   payload, in case secure transport is employed. This might not be 
 
 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 5] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   feasible in all systems/scenarios, but some special-purpose devices 
   such as e.g. media gateways in cellular telephone systems may be able 
   to implement this kind of media-aware functionality. 

   The second alternative transmitting selected subsets of layers in 
   separate RTP sessions facilitates simple scalability in intermediate 
   network elements without the requirement of being fully media-aware. 
   One use case of this alternative is layered multicast [McCanne]. On 
   the other hand, this approach introduces separate packet header 
   overhead for each subset of layers for those low-delay application 
   scenarios wherein aggregation of data from multiple frames is not 
   ideal. In this case, when the size of the encoded data block per 
   single layer is in the range of 10 to 20 bytes, the packetisation may 
   result in relatively high amount of protocol overhead, which might be 
   an expensive solution on bandwidth-limited links. Another drawback of 
   this approach is somewhat more complex session setup and the 
   additional complexity associated with handling of several concurrent 
   RTP sessions. However, this is a trade-off that enables simple 
   scalability also by intermediate network elements that are not aware 
   of the details of the transmitted media.  

2.4. Scaling scenarios & rate control 

   In principle there are three different ways to make use of the 
   layered design to control the bandwidth usage: 

   1. A sender decides to change the number of layers it is transmitting 
   (for example due to congestion control constrains) 

   2. A receiver or an intermediate network element instructs a sender 
      to change the number of layers it is transmitting 

   3. An intermediate network element passes forward only a subset of 
      layers it receives 

   The most appropriate mechanism depends on the application and the 
   employed network topology. For example point-to-point conversational 
   audio connection can easily introduce rate control by changing the 
   number of transmitted layers, while in centralized audio/video 
   conferencing scenario the conference server is a more appropriate 
   point to implement the rate control instead of transmitting end-
   point. Please refer to [RFC5117] for extensive discussion on the 
   different topologies and their implications to the transmission. 

   However, the fundamental difference between these choices is that 
   method 1 does not necessarily need any feedback from the receiver(s), 

 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 6] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   while methods 2 and 3 require a signaling mechanism to support rate 
   control. 

3. EV-VBR RTP payload format 

   The basic EV-VBR source data unit is one layer of an encoded frame. 
   Since generally the term layer refers to time series of data 
   representing certain encoding layer, in this specification we use the 
   term Encoded Data Unit (EDU) to refer to a single layer of data from 
   single encoded frame. Thus, each EDU has a (conceptual) frame number 
   indicating its location in encoding/decoding order and a layer number 
   indicating the encoding layer the EDU represents. 

3.1. Payload Structure 

   The EV-VBR payload format consists of a payload header, followed by 
   one or more transport blocks (TB) forming the actual payload data. 

    +-----------------+----------+----------+- /// -+----------+ 
    | Payload header  |  TB(1)   |  TB(2)   |          TB(n)   | 
    +-----------------+----------+----------+- /// -+----------+ 
    

3.1.1. Payload Header 

   The payload header consists of an 8-bit payload CRC checksum: 

    +-+-+-+-+-+-+-+-+ 
    |     CRC       | 
    +-+-+-+-+-+-+-+-+ 
    

   In the transmitting end the payload checksum is computed over the 
   primary transport block (see the definition section 3.1.2) of the 
   payload using the generator polynomial 

      C(z) = z^8 + z^4 + z^3 + z^2 + 1.  

   Subsequent transport blocks are prepared in such a way that the 
   payload checksum is valid for any integer number of contiguous 
   transport blocks starting from the beginning of the primary transport 
   block. 

   In the receiving end the payload CRC checksum can be used to verify 
   the correct reception of any contiguous subset of transport blocks 
   starting from the beginning of the primary transport block (see 
   section 3.3 for detailed description). 
 
 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 7] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

3.1.2. EV-VBR transport blocks 

   The basic building block of the EV-VBR RTP payload data is an EV-VBR 
   transport block (TB). There are two types of transport blocks: 
   primary transport block and secondary transport block. 

   The structure of the primary transport block is depicted below. 

                      
     0 1 2 3 4 5 6 7  
    +-+-+-+-+-+-+-+-+----------------------------+ 
    |   L-ID    |NF | Encoded data               | 
    +-+-+-+-+-+-+-+-+----------------------------+ 
    
   The structure of the secondary transport block is depicted below. 

                      
     0 1 2 3 4 5 6 7                              0 1 2 3 4 5 6 7 
    +-+-+-+-+-+-+-+-+----------------------------+-+-+-+-+-+-+-+-+ 
    |   L-ID    |NF | Encoded data               |     Tail      | 
    +-+-+-+-+-+-+-+-+----------------------------+-+-+-+-+-+-+-+-+ 
    
   The layer ID (L-ID) and the NF fields form the transport block 
   header. The L-ID field is used to identify the layer structure of the 
   encoded data carried in this EV-VBR transport block, and the NF field 
   indicates the number of encoded frames with this layer structure 
   carried in the Encoded data part following the transport block 
   header. The Tail field of the secondary transport block carries a 
   modified 8-bit CRC checksum computed over the transport block, as 
   specified below. 

             Author's note: For streaming or other applications that 
             allow for relatively long end-to-end delay, sometimes it 
             would be beneficial to aggregate more than 4 frames in one 
             TB. Should the length of NF be larger?  

   An EV-VBR RTP packet payload SHALL include exactly one primary 
   transport block, which MAY be followed by one or more secondary 
   transport blocks. The data fields of both transport block types are 
   described below. 

   L-ID Identification (6 bits) of the encoded data carried in this 
        transport block. Table 3 below specifies the mapping between L-
        ID and the encoded data. 

                                      
Lakaniemi, Wang         Expires March 28, 2009                 [Page 8] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

                Table 3: Layer identification (L-ID) values  

          L-ID    Encoded data 
        -------------------------------------- 
            0     Empty frame 
            1     L1 
            2     L1-L2 
            3     L1-L3 
            4     L1-L4 
            5     L1-L5 
            6     L2 
            7     L2-L3 
            8     L2-L4 
            9     L2-L5 
           10     L3 
           11     L3-L4 
           12     L3-L5 
           13     L4 
           14     L4-L5 
           15     L5 
           16     L1' 
           17     L1', L3' 
           18     L1', L3', L4 
           19     L1', L3', L4-L5 
           20     EV-VBR SID 
           21     AMR-WB SID 
           22-62  Reserved for stereo and SWB layers 
           63     Time synchronization element (see section 3.6) 
         
             Author's note: The current approach provides maximum 
             flexibility in terms of layer configuration. However, 
             limiting choices would be one way to leave more bits for 
             stereo & SWB layer configurations. 

             Author's note: One suggested way to make sure we do not 
             run out of L-ID values with the extension modes has been 
             to make the mapping between L-ID and layer configuration 
             dynamic (to be specified using SDP in session set-up). 
             While this would provide effective usage of L-ID bits, it 
             would require all elements processing the payload to be 
             signaling-aware. A compromise solution would be to provide 
             static mapping for selected layer configurations and leave 
             'more exotic' cases to be dynamically mapped on session 
             basis. The usage of this type of approach is FFS. 

             Author's note: Yet another possible way is to do similar 
             as in the SVC RTP payload format draft, i.e. to signal the 
 
 
Lakaniemi, Wang         Expires March 28, 2009                 [Page 9] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

             bitrate etc. parameters for an operation point, and signal 
             dependency between sessions using the MMUSIC decoding 
             dependency draft. This way should be generic enough and 
             applicable to future versions of scalable codecs. However, 
             the above methods (using detailed layer configuration may 
             provide more useful information as the bitrate etc. of 
             each layer is fixed, not as flexible as in SVC.) 

             Author's note: Yet another approach is to allocate L-ID 
             according to different mode. For example, the mode with L1 
             being present and the AMR-WB compatible mode (with L1' 
             being present) use different value spaces of L-ID.  

   NF   Number of frames in this transport block (2 bits) decreased by 
        one. The number of frames is equal to the value of NF 
        incremented by one. For example, value NF=0 indicates that the 
        transport block carries one frame, and value NF=3 indicate that 
        the transport block carries four frames. If the sender wants to 
        encapsulate more than four frames per payload, several 
        transport blocks need to be used. 

   Encoded data 

        Encoded data consists of EDUs as specified by the values L-ID 
        and NF fields, arranged according to rules given in section 
        3.2.  When L-ID is equal to 0 (empty frame), the encoded data 
        field is not present (i.e. it consists of zero octet). 

   Tail The 8-bit tail field of the secondary transport block carries a 
        bit field that is needed to modify the partial CRC checksum 
        over the payload data up to the end of this TB to match the 
        payload CRC field value carried in the payload header. 

        In the transmitter the Tail bits for a secondary TB(n) are 
        computed by first computing the CRC checksum CRC(n) over the 
        payload data from the beginning of the primary TB up to the end 
        of TB(n) using the generator polynomial C(z) given above. The 
        bits of the Tail field of TB(n) are set to zero value for the 
        CRC computation. The transmitted value of the Tail field in 
        TB(n) is obtained by bitwise XOR operation between the payload 
        CRC field value carried in the payload header and the CRC(n) 
        computed for TB(n). 


Lakaniemi, Wang         Expires March 28, 2009                [Page 10] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

3.2. Handling the Encoded data 

   In order to provide unique mapping of EDUs to encoded frames, the 
   following rules on sequence of frames and sequence of layers need to 
   be followed when creating a payload: 

   o  The frames within a payload MUST form a set of contiguous frames 
      in decoding order, i.e. if a payload carries frames n and n+N, all 
      frames between n and n+N in decoding order MUST also be present in 
      the payload. 

   o  The layers within a frame MUST form a contiguous set of layers, 
      i.e. if layers Lx and Ly of a frame are included in the payload, 
      all layers between Lx and Ly layers MUST also be present. 

   The EDUs within a transport block are arranged according to the 
   following rules: 

   o  The EDUs within a transport block MUST be arranged in increasing 
      order of layer number 

   o  The EDUs with the same layer number within a transport block MUST 
      be arranged in decoding order 

   Explicit timing information for the transport blocks is not needed, 
   since the ordering of EDUs in the payload and their mapping to 
   transport blocks can be used to implicitly carry this information. 
   The following rules apply: 

   o  If the highest layer carried in transport block k is n, and the 
      lowest layer carried by transport block k+1 is n+1, then the EDUs 
      of transport block k and k+1 belong to the same encoded frame. 
      Furthermore, if transport blocks k and k+1 carry EDUs belonging to 
      the same encoded frame(s), these transport blocks MUST include the 
      same number of EDUs. 

   o  If the highest layer carried in transport block k is n, and the 
      lowest layer carried by transport block k+1 is smaller than or 
      equal to n, the EDUs of transport block k and k+1 belong to the 
      two separate encoded frames, which are contiguous in decoding 
      order. 

   o  Multiple copies of an EDU MUST NOT be included in the payload. 

   A set of EDUs can be allocated to transport blocks in several ways. 
   For example each EDU can be encapsulated in its own transport block, 
   all EDUs can be carried in single transport block, EDUs belonging to 
 
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 11] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   the same encoded frame can be encapsulated in dedicated transport 
   block, or EDUs representing the same layer can be carried in their 
   own transport blocks. Three examples on this with two frames with 
   layers L1-L3 are given below. The first example illustrates the case 
   using a single transport block for the whole payload, while the 
   second payload example introduces separate transport blocks for each 
   of the EDUs. The third example shows an approach where all layers are 
   carried in dedicated transport blocks. The notation Fx-Ly is used to 
   denote layer y of frame x. 

    
   Example 1: All EDUs in a single transport block 

     +---------+-----+-------+-------+-------+-------+-------+--------+ 
     | L-ID=3  |NF=1 | F1-L1 | F2-L1 | F1-L2 | F2-L2 | F1-L3 | F2-L3  | 
     +---------+-----+-------+-------+-------+-------+-------+--------+ 
      
             Author's note: Currently, it is mandated that lower layer 
             EDUs of later frames go before higher layer EDUs of 
             earlier frames. This way is friendlier to adaptation 
             (dropping of higher layers). However, if all layers are 
             received, then the depacketizer needs to reorder the EDUs 
             to their decoding order before feeding them to the 
             decoder. Therefore, the other way around (i.e. lower layer 
             EDUs of later frames go after higher layer EDUs of earlier 
             frames, or EDUs in transport blocks are placed in decoding 
             order) is more friendly to the depacketizer. Another 
             benefit of the latter is that it does not introduce any 
             end-to-end delay. Which way to be specified (or both 
             allowed if needed) is FFS.  

   Example 2: All EDUs in separate transport blocks 

     +---------+-----+-------+---------+-----+-------+ 
     | L-ID=1  |NF=0 | F1-L1 | L-ID=1  |NF=0 | F2-L1 | 
     +---------+-----+-------+---------+-----+-------+ 
     | L-ID=8  |NF=0 | F1-L2 | L-ID=8  |NF=0 | F2-L2 | 
     +---------+-----+-------+---------+-----+-------+ 
     | L-ID=14 |NF=0 | F1-L3 | L-ID=14 |NF=0 | F2-L3 | 
     +---------+-----+-------+---------+-----+-------+ 
      
   Example 3: Dedicated transport for EDUs of each layer 


Lakaniemi, Wang         Expires March 28, 2009                [Page 12] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

     +---------+-----+-------+-------+---------+-----+-------+-------+ 
     | L-ID=1  |NF=1 | F1-L1 | F2-L1 | L-ID=6  |NF=1 | F1-L2 | F2-L2 | 
     +---------+-----+-------+-------+---------+-----+-------+-------+ 
     | L-ID=10 |NF=1 | F1-L3 | F2-L3 | 
     +---------+-----+-------+-------+ 
      
   While the first example carrying data from all layers in the same 
   transport block obviously consumes less bandwidth, the second example 
   using separate transport block for each EDU, and the third example 
   using dedicated transport blocks for each layer provide simple 
   scaling possibility: while in the first case the removal of e.g. 
   layer L3 (from each frame in the payload) would require changing the 
   value of the L-ID in addition to removing the corresponding EDU(s), 
   in the second and third options it is enough to just remove all 
   transport blocks carrying L3 data and the remaining part of the 
   payload can be left untouched (however the packet size information in 
   high-layer protocol headers needs change). 

3.3. EV-VBR scaling 

   Some media-aware network elements (MANEs) MAY modify the EV-VBR 
   bitstream by dropping some of the layers in case congestion control 
   or e.g. access link bandwidth requires such scaling to take place. 
   Such MANEs are RTP translators (with the topology Topo-Translator as 
   described in [RFC5117]), for which the rules for RTP translators 
   specified in [RFC3550] apply.  

   A payload can be either completely dropped or some of the transport 
   blocks it carries can be discarded. In case full payloads are dropped 
   to implement scaling, a packet containing the core layer L1 SHOULD 
   NOT be discarded, since the decoding of higher layers of the same 
   encoded frame is not possible without the core layer data being 
   available. This means that payloads with L-ID values equal to 1 to 5, 
   inclusive and 16 to 19, inclusive, SHOULD NOT be completely 
   discarded.  

             Author's note: To be checked whether the case of dropping 
             a subset of the transport blocks in one packet also 
             strictly follows the topology Topo-Translator.  

   In case the payload is forwarded with modified content, at least the 
   primary transport block MUST be preserved in the payload, while some 
   of the secondary transport blocks at the end of the payload MAY be 
   discarded. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 13] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

3.4. CRC verification 

   In the receiving end the CRC verification is made in such a way that 
   the CRC computation is started from the beginning of the primary TB, 
   i.e. from the MSB of the first octet of the TB(1), and the 
   computation is continued until the end of the payload data or until 
   an erroneous TB is encountered. At the end of each TB a check MAY be 
   performed: if the CRC value at the end of TB(n) matches the payload 
   CRC value received in the payload header, the verification is 
   successful and the data up TB(n) is valid. If the CRC value at the 
   end of TB(n) does not match the payload CRC value received in the 
   payload header, there is an error in the TB(n) and it MUST be 
   discarded as corrupted. Furthermore, if the verification indicates 
   corrupted TB(n), all subsequent transport blocks TB(m) with m>n MUST 
   also be discarded. 

3.5. EV-VBR session 

   An EV-VBR session consists of one or several RTP sessions carrying 
   encoded EV-VBR data according the payload format specified in section 
   3.2. 

3.6. Cross-stream/cross-layer timing synchronization 

   In case an EV-VBR session consists of multiple RTP sessions, the RTP 
   packets transmitted on separate RTP sessions need to be synchronized 
   in order to enable reconstruction of the frames in the receiving end. 
   Since each of the RTP sessions uses its own random initial value for 
   the RTP timestamp, there is also a random offset between the RTP 
   timestamps values carrying the EDUs belonging to the same encoded 
   frame in different RTP sessions. 

   The receiver SHOULD use the traditional RTCP based mechanism to 
   synchronize streams by using the RTP and NTP timestamps of the RTCP 
   Sender Reports (SR) it receives.  

             Author's note: The above approach for cross-session 
             synchronization is not possible until the first RTCP SRs 
             are received in all sessions. This implies that decoding 
             only a subset of layers may be possible until RTCP SRs in 
             all sessions have been received. This may imposes higher 
             end-to-end delay or higher bandwidth for RTCP data, and 
             the approach may not work perfectly for some multicast 
             topologies. There is a study ongoing by some AVT members. 
             Once there is an acceptable solution the draft documenting 
             that solution may be referenced herein.  

 
Lakaniemi, Wang         Expires March 28, 2009                [Page 14] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

3.7. RTP Header usage 

   This section specifies the usage of some fields of the RTP header 
   (specified in section 5 of [RFC3550]) with the EV-VBR RTP payload 
   format. 

   In case the EV-VBR session consists of multiple RTP sessions, the RTP 
   sessions are further separated by using different payload type (PT) 
   values for each of the RTP streams. In case of all layers carried 
   within a single RTP session there is need for only one PT. Note that 
   the assignment of the PT number(s) for this payload format are 
   outside the scope of this document. It is expected that the RTP 
   profile under which this payload is used will either assign PT 
   number(s) for this encoding or specify the PT number(s) to be 
   dynamically assigned. 

   The RTP timestamp corresponds to the sampling instant of the first 
   encoded sample of the earliest frame in the payload. The timestamp 
   clock frequency is 32 kHz.  

   The marker bit (M) of each of the RTP streams of the session SHALL be 
   set to value 1 if the payload carries an EDU belonging to the first 
   frame after an inactive period, i.e. an EDU from the first frame of a 
   talkspurt. For all other packets the marker bit is set to value 0. 

4. Payload Format Parameters 

   This section defines the parameters that may be used to configure 
   optional features in the EV-VBR RTP transmission. 

   The parameters are defined here as part of the media subtype 
   registration for the EV-VBR codec.  Mapping of the parameters into 
   the Session Description Protocol (SDP) [RFC4566] is also provided for 
   those applications that use SDP.  In control protocols that do not 
   use MIME or SDP, the media type parameters must be mapped to the 
   appropriate format used with that control protocol. 

4.1. Media Type Registration 

   This registration is done using the template defined in RFC 4288 
   [RFC4288] and following RFC 4855 [RFC4855]. 

   Type name:  audio 

   Subtype name:  EV-VBR 

   Required parameters:  none 
 
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 15] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   Optional parameters: 

      mode:      This parameter MAY be used to indicate whether the 
                 mode with layer L1 being present or the AMR-WB 
                 compatible mode (with layer L1' being present) is in 
                 use. If this parameter is not present or the value of 
                 this parameter is equal to 0, the mode with layer L1 
                 being present is in use. Otherwise, the AMR-WB 
                 compatible mode is in use. When this parameter is 
                 present, the value MUST be either 0 or 1.  

      Author's note: When the upcoming stereo and SWB options are 
      present, the semantics of this parameter may change. 

      layers:    The numbers of the layers (in range from 1 to 5, 
                 denoting layers from L1 to L5, respectively) 
                 transmitted in this session, expressed as comma-
                 separated list of layer numbers. If the parameter is 
                 present, at least layer L1 or L1' MUST be included in 
                 the list of layers in one of the RTP sessions included 
                 in the EV-VBR session. If the parameter is not 
                 present, all layers up to layer L5 MAY be used in the 
                 session. 

      Author's note: Why not use semantics similarly as L-ID? 

      ptime:     The recommended length of time (in milliseconds) 
                 represented by the media in a packet.  See Section 6 
                 of [RFC4566]. 

      maxptime:  The maximum length of time (in milliseconds) that can 
                 be encapsulated in a packet.  See Section 6 of 
                 [RFC4566] 

      Author's note: Some further study is needed to see if separate 
      parameters for sending and receiving capabilities/preferences are 
      needed -- especially for upcoming stereo and SWB options. 

      Author's note: The support for upcoming SWB and stereo options 
      needs to be taken into account. Basically we can either 1) extend 
      the parameter "layers" to cover also this aspect, or 2) define 
      separate parameter(s) for these new options when more details on 
      the stereo/SWB support are available. 

   Encoding considerations:  


Lakaniemi, Wang         Expires March 28, 2009                [Page 16] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

     This media type is framed and contains binary data; see Section 4.8 
     of [RFC4288]. 

   Security considerations:  See Section 6 of RFC xxxx 

   Interoperability considerations:  none 

   Published specification:  RFC xxxx 

   Applications which use this media type: 

     For example Voice over IP, audio and video conferencing, audio 
     streaming and voice messaging. 

   Additional information:  none 

   Person & email address to contact for further information: 

     Ari Lakaniemi, ari.lakaniemi@nokia.com 
       
   Intended usage:  COMMON 

   Restrictions on usage: 

     This media type depends on RTP framing, and hence is only defined 
     for transfer via RTP [RFC3550] 

   Author: 

     Ari Lakaniemi, ari.lakaniemi@nokia.com 
      
   Change controller: 

     IETF Audio/Video Transport working group delegated from the IESG 

 
4.2. Mapping to SDP Parameters 

   The information carried in the media type specification has a 
   specific mapping to fields of the SDP [RFC4566], which is commonly 
   used to describe RTP sessions.  When SDP is used to specify sessions 
   employing the EV-VBR codec, the mapping is as follows: 

   o  The media type ("audio") goes in SDP "m=" as the media name. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 17] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   o  The media subtype ("EV-VBR") goes in SDP "a=rtpmap" as the 
      encoding name.  The RTP clock rate in "a=rtpmap" MUST be 32000 for 
      EV-VBR. 

      Author's note: The current choice for the RTP clock rate is a 
      'placeholder'. The clock rate needs to be set according to SWB 
      sampling rate, which is still T.B.D. Since the core codec employs 
      16000 Hz sampling rate, an integer multiple of 16000 Hz seems to 
      be a preferable choice.  

   o  The parameters "ptime" and "maxptime" go in the SDP "a=ptime" and 
      "a=maxptime" attributes, respectively. 

   o  Any remaining parameters go in the SDP "a=fmtp" attribute by 
      copying them directly from the media type string as a semicolon 
      separated list of parameter=value pairs. 

4.3. Offer/answer considerations 

   The following considerations apply when using the SDP offer/answer 
   [RFC3264] mechanism to negotiate the EV-VBR transport. The parameter 
   "layers" MAY be used to indicate the layer configuration for the each 
   RTP session belonging to current EV-VBR session an end-point making 
   the offer is ready to transmit and wishes to receive. 

   o  In case the EV-VBR session consists of a single RTP session, it is 
      RECOMMENDED not to impose any layer restrictions for the session 
      but to use the rate control functionality to set possible 
      restrictions on usage of the higher or highest layers. If the 
      offer includes a layer configuration parameter, the answer MAY use 
      different configuration, but the highest layer in the answer MUST 
      NOT be higher than the highest layer of the offered configuration. 

      Author's note: Support for answer modifying the layer 
      configuration is FFS. 

   In case the EV-VBR session consists of multiple RTP sessions, the 
   answer MUST use the layer configurations provided in the offer for 
   the sessions it accepts.  

4.4. Declarative usage of SDP 

   In declarative usage, such as SDP in RTSP [RFC2326] or SAP [RFC2974], 
   the parameter "layers" SHALL be interpreted to provide a set of 
   layers that the sender may use in the session. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 18] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

4.5. SDP examples 

   Some example SDP session descriptions utilizing EV-VBR encodings are 
   provided below. 

   The first example illustrates the simple case where the EV-VBR 
   session employing a single RTP session and the AVPF profile is 
   offered, and the answer accepts the offer without any changes. 

   Offer: 

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
      
   Answer: 

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
      
   The second example shows a bit more complex case where the EV-VBR 
   session using a single RTP session and the AVPF profile is offered 
   with restriction to send/receive only with layers L1 and L2. The 
   answer indicates that the other end-point is happy to receive (and 
   send) layers up to L5. 

   Offer: 

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
     a=fmtp:97 layers=1,2 
      
   Answer: 

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
     a=fmtp:97 layers=1,2,3,4,5 
      
   The third example shows an EV-VBR session using multiple RTP sessions 
   with the AVPF profile. The answerer wishes to use only layers up to 
   L3. 

   Offer: 


Lakaniemi, Wang         Expires March 28, 2009                [Page 19] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
     a=fmtp:97 layers=1,2 
     a=mid=1 
      
     m=audio 49122 RTP/AVPF 98 
     a=rtpmap:98 EV-VBR/32000/1 
     a=fmtp:98 layers=3 
     a=mid=2 
     a=depend:lay 1 
      
     m=audio 49124 RTP/AVPF 99 
     a=rtpmap:99 EV-VBR/32000/1 
     a=fmtp:99 layers=4,5 
     a=mid=3 
     a=depend:lay 1 2 
      
   Answer: 

     m=audio 49120 RTP/AVPF 97 
     a=rtpmap:97 EV-VBR/32000/1 
     a=fmtp:97 layers=1,2 
     a=mid=1 
      
     m=audio 49120 RTP/AVPF 98 
     a=rtpmap:98 EV-VBR/32000/1 
     a=fmtp:98 layers=3 
     a=mid=2 
     a=depend:lay 1 
      
   Note that the dependency signaling according to [smd-sdp] is used in 
   the third example above to indicate the relationship between the 
   layers distributed into separate RTP sessions. 

5. Security Considerations 

   RTP packets using the payload format defined in this specification 
   are subject to the security considerations discussed in the RTP 
   specification [RFC3550], and in any appropriate RTP profile (for 
   example [RFC3551] or [RFC4585]).  This implies that confidentiality 
   of the media streams is achieved by encryption; for example, through 
   the application of SRTP [RFC3711].  Because the data compression used 
   with this payload format is applied end-to-end, any encryption needs 
   to be performed after compression. 

   A potential denial-of-service threat exists for data encodings using 
   compression techniques that have non-uniform receiver-end 
 
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 20] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   computational load.  The attacker can inject pathological datagrams 
   into the stream that will increase the processing load of the decoder 
   and may cause the receiver to be overloaded. For example inserting 
   additional EDUs representing the higher enhancement layers on top of 
   the ones actually transmitted may increase the decoder load. However, 
   the EV-VBR codec is not particularly vulnerable to such an attack, 
   since the majority of the computational load in an EV-VBR session is 
   associated to the encoder.  Another form of possible attach might be 
   forging of codec bit-rate control messages, which may result in 
   encoder operating employing higher number of enhancement layers than 
   originally intended and thereby requiring larger amount of 
   computation resources. Therefore, the usage of data origin 
   authentication and data integrity protection of at least the RTP 
   packet is RECOMMENDED; for example, with SRTP [RFC3711]. 

   Note that the appropriate mechanism to ensure confidentiality and 
   integrity of RTP packets and their payloads is very dependent on the 
   application and on the transport and signaling protocols employed. 
   Thus, although SRTP is given as an example above, other possible 
   choices exist. 

   Note that end-to-end security with either authentication, integrity 
   or confidentiality protection will prevent a network element not 
   within the security context from performing media-aware operations 
   other than discarding complete packets.  To allow any (media-aware) 
   intermediate network element to perform its operations, it is 
   required to be a trusted entity which is included in the security 
   context establishment. 

6. Congestion control 

   As scalable codec EV-VBR implicitly provides means for congestion 
   control by providing a possibility for 'thinning' the bitstream. The 
   RTP payload format according to this specification provides several 
   different means for reducing the EV-VBR session bandwidth. The most 
   appropriate mechanism (in terms of impact to the user experience) 
   depends on the employed payload structure and also on the employed 
   session configuration (single RTP session or multiple RTP sessions). 
   The following means (in no particular order) can be used to assist 
   congestion control procedures -- either by the sender or by the 
   intermediate node. 

   o  The transport blocks carrying the EDUs representing the highest 
      layers within the payload may be dropped. 

   o  The payloads carrying the EDUs representing the highest layers in 
      an EV-VBR session are dropped. 
 
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 21] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   o  Transport blocks or payloads carrying EDUs belonging to redundant 
      frames included in the payload are dropped. 

7. IANA Considerations 

   IANA is kindly requested to register a media type for the EV-VBR 
   codec for RTP transport, as specified in section 5.1 of this 
   document. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 22] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

APPENDIX A: Payload examples 

   The EV-VBR payload structure enables flexible transport either by 
   carrying all layers in the same payload or separating the layers into 
   separate payloads. The following subsections illustrate different 
   possibilities for transport by simple examples. Note that examples do 
   not show the full payload structure to keep the illustration simple.  

A.1. Simple payload examples 

A.1.1. All the layers in the same payload 

   The illustration below shows layers L1-L3 from two encoded frames 
   encapsulated into separate payloads using single transport block. 

    +-------+--------+-----+------+------+------+ 
    | RTP1  | L-ID=3 |NF=0 |F1-L1 |F1-L2 |F1-L3 |     
    +-------+--------+-----+------+------+------+ 
    
    +-------+--------+-----+------+------+------+ 
    | RTP2  | L-ID=3 |NF=0 |F2-L1 |F2-L2 |F2-L3 | 
    +-------+--------+-----+------+------+------+ 
    

   In case the same layers from two input frames are encapsulated into 
   one payload using single transport block, the structure is as shown 
   below. 

    +-------+--------+-----+------+------+------+------+------+------+ 
    | RTP1  | L-ID=3 |NF=1 |F1-L1 |F2-L1 |F1-L2 |F2-L2 |F3-L3 |F2-L3 | 
    +-------+--------+-----+------+------+------+------+------+------+ 
    

   The third example illustrates the case where the layers L1-L3 from 
   two input frames are encapsulated into one payload using two separate 
   transport blocks, the first one carrying L1 and the other one 
   containing L2 and L3. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 23] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

    +-------+--------+-----+------+------+ 
    | RTP1  | L-ID=1 |NF=1 |F1-L1 |F2-L1 | 
    +-------+--------+-----+------+------+------+------+ 
            | L-ID=7 |NF=1 |F1-L2 |F2-L2 |F2-L2 |F2-L3 | 
            +--------+-----+------+------+------+------+ 
    
A.1.2. Layers in separate RTP streams 

   In this case the data for each layer is transmitted in its own 
   payload. 

   In the first example each transport block including a single EDU is 
   carried in its own RTP payload. 

    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
    | RTP1a | L-ID=1 |NF=0 |F1-L1|    | RTP1b | L-ID=6 |NF=0 |F1-L2| 
    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
    
    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
    | RTP1c |L-ID=10 |NF=0 |F1-L3|    | RTP2a | L-ID=1 |NF=0 |F2-L1| 
    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
      
    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
    | RTP2b | L-ID=6 |NF=0 |F2-L2|    | RTP2c |L-ID=10 |NF=0 |F2-L3| 
    +-------+--------+-----+-----+    +-------+--------+-----+-----+ 
    

   If the payloads carry data from two consecutive input frames, the 
   same encoded data as in the previous example is arranged as follows. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 24] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

    +-------+--------+-----+-----+-----+ 
    | RTP1a | L-ID=1 |NF=1 |F1-L1|F2-L1|   
    +-------+--------+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP1b | L-ID=6 |NF=1 |F1-L2|F2-L2|   
    +-------+--------+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP1c |L-ID=10 |NF=1 |F1-L3|F2-L3|   
    +-------+--------+-----+-----+-----+ 
    
    
A.2. Advanced examples 

A.2.1. Different update rate for subset of layers 

   An example employing different update rates (i.e. different number of 
   frames per packet) for selected subsets of layers. In these examples 
   all core codec layers L1-L5 are shown.  


Lakaniemi, Wang         Expires March 28, 2009                [Page 25] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

    +-------+--------+-----+-----+-----+-----+-----+ 
    | RTP1  | L-ID=1 |NF=3 |F1-L1|F2-L1|F3-L1|F4-L1| 
    +-------+--------+-----+-----+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+-----+-----+ 
    | RTP2a | L-ID=7 |NF=1 |F1-L2|F2-L2|F1-L3|F2-L3| 
    +-------+--------+-----+-----+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP3a |L-ID=14 |NF=0 |F1-L4|F1-L5| 
    +-------+--------+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP3b |L-ID=14 |NF=0 |F2-L4|F2-L5| 
    +-------+--------+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+-----+-----+ 
    | RTP2b | L-ID=7 |NF=1 |F3-L2|F4-L2|F3-L3|F4-L3| 
    +-------+--------+-----+-----+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP3c |L-ID=14 |NF=0 |F3-L4|F3-L5| 
    +-------+--------+-----+-----+-----+ 
    
    +-------+--------+-----+-----+-----+ 
    | RTP3d |L-ID=14 |NF=0 |F4-L4|F4-L5| 
    +-------+--------+-----+-----+-----+ 
    

A.2.2. Redundant frames with limited set of layers 

   An example transmitting layers L1-L3 as primary data and L1 (of the 
   previous frame) as redundant data is shown below. Each payload 
   carries one primary (i.e. new) frame in one transport block and one 
   redundant frame, which in this example is the frame preceding the 
   primary frame, in another transport block. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 26] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    | RTP1  | L-ID=1 |NF=0 |F0-L1| L-ID=3 |NF=0 |F1-L1|F1-L2|F1-L3| 
    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    
    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    | RTP2  | L-ID=1 |NF=0 |F1-L1| L-ID=3 |NF=0 |F2-L1|F2-L2|F2-L3| 
    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    
    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    | RTP3  | L-ID=1 |NF=0 |F2-L1| L-ID=3 |NF=0 |F3-L1|F3-L2|F3-L3| 
    +-------+--------+-----+-----+--------+-----+-----+-----+-----+ 
    

   Alternatively, the payload carrying also redundant data for a subset 
   of layers can be arranged differently, as shown in the example below. 

    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    | RTP1  | L-ID=3 |NF=0 |F0-L1|F0-L2|F0-L3| L-ID=1 |NF=0 |F1-L1| 
    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    
    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    | RTP2  | L-ID=3 |NF=0 |F1-L1|F1-L2|F1-L3| L-ID=1 |NF=0 |F2-L1| 
    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    
    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    | RTP3  | L-ID=3 |NF=0 |F2-L1|F2-L2|F2-L3| L-ID=1 |NF=0 |F3-L1| 
    +-------+--------+-----+-----+-----+-----+--------+-----+-----+ 
    

   Now the first transport block carries the primary data and the second 
   transport block carries the redundant data, which in this case covers 
   the frame following the primary frame. The benefit of this approach 
   is that the redundant data is included in the last (secondary) 
   transport block of the payload, which might be beneficial for 
   possible payload scaling operation within the network. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 27] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

8. References 

8.1. Normative References 

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 
             Requirement Levels", BCP 14, RFC 2119, March 1997. 

   [RFC3550] Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, 
             V., "RTP: A Transport Protocol for Real-Time Applications", 
             STD 64, RFC 3550, July 2003. 

   [G.718]   ITU-T Recommendation G.718, "Frame Error Robust Narrowband 
             and Wideband Embedded Variable Bit-Rate Coding of Speech 
             and Audio from 8-32 Kbit/s", (consented) May 2008. 

   [AMR-WB]  3GPP TS 26.171, "Adaptive Multi-Rate Wideband (AMR-WB) 
             speech codec; General description (Release 7)", v7.0.0, 
             September 2006. 

   [RFC4867] Sjoberg, J., Westerlund, M., Lakaniemi, A., Xie, Q., "RTP 
             Payload Format and File Storage Format fort he Adaptive 
             Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) 
             Audio Codecs", RFC 4867, April 2007. 

   [RFC5104] Wenger, S., Chandra, U., Westerlund, M., Burman, B., "Codec 
             Control Messages in the RTP Audio-Visual Profile with 
             Feedback (AVPF)", RFC 5104, Feburary 2008. 

   [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., Rey, J., 
             "Extended RTP Profile for Real-Time Transport Control 
             Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, July 
             2006. 

   [RFC4566] Handley, M., Jacobson, V. and Perkins, C., "SDP: Session 
             Description Protocol", RFC 4566, July 2006. 

   [RFC4288] Freed, N., Klensin, J., "Media Type Specifications and 
             Registration Procedures", BCP 13, RFC 4288, December 2005. 

   [RFC4855] Casner, S., "Media Type Registration of RTP Payload 
             Formats", RFC 4855, February 2007. 

   [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model with 
             Session Description Protocol (SDP)", RFC 3264, June 2002. 


Lakaniemi, Wang         Expires March 28, 2009                [Page 28] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   [smd-sdp] Schierl, T., Wenger, S., "Signaling media decoding 
             dependency in Session Description Protocol (SDP)", draft-
             schierl-mmusic-layered-codec-04 (work in progress), June 
             2007. 

   [RFC3551] Schulzrinne, H., Casner, S., "RTP Profile for Audio and 
             Video Conferences with Minimal Control", STD 65, RFC 3551, 
             July 2003. 

   [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., Norrman, 
             K., "The Secure Real-Time Transport Protocol (SRTP)", RFC 
             3711, March 2004. 

8.2. Informative References 

   [McCanne] McCanne, S., Jacobson, V., and Vetterli, M., "Receiver-
             driven layered multicast", in Proc. of ACM SIGCOMM'96, 
             pages 117--130, Stanford, CA, August 1996. 

   [RFC5117] Westerlund, M., Wenger, S., "RTP Topologies", RFC 5117, 
             January 2008. 

   [RFC2326] Schulzrinne, H., Rao, A., Lanphier, R., "Real Time 
             Streaming Protocol (RTSP)", RFC 2326, April 1998. 

   [RFC2974] Handley, M., Perkins, C., Whelan, E., "Session Announcement 
             Protocol", RFC 2974, October 2000. 

Author's Addresses 

   Ari Lakaniemi 
   Nokia 
   P.O.Box 407 
   FIN-00045 Nokia Group, FINLAND 
    
   Phone: +358-71-8008000 
   Email: ari.lakaniemi@nokia.com 
    
   Ye-Kui Wang 
   Nokia Research Center 
   P.O. Box 1000 
   33721 Tampere 
   Finland 
       
   Phone: +358-50-466-7004 
   EMail: ye-kui.wang@nokia.com 
    
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 29] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

Intellectual Property Statement 

   The IETF takes no position regarding the validity or scope of any 
   Intellectual Property Rights or other rights that might be claimed to 
   pertain to the implementation or use of the technology described in 
   this document or the extent to which any license under such rights 
   might or might not be available; nor does it represent that it has 
   made any independent effort to identify any such rights.  Information 
   on the procedures with respect to rights in RFC documents can be 
   found in BCP 78 and BCP 79. 

   Copies of IPR disclosures made to the IETF Secretariat and any 
   assurances of licenses to be made available, or the result of an 
   attempt made to obtain a general license or permission for the use of 
   such proprietary rights by implementers or users of this 
   specification can be obtained from the IETF on-line IPR repository at 
   http://www.ietf.org/ipr. 

   The IETF invites any interested party to bring to its attention any 
   copyrights, patents or patent applications, or other proprietary 
   rights that may cover technology that may be required to implement 
   this standard.  Please address the information to the IETF at 
   ietf-ipr@ietf.org. 

Disclaimer of Validity 

   This document and the information contained herein are provided on an 
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 

Copyright Statement 

   Copyright (C) The IETF Trust (2008). 

   This document is subject to the rights, licenses and restrictions 
   contained in BCP 78, and except as set forth therein, the authors 
   retain all their rights. 

Acknowledgment 

   Funding for the RFC Editor function is currently provided by the 
   Internet Society. 

 
Lakaniemi, Wang         Expires March 28, 2009                [Page 30] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

9. Open Issues 

   1) Support of super-wideband (SWB) audio and stereophonic encoding 
      extensions to ITU-T G.718 currently being worked on by ITU-T is to 
      be specified after ITU-T completes the work in that regards. 

        a. Some further study is needed to see if separate parameters 
          for sending and receiving capabilities/preferences are needed 
          -- especially for upcoming stereo and SWB options. 

        b. The support for upcoming SWB and stereo options needs to be 
          taken into account. Basically we can either 1) extend the 
          parameter "layers" to cover also this aspect, or 2) define 
          separate parameter(s) for these new options when more details 
          on the stereo/SWB support are available. 

   2) For streaming or other applications that allow for relatively long 
      end-to-end delay, sometimes it would be beneficial to aggregate 
      more than 4 frames in one Transport Block (TB). Should the length 
      of the NF field be larger? 

   3) On layer structure and configuration signalling. Currently, a 
      unique layer ID is assigned for any possible layer combinations. 
      See the editing notes below Table 3 for other possible approaches. 
      One of the alternative ways may be chosen in the final draft. 

   4) Currently, it is mandated that lower layer EDUs of later frames go 
      before higher layer EDUs of earlier frames in a transport block. 
      This way is friendlier to adaptation (dropping of higher layers). 
      However, if all layers are received, then the depacketizer needs 
      to reorder the EDUs to their decoding order before feeding them to 
      the decoder. Therefore, the other way around (i.e. lower layer 
      EDUs of later frames go after higher layer EDUs of earlier frames, 
      or EDUs in transport blocks are placed in decoding order) is more 
      friendly to the depacketizer. Another benefit of the latter is 
      that it does not introduce any end-to-end delay. Which way to be 
      specified (or both allowed if needed) is FFS. 

   5) MANEs dropping RTP packets are RTP translators. But are those 
      MANEs dropping a subset of the transport blocks in one packet also 
      RTP translators? 

   6) The RTCP based cross-session synchronization is not possible until 
      the first RTCP SRs are received in all sessions. This implies that 
      decoding only a subset of layers may be possible until RTCP SRs in 
      all sessions have been received. This may imposes higher end-to-
      end delay or higher bandwidth for RTCP data, and the approach may 
 
 
Lakaniemi, Wang         Expires March 28, 2009                [Page 31] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

      not work perfectly for some multicast topologies. There is a study 
      ongoing by some AVT members. Once there is an acceptable solution 
      fouthe draft documenting that solution may be referenced in this 
      draft. 

   7) It might be better to change the semantics of the media type 
      parameter 'layers' to be similar as that for L-ID. 

   8) Offer/answer with answer being capable of modifying the layer 
      configuration is FFS. 

   9) Some references need to be updated in the final draft.  

10. Changes Log 

   From draft-lakaniemi-art-rtp-evbr-02 to From draft-lakaniemi-art-rtp-
   evbr-03 

   - In section 2.1, 1) updated the text and tables to include sampling 
     rates and output as NB or WB, 2) corrected the bit rate values in 
     Table 2, 3) clarified that all AMR-WB modes can be supported, and 
     4) added that in the AMR-WB interoperable mode, when the base 
     layer L1' is transported in its own RTP packet stream, the 
     packetisation specified in [RFC4867] MUST be used, to enable 
     legacy RFC4867 receivers to receive the base layer L1'. 

   - In section 3.1.2, added one more alternative way on layer 
     structure and configuration signalling in an editing note. This 
     uses separate L-ID value spaces for different modes. For example, 
     the mode with L1 being present and the AMR-WB compatible mode 
     (with L1' being present) use different value spaces of L-ID. 

   - In section 3.1.2, clarified that the encoded data is not present 
     (i.e. consists of zero octet) for an empty frame (with L-ID equal 
     to 0). 

   - In section 3.2, clarified that MANEs dropping some of the layers 
     are RTP translators, and added references to RFC 5117 and RFC 
     3550, per Colin's comment.  

   - In section 3.6, removed the payload specific multi-session 
     transmission decoder order recovery mechanism based on time 
     synchronization. In stead, the RTCP based synchronization 
     mechanism is used (with a wording SHOULD). 


Lakaniemi, Wang         Expires March 28, 2009                [Page 32] 

Internet-Draft    RTP payload for G.718 speech/audio     September 2008 
    

   - Removed the original section 4. How the preference of SWB or 
     stereo is to be signaled is for further study after the ITU-T 
     completes the relevant extension.  

   - In section 4.1, added a new media type parameter, 'mode', to 
     indicate whether the AMR-WB compatible mode is in use. 

   - Added section 9 (Open issues) and section 10 (Changes Log). 


Lakaniemi, Wang         Expires March 28, 2009                [Page 33]