Network Working Group Johan Sjoberg INTERNET-DRAFT Magnus Westerlund Expires: June 2005 Ericsson Ari Lakaniemi Nokia December 17, 2004 Real-Time Transport Protocol (RTP) Payload Format for Extended AMR Wideband (AMR-WB+) Audio Codec Status of this memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document is a submission of the IETF AVT WG. Comments should be directed to the AVT WG mailing list, avt@ietf.org. Abstract This document specifies a real-time transport protocol (RTP) payload format to be used for Extended AMR Wideband (AMR-WB+) encoded audio signals. The AMR-WB+ codec is an audio extension of the AMR-WB codec providing additional frame types designed to give higher quality of music and speech than the original frame types. A media type registration is included for AMR-WB+. Sjoberg, et. al. [Page 1] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 TABLE OF CONTENTS 1. Definitions.....................................................3 1.1. Glossary...................................................3 1.2. Terminology................................................3 2. Introduction....................................................3 3. Background on AMR-WB+ and Design Principles.....................4 3.1. The AMR-WB+ Audio Codec....................................5 3.2. Multi-rate Encoding and Rate Adaptation....................7 3.3. Voice Activity Detection and Discontinuous Transmission....8 3.4. Support for Multi-Channel Session..........................8 3.5. Unequal Bit-error Detection and Protection.................8 3.6. Robustness against Packet Loss.............................9 3.6.1. Use of Forward Error Correction (FEC).................9 3.6.2. Use of Frame Interleaving............................10 3.7. AMR-WB+ Audio over IP scenarios...........................11 4. RTP Payload Format for AMR-WB+.................................12 4.1. RTP Header Usage..........................................13 4.2. Payload Structure.........................................13 4.3. Payload definitions.......................................14 4.3.1. The Payload Table of Contents........................14 4.3.2. Audio Data...........................................20 4.3.3. Methods for Forming the Payload......................20 4.3.4. Payload Examples.....................................20 4.4. Interleaving Considerations...............................23 4.5. Implementation Considerations.............................23 4.5.1. ISF recovery when frames are lost....................24 5. Congestion Control.............................................26 6. Security Considerations........................................26 6.1. Confidentiality...........................................27 6.2. Authentication and Integrity..............................27 6.3. Decoding Validation.......................................27 7. Payload Format Parameters......................................27 7.1. Media Type Registration...................................28 7.2. Mapping Media Type Parameters into SDP....................29 7.2.1. Offer-Answer Model Considerations....................30 7.2.2. Examples.............................................31 8. IANA Considerations............................................32 9. Contributors...................................................32 10. Acknowledgements..............................................32 11. References....................................................32 11.1. Normative references.....................................32 11.2. Informative References...................................33 12. Authors' Addresses............................................34 13. IPR Notice....................................................34 14. Copyright Notice..............................................35 15. Changes.......................................................35 Sjoberg, et. al. Standards Track [Page 2] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 1. Definitions 1.1. Glossary 3GPP - the Third Generation Partnership Project AMR - Adaptive Multi-Rate Codec AMR-WB - Adaptive Multi-Rate Wideband Codec AMR-WB+ - Extended Adaptive Multi-Rate Wideband Codec CMR - Codec Mode Request CN - Comfort Noise DTX - Discontinuous Transmission FEC - Forward Error Correction FT - Frame Type ISF - Internal Sampling Frequency SCR - Source Controlled Rate Operation SID - Silence Indicator (the frames containing only CN parameters) TFI - Transport Frame Index TS - Timestamp VAD - Voice Activity Detection UED - Unequal Error Detection UEP - Unequal Error Protection 1.2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. 2. Introduction This document specifies the payload format for packetization of Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] encoded audio signals into the Real-time Transport Protocol (RTP) [3]. The payload format supports transmission of mono or stereo audio, aggregating multiple frames per payload, and mechanisms enhancing robustness against packet loss. AMR-WB+ codec is an extension to the Adaptive Multi-Rate Wideband (AMR-WB). The new features include extended audio bandwidth to enable high quality also for music, native support also for stereophonic audio and the possibility to operate on different internal sampling frequencies (ISFs). The primary usage scenario for AMR-WB+ is transport over IP and therefore AMR-WB-like need for interworking with other transport networks is not necessary. AMR-WB+ is expected to mainly be used in streaming applications and the benefit of using an octet-aligned payload format to make the packetization process on a streaming server as efficient as possible Sjoberg, et. al. Standards Track [Page 3] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 is seen substantial. Therefore, the bandwidth efficient mode as defined for AMR-WB in [7] is not specified for AMR-WB+; the saved bandwidth using bandwidth efficient mode would anyway be very small, since all extension frame types already are octet aligned at the encoder output. The stereo encoding capability makes the support for multi-channel transport at RTP payload format level, as specified for AMR-WB, obsolete and therefore this feature is not included for the AMR-WB+ RTP payload format. Due to all these changes, and the different scope of the AMR-WB+ codec this formats defines a new significantly different RTP payload format compared to the ones for AMR and AMR-WB [7]. There is no file format for AMR-WB+ defined within this specification. Instead the 3GPP defined ISO based 3GP file format [14] supports AMR-WB+, and provides all functionality required from a file format. This format does also support storage of AMR and AMR-WB, plus other multi-media formats allowing for synchronized playback. The rest of the document is organized in the following way. Background on AMR-WB+ and design principles can be found in Section 3. The payload format itself is specified in Section 4 and follows the principles used in [3] and [9]. In Section 7, a media type registration is provided. 3. Background on AMR-WB+ and Design Principles The Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] audio codec is designed for compression of speech and audio signals achieving low bit-rate with good quality. The codec is specified by 3GPP, and primary target applications within 3GPP are packet-switched streaming service (PSS) [13] and multimedia messaging service (MMS). However, due to its flexibility and robustness, AMR-WB+ is very well suited for streaming services in highly varying transport environments, e.g. the Internet. Some of the options of the payload format remain constant throughout a session, and therefore can be controlled/negotiated at the session set-up. These options and variables are described in general terms at appropriate points in the text of this specification as parameters to be established through out-of-band means. In Section 7, all of the parameters are specified in the form of media type registration for the AMR-WB+ encoding. The method used to signal these parameters at session setup or to arrange prior agreement of the participants is beyond the scope of this document; however, Section 7.2 provides a mapping of the parameters into the Session Description Protocol (SDP) [6] for those applications that use SDP. Sjoberg, et. al. Standards Track [Page 4] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 3.1. The AMR-WB+ Audio Codec The AMR-WB+ audio codec was originally developed by 3GPP to be used for streaming and messaging services in GSM and 3G cellular systems. AMR-WB+ is designed as an audio extension to the AMR-WB speech codec. The extension adds new functionality to the codec in order to provide high audio quality for a large range of signals including music. Stereophonic operation has also been added where a new high- efficiency hybrid stereo coding algorithm enables stereo operation at bit-rates as low as 6.2 kbit/s in total. The AMR-WB+ audio codec includes the nine frame types specified for AMR-WB, extended with new bit-rates ranging from 5.2 to 48 kbit/s. Whereas the AMR-WB frame types employ 16000 Hz sampling frequency and operates only on monophonic signals, the extension can operate at a number of internal sampling frequencies, ISFs, both in mono and stereo, see Table 24 in [1]. However, the output sampling frequency of the decoder is limited to 8, 16, 24, 32 or 48 kHz. An overview of the AMR-WB+ encoding operations is as follows. The encoder receives the audio sampled at for example 48 kHz. The encoding process starts with pre-processing and resampling to the Internal Sampling Frequency (ISF) used. The encoding is performed on equal sized super-frames, each corresponding to 2048 samples per channel at the ISF. The codec performs a number of encoding decisions for each super-frame choosing between different encoding algorithms and block lengths giving fidelity-optimized encoding adapted to the signal characteristics of the source. The stereo encoding (if used) is performed separately from the monophonic core encoding, thus enabling the selection of different combinations of core and stereo encoding rates. The resulting encoded audio is produced in 4 equally long transport frames, individually usable by the decoder, corresponding to 512 samples. The codec supports 13 different ISFs, ranging from 12.8 up to 38.4 kHz as described by table 24 in [1]. This allows a trade-off between audio bandwidth and the bit-rate required. As encoding is performed on 2048 samples at the ISF, the duration of a super-frame and the effective bit-rate of the used frame type varies. The ISF of 25600 Hz has a super-frame duration of 80 ms and is also the 'nominal' value used to describe the encoding bit-rates. Using this normalization, the ISF selection results in bit-rate variations from 1/2 up to 3/2 of the nominal bit-rate. For each of the 4 transport frame of a super-frame to be individually decodable, the position within the super-frame must be known. The encoding for the extension modes is performed as one monophonic core encoding and one stereo encoding. The core encoding is Sjoberg, et. al. Standards Track [Page 5] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 performed by splitting the monophonic signal into a lower and a higher frequency bands. The lower band is encoded using either algebraic code excited linear prediction (ACELP) or transform coded excitation (TCX), which is selected once per transport frame with certain allowed combinations within the super-frame. The higher band is encoded using a low-rate parametric bandwidth extension approach. The stereo signal is encoded using a similar frequency band decomposition as that for the mono signal, however here the signal is divided into three bands that are individually parameterized using different techniques. The total bit-rate produced by the extension is the result of the combination of the encoder's core rate, stereo rate and ISF. The extension supports 8 different core encoding rates producing bit- rates between 10.4 and 24.0 kbit/s, see table 22 of [1]. There are 16 stereo encoding rates generating bit-rates between 2.0 and 8.0 kbit/s, see table 23 of [1]. The frame type encodes the AMR-WB modes, 4 fixed extension rates (see below), 24 combinations of core and stereo rates for stereo signals, and the 8 core rates for mono signals as listed in table 25 in [1]. This results in that the AMR- WB+ supports encodings between 10.4 and 32 kbit/s using an ISF of 25600 Hz. Further freedom in produced bit-rates and quality is available by using different ISFs. The selection of an ISF will change the available audio bandwidth of the reconstructed signal, and at the same time change the total bit-rate. The bit-rate for a given combination of frame type and ISF is determined by multiplying the frame type's bit-rate with the used ISF's bit-rate factor (see table 24 of [1]). The extension also has 4 frame types, which have fixed core bit- rates, stereo bit-rates and ISFs, see frame types 10-13 in Table 21 in [1]. These four pre-defined frame types have a fixed input sampling frequency to the encoder set either at 16 or 24 kHz. These frame types share the property with the AMR-WB modes that each transport frame only represens 20 ms of audio signal, however they are also part of 80 ms super-frames. Thus frame types 0-13 (AMR-WB and fixed extension rates) as listed in table 21 of [1] do not require explicit ISF indication. The other frame types 14-47 require the ISF employed to be indicated. The fact that the extension has 32 different frame types that can be combined with 13 ISFs allows for a great flexibility in bit-rate and selection of desired quality. For example there exist a number of combinations that will produce the same codec bit-rate. One possible way of producing a 32 kbit/s audio stream is to utilize frame type 41, i.e. 25.6 kbit/s, and the ISF of 32kHz (5/4 * (19.2+6.4) = 32 kbit/s), and another way is to use frame type 47 and the ISF of 25.6 kHz (1 * (24 + 8) = 32 kbit/s). Which combination to use depends on the content being encoded. In the above example the first case provides wider audio bandwidth, while the second one spends the same number of bits on somewhat narrower audio bandwidth. Sjoberg, et. al. Standards Track [Page 6] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 The duration of one AMR-WB+ audio transport frame can vary and depends on the ISF. Since a transport frame always corresponds to 512 samples at the used ISF, its duration is limited to the range 13.33 to 40 ms. The RTP TS clock rate 72000 Hz results in an AMR- WB+ transport frame lengths from 960 to 2880 ticks, depending on the selected ISF. If the internal sampling rate is set to 25600 Hz, the transport frame duration is equal to 20 ms and the super-frame duration is equal to 80 ms. Index ISF Duration(ms) Duration(TS Ticks) ----------------------------------------------- 0 N/A 20 1440 1 12800 40 2880 2 14400 35.55 2560 3 16000 32 2304 4 17067 30 2160 5 19200 26.67 1920 6 21333 24 1728 7 24000 21.33 1536 8 25600 20 1440 9 28800 17.78 1280 10 32000 16 1152 11 34133 15 1080 12 36000 14.22 1024 13 38400 13.33 960 Table 1: RTP Timestamp Ticks for each ISF The encoder is able to change the used ISF and encoding frame type (both mono and stereo) during an encoding session. For the extension frame types with index 10-13 and 16-47 the ISF and frame type changes are constrained to occur at super-frame boundaries, i.e. within a super-frame the ISF is constant. Such a limitation does not apply for frame types with index 0-9, i.e. the original AMR-WB frame types. In conclusion there are some features that need special consideration from transport point of view. Firstly, the fact that the frame duration depends on the ISF sets requirements on the RTP timestamping. Secondly, each frame of encoded audio must maintain information about its frame type, ISF and position in the super- frame. 3.2. Multi-rate Encoding and Rate Adaptation The multi-rate encoding capability of AMR-WB+ is designed for preserving high audio quality under a wide range of bandwidth requirements and transmission conditions. Sjoberg, et. al. Standards Track [Page 7] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 AMR-WB+ enables seamless switching between frame types using the same number of audio channels and the same ISF. Every AMR-WB+ codec implementation is required to support all the respective audio coding frame types defined by the codec and must be able to handle switching between any two frame types. Switching between frame types employing different number of audio channels or different ISF is possible, but may not be completely seamless. Therefore it is recommended to perform such switching infrequently and if possible during periods where the input is silent. 3.3. Voice Activity Detection and Discontinuous Transmission AMR-WB+ supports the same algorithms for voice activity detection (VAD) and generation of comfort noise (CN) parameters during silence periods as used by the AMR-WB codec. However it can only be used in conjunction with the AMR-WB frame types (FT=0-8). As with the AMR- WB codec, this option allows for reduction of the number of transmitted bits and packets during silence periods to a minimum when operating in the AMR-WB frame types (FT = 0...8). The operation of sending CN parameters at regular intervals during silence periods is usually called discontinuous transmission (DTX) or source controlled rate (SCR) operation. The AMR-WB+ frames containing CN parameters are called Silence Indicator (SID) frames. See more details about VAD and DTX functionality in [4] and [5]. 3.4. Support for Multi-Channel Session Some of the AMR-WB+ frame types support encoding of stereophonic audio. Because of this native support for two-channel stereophonic signal it does not seem necessary to support multi-channel transport with separate codecs as done in AMR-WB RTP payload [7]. The codec has the capability of stereo to mono downmixing as part of the decoding process. Thus, also receiver that is only capable of playout of monophonic audio can still decode and play signals originally encoded and transmitted as stereo. However, to avoid spending bit-rate on stereo encoding that will not be utilized, a mechanism for signaling a session with mono only is defined. 3.5. Unequal Bit-error Detection and Protection The audio bits encoded in each AMR-WB frame are sorted according to their different perceptual sensitivity to bit errors. This property can be exploited e.g. in cellular systems to achieve better voice quality by using unequal error protection and detection (UEP and UED) mechanisms. However, the bits of the extension frame types of the AMR-WB+ codec do not have a consistent sensitivity property and are not sorted in sensitivity order. Thus, UEP or UED cannot be Sjoberg, et. al. Standards Track [Page 8] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 utilized with the extension frame types. If there is a need to use UEP or UED for AMR-WB frame types, please use the RTP payload format for the AMR-WB defined in RFC 3267 [7]. 3.6. Robustness against Packet Loss The payload format supports two mechanisms to improve robustness against packet loss: simple forward error correction (FEC) and frame interleaving. 3.6.1. Use of Forward Error Correction (FEC) The simple scheme of repetition of previously sent data is one way of achieving FEC. Another possible scheme which can be more bandwidth efficient is to use payload external FEC, e.g. RFC2733 [11], which generates extra packets containing repair data. For the AMR-WB+ extension frame types, it is possible to send redundant copies of an input frame encoded using the same frame type and ISF. We describe such a scheme next. The basic idea is to send previously transmitted frame(s) together with the new one(s). This is done by using a sliding window to group the audio frames to be sent in each payload. Figure 1 below shows an example. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- <---- p(n-1) ----> <----- p(n) -----> <---- p(n+1) ----> <---- p(n+2) ----> <---- p(n+3) ----> <---- p(n+4) ----> Figure 1: An example of redundant transmission. In this example each frame is retransmitted once in the following RTP payload packet. Here, f(n-2)...f(n+4) denotes a sequence of audio frames and p(n-1)...p(n+4) a sequence of payload packets. The use of this approach does not require signaling at the session setup. In other words, the audio sender can choose to use this scheme without consulting the receiver. This is because a packet containing redundant frames will not look different from a packet with only new frames. For a certain timestamp, the receiver may receive multiple copies of a frame containing encoded audio data or frames indicated as NO_DATA. Sjoberg, et. al. Standards Track [Page 9] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 This redundancy scheme provides the same functionality as the one described in RFC 2198 "RTP Payload for Redundant Audio Data" [12]. In most cases the mechanism described above is more efficient and simpler than requiring both endpoints to support RFC 2198 in addition to the AMR-WB+ RTP payload format. However, there is one scenario in which the use of RFC 2198 is needed: if one desires to use some other codec than AMR-WB+ for the redundant encoding, the AMR-WB+ payload format is not able to carry it. The sender is responsible for selecting an appropriate amount of redundancy based on feedback about the channel conditions, e.g. in RTCP receiver reports. The sender is also responsible for avoiding congestion, which may be exacerbated by redundancy (see Section 5 for more details). 3.6.2. Use of Frame Interleaving To decrease protocol overhead, the payload design allows several audio frames be encapsulated into a single RTP packet. One of the drawbacks of such an approach is that in case of packet loss this means loss of several consecutive audio frames, which usually causes clearly audible distortion in the reconstructed audio. Interleaving of frames can improve the audio quality in such cases by distributing the consecutive losses into a series of single frame losses, which are easier to cover by an error concealment algorithm. However, interleaving and bundling several frames per payload will also increase end-to-end delay and sets higher buffering requirements, and it is therefore not appropriate for all usage scenarios or devices. Anyway, streaming applications will most likely be able to exploit interleaving to improve audio quality in lossy transmission conditions. Note that this payload design supports the use of frame interleaving as an option. The usage of this feature needs to be negotiated or at least signaled in the session set-up. The interleaving supported by this format is rather flexible. For example, a continuous pattern can be defined, as the example below shows. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- [ P(n) ] [ P(n+1) ] [ P(n+1) ] [ P(n+2) ] [ P(n+2) ] [ P(n+3) ] [P( [ P(n+4) ] Sjoberg, et. al. Standards Track [Page 10] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 Figure 2: An example of interleaving pattern that has constant delay. In Figure 2 the consecutive frames, denoted f(n-2) to f(n+4), are aggregated into packets P(n) to P(n+4), two in each packet with interleaving. This approach provides a pattern that allows for constant delay in both interleaving and deinterleaving process. The deinterleaving buffer in this example needs to have room for at least 3 frames, including the one that is ready to be consumed. One case when the storage space for 3 frames is needed is for example when f(n) is the next frame to be decoded and played: frame f(n) was received in packet P(n+2) carrying also frame f(n+3), and also frame f(n+1) received in packet P(n+1) is already in the deinterleaving buffer. Note also that in this example the buffer occupancy varies: when frame f(n+1) is the next one to be decoded, there are only two frames (f(n+1) and f(n+3)) in the buffer. 3.7. AMR-WB+ Audio over IP scenarios Since the primary target application for the AMR-WB+ codec is packet switched streaming, the most relevant usage scenario for this payload format is IP end-to-end between a server and a terminal, as shown in Figure 3. +----------+ +----------+ | | IP/UDP/RTP/AMR-WB+ | | | SERVER |<------------------------>| TERMINAL | | | | | +----------+ +----------+ Figure 3: Server to terminal IP scenario Sjoberg, et. al. Standards Track [Page 11] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 4. RTP Payload Format for AMR-WB+ Despite belonging to a same family of codecs, the payload format for the AMR-WB+ is different from the AMR and AMR-WB payload formats [7]. The main emphasis in the payload design has been to minimize the overhead in typical use cases, while still providing full flexibility with slightly higher overhead. This is made possible by defining some frame specific parameters to cover all frames in the payload instead of defining them for each frame separately. The payload format has two modes, the basic mode and the interleaved mode. The main structural difference between the two modes is the extension of the table of content entries with a frame displacement fields in the interleaved mode. The basic mode supports aggregation of multiple consecutive frames in a payload. The interleaved mode supports aggregation of multiple frames that are non-consecutive in time. In both modes it is possible to have frames encoded at different frame types in the same payload, but the ISF must remain constant throughout the payload. However, frequent switching of the ISF is not expected, and the codec is restricted to switch ISF only on super-frame boundaries. Thus, the payload format allows ISF switching only between payloads. The payload format is designed around the property that AMR-WB+ frames carried in a payload are consecutive in time and share the same frame duration in between any ISF change. Then enables the receiver to derive the timestamp for an individual frame within a payload based, either on the order of frames in the payload (basic mode), or the compact displacement fields (interleaving mode). The frame timestamps are used to regenerate the correct order of frames after reception, identify duplicates, and detect lost frames that require concealment. The interleaving scheme of this payload format is significantly more flexible than the one specified in RFC 3267. The AMR and AMR-WB payload format is only capable of using periodic patterns with frames taken from an interleaving group at fixed intervals, whereas this interleaving scheme allows for any patterns as long as the difference in decoding order between any two adjacent frames in the interleaved payload is not more than 256 frames. Note that even at the highest ISF this allows interleaving depth up to 3.41 seconds. To allow for error resiliency through redundant transmission, the periods covered by multiple packets MAY overlap in time. A receiver MUST be prepared to receive any audio frame multiple times, all multiply sent frames MUST use the same frame type and ISF, and have the same RTP timestamp, or be a NO_DATA frame (FT=15). The payload consists of octet aligned elements (header, ToC and audio frames), and only the audio frames for AMR-WB frame types (0- 9) require any padding to make them an integral number of octets Sjoberg, et. al. Standards Track [Page 12] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 long. If additional padding is required to bring the payload length to a larger multiple of octets or for some other purpose, then the P bit in the RTP header MAY be set and padding appended as specified in [3]. 4.1. RTP Header Usage The format of the RTP header is specified in [3]. This payload format uses the fields of the header in a manner consistent with that specification. The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame in the packet. The timestamp clock frequency SHALL be 72000 Hz. This frequency allows the frame duration to be integer RTP timestamp ticks for the used ISFs, and also gives reasonable conversion factors to used audio sampling frequencies. See section 4.3.1 for how to derive the RTP timestamp for any audio frame beyond the first one. The RTP header marker bit (M) SHALL be set to 1 if the first frame carried in the packet contains an audio frame, which is the first in a talkspurt. For all other packets the marker bit SHALL be set to zero (M=0). The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically. The media type parameter "channels" is used to indicate the maximum number of channels allowed to be used for a given payload type. A payload type where channels=1 (mono), SHALL only carry mono content. While a payload type for which channels=2 has been declared MAY carry both mono and stereo content. 4.2. Payload Structure The complete payload consists of a payload header, a payload table of contents, and the audio data representing one or more audio frames. The following diagram shows the general payload format layout: +----------------+-------------------+---------------- | payload header | table of contents | audio data ... +----------------+-------------------+---------------- Payloads containing more than one audio frame are called compound payloads. Sjoberg, et. al. Standards Track [Page 13] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 The following sections describe the variations taken by the payload format depending on whether the AMR-WB+ session is set up to use the basic mode or interleaved mode. 4.3. Payload Definitions 4.3.1. The Payload Header The payload header carries data that is common for all frames in the payload. The structure of the payload header is described below. 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | ISF |TFI|L| +-+-+-+-+-+-+-+-+ ISF (5 bits): Indicates the Internal Sampling Frequency employed for all frames in this payload. The index value corresponds to internal sampling frequency as specified in Table 24 in [1]. This field SHALL be set to 0 for Frame Type values 0-13. TFI (2 bits): Transport Frame Index from 0 (first) to 3 (last) indicating the position of the first transport frame of this payload in the AMR-WB+ super-frame structure. This field SHALL be set to 0 for Frame Type values 0-9, and SHALL be ignored by the receiver. L (1 bit): Long displacement field flag for payloads in interleaved mode. If set to 0, four-bit displacement fields are used to indicate interleaving offset; if set to 1, displacement fields of eight bits are used (see section 4.3.2.2). For payloads in the basic mode this bit SHALL be set to 0 and SHALL be ignored by the receiver. Note that the change of ISF during a session always requires separate packets for frames employing different ISF value. Furthermore, in the interleaved mode the ISF switching also requires termination of the previous interleaving pattern and restarting a new one for the new ISF. Sjoberg, et. al. Standards Track [Page 14] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 4.3.2. The Payload Table of Contents The table of contents (ToC) consists of a list of ToC entries where each entry corresponds to a group of audio frames carried in the payload, i.e. +----------------+----------------+- ... -+----------------+ | ToC entry #1 | Toc entry #2 | ToC entry #N | +----------------+----------------+- ... -+----------------+ When multiple groups of frames are present in a payload, the ToC entries SHALL be placed in the packet in order of their creation time. 4.3.2.1. ToC Entry in the Basic Mode A ToC entry of a payload in the basic mode takes the following format: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Frame Type | #frames | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ F (1 bit): If set to 1, indicates that this ToC entry is followed by another ToC entry; if set to 0, indicates that this ToC entry is the last one in the ToC. Frame Type (FT) (7 bits): Indicates the audio codec frame type used for the group of frames corresponding to this ToC entry. FT indicates the combination of AMR-WB+ core and stereo rate, one of the special AMR-WB+ frame types, the AMR-WB rate, or comfort noise, as specified by Table 25 in [1]. #frames (8 bits): This field indicates the number of frames in the group corresponding to this ToC entry. The number of frames is the value of this field plus one, i.e. in the range 1-256. Sjoberg, et. al. Standards Track [Page 15] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 4.3.2.2. ToC Entry in the Interleaved Mode A ToC entry of a payload in the interleaved mode takes the following format if the L-bit in the payload header is set to 0: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Frame Type | #frames | DIS1 | ... | DISi | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | ... | DISn | padd | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ F (1 bit): See definition in 4.3.2.1. Frame Type (FT) (7 bits): See definition in 4.3.2.1. #frames (8 bits): See definition in 4.3.2.1. DIS1...DISn (4 bits): A list of n (n=#frames) displacement fields indicating the displacement of the i:th (i=1..n) audio frame relative to the preceding audio frame in the payload as number of frames. The four-bit displacement values may be between 0 and 15 indicating the number of audio frames in decoding order between the (i-1):th and the i:th frame in the payload. Note that for the first ToC entry of the payload the value of DIS1 has no meaning, since this frame's location in the decoding order is uniquely defined by the RTP timestamp and TFI in the payload header. For the first ToC entry of a payload the DIS1 SHALL be set to zero, and the receiver SHALL ignore the value. Note also that for subsequent ToC entries DIS1 indicates the number of frames between the last frame of the previous group and the first frame of this group. Padd (4 bits): Four padding bits SHALL be included at the end of the ToC entry in case there is odd number of frames in the group corresponding to this entry. These bits SHALL be set to zero and SHALL be ignored by the receiver. If a group containing an even number of frames is associated with this ToC entry, these padding bits SHALL NOT be included in the payload. Sjoberg, et. al. Standards Track [Page 16] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 A ToC entry of a payload in the interleaved mode takes the following format if the L-bit in the payload header is set to 1: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Frame Type | #frames | DIS1 | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | DISn | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ F (1 bit): See definition in 4.3.2.1. Frame Type (FT) (7 bits): See definition in 4.3.2.1. #frames (8 bits): See definition in 4.3.2.1. DIS1...DISn (8 bits): A list of n (n=#frames) displacement fields indicating the displacement of the i:th (i=1..n) audio frame relative to the preceding audio frame in the payload as number of frames. The eight-bit displacement values may be between 0 and 255 indicating the number of audio frames in decoding order between the (i-1):th and the i:th frame in the payload. Note that for the first ToC entry of the payload the value of DIS1 has no meaning, since this frame's location in the decoding order is uniquely defined by the RTP timestamp and TFI in the payload header. For the first ToC entry of a payload the DIS1 SHALL be set to zero, and the receiver SHALL ignore the value. Note also that for subsequent ToC entries DIS1 indicates the displacement between the last frame of the previous group and the first frame of this group. 4.3.2.3. RTP Timestamp Derivation The RTP Timestamp value for a frame is the timestamp value of the first audio sample encoded in the frame. The timestamp value for a frame is derived differently depending on if the payload is in basic or interleaved mode. In both cases the first frame in a compound packet has an RTP timestamp equal to the one received in the RTP header. In the basic mode, the RTP time for any subsequent frame is derived by adding together the frame durations (see Table 1) of all the preceding frames in the payload and adding the sum to the RTP header timestamp value. For example if the RTP Header timestamp value is 12345, the payload carries four frames, and the frame duration is 16 ms (ISF = 32 kHz) corresponding to 1152 timestamp ticks, the RTP timestamp of the fourth frame in the payload is 12345 + 3 * 1152 = 15801. In interleaved mode the RTP timestamp for each frame in the payload is derived by combining the RTP header timestamp and the sum of the Sjoberg, et. al. Standards Track [Page 17] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 time offsets of all preceding frames in this payload. The frame timestamps are computed based on displacement fields and the frame duration derived from the ISF value. Note that the displacement in time between frame i-1 and frame i is (DISi + 1) * frame duration because also the duration of the (i-1):th must be taken into account. The following example derives the RTP timestamps for the frames in an interleaved mode payload having the following header and ToC information: RTP header timestamp: 12345 ISF = 32 kHz Frame 1 displacement field: DIS1 = 0 Frame 2 displacement field: DIS2 = 6 Frame 3 displacement field: DIS3 = 4 Frame 4 displacement field: DIS4 = 7 The ISF of 32 kHz implies frame duration of 16 ms, which means 1152 ticks in 72 kHz timestamp rate. The timestamp of the first frame in the payload is the RTP timestamp, i.e. TS1 = RTP TS. Note that the displacement field value for this frame must be ignored. For the second frame in the payload the timestamp can be calculated as TS2 = TS1 + (DIS2 + 1) * 1152 = 20409. For the third frame the timestamp is TS3 = TS2 + (DIS3 + 1) * 1152 = 26169. Finally, for the fourth frame of the payload we have TS4 = TS3 + (DIS4 + 1) * 1152 = 35385. 4.3.2.4. Frame Type Considerations The value of Frame Type is defined in Table 25 in [1]. FT=14 (AUDIO_LOST) is used to indicate frames that are lost. NO_DATA (FT=15) frame could mean either that there is no data produced by the audio encoder for that frame or that no data for that frame is transmitted in the current payload (i.e., valid data for that frame could be sent either in an earlier or later packet). The duration for these non-included frames is dependent on the internal sampling frequency indicated by the ISF field. For frame types with index 0-13 the ISF field SHALL be set 0 and has no meaning. The frame duration for these frame types are fixed to 20 ms in time, i.e. 1440 ticks in 72 kHz. For payloads containing only frame types with index 0-9 the TFI field SHALL be set to 0, and lacks meaning. 4.3.2.5. Other TOC Considerations If receiving a ToC entry with a FT value not defined, the whole packet SHOULD be discarded. This is to avoid the loss of data synchronization in the depacketization process, which can result in a severe degradation in audio quality. Sjoberg, et. al. Standards Track [Page 18] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 Note that packets containing only NO_DATA frames SHOULD NOT be transmitted. Also, NO_DATA frames at the end of a frame sequence to be carried in a payload SHOULD NOT be included in the transmitted packet. The AMR-WB+ SCR/DTX is identical with AMR-WB SCR/DTX described in [5] and can only be used in combination with the AMR-WB frame types (0-8). When multiple groups of frames are present, their ToC entries will be placed in the ToC in order of their creation time independently on the payload mode. In basic mode the frames will be consecutive in time, while in interleaved mode the frames may not only be non- consecutive in time but may even have varying inter frame distances. 4.3.2.6. ToC Examples The following figure shows an example of a ToC for three audio frames in basic mode. Note that in this case all audio frames are encoded using the same frame type, i.e. there is only one ToC entry. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Frame Type1 | #frames = 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following figure shows an example of a ToC of three entries in basic mode. Note that also in this case the payload carries three frames, but three ToC entries are needed since all frames of the payload are encoded using different frame types. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Frame Type1 | #frames = 1 |1| Frame Type2 | #frames = 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Frame Type3 | #frames = 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following figure shows an example of a ToC of two entries in interleaved mode using four-bit displacement fields. The payload includes two groups of frames, the first one including a single frame, and the other one consisting of two frames. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Frame Type1 | #frames = 1 | DIS1 | padd |0| Frame Type2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | #frames = 2 | DIS1 | DIS2 | Sjoberg, et. al. Standards Track [Page 19] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3.3. Audio Data Audio data of a payload contains one or more audio frames or comfort noise frames, as described in the ToC of the payload. Note, for ToC entries with FT=14 or 15, there will be no corresponding audio frame present in the audio data. Each audio frame for an extension frame type represents an AMR-WB+ transport frame corresponding to the encoding of 512 samples of audio sampled with the internal sampling frequency specified by the ISF indicator. As an exception, frame types with index 10-13 are only capable of using a single internal sampling frequency (25600 Hz). The encoding rates (combination of core bit-rate and stereo bit-rate) are indicated in the frame type field of the corresponding ToC entry. The octet length of the audio frame is implicitly defined by the frame type field and is given in tables 21 and 25 of [1]. The order and numbering notation of the bits are as specified in [1]. As specified there, the bits of the AMR-WB audio frames (frame type values in range 0...8) have been rearranged in order of decreasing sensitivity. For the AMR-WB+ extension frame types and comfort noise frames, the bits are in the order produced by the encoder. The last octet of each audio frame MUST be padded with zeroes at the end if not all bits in the octet are used. In other words, each audio frame MUST be octet-aligned. However, all extension frame types (10-13, 16-47) specified in [1] lead to octet- aligned frames. 4.3.4. Methods for Forming the Payload The payload begins with the payload header, followed by the table of contents consisting of a list of ToC entries. The audio data follows the table of contents, all of the octets comprising an audio frame are appended to the payload as a unit. The audio frames are packed in timestamp order within each group of frames (per ToC entry). Each group of frames is packed in the same order as their corresponding ToC entries are arranged in the ToC, with the exception that a ToC entry with FT=14 or FT=15 there will be no data octets present for that group of frames. 4.3.5. Payload Examples 4.3.5.1. Example 1, Basic Mode Payload Carrying Multiple Frames Encoded Using the Same Frame Type Sjoberg, et. al. Standards Track [Page 20] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 The following diagram shows a payload that carries three AMR-WB+ frames encoded using 14 kbit/s frame type (FT=26) with a frame length of 280 bits (35 bytes). The internal sampling frequency in this example is 25.6 kHz (ISF = 8). The TFI for the first frame is 2, indicating that the first transport frame in this payload is the third in a super-frame. Since this payload is in the basic mode the subsequent frames of the payload are consecutive frames in decoding order, i.e. the fourth transport frame of the current super-frame and the first transport frame of the next super-frame. Note that because the frames are all encoded using the same frame type, only one ToC entry is required. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ISF = 8 | 2 |0|0| FT = 26 | #frames = 3 | f1(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f1(272...279) | f2(0...7) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(272...279) | f3(0...7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f3(272...279) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3.5.2. Example 2, Basic Mode Payload Carrying Multiple Frames Encoded Using Different Frame Types The following diagram shows a payload that carries three AMR-WB+ frames; the first frame is encoded using 18.4 kbit/s frame type (FT=33) with a frame length of 368 bits (46 bytes), and the two subsequent frames are encoded using 20 kbit/s frame type (FT=35) having frame length of 400 bits (50 bytes). The internal sampling frequency in this example is 32 kHz (ISF = 10), implying the overall bit-rates of 23 kbit/s for the first frame of the payload, and 25 kbit/s for the subsequent frames. The TFI for the first frame is 3, indicating that the first transport frame in this payload is the fourth in a super-frame. Since this is a payload in the basic mode the subsequent frames of the payload are consecutive frames in decoding order, i.e. the first and second transport frames of the current super-frame. Note that since the payload carries two different frame types, there are two ToC entries. Sjoberg, et. al. Standards Track [Page 21] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ISF=10 | 3 |0|1| FT = 33 | #frames = 1 |0| FT = 35 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | #frames = 2 | f1(0...7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f1(360...367) | f2(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(392...399) | f3(0...7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f3(392...399) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3.5.3. Example 3, Payload in Interleaved Mode This example shows a payload in interleaved mode carrying four frames encoded using 32 kbit/s frame type (FT=47) with frame length of 640 bits (80 bytes). The internal sampling frequency is 38.4 kHz (ISF = 13) implying bit-rate of 48 kbit/s for all frames in the payload. The TFI for the first frame is 0, i.e. it is the first transport frame of a super-frame. The displacement fields for the subsequent frames are DIS2=18, DIS3=15, and DIS4=10, which implies that the subsequent frames have the TFIs of 3, 3, and 2, respectively. The long displacement field flag L in the payload header is set to 1, which means that the displacement fields in the ToC entry use eight bits. Note that since all frames of this payload are encoded using the same frame type, there is need only for a single ToC entry. Furthermore, the displacement field for the first frame corresponding to the first ToC entry (DIS1=0) must be ignored since its timestamp and TFI are defined by the RTP timestamp and the TFI found in the payload header. The RTP timestamp values of the frames in this example is: Frame1: TS1 = RTP Timestamp Frame2: TS2 = TS1 + 19 * 960 Frame3: TS3 = TS2 + 16 * 960 Frame4: TS4 = TS3 + 11 * 960 Sjoberg, et. al. Standards Track [Page 22] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ISF=13 | 0 |1|0| FT = 47 | #frames = 4 | DIS1 = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DIS2 = 18 | DIS3 = 15 | DIS4 = 10 | f1(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f1(632...639) | f2(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f2(632...639) | f3(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f3(632...639) | f4(0...7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f4(632...639) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.4. Interleaving Considerations The flexible interleaving scheme requires some further usage considerations. As presented in the example in Section 3.6.2, an interleaving pattern requires a certain size of the deinterleaving buffer. This required buffer space, expressed as number of frame slots is indicated using the "interleaving" media parameter. The number of frame slots needed can be converted into actual memory requirement considering the largest (in bytes) combination of AMR- WB+'s core and stereo rates. However, the information about the frame buffer size is not always sufficient to determine when it is appropriate to start consuming frames from the interleaving buffer, there are two cases in which additional information is needed: either due to switching of the ISF or due to changes of the interleaving pattern. Due to this the "int-delay" media type parameter is defined. It allows a sender to indicate the minimal media time that needs to be present in the buffer before the decoder can start consuming frames from the buffer. 4.5. Implementation Considerations An application implementing this payload format MUST understand all the payload parameters in the out-of-band signaling used. For Sjoberg, et. al. Standards Track [Page 23] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 example, if an application uses SDP, all the SDP and MIME parameters in this document MUST be understood. This requirement ensures that an implementation always can decide if it is capable or not of communicating. Both basic and interleaving mode SHALL be implemented. The implementation burden of both is rather small and requiring both ensures interoperability. As the AMR-WB+ codec contains full functionality of the AMR-WB codec, anyone supporting the AMR-WB+ codec and this payload format is RECOMMENDED to also implement the payload format in RFC 3267 [7] for the AMR-WB frame types. This will significantly help interoperability with devices that only support AMR-WB, in applications and scenarios where this is possible. Otherwise an AMR-WB+ end-point that is in fact capable of everything except the RTP payload format for AMR-WB will not be able to communicate. When doing error concealment certain precautions are needed due to the possibility of switching of the ISF. The main difficulty arises from the fact that with packet loss naturally also the information about the ISF, number of frames and RTP timestamp of the missing packet that are required to perform the error concealment in a correct manner are lost. This may lead to a case where the error concealment is performed using incorrect frame length, which in turn can in the worst case make some of the frames received in subsequent payloads unusable. More information and an example algorithm for solving this problem is available in section 4.5.1 below. 4.5.1. ISF recovery in case of packet loss In case of packet loss a proper error concealment has to be initiated in the AMR-WB+ decoder to replace the frames carried in the lost packet. A loss concealment algorithm requires a codec framing that matches the timestamps of the correctly received frames. Hence, it is necessary to recover the timestamps of the lost frames. A difficulty with this may arise due to the fact that the codec frame length that is associated with the ISF may have changed during the frame loss. The task of recovering the timestamps of lost frames is illustrated by an example case where two frames with timestamps t0 and t1 have been received properly, the first one being the last packet before the loss, and the latter one is the first packet after the loss period. The ISF values for these packets are isf0 and isf1, respectively. The associated frame lengths (in timestamp ticks) are given as L0 and L1, respectively. Three frames with timestamps x1 - x3 have been lost. The example further assumes that ISF changes once from isf0 to isf1 during the frame loss, as shown in the figure below. Sjoberg, et. al. Standards Track [Page 24] INTERNET-DRAFT RTP payload format for AMR-WB+ Dec 17, 2004 What is generally not known in the decoder and what is required for recovery of the timestamps is: * the ISFs associated to the lost frames * how many frames have been lost |<---L0--->|<---L0--->|<-L1->|<-L1->|<-L1->| | Rxd | lost | lost | lost | Rxd | --+----------+----------+------+------+------+-- t0 x1 x2 x3 t1 In the following an example algorithm is given, which may be used to recover timestamps and ISFs belonging to lost frames. As in the above example, it is assumed that two frames have been received properly with timestamps t0 and t1, and ISF values isf0 and isf1, and associated frame lengths L0 and L1, respectively. Furthermore, the TFIs of the two received frames are denoted by tfi0 and tfi1, respectively. Example Algorithm: Start: # check for frame loss If (t0 + L0) == t1 Then goto End # no frame loss Step 1: # check case with no ISF change If (isf0 != isf1) Then goto Step 2 # At least one ISF change If (isFractional(t1 - t0)/L0) Then goto Step 3 # More than 1 ISF change Return recovered timestamps as x(n) = t0 + n*L1 and associated ISF equal to isf0, for 0