Internet Draft Kyle J. McKay draft-mckay-qcelp-01.txt Eric C. Rosen Expires: October 1998 QUALCOMM Incorporated August 5, 1998 RTP Payload Format for PureVoice(tm) Audio Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. ABSTRACT This document describes the RTP payload format for PureVoice(tm) Audio. The packet format supports variable interleaving to reduce the effect of packet loss on audio quality. 1 Introduction This document describes how compressed PureVoice audio as produced by the QUALCOMM PureVoice CODEC [1] may be formatted for use as an RTP payload type. A method is provided to interleave the output of the compressor to reduce quality degradation due to lost packets. Furthermore, the sender may choose various interleave settings based on the importance of low end-to-end delay versus greater tolerance for lost packets. McKay Expires October 1998 [Page 1] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 2 Background The Electronic Industries Association (EIA) & Telecommunications Industry Association (TIA) standard IS-733 [1] defines an audio compression algorithm for use in CDMA applications. In addition to being the standard CODEC for all wireless CDMA terminals, the QUALCOMM PureVoice CODEC (a.k.a. Qcelp) is used in several Internet applications most notably JFax(tm), Apple(r) QuickTime(tm), and Eudora(r). The Qcelp CODEC [1] compresses each 20 milliseconds of 8000 Hz, 16- bit sampled input speech into one of four different size output frames: Rate 1 (266 bits), Rate 1/2 (124 bits), Rate 1/4 (54 bits) or Rate 1/8 (20 bits). The CODEC chooses the output frame rate based on analysis of the input speech and the current operating mode (either normal or reduced rate). For typical speech patterns, this results in an average output of 6.8 k bits/sec for normal mode and 4.7 k bits/sec for reduced rate mode. 3 RTP/Qcelp Packet Format The RTP timestamp is in 1/8000 of a second units. The RTP payload data for the Qcelp CODEC has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [2] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |E|R| LLL | NNN | | +-+-+-+-+-+-+-+-+ one or more codec data frames | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The RTP header has the expected values as described in [2]. The extension bit is not set. The marker bit MAY be set at the beginning of a talkspurt. The codec data frames are aligned on octet boundaries. When interleaving is in use, and/or multiple codec data frames are present in a single RTP packet, the timestamp is, as always, that of the oldest data represented in the RTP packet. The other fields have the following meaning: Encrypted (E): 1 bit MUST by set to zero by sender except when data frames are encrypted. Senders MAY support encryption. Receivers MAY optionally support encryption but MUST examine this bit to detect and discard encrypted payloads if encryption is not supported. Encryption is discussed further in Section (6). McKay Expires October 1998 [Page 2] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 Reserved (R): 1 bit MUST be set to zero by sender, ignored by receiver. Interleave (LLL): 3 bits MUST have a value between 0 and 5 inclusive. The remaining two values (6 and 7) MUST not be used by senders. If this field is non-zero, interleaving is enabled. All receivers MUST support interleaving. Senders MAY support interleaving. Senders that do not support interleaving MUST set field LLL and NNN to zero. Interleave Index (NNN): 3 bits MUST have a value less than or equal to the value of LLL. Values of NNN greater than the value of LLL are invalid. 3.1 Receiving Invalid Values On receipt of an RTP packet with an invalid value of the LLL or NNN field, the RTP packet MUST be treated as lost by the receiver for the purpose of generating erasure frames as described in section 4. 3.2 CODEC data frame format The output of the Qcelp CODEC must be converted into CODEC data frames for inclusion in the RTP payload as follows: a. The lower nibble of Octet 0 of each CODEC data frame defines the rate and total size of the frame. The upper nibble is reserved. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RRRR | T | Vocoder data frame octets as in [1] .... | +-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The interpretation of Octet 0 of each CODEC data frame is defined as follows: Reserved (RRRR): 4 bits MUST by set to zero by sender. MUST be ignored by receivers. Frame Type (T): 4 bits Defines (by value) the frame rate and size of the vocoder data octets that immediately follow according to the following mapping: TOTAL CODEC FRAME FRAME Data frame McKay Expires October 1998 [Page 3] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 TYPE (T) RATE size (octets) --------+-----------+---------------- 0 | Blank | 1 1 | 1/8 | 4 2 | 1/4 | 8 3 | 1/2 | 17 4 | 1 | 35 5 | Reserved | 8 14 | Erasure | 1 All other Frame Type values not listed above are reserved. Receipt of a CODEC data frame with a reserved frame-type MUST be considered invalid data as described in 3.1. b. The bits as numbered in the standard [1] from highest to lowest are packed into octets. The highest numbered bit (265 for Rate 1, 123 for Rate 1/2, 53 for Rate 1/4 and 19 for Rate 1/8) is placed in the most significant bit (Internet bit 0) of octet 1 of the CODEC data frame. The second highest numbered bit (264 for Rate 1, etc.) is placed in the second most significant bit (Internet bit 1) of octet 1 of the data frame. This continues so that bit 258 from the standard Rate 1 frame is placed in the least significant bit of octet 1. Bit 257 from the standard is placed in the most significant bit of octet 2 and so on, until bit 0 from the standard Rate 1 frame is placed in Internet bit 1 of octet 34 of the CODEC data frame. The remaining unused bits of the last octet of the CODEC data frame MUST be set to zero. Here is a detail of how a Rate 1/8 frame is converted into a CODEC data frame: CODEC data frame 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |1|1|1|1|1|1|1|1|1|1| | | | | | | | | | | | | | | | 1 (Rate 1/8) |9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|Z|Z|Z|Z| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Octet 0 of the data frame has value 1 (see table above) indicating the total data frame length (including octet 0) is 4 octets. Bits 19 through 0 from the standard Rate 1/8 frame are placed as indicated with bits marked with "Z" being set to zero. The Rate 1, 1/4 and 1/2 standard frames are converted similarly. 3.3 Bundling CODEC data frames As indicated in section 3, more than one CODEC data frame MAY be McKay Expires October 1998 [Page 4] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 included in a single RTP packet by a sender. Receivers MUST handle bundles of up to 10 CODEC data frames in a single RTP packet. Furthermore, senders have the following additional restrictions: o MUST not bundle more CODEC data frames in a single RTP packet than will fit in the MTU of the RTP transport protocol. For the purpose of computing the maximum bundling value, all CODEC data frames should be assumed to have the Rate 1 size. o MUST never bundle more than 10 CODEC data frames in a single RTP packet. o Once beginning transmission with a given SSRC and given bundling value, MUST NOT increase the bundling value. If the bundling value needs to be increased, a new SSRC number MUST be used. o MAY decrease the bundling value only between interleave groups (see section 3.4). If the bundling value is decreased, it MUST NOT be increased (even to the original value), although it may be decreased again at a later time. 3.3.1 Determining the number of bundled CODEC data frames Since no count is transmitted as part of the RTP payload and the CODEC data frames have differing lengths, the only way to determine how many CODEC data frames are present in the RTP packet is to examine octet 0 of each CODEC data frame in sequence until the end of the RTP packet is reached. 3.4 Interleaving CODEC data frames Interleaving is meaningful only when more than one CODEC data frame is bundled into a single RTP packet. All receivers MUST support interleaving. Senders MAY support interleaving. Given a time-ordered sequence of output frames from the Qcelp CODEC numbered 0..n, a bundling value B, and an interleave value L where n = B * (L+1) - 1, the output frames are placed into RTP packets as follows (the values of the fields LLL and NNN are indicated for each RTP packet): First RTP Packet in Interleave group: LLL=L, NNN=0 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of B frames McKay Expires October 1998 [Page 5] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 Second RTP Packet in Interleave group: LLL=L, NNN=1 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a total of B frames This continues to the last RTP packet in the interleave group: L+1 RTP Packet in Interleave group: LLL=L, NNN=L Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a total of B frames Senders MUST transmit in timestamp-increasing order. Furthermore, within each interleave group, the RTP packets making up the interleave group MUST be transmitted in value-increasing order of the NNN field. While this does not guarantee reduced end-to-end delay on the receiving end, when packets are delivered in order by the underlying network, delay will be reduced to the minimum possible. Additionally, senders have the following restrictions: o Once beginning transmission with a given SSRC and given interleave value, MUST NOT increase the interleave value. If the interleave value needs to be increased, a new SSRC number MUST be used. o MAY decrease the interleave value only between interleave groups. If the interleave value is decreased, it MUST NOT be increased (even to the original value), although it may be decreased again at a later time. 3.5 Finding Interleave Group Boundaries Given an RTP packet with sequence number S, interleave value (field LLL) L, and interleave index value (field NNN) N, the interleave group consists of RTP packets with sequence numbers from S-N to S-N+L inclusive. In other words, the Interleave group always consists of L+1 RTP packets with sequential sequence numbers. The bundling value for all RTP packets in an interleave group MUST be the same. The receiver determines the expected bundling value for all RTP packets in an interleave group by the number of CODEC data frames bundled in the first RTP packet of the interleave group received. Note that this may not be the first RTP packet of the interleave group sent if packets are delivered out of order by the underlying network. On receipt of an RTP packet in an interleave group with other than the expected bundling value, the receiver MAY discard CODEC data McKay Expires October 1998 [Page 6] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 frames off the end of the RTP packet or add erasure CODEC data frames to the end of the packet in order to manufacture a substitute packet with the expected bundling value. The receiver MAY instead choose to discard the whole interleave group and play silence. 3.6 Reconstructing Interleaved Audio Given an RTP sequence number ordered set of RTP packets in an interleave group numbered 0..L, where L is the interleave value and B is the bundling value, and CODEC data frames within each RTP packet that are numbered in order from first to last with the numbers 1..B, the original, time-ordered sequence of output frames from the CODEC may be reconstructed as follows: First L+1 frames: Frame 0 from packet 0 of interleave group Frame 0 from packet 1 of interleave group And so on up to... Frame 0 from packet L of interleave group Second L+1 frames: Frame 1 from packet 0 of interleave group Frame 1 from packet 1 of interleave group And so on up to... Frame 1 from packet L of interleave group And so on up to... Bth L+1 frames: Frame B from packet 0 of interleave group Frame B from packet 1 of interleave group And so on up to... Frame B from packet L of interleave group 3.6.1 Additional Receiver Responsibility Assume that the receiver has begun playing frames from an interleave group. The time has come to play frame x from packet n of the interleave group. Further assume that packet n of the interleave group has not been received. As described in section 4, an erasure frame will be sent to the Qcelp CODEC. Now, assume that packet n of the interleave group arrives before frame x+1 of that packet is needed. Receivers SHOULD use frame x+1 of the newly received packet n rather than substituting an erasure frame. In other words, just because packet n wasn't available the first time it was needed to reconstruct the interleaved audio, the receiver SHOULD NOT assume it's not available when it's subsequently McKay Expires October 1998 [Page 7] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 needed for interleaved audio reconstruction. 4 Handling lost RTP packets The Qcelp CODEC supports the notion of erasure frames. These are frames that for whatever reason are not available. When reconstructing interleaved audio or playing back non-interleaved audio, erasure frames MUST be fed to the Qcelp CODEC for all of the missing packets. Receivers MUST use the timestamp clock to determine how many CODEC data frames are missing. Each CODEC data frame advances the timestamp clock EXACTLY 160 counts. Since the bundling value may vary (it can only decrease), the timestamp clock is the only reliable way to calculate exactly how many CODEC data frames are missing when a packet is dropped. Specifically when reconstructing interleaved audio, a missing RTP packet in the interleave group should be treated as containing B erasure CODEC data frames where B is the bundling value for that interleave group. 5 Discussion The Qcelp CODEC interpolates the missing audio content when given an erasure frame. However, the best quality is perceived by the listener when erasure frames are not consecutive. This makes interleaving desirable as it increases audio quality when dropped packets are more likely. On the other hand, interleaving can greatly increase the end-to-end delay. Where an interactive session is desired, an interleave (field LLL) value of 0 or 1 and a bundling factor of 4 or less is recommended. When end-to-end delay is not a concern, a bundling value of at least 4 and an interleave (field LLL) value of 4 or 5 is recommended subject to MTU limitations. The restrictions on senders set forth in sections 3.3 and 3.4 guarantee that after receipt of the first payload packet from the sender, the receiver can allocate a well-known amount of buffer space that will be sufficient for all future reception from the same SSRC value. Less buffer space may be required at some point in the future if the sender decreases the bundling value or interleave, but never more buffer space. This prevents the possibility of the receiver needing to allocate more buffer space (with the possible result that McKay Expires October 1998 [Page 8] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 none is available) should the bundling value or interleave value be increased by the sender. Also, were the interleave or bundling value to increase, the receiver could be forced to pause playback while it receives the additional packets necessary for playback at an increased bundling value or increased interleave. 6 Encrypted RTP/Qcelp Packet Format Senders may optionally encrypt PureVoice data frames as a means of guarding against eavesdropping. Applications may use any appropriate algorithm to encrypt data frames. The encryption algorithm, key length, key exchange mechanisms, and other encryption algorithm parameters are not defined here. When encrypted data frames are sent, the first bit in the RTP payload is set to one and additional crypto payload fields are defined according to the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [2] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |E|R| LLL | NNN |R|K| CSL | zero or more cryptosync | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | octets, followed by zero or two cryptocheck octets, | | followed by one or more encrypted codec data frames ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The RTP header has the expected values as described earlier. The remaining fields have the following meaning: Encrypted (E): 1 bit MUST by set to one by sender when data frames are encrypted. Reserved (R): 1 bit MUST be set to zero by sender, ignored by receiver. Interleave (LLL): 3 bits As defined in Section (3). Interleave Index (NNN): 3 bits As defined in Section (3). Reserved (R): 1 bit MUST be set to zero by sender, ignored by receiver. Cryptocheck Presence (K): 1 bit McKay Expires October 1998 [Page 9] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 Indicates the presence or absence of two octets comprising a single 16 bit cryptocheck word to follow any cryptosync octets in the payload. The bit is set if and only if a cryptocheck word follows. The generation of the cryptocheck is considered application dependent and outside the scope of this description. However, it is provided as a means to allow the receiver to routinely validate the use of consistent keys and other aspects of the encryption process. The cryptocheck word could simply consist of an encrypted known-value or be generated by encrypting the output of a hash function applied to the plaintext vocoder data (or other data). Cryptosync Length (CSL): 6 bits Indicates (by value) the number of cryptosync octets which follow later in the payload. The length and meaning of the cryptosync octets are dependent on the encryption algorithm and is considered application specific and outside the scope of this description. The cryptosync octets could contain a Cipher Block Chaining mode initialization vector (IV) or a Counter Mode Encryption state variable (SV) [4]. 7 Security Considerations The presence of the cryptocheck provides an obvious mechanism to compromise encryption if adequate care in defining the cryptocheck is not taken. Encrypting a fixed known or well documented value (such as the RTP header timestamp) could be a particularly poor design choice, for example. Although no specific encryption algorithms are endorsed here, techniques with low overhead requirements are expected to be more popular, and these techniques may be more susceptible to spoofing and authentication attacks than techniques which include additional (redundant) information. To reduce overhead, no mechanism for interoperability between applications using different encryption algorithms is defined here. Rather, it is expected that applications will exchange this and other encryption related information in advance of streaming audio. 8 References [1] TIA/EIA/IS-733. TR45: High Rate Speech Service Option for Wideband Spread Spectrum Communications Systems. Available from Global Engineering +1 800 854 7179 or +1 303 792 2181. May also be ordered online at http://www.eia.org/eng/. McKay Expires October 1998 [Page 10] draft-mckay-qcelp-01.txt PureVoice over RTP 5 August 1998 [2] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V., "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, Audio-Video Transport Working Group, January 1996. [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. [4] Schneier, B., "Applied Cryptography: Protocols, Algorithms, and Source Code", John Wiley & Sons, ISBN: 0471128457, October, 1995. Authors' Address Kyle J. McKay Eric C. Rosen QUALCOMM, Inc. 6455 Lusk Boulevard San Diego, CA 92121 USA Phone: +1 619 587 1121 EMail: "Kyle J. McKay" "Eric C. Rosen" McKay Expires October 1998 [Page 11]