Internet Engineering Task Force J. Meunier Audio/Video Transport WG Motorola Internet Draft Document: draft-meunier-avt-rtp-dsr-00.txt 6/29/00 Category: Informational RTP Payload Format for Distributed Speech Recognition Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document specifies an RTP payload format for encapsulating ETSI Standard ES 201 108 Distributed Speech Recognition (DSR) streams in the Real-Time Transport Protocol. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. 3. Introduction Motivated by technology advances in the field of speech recognition, voice interfaces to a variety of services (such as airline information systems, unified messaging, and the like) are becoming more and more prevalent. In parallel, the popularity of mobile computing and communications devices has also increased dramatically. However, the voice codecs typically employed in mobile systems were designed to optimize audible voice quality and not speech recognition accuracy, and using these codecs with speech recognizers can result in poor recognition rates. For systems that Meunier Internet Draft - Expires 12/29/00 1 RTP Payload Format for Distributed Speech Recognition 6/29/00 can be accessed from multiple networks using multiple speech codecs, recognition system designers are further challenged to accommodate the characteristics of these differences in a robust manner. Channel errors and lost data packets in these networks result in further degradation of the speech signal. In traditional systems as described above, the entire speech recognizer lies on the server appliance. It is forced to use incoming speech in whatever condition it arrives in after the network decodes the vocoded speech. A solution that combats this uses a scheme called "distributed speech recognition (DSR)." In this system, the remote device acts as a thin client in communication with a speech recognition server. The remote device processes the speech, compresses, and error protects the bitstream in a manner optimal for speech recognition. The server then uses this representation directly, minimizing the signal processing necessary and benefiting from enhanced error concealment. To achieve interoperability with different client devices and servers, a common format is needed. Within the "Aurora" DSR working group of the European Telecommunications Standards Institute (ETSI), a payload has been defined and was published as a standard in February 2000. For interactive voice user interface dialogues between a caller and a voice service, low latency is also a high priority along with accurate speech recognition. While jitter in the speech recognizer input is not particularly important, many issues related to speech interaction over an IP-based connection are still relevant. Therefore, it will be desirable to use the DSR payload in an RTP- based session. 4. ETSI Distributed Speech Recognition Payload The ETSI Standard ES 201 108 for distributed speech recognition [3] defines a signal processing front-end and compression scheme for speech input to a speech recognition system. Some important characteristics of the payload are summarized below. 4.1 Input The coding algorithm, a standard mel-cepstral technique common to many speech recognition systems, supports three raw sampling rates: 8 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame- based scheme that produces an output vector every 10 ms. 4.2 Compression and Data Rate After calculation of the mel-cepstral representation, the representation is quantized via split-vector quantization to reduce the data rate of the encoded stream. This is a lossy compression, with the output being an integer representation of the encoded speech. 4.3 Packet Format and Headers Meunier Internet Draft - Expires 12/29/00 2 RTP Payload Format for Distributed Speech Recognition 6/29/00 Pairs of the quantized 10ms mel-cepstral frames are grouped together and protected with a 4-bit CRC, forming a 92-bit long "frame pair." This becomes the smallest unit of data in the payload. Twelve of these "frame pairs" are grouped together, along with a sync sequence (16 bits) and a payload header (32 bits) to form a payload "multiframe" of 144 octets, or 1152 bits. Each multiframe represents 240ms of speech data in 1152 bits, resulting in a payload data rate of 4.8 kbps. This bit rate is independent of the sampling frequency, due to the frame rates and frame sizes chosen. The packet format for the DSR multiframe payload is summarized below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sync sequence | Payload Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Header | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Frame Pair #1 | + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+ + | Frame Pair #2 | + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+ + | Frame Pair #3 | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Frame Pair #4 | . . . The multiframe header is formatted as follows 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SR |T|MFRMCT | EXP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CODEWORD | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Meunier Internet Draft - Expires 12/29/00 3 RTP Payload Format for Distributed Speech Recognition 6/29/00 The fields in the header are: (SR) a 2-bit indicator for the source sampling rate (8, 11, or 16 kHz), (T) a 1-bit indicator for the DSR payload specification type (standard or "noise robust," the latter of which is currently under development within ETSI), (MFRMCT) a 4- bit modulo-16 multiframe counter, (EXP) 9 bits of expansion space, and a 16-bit systematic codeword for error protection. 5. Usage as an RTP Payload 5.1 Packetization The most logical choice of Application Data Unit (ADU) for the DSR payload would be a 92 bit "frame pair" (including data and its associated CRC code). Since the unit represents 20ms of audio, it is in step with the RTP payload guidelines. Since the unit contains it's own CRC, both bit errors and frame pair loss can be handled in the standard's error mitigation scheme (described below). The complete 1152 bit multiframe (including its sync word and header) was considered for the ADU, however the importance of low latency in voice dialogue applications and the potential for unrecoverable loss of an entire multiframe makes that choice very undesirable. When a new multiframe begins, the sync word and header information should be grouped with at least the first frame pair of that multiframe, resulting in the first packet of each multiframe being 48 + n*92 bits where n is the number of "frame pairs" in the packet. In choosing the number of ADUs per packet (i.e., the value of n), the following restrictions exist. - Implementations SHOULD not include more ADUs (or combined ADUs and header information) in a single RTP packet that will fit in the MTU of the RTP transport protocol. - Implementations SHOULD limit the number of frame pair ADUs per packet to be less than or equal to the number of frame pairs in a "multiframe" of the ETSI specification (12). - All frames contained in a single RTP packet MUST be of the same sampling frequency and front-end "type" (per the ETSI standard). - ADUs MUST not be split between RTP packets. - Packets containing an odd number of ADUs MUST pad the end of the packet with 4-bits of zeros to the last octet boundary. It is RECOMMENDED that the number of frame pairs contained within an RTP packet be consistent with the application. Many DSR applications are interactive dialogues, and decreasing the number of ADUs per packet reduces the end-to-end delay. Furthermore, current speech recognition techniques are challenged by the loss of large numbers of frame pairs. Therefore, it is RECOMMENDED that the number of Meunier Internet Draft - Expires 12/29/00 4 RTP Payload Format for Distributed Speech Recognition 6/29/00 frame pair ADUs per packet be minimized. This addresses both the latency sensitivity of interactive DSR applications and the frame loss sensitivity of current speech recognition algorithms. Header compression [4] seems to make the most sense for mitigating the header overhead, given the payload's low bitrate, the recognizer's sensitivity to lost packets, and the application's sensitivity to latency. Information describing the number of frames in an RTP packet is not transmitted as part of the RTP payload. The only way to determine the number of DSR frame pairs is to divide the size of the packet (less the size of the optional DSR payload header if the RTP header indicates its presence) by the size of a DSR frame pair ADU. 5.2 Packet Reordering & Loss Since the RTP header includes a 16-bit sequence number, and the payload's multiframe header incorporates a four bit counter, dealing with packet reordering & duplication is straightforward. Unlike other audio payloads, jitter not a concern for packets received by the server-based speech recognizer. Incoming data is simply used as it arrives to update the recognizer's search hypotheses. Therefore, an explicit jitter buffer is unnecessary. ADU's can be processed as received if they arrive in order, or buffered if received out of order up to an implementation-specified maximum latency limit. The payload itself also addresses the issue of packet loss and bit errors through protection of the source data and the use of error detection and concealment techniques at the server end. The payload was originally designed for use on transparent circuit data channels in mobile scenarios, therefore handling bit errors is an important part of the payload's definition. Though bit errors are not currently a significant issue for IP-based systems, they may become more significant in mobile IP applications where ARQ is undesirable because of it's unpredictable latency. Regardless, the mitigation techniques employed for correcting errorful data are also extendable to accommodate dropped packets by synthesizing new ones. A bi- directional frame repetition method is used for error concealment, copying frames from successfully received frames on either side of those missing. Currently, no forward error correction (FEC) is defined for the DSR payload. Use of a media-specific FEC (e.g., a subset of the main payload's quantized cepstral parameters, or a lower bandwidth quantization of the parameters) would follow the recommendations for interactive streaming media in [5] and could be investigated in following with the recommendations for redundant RTP payload data [6]. The payload's header information is sent only periodically (each multiframe). Since correct interpretation of the encoded speech in that multiframe depends on error-free receipt of the header, it is protected more thoroughly for bit errors than the raw data. A systematic codeword allows for detection of up to seven bit errors and correction of up to 3 bit errors in the header field. Meunier Internet Draft - Expires 12/29/00 5 RTP Payload Format for Distributed Speech Recognition 6/29/00 Furthermore, since the header is sent repeatedly during the course of a session, old header information can be reused if a packet containing a multiframe header is dropped. To protect individual data frames, each frame is protected by a 4-bit CRC, allowing detection of bit errors there as well. The mitigation strategy for correcting errorful frames uses information from surrounding frames to re-generate a substitute frame, therefore handling dropped frame packets can also be handled using the same scheme in concert with the frame counter. Some information in the RTP header will be redundant to the end application, since the DSR payload header was designed to be self- sufficient and already carries much of the relevant information. The RTP header marker, sequence number, and timestamp fields may be useful in systems where the RTP session terminates at the application. The RTP header marker bit MUST be set to 1 to denote that a packet carries sync and header information, and not set (set to 0) when a packet carries only frame pairs and no multiframe header. Furthermore, the timestamp field will be useful for applications where parallel data streams carrying related speech data (such as FEC data or other speech parameters) exist. The RTP timestamp SHALL be incremented according to the sampling frequency selected for the DSR processing (8 kHz, 11 kHz, 16 kHz). The RTP timestamp for a packet SHALL be set to correspond to the first sample of the first frame of the first frame pair in the packet. Since all payload-related option information is carried in the payload header, no RTP header extension is foreseen. 5.3 Interoperability with non-RTP services Scenarios for the use of the DSR payload may include both RTP managed sessions and non-RTP circuit data connections at different points in the network. Regeneration of this header information at transitional, intermediate points in the network would not be practical. Therefore, the integrity of the ETSI standard payload headers MUST be maintained. 6. MIME Type Considerations (was "IANA Considerations") A MIME type request for the ETSI DSR payload has been requested in conjunction with submission of this Internet Draft. In accordance with the MIME type registration procedures outlined in [7] and [8], the following registration has been proposed: MIME media type name : Audio MIME subtype name : DSR Required parameters : Meunier Internet Draft - Expires 12/29/00 6 RTP Payload Format for Distributed Speech Recognition 6/29/00 Rate: RTP timestamp clock rate, equal to the sampling rate selected for the DSR processing (8000, 11000, or 16000) Optional parameters : None Encoding considerations : This data can be transferred via RTP. The document draft-meunier-avt-rtp-dsr-00.doc in conjunction with the published ETSI specification below specifies this encoding. This is binary data. Binary encoding is preferred. Security considerations : This media type consists of audio samples; it does not contain any commands, control sequences, or any other material the misuse of which could result in security vulnerability. Interoperability considerations : The data specified by this type must be compliant with the published ETSI specification below, such that the payload data is usable in systems where RTP is not employed and interoperability between RTP and non-RTP systems is maintained. Published specification : European Telecommunications Standards Institute (ETSI) Standard ES 201 108, "Speech Processing, Transmission and Quality Aspects(STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 11, 2000. Applications which use this media : Client/Server speech recognition systems. The captured speech signal is encoded via the standard encoding algorithm at the client device and sent (possibly via a RTP session) to a server appliance where the encoded signal is used by a DSR-compliant speech recognizer for any speech recognition task. Additional information : 1. Magic number(s) : 2. File extension(s) : .dsr (proposed) Meunier Internet Draft - Expires 12/29/00 7 RTP Payload Format for Distributed Speech Recognition 6/29/00 3. Macintosh file type code : 7. Security Considerations Implementations using the payload defined in this specification are subject to the security considerations discussed in the RTP specification [9] and the RTP profile [10]. This payload does not specify any different security services. 8. References 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 3 European Telecommunications Standards Institute (ETSI) Standard ES 201 108, "Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948 4 S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers for Low-Speed Serial Links," RFC 2508, February 1999. 5 C. Perkins, O. Hodson, and V. Hardman, "A survey of packet loss recovery techniques for streaming audio," IEEE Network, vol. 12 no. 5, September/October 1998, pp. 40-48. 6 C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J. C. Bolot, A. Vega-Garcia, and S. Fosse-Parisis, "RTP Payload for Redundant Audio Data," IETF RFC 2198, September 1997. 7 N. Freed, J. Klensin, and J. Postel, "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures," BCP 13, RFC2048, November 1996. 8 S. Casner and P. Hoschka, "MIME Type Registration of RTP Payload Formats," Internet Draft, work in progress draft-ietf-avt-rtp- mime-02.txt, March 10, 2000. 9 H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A transport protocol for real-time applications," Internet Draft, Internet Engineering Task Force, Feb. 1999 Work in progress, revision to RFC 1889. Meunier Internet Draft - Expires 12/29/00 8 RTP Payload Format for Distributed Speech Recognition 6/29/00 10 H. Schulzrinne and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control," Internet Draft draft-ietf-avt- profile-new-08.txt, Work in Progress January 14, 2000, revision to RFC 1890. 9. Acknowledgments The DSR payload format referenced here is a product of the diligent work of the ETSI Speech Transmission & Quality (STQ) "Aurora" DSR Working Group. 10. Author's Addresses Jeff Meunier Motorola Inc. 1301 E. Algonquin Rd. Schaumburg, IL 60196 USA Email: meunier@labs.mot.com Meunier Internet Draft - Expires 12/29/00 9