Internet Engineering Task Force                              J. Meunier
Audio/Video Transport WG                                       Motorola
Internet Draft
Document: draft-meunier-avt-rtp-dsr-00.txt                      6/29/00
Category: Informational


         RTP Payload Format for Distributed Speech Recognition


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026 [1].

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts. Internet-Drafts are draft documents valid for a maximum of
   six months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet- Drafts
   as reference material or to cite them other than as "work in
   progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt
   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


1. Abstract

   This document specifies an RTP payload format for encapsulating ETSI
   Standard ES 201 108 Distributed Speech Recognition (DSR) streams in
   the Real-Time Transport Protocol.

2. Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
   this document are to be interpreted as described in RFC-2119 [2].


3. Introduction

   Motivated by technology advances in the field of speech recognition,
   voice interfaces to a variety of services (such as airline
   information systems, unified messaging, and the like) are becoming
   more and more prevalent. In parallel, the popularity of mobile
   computing and communications devices has also increased
   dramatically. However, the voice codecs typically employed in mobile
   systems were designed to optimize audible voice quality and not
   speech recognition accuracy, and using these codecs with speech
   recognizers can result in poor recognition rates. For systems that

Meunier           Internet Draft - Expires 12/29/00                 1

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   can be accessed from multiple networks using multiple speech codecs,
   recognition system designers are further challenged to accommodate
   the characteristics of these differences in a robust manner. Channel
   errors and lost data packets in these networks result in further
   degradation of the speech signal.

   In traditional systems as described above, the entire speech
   recognizer lies on the server appliance. It is forced to use
   incoming speech in whatever condition it arrives in after the
   network decodes the vocoded speech. A solution that combats this
   uses a scheme called "distributed speech recognition (DSR)." In this
   system, the remote device acts as a thin client in communication
   with a speech recognition server. The remote device processes the
   speech, compresses, and error protects the bitstream in a manner
   optimal for speech recognition. The server then uses this
   representation directly, minimizing the signal processing necessary
   and benefiting from enhanced error concealment. To achieve
   interoperability with different client devices and servers, a common
   format is needed. Within the "Aurora" DSR working group of the
   European Telecommunications Standards Institute (ETSI), a payload
   has been defined and was published as a standard in February 2000.

   For interactive voice user interface dialogues between a caller and
   a voice service, low latency is also a high priority along with
   accurate speech recognition. While jitter in the speech recognizer
   input is not particularly important, many issues related to speech
   interaction over an IP-based connection are still relevant.
   Therefore, it will be desirable to use the DSR payload in an RTP-
   based session.

4. ETSI Distributed Speech Recognition Payload

   The ETSI Standard ES 201 108 for distributed speech recognition [3]
   defines a signal processing front-end and compression scheme for
   speech input to a speech recognition system. Some important
   characteristics of the payload are summarized below.

   4.1 Input

   The coding algorithm, a standard mel-cepstral technique common to
   many speech recognition systems, supports three raw sampling rates:
   8 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame-
   based scheme that produces an output vector every 10 ms.

   4.2 Compression and Data Rate

   After calculation of the mel-cepstral representation, the
   representation is quantized via split-vector quantization to reduce
   the data rate of the encoded stream. This is a lossy compression,
   with the output being an integer representation of the encoded
   speech.

   4.3 Packet Format and Headers

Meunier           Internet Draft - Expires 12/29/00                 2

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   Pairs of the quantized 10ms mel-cepstral frames are grouped together
   and protected with a 4-bit CRC, forming a 92-bit long "frame pair."
   This becomes the smallest unit of data in the payload. Twelve of
   these "frame pairs" are grouped together, along with a sync sequence
   (16 bits) and a payload header (32 bits) to form a payload
   "multiframe" of 144 octets, or 1152 bits. Each multiframe represents
   240ms of speech data in 1152 bits, resulting in a payload data rate
   of 4.8 kbps. This bit rate is independent of the sampling frequency,
   due to the frame rates and frame sizes chosen.

   The packet format for the DSR multiframe payload is summarized
   below.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      sync sequence            |         Payload Header        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Payload Header         |                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
   |                         Frame Pair #1                         |
   +                                                               +
   |                                                               |
   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       |                                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+                                       +
   |                      Frame Pair #2                            |
   +                                                               +
   |                                                               |
   +       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |       |                                                       |
   +-+-+-+-+                                                       +
   |                       Frame Pair #3                           |
   +                                                               +
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Frame Pair #4                          |
                                .
                                .
                                .


   The multiframe header is formatted as follows

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                   |SR |T|MFRMCT |       EXP       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        CODEWORD               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


Meunier           Internet Draft - Expires 12/29/00                 3

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   The fields in the header are: (SR) a 2-bit indicator for the source
   sampling rate (8, 11, or 16 kHz), (T) a 1-bit indicator for the DSR
   payload specification type (standard or "noise robust," the latter
   of which is currently under development within ETSI), (MFRMCT) a 4-
   bit modulo-16 multiframe counter, (EXP) 9 bits of expansion space,
   and a 16-bit systematic codeword for error protection.


5. Usage as an RTP Payload

   5.1 Packetization

   The most logical choice of Application Data Unit (ADU) for the DSR
   payload would be a 92 bit "frame pair" (including data and its
   associated CRC code). Since the unit represents 20ms of audio, it is
   in step with the RTP payload guidelines. Since the unit contains
   it's own CRC, both bit errors and frame pair loss can be handled in
   the standard's error mitigation scheme (described below). The
   complete 1152 bit multiframe (including its sync word and header)
   was considered for the ADU, however the importance of low latency in
   voice dialogue applications and the potential for unrecoverable loss
   of an entire multiframe makes that choice very undesirable. When a
   new multiframe begins, the sync word and header information should
   be grouped with at least the first frame pair of that multiframe,
   resulting in the first packet of each multiframe being 48 + n*92
   bits where n is the number of "frame pairs" in the packet. In
   choosing the number of ADUs per packet (i.e., the value of n), the
   following restrictions exist.

       - Implementations SHOULD not include more ADUs (or combined ADUs
   and header information) in a single RTP packet that will fit in the
   MTU of the RTP transport protocol.

       - Implementations SHOULD limit the number of frame pair ADUs per
   packet to be less than or equal to the number of frame pairs in a
   "multiframe" of the ETSI specification (12).

       - All frames contained in a single RTP packet MUST be of the
   same sampling frequency and front-end "type" (per the ETSI
   standard).

       - ADUs MUST not be split between RTP packets.

       - Packets containing an odd number of ADUs MUST pad the end of
   the packet with 4-bits of zeros to the last octet boundary.


   It is RECOMMENDED that the number of frame pairs contained within an
   RTP packet be consistent with the application. Many DSR applications
   are interactive dialogues, and decreasing the number of ADUs per
   packet reduces the end-to-end delay. Furthermore, current speech
   recognition techniques are challenged by the loss of large numbers
   of frame pairs. Therefore, it is RECOMMENDED that the number of

Meunier           Internet Draft - Expires 12/29/00                 4

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   frame pair ADUs per packet be minimized. This addresses both the
   latency sensitivity of interactive DSR applications and the frame
   loss sensitivity of current speech recognition algorithms. Header
   compression [4] seems to make the most sense for mitigating the
   header overhead, given the payload's low bitrate, the recognizer's
   sensitivity to lost packets, and the application's sensitivity to
   latency.

   Information describing the number of frames in an RTP packet is not
   transmitted as part of the RTP payload. The only way to determine
   the number of DSR frame pairs is to divide the size of the packet
   (less the size of the optional DSR payload header if the RTP header
   indicates its presence) by the size of a DSR frame pair ADU.

   5.2 Packet Reordering & Loss

   Since the RTP header includes a 16-bit sequence number, and the
   payload's multiframe header incorporates a four bit counter, dealing
   with packet reordering & duplication is straightforward. Unlike
   other audio payloads, jitter not a concern for packets received by
   the server-based speech recognizer. Incoming data is simply used as
   it arrives to update the recognizer's search hypotheses. Therefore,
   an explicit jitter buffer is unnecessary. ADU's can be processed as
   received if they arrive in order, or buffered if received out of
   order up to an implementation-specified maximum latency limit.

   The payload itself also addresses the issue of packet loss and bit
   errors through protection of the source data and the use of error
   detection and concealment techniques at the server end. The payload
   was originally designed for use on transparent circuit data channels
   in mobile scenarios, therefore handling bit errors is an important
   part of the payload's definition. Though bit errors are not
   currently a significant issue for IP-based systems, they may become
   more significant in mobile IP applications where ARQ is undesirable
   because of it's unpredictable latency. Regardless, the mitigation
   techniques employed for correcting errorful data are also extendable
   to accommodate dropped packets by synthesizing new ones. A bi-
   directional frame repetition method is used for error concealment,
   copying frames from successfully received frames on either side of
   those missing. Currently, no forward error correction (FEC) is
   defined for the DSR payload. Use of a media-specific FEC (e.g., a
   subset of the main payload's quantized cepstral parameters, or a
   lower bandwidth quantization of the parameters) would follow the
   recommendations for interactive streaming media in [5] and could be
   investigated in following with the recommendations for redundant RTP
   payload data [6].

   The payload's header information is sent only periodically (each
   multiframe). Since correct interpretation of the encoded speech in
   that multiframe depends on error-free receipt of the header, it is
   protected more thoroughly for bit errors than the raw data. A
   systematic codeword allows for detection of up to seven bit errors
   and correction of up to 3 bit errors in the header field.

Meunier           Internet Draft - Expires 12/29/00                 5

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   Furthermore, since the header is sent repeatedly during the course
   of a session, old header information can be reused if a packet
   containing a multiframe header is dropped. To protect individual
   data frames, each frame is protected by a 4-bit CRC, allowing
   detection of bit errors there as well. The mitigation strategy for
   correcting errorful frames uses information from surrounding frames
   to re-generate a substitute frame, therefore handling dropped frame
   packets can also be handled using the same scheme in concert with
   the frame counter.

   Some information in the RTP header will be redundant to the end
   application, since the DSR payload header was designed to be self-
   sufficient and already carries much of the relevant information. The
   RTP header marker, sequence number, and timestamp fields may be
   useful in systems where the RTP session terminates at the
   application. The RTP header marker bit MUST be set to 1 to denote
   that a packet carries sync and header information, and not set (set
   to 0) when a packet carries only frame pairs and no multiframe
   header. Furthermore, the timestamp field will be useful for
   applications where parallel data streams carrying related speech
   data (such as FEC data or other speech parameters) exist. The RTP
   timestamp SHALL be incremented according to the sampling frequency
   selected for the DSR processing (8 kHz, 11 kHz, 16 kHz). The RTP
   timestamp for a packet SHALL be set to correspond to the first
   sample of the first frame of the first frame pair in the packet.

   Since all payload-related option information is carried in the
   payload header, no RTP header extension is foreseen.


   5.3 Interoperability with non-RTP services

   Scenarios for the use of the DSR payload may include both RTP
   managed sessions and non-RTP circuit data connections at different
   points in the network. Regeneration of this header information at
   transitional, intermediate points in the network would not be
   practical. Therefore, the integrity of the ETSI standard payload
   headers MUST be maintained.


6. MIME Type Considerations (was "IANA Considerations")

   A MIME type request for the ETSI DSR payload has been requested in
   conjunction with submission of this Internet Draft. In accordance
   with the MIME type registration procedures outlined in [7] and [8],
   the following registration has been proposed:

          MIME media type name : Audio

          MIME subtype name : DSR

          Required parameters :


Meunier           Internet Draft - Expires 12/29/00                 6

        RTP Payload Format for Distributed Speech Recognition 6/29/00


            Rate: RTP timestamp clock rate, equal to the sampling
          rate selected for the DSR processing (8000, 11000, or
          16000)


          Optional parameters :
          None


          Encoding considerations :
          This data can be transferred via RTP. The document
          draft-meunier-avt-rtp-dsr-00.doc in conjunction with the
          published ETSI specification below specifies this
          encoding. This is binary data. Binary encoding is
          preferred.


          Security considerations :
          This media type consists of audio samples; it does not
          contain any commands, control sequences, or any other
          material the misuse of which could result in security
          vulnerability.


          Interoperability considerations :
          The data specified by this type must be compliant with
          the published ETSI specification below, such that the
          payload data is usable in systems where RTP is not
          employed and interoperability between RTP and non-RTP
          systems is maintained.


          Published specification :

          European Telecommunications Standards Institute (ETSI)
          Standard ES 201 108, "Speech Processing, Transmission and
          Quality Aspects(STQ); Distributed Speech Recognition;
          Front-end Feature Extraction Algorithm; Compression
          Algorithms," Ver. 1.1.2, April 11, 2000.

          Applications which use this media :
          Client/Server speech recognition systems. The captured
          speech signal is encoded via the standard encoding
          algorithm at the client device and sent (possibly via a
          RTP session) to a server appliance where the encoded
          signal is used by a DSR-compliant speech recognizer for
          any speech recognition task.


          Additional information :

          1. Magic number(s) :
          2. File extension(s) : .dsr (proposed)

Meunier           Internet Draft - Expires 12/29/00                 7

        RTP Payload Format for Distributed Speech Recognition 6/29/00


          3. Macintosh file type code :


7. Security Considerations

   Implementations using the payload defined in this specification are
   subject to the security considerations discussed in the RTP
   specification [9] and the RTP profile [10]. This payload does not
   specify any different security services.


8. References


   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
      9, RFC 2026, October 1996.

   2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
      Levels", BCP 14, RFC 2119, March 1997

   3  European Telecommunications Standards Institute (ETSI) Standard
      ES 201 108, "Speech Processing, Transmission and Quality Aspects
      (STQ); Distributed Speech Recognition; Front-end Feature
      Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April
      11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948

   4  S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers for
      Low-Speed Serial Links," RFC 2508, February 1999.

   5  C. Perkins, O. Hodson, and V. Hardman, "A survey of packet loss
      recovery techniques for streaming audio," IEEE Network, vol. 12
      no. 5, September/October 1998, pp. 40-48.

   6  C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J. C.
      Bolot, A. Vega-Garcia, and S. Fosse-Parisis, "RTP Payload for
      Redundant Audio Data," IETF RFC 2198, September 1997.

   7  N. Freed, J. Klensin, and J. Postel, "Multipurpose Internet Mail
      Extensions (MIME) Part Four: Registration Procedures," BCP 13,
      RFC2048, November 1996.

   8  S. Casner and P. Hoschka, "MIME Type Registration of RTP Payload
      Formats," Internet Draft, work in progress draft-ietf-avt-rtp-
      mime-02.txt, March 10, 2000.

   9  H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A
      transport protocol for real-time applications," Internet Draft,
      Internet Engineering Task Force, Feb. 1999 Work in progress,
      revision to RFC 1889.


Meunier           Internet Draft - Expires 12/29/00                 8

        RTP Payload Format for Distributed Speech Recognition 6/29/00


   10  H. Schulzrinne and S. Casner, "RTP Profile for Audio and Video
      Conferences with Minimal Control," Internet Draft draft-ietf-avt-
      profile-new-08.txt, Work in Progress January 14, 2000, revision
      to RFC 1890.


9.   Acknowledgments

   The DSR payload format referenced here is a product of the diligent
   work of the ETSI Speech Transmission & Quality (STQ) "Aurora" DSR
   Working Group.


10. Author's Addresses

   Jeff Meunier
   Motorola Inc.
   1301 E. Algonquin Rd.
   Schaumburg, IL 60196
   USA
   Email: meunier@labs.mot.com


Meunier           Internet Draft - Expires 12/29/00                 9