HTTP/1.1 200 OK Date: Tue, 09 Apr 2002 01:01:34 GMT Server: Apache/1.3.20 (Unix) Last-Modified: Tue, 17 Mar 1998 16:31:00 GMT ETag: "2e7a1a-30bd-350ea544" Accept-Ranges: bytes Content-Length: 12477 Connection: close Content-Type: text/plain Internet Engineering Task Force AVT WG Internet Draft Schulzrinne ietf-avt-dtmf-00.txt Columbia U. July 8, 1997 Expires: December 1, 1997 RTP Payload for DTMF Digits STATUS OF THIS MEMO This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. ABSTRACT This memo describes how to carry dual-tone multifrequency (DTMF) signaling in RTP packets. 1 Introduction This memo defines a payload type for carrying dual-tone multifrequency (DTMF) digits in RTP packets. A separate payload type is desirable since low-rate voice codecs cannot be guaranteed to accurately reproduce DTMF. Defining a separate payload type also permits higher redundancy while maintaining a low bit rate. The DTMF payload type must be suitable for both a gateway and end- to-end scenario. In the gateway scenario, a gateway connecting a Schulzrinne [Page 1] Internet Draft Profile July 8, 1997 packet voice network with the PSTN recreates the DTMF tones and injects them into the PSTN. Since DTMF digit recognition may take several tens of milliseconds, careful time and power (volume) alignment is needed to avoid generating spurious digits. For interactive voice response (IVR) systems directly connected to the packet voice network, time alignment and volume levels are not important, since the unit will not perform any signal analysis to detect DTMF tones from the audio stream. DTMF digits are carried as part of the audio stream, and SHOULD use the same sequence number and time-stamp base as the regular audio channel to simplify recreation of analog audio at a gateway. The default clock frequency is 8000 Hz, but the clock frequency can be redefined when assigning the dynamic payload type. This format achieves a higher redundancy even in the case of sustained packet loss than the method proposed for the Voice over Frame Relay Implementation Agreement [1]. In circumstances where exact timing alignment between the audio stream and the DTMF digits is not important and data is sent unicast, such as the IVR example mentioned earlier, it may be preferable to use a reliable control stream such as H.245. A source MAY send coded DTMF and coded audio packets for the same time instants, using DTMF as the redundant encoding for the audio stream, or it MAY block outgoing audio while DTMF tones are active and only send DTMF digits as both the primary and redundant encodings. A source SHOULD send an update with the same packet frequency as the current audio codec while the DTMF digit is active. 2 Payload Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R R R| digit |R R| volume | duration | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Schulzrinne [Page %] Internet Draft Profile July 8, 1997 digit: The DTMF digits are encoded as follows: DTMF digit encoding (decimal) ________________________________ 0 0 1 1 2 2 9 9 !*! 10 # 11 A 12 B 13 C 14 D 15 Flash 16 volume: The power level of the digit, expressed in dBm0 after dropping the sign, with range from 0 to -63 dBm0. The range of valid DTMF is from 0 to -36 dBm0 (must accept); lower than -55 dBm0 must be rejected (TR-TSY-000181, ITU-T Q.24A). Thus, larger values denote lower volume. Note: since the acceptable dip is 10 dB and the minimum detectable loudness variation is 3 dB, this field could be compressed by at least a bit by reducing resolution to 2 dB, if needed. duration: Duration of this digit, in timestamp units. (For a sampling rate of 8000 Hz, this field is sufficient to express digit durations of upto approximately 8 seconds; the minimum permissible digit length is 40 ms.) R: This field is reserved for future use. The sender MUST set it to zero, the receiver MUST ignore it. An audio source SHOULD start transmitting DTMF digit packets as soon as it recognizes the first DTMF digit and every multiple of a frame period or, for sample-based codecs, every 50 ms thereafter. If a digit continues for more than one period, it should send a new DTMF packet with the RTP timestamp value corresponding to the beginning of the digit and the duration of the digit increased correspondingly. (The RTP sequence number is incremented by one for each packet.) If there has been no new digit in the last interval, the digit SHOULD be retransmitted three times to ensure some measure of reliability for the last digit. DTMF digits are sent incrementally to avoid having the receiver wait for the completion of the digit. Since some Schulzrinne [Page 3] Internet Draft Profile July 8, 1997 tones are two seconds long, this would incur a substantial delay. 3 Reliability To achieve reliability even when the network loses packets, the audio redundancy mechanism described in [2] is used. The effective data rate is !r! times 64 bits (32 bits for the redundancy header and 32 bits for the DTMF payload) every 50 ms or !r! times 1280 bits/second, where !r! is the number of redundant DTMF digits carried in each packet. The value of !r! is an implementation trade-off, with a value of 5 suggested. The timestamp offset in this redundancy scheme has 14 bits, so that it allows a single packet to "cover" 2.048 seconds of DTMF digits at a sampling rate of 8000 Hz. Including the starting time of previous digits allows precise reconstruction of the tone sequence at a gateway. The scheme is resilient to consecutive packet losses spanning this interval of 2.048 seconds or !r! digits, whichever is less. Note that for previous digits, only an average loudness can be represented. An encoder MAY treat the DTMF payload as a highly-compressed version of the current audio frame. In that mode, each RTP packet during a DTMF tone would contain the current audio codec rendition (say, G.723.1 or G.729) of this digit as well as the representation described in Section 2, plus any previous digits as before. This approach allows dumb gateways that do not understand this format to function. Other reasons? 3.1 Example A typical RTP packet, where the user is just dialing the last digit of the DTMF sequence "911". The first digit was 200 ms long and started at time 0, the second digit lasted 250 ms and started at time 800 ms, the third digit has just been pressed for 100 ms, at time 1.5 s. The frame duration is 50 ms. To make the parts recognizable, the figure below ignores byte alignment. Timestamp and sequence number are assumed to have been zero at the beginning of the first digit. 0 1 2 3 Schulzrinne [Page 4] Internet Draft Profile July 8, 1997 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | | 2 |0|0| 0 |0| 96 | 31 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | | 12000 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | | 0x5234a8 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| block PT | timestamp offset | block length | |1| 96 | 12400 | 4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| block PT | timestamp offset | block length | |1| 96 | 5600 | 4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Block PT | |0| 96 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R R R| digit |R R| volume | duration | |0 0 0| 9 |0 0| 7 | 1600 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R R R| digit |R R| volume | duration | |0 0 0| 1 |0 0| 10 | 2000 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R R R| digit |R R| volume | duration | |0 0 0| 1 |0 0| 20 | 800 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 Compact Reliability Scheme A more compact representation could be achieved by measuring DTMF tones in a different sampling rate from that of the surrounding audio codec, e.g., as multiples of 1, 10, 40 or 50 ms. Each RTP payload type should have a fixed sampling rate, so choosing a value that depends on frame interval of the surrounding codec is not recommended. For a sampling interval of 50 ms, the following payload would "cover" 8 seconds of duration and offset: 0 1 2 3 Schulzrinne [Page 5] Internet Draft Profile July 8, 1997 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | offset |R R R| digit |R R| volume | duration | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5 Acknowledgements The suggestions of the VoIP working group are gratefully acknowledged. 6 Bibliography [1] R. Kocen and T. Hatala, "Voice over frame relay implementation agreement," Implementation Agreement FRF.11, Frame Relay Forum, Foster City, California, Jan. 1997. [2] C. Perkins, I. Kouvelas, V. Hardman, M. Handley, J.-C. Bolot, A. Vega-Garcia, and S. Fosse-Parisis, "RTP payload for redundant audio data," Internet Draft, Internet Engineering Task Force, Mar. 1997. Work in progress. Schulzrinne [Page 6]