Internet Engineering Task Force Saravanan Shanmugham Internet-Draft Cisco Systems Inc. draft-ietf-speechsc-mrcpv2-05 October 18, 2004 Expires: April 18, 2005 Media Resource Control Protocol Version 2(MRCPv2) Status of this Memo By submitting this Internet-Draft, we certify that any applicable patent or other IPR claims of which we are aware have been disclosed, and any of which we become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt . The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html . This Internet-Draft will expire on April 18, 2005. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document describes a proposal for a Media Resource Control Protocol Version 2 (MRCPv2) and aims to meet the requirements specified in the SPEECHSC working group requirements document. It is based on the Media Resource Control Protocol (MRCP), also called S. Shanmugham, et. al. Page 1 MRCPv2 Protocol October, 2004 MRCPv1 developed jointly by Cisco Systems, Inc., Nuance Communications, and Speechworks Inc. The MRCPv2 protocol will control media service resources like speech synthesizers, recognizers, signal generators, signal detectors, fax servers etc. over a network. This protocol depends on a session management protocol such as the Session Initiation Protocol (SIP) to establish a separate MRCPv2 control session between the client and the server. It also depends on SIP to establish the media pipe and associated parameters between the media source or sink and the media server. Once this is done, the MRCPv2 protocol exchange can happen over the control session established above allowing the client to command and control the media processing resources that may exist on the media server. Table of Contents Status of this Memo..............................................1 Copyright Notice.................................................1 Abstract.........................................................1 Table of Contents................................................2 1. Introduction:...............................................4 2. Notational Convention.......................................5 3. Architecture:...............................................5 3.1. MRCPv2 Media Resources:....................................7 3.2. Server and Resource Addressing.............................8 4. MRCPv2 Protocol Basics......................................8 4.1. Connecting to the Server...................................8 4.2. Managing Resource Control Channels.........................8 4.3. Media Streams and RTP Ports...............................15 4.4. MRCPv2 Message Transport..................................16 4.5. Resource Types............................................17 5. MRCPv2 Specification.......................................17 5.1. Request...................................................18 5.2. Response..................................................19 5.3. Event.....................................................20 6. MRCP Generic Features......................................21 6.1. Generic Message Headers...................................21 6.2. SET-PARAMS................................................30 6.3. GET-PARAMS................................................30 7. Resource Discovery.........................................31 8. Speech Synthesizer Resource................................32 8.1. Synthesizer State Machine.................................33 8.2. Synthesizer Methods.......................................33 8.3. Synthesizer Events........................................34 8.4. Synthesizer Header Fields.................................34 8.5. Synthesizer Message Body..................................40 8.6. SPEAK.....................................................43 8.7. STOP......................................................44 8.8. BARGE-IN-OCCURRED.........................................45 S Shanmugham IETF-Draft Page 2 MRCPv2 Protocol October, 2004 8.9. PAUSE.....................................................47 8.10. RESUME....................................................48 8.11. CONTROL...................................................49 8.12. SPEAK-COMPLETE............................................50 8.13. SPEECH-MARKER.............................................51 8.14. DEFINE-LEXICON............................................52 9. Speech Recognizer Resource.................................53 9.1. Recognizer State Machine..................................54 9.2. Recognizer Methods........................................54 9.3. Recognizer Events.........................................55 9.4. Recognizer Header Fields..................................55 9.5. Recognizer Message Body...................................69 9.6. DEFINE-GRAMMAR............................................83 9.7. RECOGNIZE.................................................87 9.8. STOP......................................................89 9.9. GET-RESULT................................................90 9.10. START-OF-SPEECH...........................................91 9.11. START-INPUT-TIMERS........................................92 9.12. RECOGNITION-COMPLETE......................................92 9.13. START-PHRASE-ENROLLMENT...................................94 9.14. ENROLLMENT-ROLLBACK.......................................95 9.15. END-PHRASE-ENROLLMENT.....................................96 9.16. MODIFY-PHRASE.............................................96 9.17. DELETE-PHRASE.............................................97 9.18. INTERPRET.................................................97 9.19. INTERPRETATION-COMPLETE...................................98 9.20. DTMF Detection...........................................100 10. Recorder Resource.........................................100 10.1. Recorder State Machine...................................100 10.2. Recorder Methods.........................................100 10.3. Recorder Events..........................................100 10.4. Recorder Header Fields...................................101 10.5. Recorder Message Body....................................105 10.6. RECORD...................................................105 10.7. STOP.....................................................106 10.8. RECORD-COMPLETE..........................................107 10.9. START-INPUT-TIMERS.......................................107 11. Speaker Verification and Identification...................109 11.1. Speaker Verification State Machine.......................110 11.2. Speaker Verification Methods.............................110 11.3. Verification Events......................................111 11.4. Verification Header Fields...............................111 11.5. Verification Result Elements.............................119 11.6. START-SESSION............................................123 11.7. END-SESSION..............................................124 11.8. QUERY-VOICEPRINT.........................................124 11.9. DELETE-VOICEPRINT........................................125 11.10. VERIFY..................................................126 11.11. VERIFY-FROM-BUFFER......................................126 11.12. VERIFY-ROLLBACK.........................................129 11.13. STOP....................................................130 S Shanmugham IETF-Draft Page 3 MRCPv2 Protocol October, 2004 11.14. START-INPUT-TIMERS......................................131 11.15. VERIFICATION-COMPLETE...................................131 11.16. START-OF-SPEECH.........................................132 11.17. CLEAR-BUFFER............................................132 11.18. GET-INTERMEDIATE-RESULT.................................132 12. Security Considerations...................................133 13. Examples:.................................................133 14. Reference Documents.......................................145 15. Appendix..................................................146 15.1. ABNF Message Definitions.................................146 15.2. XML Schema and DTD.......................................161 Full Copyright Statement.......................................168 Intellectual Property..........................................169 Contributors...................................................169 Acknowledgements...............................................170 Editors' Addresses.............................................170 1. Introduction: The MRCPv2 protocol is designed for a client device to control media processing resources on the network allowing to process and audio/video stream. Some of these media processing resources could be speech recognition, speech synthesis engines, speaker verification or speaker identification engines. This allows a vendor to implement distributed Interactive Voice Response platforms such as VoiceXML [7] browsers. The protocol requirements of SPEECHSC require that the protocol is capable of reaching a media processing server and setting up communication channels to the media resources, to send/recieve control messages and media streams to/from the server. The Session Initiation Protocol (SIP) protocol described in [4] meets these requirements and is used to setup and tear down media and control pipes to the server. In addition, the SIP re-INVITE can be used to change the characteristics of these media and control pipes mid- session. The MRCPv2 protocol hence is designed to leverage and build upon a session management protocols such as Session Initiation Protocol (SIP) and Session Description Protocol (SDP). SDP is used to describe the parameters of the media pipe associated with that session. It is mandatory to support SIP as the session level protocol to ensure interoperability. Other protocols can be used at the session level by prior agreement. The MRCPv2 protocol depends on SIP and SDP to create the session, and setup the media channels to the server. It also depends on SIP and SDP to establish MRCPv2 control channels between the client and the server for each media processing resource required for that session. The MRCPv2 protocol exchange between the client and the media resource can then happen on that control channel. The MRCPv2 S Shanmugham IETF-Draft Page 4 MRCPv2 Protocol October, 2004 protocol exchange happening on this control channel does not change the state of the SIP session, the media or other parameters of the session SIP initiated. It merely controls and affects the state of the media processing resource associated with that MRCPv2 channel. The MRCPv2 protocol defines the messages to control the different media processing resources and the state machines required to guide their operation. It also describes how these messages are carried over a transport layer such as TCP, SCTP or TLS. 2. Notational Convention The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[9]. Since many of the definitions and syntax are identical to HTTP/1.1, this specification only points to the section where they are defined rather than copying it. For brevity, [HX.Y] is to be taken to refer to Section X.Y of the current HTTP/1.1 specification (RFC 2616 [1]). All the mechanisms specified in this document are described in both prose and an augmented Backus-Naur form (ABNF). It is described in detail in RFC 2234 [3]. The complete message format in ABNF form is provided in Appendix section 12.1 and is the normative format definition. Media Resource An entity on the MRCP Server that can be controlled through the MRCP protocol MRCP Server Aggregate of one or more "Media Resource" entities on a Server, exposed through the MRCP protocol.("Server" for short) MRCP Client An entity controlling one or more Media Resources through the MRCP protocol. ("Client" for short) 3. Architecture: The system consists of a client that requires the generation of media streams or requires the processing of media streams and a media resource server that has the resources or engines to process or generate these streams. The client establishes a session using SIP and SDP with the server to use its media processing resources. A SIP URI refers to the MRCPv2 server. S Shanmugham IETF-Draft Page 5 MRCPv2 Protocol October, 2004 The session management protocol (SIP) will use SDP with the offer/answer model described RFC 3264 to describe and setup the MRCPv2 control channels. Separate MRCPv2 control channels are need for controlling the different media processing resources associated with that session. Within a SIP session, the individual resource control channels for the different resources are added or removed through the SDP offer/answer model and the SIP re-INVITE dialog. The server, through the SDP exchange, provides the client with a unique channel identifier and a port number(TCP or SCTP). The client MAY then open a new TCP connection with the server using this port number. Multiple MRCPv2 channels can share a TCP connection between the client and the server. All MRCPv2 messages exchanged between the client and the server will also carry the specified channel identifier that MUST be unique among all MRCPv2 control channels that are active on that server. The client can use this channel to control the media processing resource associated with that channel. The session management protocol (SIP) will also establish media pipes between the client (or source/sink of media) and the MRCP server using SDP m-lines. A media pipe maybe shared by one or more media processing resources under that SIP session or each media processing resource may have its own media pipe. MRCPv2 client MRCPv2 Media Resource Server |--------------------| |-----------------------------| ||------------------|| ||---------------------------|| || Application Layer|| || TTS | ASR | SV | SI || ||------------------|| ||Engine|Engine|Engine|Engine|| ||Media Resource API|| ||---------------------------|| ||------------------|| || Media Resource Management || || SIP | MRCPv2 || ||---------------------------|| ||Stack | || || SIP | MRCPv2 || || | || || Stack | || ||------------------|| ||---------------------------|| || TCP/IP Stack ||----MRCPv2---|| TCP/IP Stack || || || || || ||------------------||-----SIP-----||---------------------------|| |--------------------| |-----------------------------| | / SIP / | / |-------------------| RTP | | / | Media Source/Sink |-------------/ | | |-------------------| Fig 1: Architectural Diagram S Shanmugham IETF-Draft Page 6 MRCPv2 Protocol October, 2004 MRCPv2 Media Resource Types: The MRCP server may offer one or more of the following media processing resources to its clients. Basic Synthesizer A speech synthesizer resource with very limited capabilities, that can be achieved through the playing out concatenated audio file clips. The speech data is described as SSML data but with limited support for its elements. It MUST support ,