Internet Engineering Task Force Saravanan Shanmugham Internet-Draft Cisco Systems Inc. draft-ietf-speechsc-mrcpv2-06 Daniel C. Burnett Expires: August 20, 2005 Nuance Communications February 20, 2005 Media Resource Control Protocol Version 2(MRCPv2) Status of this Memo By submitting this Internet-Draft, we certify that any applicable patent or other IPR claims of which we are aware have been disclosed, and any of which we become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt . The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html . This Internet-Draft will expire on August 20, 2005. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document describes a proposal for a Media Resource Control Protocol Version 2 (MRCPv2) and aims to meet the requirements specified in the SPEECHSC working group requirements document. It is based on the Media Resource Control Protocol (MRCP), also called S. Shanmugham, et. al. Page 1 MRCPv2 Protocol February, 2005 MRCPv1 developed jointly by Cisco Systems, Inc., Nuance Communications, and Speechworks Inc. The MRCPv2 protocol will control media service resources like speech synthesizers, recognizers, signal generators, signal detectors, fax servers etc. over a network. This protocol depends on a session management protocol such as the Session Initiation Protocol (SIP) to establish a separate MRCPv2 control session between the client and the server. It also depends on SIP to establish the media pipe and associated parameters between the media source or sink and the media server. Once this is done, the MRCPv2 protocol exchange can happen over the control session established above allowing the client to command and control the media processing resources that may exist on the media server. Table of Contents Status of this Memo..............................................1 Copyright Notice.................................................1 Abstract.........................................................1 Table of Contents................................................2 1. Introduction:.............................................4 2. Notational Convention.....................................5 3. Architecture:.............................................6 3.1. Server and Resource Addressing............................8 4. MRCPv2 Protocol Basics....................................8 4.1. Connecting to the Server..................................8 4.2. Managing Resource Control Channels........................9 4.3. Media Streams and RTP Ports..............................15 4.4. MRCPv2 Message Transport.................................16 5. MRCPv2 Specification.....................................17 5.1. Request..................................................18 5.2. Response.................................................19 5.3. Event....................................................20 6. MRCP Generic Features....................................21 6.1. Generic Message Headers..................................21 6.2. SET-PARAMS...............................................30 6.3. GET-PARAMS...............................................31 7. Resource Discovery.......................................31 8. Speech Synthesizer Resource..............................33 8.1. Synthesizer State Machine................................33 8.2. Synthesizer Methods......................................34 8.3. Synthesizer Events.......................................34 8.4. Synthesizer Header Fields................................34 8.5. Synthesizer Message Body.................................41 8.6. SPEAK....................................................43 8.7. STOP.....................................................45 8.8. BARGE-IN-OCCURRED........................................46 8.9. PAUSE....................................................47 8.10. RESUME...................................................48 S Shanmugham IETF-Draft Page 2 MRCPv2 Protocol February, 2005 8.11. CONTROL..................................................50 8.12. SPEAK-COMPLETE...........................................51 8.13. SPEECH-MARKER............................................52 8.14. DEFINE-LEXICON...........................................53 9. Speech Recognizer Resource...............................54 9.1. Recognizer State Machine.................................55 9.2. Recognizer Methods.......................................55 9.3. Recognizer Events........................................56 9.4. Recognizer Header Fields.................................56 9.5. Recognizer Message Body..................................71 9.6. Natural Language Semantic Markup Language................75 9.7. Enrollment Results.......................................83 9.8. DEFINE-GRAMMAR...........................................85 9.9. RECOGNIZE................................................88 9.10. STOP.....................................................91 9.11. GET-RESULT...............................................93 9.12. START-OF-SPEECH..........................................93 9.13. START-INPUT-TIMERS.......................................94 9.14. RECOGNITION-COMPLETE.....................................94 9.15. START-PHRASE-ENROLLMENT..................................96 9.16. ENROLLMENT-ROLLBACK......................................97 9.17. END-PHRASE-ENROLLMENT....................................98 9.18. MODIFY-PHRASE............................................98 9.19. DELETE-PHRASE............................................99 9.20. INTERPRET................................................99 9.21. INTERPRETATION-COMPLETE.................................100 9.22. DTMF Detection..........................................101 10. Recorder Resource.......................................102 10.1. Recorder State Machine..................................102 10.2. Recorder Methods........................................102 10.3. Recorder Events.........................................102 10.4. Recorder Header Fields..................................102 10.5. Recorder Message Body...................................107 10.6. RECORD..................................................107 10.7. STOP....................................................108 10.8. RECORD-COMPLETE.........................................109 10.9. START-INPUT-TIMERS......................................109 11. Speaker Verification and Identification.................111 11.1. Speaker Verification State Machine......................112 11.2. Speaker Verification Methods............................112 11.3. Verification Events.....................................113 11.4. Verification Header Fields..............................113 11.5. Verification Result Elements............................121 11.6. START-SESSION...........................................125 11.7. END-SESSION.............................................126 11.8. QUERY-VOICEPRINT........................................126 11.9. DELETE-VOICEPRINT.......................................127 11.10. VERIFY..................................................128 11.11. VERIFY-FROM-BUFFER......................................128 11.12. VERIFY-ROLLBACK.........................................131 11.13. STOP....................................................132 S Shanmugham IETF-Draft Page 3 MRCPv2 Protocol February, 2005 11.14. START-INPUT-TIMERS......................................133 11.15. VERIFICATION-COMPLETE...................................133 11.16. START-OF-SPEECH.........................................134 11.17. CLEAR-BUFFER............................................134 11.18. GET-INTERMEDIATE-RESULT.................................134 12. Security Considerations.................................135 13. IANA Considerations.....................................135 13.1. New registries..........................................135 13.2. NLSML-related registrations.............................137 13.3. session URL scheme registration.........................139 13.4. SDP parameter registrations.............................140 14. Examples:...............................................141 14.1. Message Flow............................................141 14.2. Recognition Result Examples.............................149 Normative Reference............................................153 Appendix.......................................................155 A.1 ABNF Message Definitions...................................155 A.2 XML Schema and DTD.........................................168 A.2.1 Recognition Results......................................168 A.2.2 Enrollment Results.......................................170 A.2.3 Verification Results.....................................171 Full Copyright Statement.......................................175 Intellectual Property..........................................175 Contributors...................................................176 Acknowledgements...............................................176 Editors' Addresses.............................................177 1. Introduction: The MRCPv2 protocol is designed for a client device to control media processing resources on the network allowing to process and audio/video stream. Some of these media processing resources could be speech recognition, speech synthesis engines, speaker verification or speaker identification engines. This allows a vendor to implement distributed Interactive Voice Response platforms such as VoiceXML [7] browsers. The protocol requirements of SPEECHSC require that the protocol is capable of reaching a media processing server and setting up communication channels to the media resources, to send/recieve control messages and media streams to/from the server. The Session Initiation Protocol (SIP) protocol described in [4] meets these requirements and is used to setup and tear down media and control pipes to the server. In addition, the SIP re-INVITE can be used to change the characteristics of these media and control pipes mid- session. The MRCPv2 protocol hence is designed to leverage and build upon a session management protocols such as Session Initiation Protocol (SIP) and Session Description Protocol (SDP). SDP is used to describe the parameters of the media pipe associated with that S Shanmugham IETF-Draft Page 4 MRCPv2 Protocol February, 2005 session. It is mandatory to support SIP as the session level protocol to ensure interoperability. Other protocols can be used at the session level by prior agreement. The MRCPv2 protocol depends on SIP and SDP to create the session, and setup the media channels to the server. It also depends on SIP and SDP to establish MRCPv2 control channels between the client and the server for each media processing resource required for that session. The MRCPv2 protocol exchange between the client and the media resource can then happen on that control channel. The MRCPv2 protocol exchange happening on this control channel does not change the state of the SIP session, the media or other parameters of the session SIP initiated. It merely controls and affects the state of the media processing resource associated with that MRCPv2 channel. The MRCPv2 protocol defines the messages to control the different media processing resources and the state machines required to guide their operation. It also describes how these messages are carried over a transport layer such as TCP, TLS or, in the future, SCTP. 2. Notational Convention The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[9]. Since many of the definitions and syntax are identical to HTTP/1.1, this specification only points to the section where they are defined rather than copying it. For brevity, [HX.Y] is to be taken to refer to Section X.Y of the current HTTP/1.1 specification (RFC 2616 [1]). All the mechanisms specified in this document are described in both prose and an augmented Backus-Naur form (ABNF). It is described in detail in RFC 2234 [3]. The complete message format in ABNF form is provided in Appendix section 12.1 and is the normative format definition. Media Resource An entity on the MRCP Server that can be controlled through the MRCP protocol MRCP Server Aggregate of one or more "Media Resource" entities on a Server, exposed through the MRCP protocol.("Server" for short) MRCP Client An entity controlling one or more Media Resources through the MRCP protocol. ("Client" for short) S Shanmugham IETF-Draft Page 5 MRCPv2 Protocol February, 2005 3. Architecture: The system consists of a client that requires the generation of media streams or requires the processing of media streams and a media resource server that has the resources or engines to process or generate these streams. The client establishes a session using SIP and SDP with the server to use its media processing resources. A SIP URI refers to the MRCPv2 server. The session management protocol (SIP) will use SDP with the offer/answer model described RFC 3264 to describe and setup the MRCPv2 control channels. Separate MRCPv2 control channels are need for controlling the different media processing resources associated with that session. Within a SIP session, the individual resource control channels for the different resources are added or removed through the SDP offer/answer model and the SIP re-INVITE dialog. The server, through the SDP exchange, provides the client with a unique channel identifier and a TCP port number. The client MAY then open a new TCP connection with the server using this port number. Multiple MRCPv2 channels can share a TCP connection between the client and the server. All MRCPv2 messages exchanged between the client and the server will also carry the specified channel identifier that MUST be unique among all MRCPv2 control channels that are active on that server. The client can use this channel to control the media processing resource associated with that channel. The session management protocol (SIP) will also establish media pipes between the client (or source/sink of media) and the MRCP server using SDP m-lines. A media pipe maybe shared by one or more media processing resources under that SIP session or each media processing resource may have its own media pipe. MRCPv2 client MRCPv2 Media Resource Server |--------------------| |-----------------------------| ||------------------|| ||---------------------------|| || Application Layer|| || TTS | ASR | SV | SI || ||------------------|| ||Engine|Engine|Engine|Engine|| ||Media Resource API|| ||---------------------------|| ||------------------|| || Media Resource Management || || SIP | MRCPv2 || ||---------------------------|| ||Stack | || || SIP | MRCPv2 || || | || || Stack | || ||------------------|| ||---------------------------|| || TCP/IP Stack ||----MRCPv2---|| TCP/IP Stack || || || || || ||------------------||-----SIP-----||---------------------------|| |--------------------| |-----------------------------| | / S Shanmugham IETF-Draft Page 6 MRCPv2 Protocol February, 2005 SIP / | / |-------------------| RTP | | / | Media Source/Sink |-------------/ | | |-------------------| Fig 1: Architectural Diagram MRCPv2 Media Resource Types: The MRCP server may offer one or more of the following media processing resources to its clients. Basic Synthesizer A speech synthesizer resource with very limited capabilities, that can be achieved through the playing out concatenated audio file clips. The speech data is described as SSML data but with limited support for its elements. It MUST support ,