SIPPING T. Melanchuk Internet Draft G. Sharratt Expires: April 26, 2004 Convedia Oct. 26, 2003 Media Sessions Markup Language (MSML) draft-melanchuk-sipping-msml-01 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Media Sessions Markup Language (MSML) is used to control and invoke many different types of services on IP media servers. Clients can use it define how media sessions interact on a media server and to apply services to individual or groups of users. MSML can be used, for example, to control media server advanced conferencing features, create sidebar conferences, and insert media processing objects into media streams. As well, clients can use MSML with other languages such as the Media Objects Markup Language (MOML) or VoiceXML to interact with individual users or with groups of conference participants. Table of Contents 1. Introduction...................................................2 2. Glossary.......................................................3 Melanchuk Expires - April 2004 [Page 1] Media Sessions Markup Language (MSML) Oct 2003 3. Media Server Architecture Model................................4 3.1 Objects....................................................5 3.2 Identifiers................................................7 3.3 Media Streams.............................................10 4. MSML SIP Usage................................................11 5. Execution Flow................................................12 6. Elements Received by a Media Server...........................14 6.1 ....................................................14 6.2 ...................................................14 6.3 ....................................................15 6.4 ..................................................16 6.5 ........................................17 6.5.1 ............................................18 6.6 .......................................20 6.7 .............................................20 6.8 ...............................................22 6.9 ..................................................22 6.10 .................................................24 6.11 ...................................................25 7. Elements Sent by a Media Server...............................25 7.1 ....................................................25 7.2 ..................................................25 7.3 ...................................................26 8. Response Codes................................................26 9. Examples......................................................27 9.1 MSML transaction..........................................27 9.2 Call Flow.................................................28 10. Change Summary...............................................29 11. Future Work..................................................30 12. XML Schema...................................................31 Security Considerations..........................................37 References.......................................................38 Acknowledgments..................................................38 Authors' Addresses...............................................39 Intellectual Property Statement..................................39 Full Copyright Statement.........................................40 Acknowledgement..................................................40 1. Introduction Media servers contain dynamic pools of media resources. Application servers and other users of media servers (called media server clients) can define and create many different services based on how they configure and use those resources. Often, that configuration and the ways in which those resources interact will be changed dynamically over the course of a call, to reflect changes in the way that an application interacts with a user. Melanchuk Expires - April 2004 [Page 2] Media Sessions Markup Language (MSML) Oct 2003 For example, a call may undergo an initial IVR dialog before being placed into a conference. Calls may be moved from a main conference to a sidebar conference and then back again. Individual calls may be directly bridged to create small n-way calls or simple sidebars. None of these change the SIP dialog or RTP session. Yet these do affect the media flow internal to the media server. The Media Sessions Markup Language (MSML) is an XML language used to change the flow of and services on media streams within a media server. It is used to invoke many different types of services on individual sessions, groups of sessions, and conferences. MSML allows the creation of conferences, bridging different sessions together, and bridging sessions into conferences. MSML can be used to apply IVR operations and dialogs to sessions or conferences, and to modify the media flowing on a session. Dialogs may be specified in any suitable language from VoiceXML [7], which allows complete application interfaces to be executed by a media server, to Media Objects Markup Language (MOML) [9], which can be used to define individual user dialog commands and user input controlled widgets. A network connection is established with the media server using SIP. Media received and transmitted on that connection will flow through different media resources on the media server depending on the requested service. Basic Network Media Services with SIP [6] defines conventions for associating a basic service with a SIP Request-URI. MSML allows services to be dynamically applied and changed by an application server during the lifetime of the SIP dialog. MSML and MOML have been designed to work closely together: MOML addresses the control and manipulation of media processing operations (e.g., announcement, IVR, play and record, ASR/TTS, fax, video), while MSML addresses the relationships of media streams (e.g., simple and advanced conferencing). Together, MSML and MOML create a general- purpose media server control architecture. MSML can additionally be used to invoke other IVR languages such as VoiceXML. MSML is currently defined for audio conferences. Future work will add support for different types of video and multimedia conferences. 2. Glossary media server: a general-purpose platform for executing high-density real-time media processing tasks. It may be a single physical device or a logical function within a physical device. media server client: an application residing on an external server which originates MSML requests to a media server. Melanchuk Expires - April 2004 [Page 3] Media Sessions Markup Language (MSML) Oct 2003 object: the generic term for a media server entity that terminates, originates, or processes media. This specification defines four classes of objects and specifies mechanisms to create them, join them together, and destroy them. participant object: an object in a media server that sources original media in a call or that receives and terminates media in a call. intermediary object: an object in a media server that acts on media within a call for the benefit of the participants. independent object: an object that can exist on a media server independent of other objects. network connection: a participant object class that represents the termination on a media server of an RTP session and one or more media streams (for example audio and video). Network connections are established and removed using a session establishment protocol such as SIP. Networks connections are independent objects. operator: an intermediary object class that modifies or transforms a media stream. Examples of operators may be gain controls, voice masking, or tone filters. Specific types of operators are not defined within MSML. Operators may be defined in MOML or other similar languages. dialog: an automated participant object class. Examples of dialogs may be announcement players, IVR interfaces, or voice recorders. Specific types of dialogs are not defined within MSML. Dialogs may be defined in VoiceXML, MOML, or other similar languages. conference: an intermediary object class that provides multimedia mixing and other advanced conferencing services. This specification currently considers audio conferences, but is extensible to video and other media applications. Conferences are independent objects. identifier: a name that is used to refer to a specific instance of an object on the media server. Identifiers are composed of one or more terms where each term identifies an object class and instance. media stream: the media flow between two objects. A media stream may be half-duplex or full-duplex. The duplex nature of the media stream is specified when it is established. 3. Media Server Architecture Model Media servers are a general-purpose platform for executing real-time media processing tasks. These tasks range in complexity from simple ones such as serving announcements, to complex ones, such as speech Melanchuk Expires - April 2004 [Page 4] Media Sessions Markup Language (MSML) Oct 2003 interfaces, centralized multimedia conferencing, and sophisticated gaming applications. Sessions are established to a media server using SIP. Clients will often use SIP third party call control (3PCC) [2] to establish media sessions to a media server on behalf of end users. However MSML does not require that 3PCC be used; only that the client and the media server share a common identifier for the session. The primary abstractions used by MSML are objects and streams. Objects represent entities which source, sink, or modify media and streams represent the media flow between objects on a media server. The following subsections define the classes of objects that exist on a media server, the way these are identified in MSML, and the basic properties of media streams. 3.1 Objects A media object is an endpoint of a media stream. It may be something that terminates an external network connection or a resource which transforms or manipulates the media in some way. MSML defines four classes of media objects. Each class defines the basic properties of how object instances are used within a media server. Some classes require that the function of specific instances be defined by the client, using languages such as VoiceXML or the Media Objects Markup Language (MOML). The following classes of media processing objects are defined. The class names are given in parentheses.: o network connection (conn) o conference (conf) o dialog (dialog) o operator (oper) Network connection is an abstraction for the media processing resources involved in terminating a RTP session from the network. For audio services as described here, they present a single full-duplex media stream interface within a media server. Multimedia services may have multiple media streams associated with a RTP session. Network connections get instantiated when a media session is established to a media server (e.g., through SIP). Conference represents an n-1 audio mixer and the other media resources and state information required for an audio conference. They have multiple inputs and possibly several classes of inputs. Different classes of input allow the conference to offer different treatment to different audio streams. For example, an advanced Melanchuk Expires - April 2004 [Page 5] Media Sessions Markup Language (MSML) Oct 2003 conference may have some inputs contend to contribute to the mix while others are always able to contribute. A conference has a single logical output consisting of the conference mix, less any contributed audio of an individual participant who receives the output. Conferences are instantiated using the element. The element, and the other elements identified below, are discussed in depth in the following section. Dialogs are a class of objects that represent automated participants. They are similar to network connections from a media flow perspective and have a single full-duplex media stream as the abstraction for their interface within a media server. Unlike connections however, dialogs are created and destroyed through MSML, and the media server itself implements the dialog participant. The function that an instance of a dialog fulfills is defined by a client using a language such as VoiceXML or MOML. As such, "dialog" is a generic reference to the set of resources, both media and control, that are used to create either a simple action, such as an atomic play or record operation, or more complex application interface components, such as a VoiceXML interpreter. Dialogs are instantiated through the element. Operators are a class of objects that are used to filter or transform a media stream. The function that an instance of an operator fulfills is defined by a client using a language such as MOML. Operators may be half-duplex or full-duplex. Half-duplex operators reflect simple atomic functions such as automatic gain control or filtering tones from conferences. Half-duplex operators have a single media input, which is connected to the media stream from one object, and a single media output, which is connected to the media stream to a different object. Full-duplex operators have two media inputs and two media outputs. One media input and output is associated with the stream to one object and the other input and output is associated with a stream to a different object. Full-duplex objects may also treat the media differently in each direction. For example, an operator could be defined which changed the media sent to a connection based upon recognized speech or DTMF received from the connection. Operators get instantiated through the element. The relationships between the different object classes is shown in the figure below. +--------------------------------------+ | Media Server | Melanchuk Expires - April 2004 [Page 6] Media Sessions Markup Language (MSML) Oct 2003 | | |------+ ,---. | | | +------+ / \ | <== RTP ==>| conn |<---->| oper |<---->( conf ) | | | +------+ \ / | |------+ `---' | | ^ ^ | | | | | | | +------+ +------+ | | | | | | | | | | | +-->|dialog| |dialog|<---+ | | | | | | | | +------+ +------+ | +--------------------------------------+ A single, full-duplex instance of each object class is shown together with common relationships between them. An operator is shown between a connection and a conference and dialogs are shown participating both with an individual connection and with a conference. 3.2 Identifiers Objects are referenced using identifiers which are composed of one or more terms. Each term specifies an object class and names a specific instance within that class. The object class and instance are separated by a colon ":" in an identifier term. Identifiers are assigned to objects when they are first created. In general, either the MSML client or a media server may specify the instance name for an object. Objects for which a client does not assign an instance name will be assigned one by a media server. Media server assigned instance names are returned to the client as a complete object identifier in the response to the command which created the object. It is meaningful for some classes of objects to exist independently on a media server. Network connections may be created through SIP at any time. MSML is then used to associate their media with other objects as required to create services requested by clients. Conferences may be created and have specific resources reserved waiting for connections. Objects from these two classes, connections and conferences, are considered independent objects since they can exist on a standalone basis. Identifiers for independent objects consist of single term as defined above. For example, identifiers for a conference and connection could be "conf:abc" or "conn:1234" respectively. Clients which choose to assign instance names to independent objects must use globally unique instance names. One way to create globally unique Melanchuk Expires - April 2004 [Page 7] Media Sessions Markup Language (MSML) Oct 2003 names is to include the domain name of the client as part of the name. Dialogs and operators are only created to provide some form of service to independent objects. Dialogs may act as a participant in a conference or interact with a connection similar to a two participant call. Operators modify the media flow between other objects, such as between a connection and a conference. As such, dialogs and operators depend upon the existence of independent objects and this is reflected in the composition of their identifiers. Identifiers for dialogs and operators are composed of a structured list of slash ('/') separated terms. The left-most term of the identifier must specify a conference or connection. This serves as the root for the identifier. An example of an identifier for a dialog acting as a conference participant could be: conf:abc/dialog:recorder Because operators may exist relative to two independent objects, different identifiers, with each independent object serving as the root, may be used to refer to the same operator. This is discussed further below. All objects except connections are created using MSML. Connections are created when media sessions get established through SIP. There are several options clients and media servers can use to establish a shared instance name for the media streams of a connection. Connection instance names can be network identifiers or aliases. A network identifier consists of the IP address and the port number of a specific media stream. Because these are not used in the context of addressing, and are instead intended as opaque media stream identifiers, it is not necessary to distinguish the address family. Either IPv4 or IPv6 addresses may be used. Examples of connection identifiers based upon network identifiers are: conn:192.0.26.2:10000 conn:FF1E:03AD::7F2E:172A:1E24:10000 Clients may also use the element to create a named alias for a connection. Media servers may also support automatic creation of aliases. An automatic alias can use an identifier from the SIP message used to establish the session together with a stream identifier from the SDP. Stream identifiers are required in order to support multiple media streams, such as audio and video, from a single connection. Melanchuk Expires - April 2004 [Page 8] Media Sessions Markup Language (MSML) Oct 2003 Any of the fields that are used to uniquely identify a SIP dialog (Call-ID, from-tag, to-tag) could be used to identify the session. Specific media streams can be identified using either the position of the media line describing the stream in SDP or use the per media description information lines "i-lines" when those are present. An example of a connection identifier based upon an explicitly assigned alias is: conn:aliasedConnection And an exampled based upon an automatically assigned alias using the position of the media line could be: conn:1234567890@example.com:2 Note that in all cases, there are no MSML semantics to the contents of the identifier. The preceding discussion on connection identifiers assumes that an MSML client is using 3PCC to establish the media sessions. Thus both the client and the media server have access to the SDP and SIP dialog identifiers. If this is not the case, it would be a simple matter to use an event package to allow a media server to notify new sessions to a client that has subscribed to this information. Identifiers as described above allow every object in a media server to be uniquely addressed. They can also be used to refer to multiple objects. There are two ways in which this can currently be done: o common instance names o wildcards Operators that are inserted between two independent objects, such as between a conference and a connection, can be identified using either independent object as the root for its identifier as described above. All operators can thus be uniquely referenced through connections, even if they have the same instance name. An operator identifier that uses a conference as the root may resolve to multiple objects. This allows common control for operators on multiple media streams. For example, assume that a client has created a conference with an instance name of "abc". Assume also that it has inserted an operator, with an instance name "VolumeGroup1", in a subset of media streams into the conference. In this case, the identifier "conf:abc/oper:VolumeGroup1" identifies multiple objects which, for example, could all have their input volume simultaneously muted. The volume operator for a specific connection would be identified similar to "conn:192.0.26.8:10000/oper:VolumeGroup1". Melanchuk Expires - April 2004 [Page 9] Media Sessions Markup Language (MSML) Oct 2003 The other way an identifier can reference multiple objects is when a wildcard is used as an instance name. MSML reserves the instance name comprised of a single asterisk ('*') to mean all objects that have the same identifier root and class. Instance names containing an asterisk cannot be created. Wildcards are not allowed to represent the instance of an independent object, either when identifying an independent object itself, or as a root for other object identifiers. The following are examples of valid wildcards: conf:abc/oper:* conn:1234567890@example.com:2/dialog:* Examples of illegal wildcard usage are: conf:*/oper:73849 conn:* Although identifiers share a common syntax, MSML elements restrict the class of objects which are valid in a given context. As an example, although it is valid to join two connections together, it is not valid to join two IVR dialogs. 3.3 Media Streams All objects have at least one input and one output. Each class defines the number of inputs and outputs an object supports. Joining objects consists of connecting the output from one object to the input of another object and vice versa. When a join is requested to an object that already has a media stream connected to its single input (such as a connection), a media server should automatically bridge the new stream with the existing stream. The maximum number of streams that may be bridged in this manner is implementation-specific, but it is recommended that a media server support bridging at least two streams. Two reflects the ability to easily create simple three-way calls, and to bridge private announcements with a conference mix that can be heard by an individual conference participant. Other applications with specialized topologies may benefit from more than two or three automatically bridged media streams. In the case of general conferences, however, it is simpler to create a conference (with associated mixer) and then to join participants to the conference. Melanchuk Expires - April 2004 [Page 10] Media Sessions Markup Language (MSML) Oct 2003 4. MSML SIP Usage SIP is used to create and modify media sessions with a media server according to the procedures defined in RFC 3261 [1]. Often, SIP third party call control will be used to create sessions to a media server on behalf of end users. MSML is used to define and change the service which a user connected to a media server will receive. As such, MSML clients are expected to be application servers, which must have an authorized security relationship with the media server. MSML itself does not define authorization mechanisms. MSML transactions are originated based upon events that occur in the application domain. These events may be independent from any media or user interaction. For example, an application may wish to play an announcement to a conference warning that its scheduled completion time is approaching. Applications themselves are structured in many different ways. Their structure and requirements contribute to their selection of protocols and languages. To accommodate differing application needs, MSML has been designed to be neutral to other languages and independent of the transport used to carry it. Many alternatives exist for a transport mechanism for MSML. There may be one or many transport channels used to carry MSML based upon the requirements and structure of applications. SIP INVITE and INFO [3] requests and responses have been chosen to carry MSML in this release of the specification. INFO requests allow asynchronous mid-call messages within SIP with few additional semantics. In addition, there are existing widely deployed implementations of that method, it aids in initial developments which are closely coupled with SIP session establishment, and it allows MSML to be directly associated with user dialogs when third party call control is used. Although INFO is generally not considered to be a suitable general- purpose transport mechanism for messages within SIP, there have been proposals to make it more acceptable [10]. MSML is expected to evolve to include other SIP usage and/or to work with other protocols or as a stand-alone protocol established through SIP, in future releases of this document. MSML supports several models for client interaction. When clients use 3PCC to establish media sessions on behalf of end users, clients will have a SIP dialog for each media session. MSML may be sent on these dialogs. However the targets of MSML actions are not inferred from the session associated with the SIP dialog. The targets of MSML actions are always explicitly specified using identifiers as previously defined. An application, after interacting with a user, may want to affect multiple objects within a media server. For example, tones or Melanchuk Expires - April 2004 [Page 11] Media Sessions Markup Language (MSML) Oct 2003 messages are often played to a conference when connections are added or removed. A separate message may also be played to a participant as they are joined, or to moderators. Explicit identifiers not inferred from a transport mechanism allow these multiple actions to be easily grouped into a single transaction sent on any SIP dialog. MSML also supports a model of dedicated control associations. This supports decoupled application architectures where a client can control media server services without also establishing all of the media sessions itself. Control associations are created using SIP but they do not have any associated media session. Although initially INFO messages will be sent on this SIP dialog, just as with dialogs associated with media sessions, it is expected that in the future, the SIP dialog will be used to establish a separate control session (defined in SDP) that does not use SIP as the transport for MSML messages. A media server using MSML also sends asynchronous events to a client using SIP INFO. Events are sent based on previous MSML requests. Events may be generated during the execution of a dialog created by a element. For example, dialogs defined in MOML can send events based on user input. VoiceXML dialogs on the other hand, generally interact with other servers outside of MSML using HTTP. An event is also generated when the execution of a dialog terminates. The exact information returned is dependent on the dialog language, the capabilities of the dialog execution environment, and what was specified by the dialog. Both MOML and VoiceXML allow information to be returned when they exit. These events may be sent in a SIP INFO or a SIP BYE. Conferences may also generate events based upon their configuration. An example of this is the notification of the set of active speakers. Events are sent to the URI from the Contact header of the SIP request that initiated the dialog execution or created the conference. 5. Execution Flow MSML assumes a model where there is a single control context within a media server for MSML processing. That context may have one or many SIP dialogs associated with it. It is assumed that any SIP dialogs associated with the MSML control context have been authorized by mechanisms outside the scope of MSML. A media server control context maintains information about the state of all media objects and media streams within a media server. It receives and processes all MSML requests from authorized SIP dialogs and receives all events generated internally by media objects. An Melanchuk Expires - April 2004 [Page 12] Media Sessions Markup Language (MSML) Oct 2003 MSML request is able to create new media objects and streams, and to modify or destroy any existing media objects and streams. An MSML request may simply specify a single action for a media server to undertake. In this case, the document is very similar to a simple command request. Often, though, it may be more natural for a client to request multiple actions at one time, or the client would like several actions to be closely coordinated by the media server. Multiple MSML elements received in a single request are processed sequentially in document order. An example of the first scenario would be to create a conference and join it with an initial participant. An example of the second case would be to unjoin one or more participants from a main conference and join them to a sidebar conference. In the first scenario, network latencies may not be an issue, but it is simpler for the client to combine the requests. In the second case, the added network latency between separate requests could mean perceptible audio loss to the participant. Each MSML request is processed as a single transaction. A media server must ensure that it has the necessary resources available to carry out the complete transaction before executing any elements of the request. If it does not have sufficient resources, it should return a 520 response and not execute the transaction. Each element is expected to execute immediately. Elements such as , which take time, are "forked" and executed in a separate thread. Once successfully forked, execution continues with the element following the dialog. As such, MSML does not provide mechanisms to sequence or coordinate other operations with dialog elements. Processing within a transaction stops if any errors occur. Elements that were executed prior to the error are not rolled back. It is the responsibility of the client to determine appropriate actions based upon the results indicated in the response. Most elements may contain an optional "mark" attribute. The value of that attribute from the last successfully executed element is returned in an error response. Note that errors that occur during the execution of a dialog occur outside the context of an MSML transaction. These errors will be indicated in an asynchronous event. Transaction results are returned as part of the SIP request response. The transaction results indicate the success or failure of the transaction. It will also include identifiers for any objects created by a media server for which the client did not provide an instance name. Additionally, if the transaction fails, the reason for the Melanchuk Expires - April 2004 [Page 13] Media Sessions Markup Language (MSML) Oct 2003 failure is returned, as well as an indication of how much of the transaction was executed before the failure occurred. 6. Elements Received by a Media Server 6.1 is the root element. When received by a media server, it defines the set of actions that form a single MSML transaction. Actions are defined by the remaining elements of this section, most of which may appear zero or more times as children of . Because MSML transactions execute immediately, the results of the transaction are included as a body in the response to the request. This response will contain any identifiers that the media server assigned to newly created objects. All messages that a media server generates are correlated to those object identifiers. Attributes: version: "1.0" Mandatory 6.2 Alias is used to assign a name to a connection. The same connection can be assigned multiple aliases. attributes: name: the instance name of the aliased connection. Mandatory. id: an existing connection identifier. Mandatory. mark: a token that can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mark attributes within an MSML document should be unique. For example, the following assigns the alias "moderator" to a connection referenced using a network identifier. Melanchuk Expires - April 2004 [Page 14] Media Sessions Markup Language (MSML) Oct 2003 6.3 Join is used to create a media stream between two media objects. A media object may be a network connection or an internal media resource such as a conference. By default a full-duplex media stream is created. However a half-duplex stream may be created by setting a duplex attribute to "half". Join establishes the specified relationship between the two objects and does not change any pre- existing relationships. Some media resources may only be half-duplex. It is illegal to join a full-duplex resource to a half-duplex resource without specifying the join to be half-duplex. Otherwise the media server cannot be sure of the intent and will generate an error (441). At most one media stream may be created between the same two objects. Join is only used to establish media streams between independent objects. Media streams to the media resources for a dialog are established in conjunction with starting the dialog. Operators are only inserted into existing media streams. attributes: id1: an identifier of either a connection or conference. Any other object class results in a 440 error. id2: an identifier of either a connection or conference. Any other object class results in a 440 error. duplex: "half" or "full". When "half" is specified the object identified by id1 receives media from the object identified by id2 but not vice versa. Default is full. class: only allowed when either id1 or id2 identifies an advanced audio conference. These conferences have two types of audio inputs; "standard" participants contend to be one of the N-loudest which contribute to the conference mix; "preferred" always contribute to the conference mix. The duplex attribute governs whether a media stream is actually established. Default is "standard". When both id1 and id2 are conferences, class is ignored and a preferred input is used. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mark attributes within an MSML document should be unique. Melanchuk Expires - April 2004 [Page 15] Media Sessions Markup Language (MSML) Oct 2003 For example, consider a call center coaching scenario where a supervisor can listen to the conversation between an agent and a customer, and provide hints to the agent, which are not heard by the customer. One join establishes a stream between the agent and the customer and another join establishes a stream between the agent and the supervisor. A third join is used to establish a half-duplex stream from the customer to the supervisor. The media server automatically bridges the media streams from the customer and the supervisor for the agent, and from the customer and the agent for the supervisor. Assuming the following media streams supervisor: 192.0.2.2:4680 agent: 192.0.2.4:6428 customer: 192.0.2.6:9684 If all connections were already established on a media server the following would create the media flows previously described: 6.4 Removes a media stream between two objects. and may be used together to move a media stream, such as from a main conference to a sidebar conference. attributes: id1: an identifier of either a specific connection or a conference. Mandatory. id2: an identifier of either a specific connection or a conference. Mandatory. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mark attributes within an MSML document should be unique. The following removes a participant from a conference and plays a leave tone for the remaining participants in the conference. Melanchuk Expires - April 2004 [Page 16] Media Sessions Markup Language (MSML) Oct 2003 6.5 This element is used to allocate media processing resources for audio conferences. Two types of audio conferences are defined; audio.basic and audio.advanced. A basic conference is simply an N-1 mix of the participants. An advanced conference supports the following features: o N-loudest o preferred speakers o participant and conference resource reservations o active speaker notifications Characteristics of conferences may be changed dynamically during the conference by sending events to the conference identifier. The event name would identify the attribute to set and the valuelist for the event would specify the new value. The type of conference cannot be changed once the conference has been created. attributes: (all attributes are optional. Only the name attribute is allowed if type="audio.basic") type: audio.basic or audio.advanced. In the future this may be a URL which identifies a description of a conference. MOML may be a candidate language to specify conference descriptions. Default is audio.advanced. name: the instance name of the conference. If the attribute is not present, the media server will assign a globally unique name for the conference. If the attribute is present but the name is already in use, an error (432) will result and MSML document execution will stop. Any events which the conference generates will be correlated with this name. n: the number of participants (excluding preferred speakers) who contend to be included in the conference mix based upon their audio energy. Default is 3. Melanchuk Expires - April 2004 [Page 17] Media Sessions Markup Language (MSML) Oct 2003 asn: boolean which defines whether active speakers should be reported. Only the participants who are active and eligible to contribute to the mix are reported. Default is false. ri: the minimum reporting interval defines the minimum duration of time which must pass before changes to active speakers will be reported. Default is 5 seconds. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mar" attributes within an MSML document should be unique. An example of creating a conference is shown below. This conference allows at most two participants to contend to be heard and reports the set of active speakers no more frequently than ten seconds. If the MSML client later decides that three participants should contend to be mixed, it may do so by: 6.5.1 Conference resources may be reserved by including the element as a child of . allows the specification of a set of resources which a media server will reserve for the conference. Any requests for resources beyond those that have been reserved should be honored on a best-effort basis by a media server. attributes: required: boolean that specifies whether should fail if the requested resources are not available. When set to false, the conference will be created, with no reserved resources, if the complete reservation cannot be honored. Default true. Melanchuk Expires - April 2004 [Page 18] Media Sessions Markup Language (MSML) Oct 2003 Two classes of resources are associated with a conference. Individual resources consist of those resources which are reserved for each participant. Shared resources are reserved for the entire conference. Examples of shared resources may be announcement players or voice recorders. Each class of resource is defined using the child elements and respectively. Each of these accepts an attribute that indicates the number of resource instances that are to be reserved. The contents of each element describes the resources that are to be reserved for each instance. Descriptions are implementation- dependent. Media servers that support MOML may use the elements from that language as the basis for resource descriptions. In the general case, the resources reserved for each participant must accommodate all of the codecs that a media server supports. However in some environments, the types of codecs that are expected may be known. In these cases, a media server may support descriptions that constrain reservations to different media formats. For example, the following creates a conference and reserves individual resources for 20 participants, and reserves 2 instances of the shared resources for use on a conference-wide basis. The number of individual and shared reservations may be changed during the life of a conference by sending "reserve-ind" and "reserve-shared" events to the conference identifier. The valuelist indicates the number of instances which are now desired. The success of the request is indicated in the response to the event. Requests for increases that cannot be honored do not change the reservation which was previously established. Individual reservations are consumed when participants are joined to the conference. Applications must either include MSML indicating a to the conference with the initial SIP INVITE, or must use the "conf=" Request-URI conventions defined in [6], to ensure that the conference connection may make use of the reserved resources when the connection is first established. Otherwise, if a dialog is required Melanchuk Expires - April 2004 [Page 19] Media Sessions Markup Language (MSML) Oct 2003 to determine the conference or service, the connection contends for available media server resources. 6.6 Deletes the conference state and all shared resources. Any media streams between the conference and participants are also removed. attributes: id: the identifier for a conference. Mandatory. term: a boolean value: When true, the media server will send a BYE request on all SIP dialogs still associated with the conference. Setting term equal to false allows clients to start dialogs on connections once the conference has completed. Default true. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all "mark" attributes within an MSML document should be unique. 6.7 Dialog start is used to instantiate media dialog on connections or conferences. A dialog may consist of a simple atomic command or it may be an entire sequence of interactions. Dialogs may be speech or IVR dialogs with human participants, of fax dialogs with a machine. A media server must support MOML to allow command driven and fax interactions, and should support VoiceXML to allow execution of complex user interfaces. Other dialog languages may also be supported. The control resources associated with dialogs are separate from the MSML thread of execution. When a dialog is started, a media stream is created between the media resources required for the dialog and then MSML allows the dialog control resources to execute. MSML execution continues without waiting for the dialog to complete. The media stream between the dialog media resources and the network connection or conference may be full-duplex or half-duplex. When a dialog is performed on conferences of type audio.advanced, the dialog media stream is connected to the conference mixer as a preferred speaker. The dialog description may be specified either inline or by a URL. The description must not be inline if the src attribute is specified. Melanchuk Expires - April 2004 [Page 20] Media Sessions Markup Language (MSML) Oct 2003 The originator of the dialog is notified using a "msml.dialog.exit" event when the dialog completes. Any results returned by the dialog when it exits should be returned as part of the "msml.dialog.exit" event. attributes: target: an identifier of a connection or a conference which will interact with the dialog. Mandatory. src: the URL of the dialog description. Must not be used if the dialog description is inline. Otherwise an error (422) will result and MSML document execution will stop. type: a MIME type which identifies the type of language used to describe the dialog. application/moml+xml and application/vxml+xml are used to identify MOML and VoiceXML respectively. name: an instance name for the dialog. If the attribute is not present, the media server will assign an identifier to the dialog. If the attribute is present but the name is already associated with the target, an error (431) will result and MSML document execution will stop. Any results that a dialog generates will be correlated to its identifier. duplex: "full" means the media stream is full-duplex, "to" means the media stream is half-duplex to the dialog (it receives audio), and "from" means the media stream is half- duplex from the dialog (it sends media). Default "full". mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all "mark" attributes within an MSML document should be unique. The following example starts a VoiceXML dialog on a connection. Melanchuk Expires - April 2004 [Page 21] Media Sessions Markup Language (MSML) Oct 2003 6.8 End dialog is used to terminate a dialog created through before it completes of its own accord. The operation of depends on the dialog language being used by the executing context. When that context is VoiceXML, a "connection.disconnected" event will be thrown to the VoiceXML application. When that context is MOML, a "terminate" event will be sent to the MOML context. In both cases, the executing dialog has the opportunity to gracefully complete and return data to the media server client. As for all data sent to the client, it will be sent in a separate message and correlated with the dialog identifier. attributes: id: the identifier of a dialog. Mandatory. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all "mark" attributes within an MSML document should be unique. For example, the following would terminate the dialog started by the previous example if it had not already completed. 6.9 Insert allows media streams to be modified by placing a media processing operator between two existing objects. Operators may be full-duplex or half-duplex. The existing media stream is replaced by a new stream which includes the inserted operator. When a half-duplex operator is inserted, only the media stream in one direction is affected. Operators are used to transform, filter, or otherwise modify a media stream. Examples are gain controls, voice masking, or tone filters although specific operators are not defined within MSML. Operators are defined using MOML although media servers may also support other languages. Half-duplex operators may be as simple as individual MOML transform primitives such as or which explicitly adjust gain and filter DTMF tones respectively. More Melanchuk Expires - April 2004 [Page 22] Media Sessions Markup Language (MSML) Oct 2003 complex half-duplex and full-duplex operators can also be created and used. Media objects may be referenced by a URL or defined inline. Full-duplex operators may not be symmetric in how they process media. Media in one direction may be treated or used differently from media in the other direction. For example, automatic gain control may be applied to media going to a conference mix but a participant may have the ability to explicitly control the volume of the conference mix that they hear. In this case, a media server uses the left side and right side definition of the inserted operator to know how to orient its inputs and outputs with the media stream. Insert differs from using explicit unjoin and join operations because the inserted operator is considered to be part of the media stream between the two original objects. If one of those objects terminates, for example a dialog, then the operator is automatically removed as part of deleting the media stream. Multiple operators may be inserted and all become part of the media stream. It may often be desired to define the same treatment for all of the media streams associated with a conference. When one of the objects identifies a conference, allows the other object to be specified as "all" to indicate that a copy of the new object should become part of the media stream for all participants of the conference. The keyword "all" affects all current and future streams joined to the conference. The class of media stream that is affected may be specified for conferences that support multiple kinds of participants. attributes: id1: an identifier or "all" if id2 specifies a conference. If the inserted operator is half-duplex, id1 identifies the recipient of the media from the operator. If the inserted operator is full-duplex, id1 is connected to the left side of the operator. Mandatory. id2: an identifier or "all" if id1 specifies a conference. If the inserted operator is half-duplex, id2 identifies the object that will send media to the operator. If the inserted operator is full-duplex, id2 is connected to the right-side of the operator. Mandatory. src: the URL of the media operator. Must not be used if the operator is inline or an error (422) will result and MSML document execution will stop. Melanchuk Expires - April 2004 [Page 23] Media Sessions Markup Language (MSML) Oct 2003 type: a MIME type which identifies the type of language used to describe the operator. The type "application/moml+xml" must be supported. duplex: "half" or "full". When "half" is specified the inserted object affects the stream in the direction from id2 to id1 but not vice versa. Default is half. name: an instance name for the object. class: defines the class of conference media stream which is affected when a conference supports multiple classes and the "all" keyword is used (see also ). If no class is specified, then all media streams are affected. The class attribute must not be used if neither id1 nor id2 specify "all" media streams. mark: a token which can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mark attributes within an MSML document should be unique. For example, the following creates a conference, and specifies that a gain control object from MOML should be inserted for each participant that joins the conference that will automatically adjust the gain on their input to the conference mix. 6.10 The remove element is used to remove objects which have been placed in a media stream using . Remove restores the original media stream. attributes: id: the identifier of the object to remove. If id refers to multiple objects affecting multiple media streams, then all objects are removed from all affected media streams. Mandatory. Melanchuk Expires - April 2004 [Page 24] Media Sessions Markup Language (MSML) Oct 2003 6.11 Events are used to affect the behavior of different objects within a media server. The element is used to send an event to the specified recipient. An example of using is shown in the definition of . attributes: event: the name of an event. Mandatory. target: an object identifier. When the identifier is for a dialog or operator, it may optionally be appended with a slash "/" followed by the target to be included in a MOML . Connections and conferences of type "audio.basic" do not support events. VoiceXML only supports events in a limited fashion. Mandatory. valuelist: a list of zero or more parameters that are included with the event. mark: a token that can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore the value of all mark attributes within an MSML document should be unique. 7. Elements Sent by a Media Server 7.1 The root element for MSML. When sent by a media server, it encompasses the results of a request or the contents of events. Attributes: version: Mandatory "1.0" 7.2 The element is used to report the results of an MSML transaction. It is included as a body in the final response to the SIP request which initiated the transaction. An optional child element may include text which expands on the meaning of error responses. Response codes are defined in the following section. attributes: Melanchuk Expires - April 2004 [Page 25] Media Sessions Markup Language (MSML) Oct 2003 response: a numeric code indicating the overall success or failure of the transaction, and in the case of failure, an indication of the reason. Mandatory. mark: in the case of an error, the value of the mark attribute from the last successfully executed element that included the mark attribute. Two other child elements allow for the response to include identifiers for objects created by the request but which did not have instance names specified by the client. Those elements are and for objects created though a or respectively. 7.3 The element is used to convey an event to a media server client. Two types of events are defined by MSML: "msml.dialog.exit", and "msml.conf.asn". These correspond to the termination of an executing dialog, and the notification of the current set of active speakers for a conference, respectively. Events may also be generated by an executing dialog. In this case the event type is specified by the dialog. attributes: name: the type of event. Mandatory. id: the identifier of the conference or dialog that generated the event or caused the event to be generated. Mandatory. has two children, and , which contain the name and value respectively of each namelist item associated with the event. 8. Response Codes The response codes defined in this section are returned as the value of the response attribute to the element. Informational No informational responses are currently defined Success 200 Ok Client Error Melanchuk Expires - April 2004 [Page 26] Media Sessions Markup Language (MSML) Oct 2003 400 Bad Request 401 Unknown Element 402 Unsupported Element 403 Missing mandatory element content 404 Forbidden element content 405 Invalid element content 406 Unknown attribute 407 Attribute not supported 408 Missing mandatory attribute 409 Forbidden attribute is present 410 Invalid attribute value 420 Unsupported media description language 421 Unknown media description language 422 Ambiguous request (both URI and inline description) 430 Object does not exist 431 Object instance name already used 432 Conference name already in use 440 Cannot join objects of specified class 441 Objects have wrong duplexity 442 Media stream does not exist to insert 443 Objects already joined with specified duplexity Server Error 500 Internal media server error 510 Not in service 520 No resource to fulfill request 9. Examples 9.1 MSML transaction The following example creates a sidebar conference and moves three participants from a main conference to the sidebar using and . It then does a half-duplex join of the main conference to the sidebar, so that those in the sidebar may hear the main conference. However, a gain is inserted so that the volume of the main conference does not intrude on the sidebar. Melanchuk Expires - April 2004 [Page 27] Media Sessions Markup Language (MSML) Oct 2003 9.2 Call Flow The following call flow shows an application server (AS) using a media server (MS) to create a three-person conference. Three terminals (T1, T2, T3) each dial in to a conferencing access number hosted on the AS. The AS uses 3PCC to connect their media streams to the MS and uses the MS to engage in a welcome dialog with each person. That dialog would consist of playing a greeting and asking them for a conference identifier. Either MOML or VoiceXML could be used for dialogs. The flow presented here assumes that MOML has been used, and that the results are returned to the AS using MSML. Had VoiceXML been used, the results would have been posted to the AS using HTTP. After the first person has requested the conference, the AS instructs the MS to create the conference and joins in that participant. Subsequent participants are simply joined to the existing conference. Each person records their name and the second and third participants have their name announced to the conference as they join. T1, T2, T3 AS MS ---------- -- -- T1 -----------INVITE--------> | T1 <------------OK----------- | T1 -----------ACK-----------> | | ---INVITE (MSML dialog[welcome])-> | | <----OK--------------------------- | | ---ACK---------------------------> | T1 =====================RTP======================================> | | <--INFO (MSML event:conf-id)------ | | -----OK--------------------------> | | ---INFO (MSML dialog[rec name])--> | | <----OK--------------------------- | | <--INFO (MSML event:recordexit)--- | Melanchuk Expires - April 2004 [Page 28] Media Sessions Markup Language (MSML) Oct 2003 | -----OK--------------------------> | | ---INFO (MSML createconf+join)---> | | <----OK--------------------------- | T2 ---------INVITE------> | T2 <----------OK--------- | T2 ---------ACK---------> | | ---INVITE (MSML dialog[welcome])-> | | <----OK--------------------------- | | ---ACK---------------------------> | T2 <================RTP======================================> | | <--INFO (MSML event:conf-id)------ | | -----OK--------------------------> | | ---INFO (MSML dialog[rec name])--> | | <----OK--------------------------- | | <--INFO (MSML event:recordexit)--- | | -----OK--------------------------> | | ---INFO (MSML join)--------------> | | <----OK--------------------------- | T1 <=========RTP (conf mix)======================================> | T2 <=====RTP (conf mix)======================================> | | ---INFO (MSML dialog[play name])-> | | <----OK--------------------------- | T3 --------INVITE---> | T3 <---------OK------ | T3 --------ACK------> | | ---INVITE (MSML dialog[welcome])-> | | <----OK--------------------------- | | ---ACK---------------------------> | T3 <============RTP======================================> | | <--INFO (MSML event:conf-id)------ | | -----OK--------------------------> | | ---INFO (MSML dialog[rec name])--> | | <----OK--------------------------- | | <--INFO (MSML event:recordexit)--- | | -----OK--------------------------> | | ---INFO (MSML join)--------------> | | <----OK--------------------------- | T3 <=RTP (conf mix)======================================> | | ---INFO (MSML dialog[play name])-> | | <----OK--------------------------- | 10. Change Summary The following are the primary changes between this version of the draft and the -00 version. o added a glossary Melanchuk Expires - April 2004 [Page 29] Media Sessions Markup Language (MSML) Oct 2003 o rewrote the description of objects to precisely distinguish between classes and instances. All classes are now defined in MSML. The "oper" class replaces "application defined classes". o rewrote the description of identifiers. All terms must use "class:instance" where the instance may be assigned by the client or media server. "/" replaces ";" as the term separator for identifiers. o clarified the definition of connection identifiers and require that "conn" be the class for all forms of the identifier. o '*' wildcard allowed for an instance name in limited situations. o alias only names a single connection. o clarified SIP usage and transport neutrality. All actions use mandatory explicit identifiers rather than inferring targets from a SIP dialog. o changed the attribute name from "id" to "name" for client assigned instance names. o fixed so that MOML target is appended to the MSML target rather than the MSML event. o changed xml+moml to moml+xml and xml+vxml to vxml+xml. o changed "namelist" to "valuelist" in send. o removed explicit "lhs" / "rhs" labeling of full duplex objects. o added specification of result codes and when they are returned. 11. Future Work Some of the likely functions to be added in future release of MSML include: o a mechanism for extending the language, similar conceptually to MGCP/MEGACO packages. o management capabilities such as dedicated management associations between control agent and media server, auditing of state, and support for control agent failover. Melanchuk Expires - April 2004 [Page 30] Media Sessions Markup Language (MSML) Oct 2003 o video and multimedia. 12. XML Schema The MSML schema uses one core schema which includes two other schemas; one defines the MSML datatypes, the other is for MOML which is optionally used for dialogs and is required to define operators. The core schema is: Melanchuk Expires - April 2004 [Page 31] Media Sessions Markup Language (MSML) Oct 2003 Melanchuk Expires - April 2004 [Page 32] Media Sessions Markup Language (MSML) Oct 2003 Melanchuk Expires - April 2004 [Page 33] Media Sessions Markup Language (MSML) Oct 2003 Following is the schema which defines the basic datatypes used by the other schema. Note that several regular expressions required them to be split across two lines for formatting reasons. Melanchuk Expires - April 2004 [Page 35] Media Sessions Markup Language (MSML) Oct 2003 Melanchuk Expires - April 2004 [Page 36] Media Sessions Markup Language (MSML) Oct 2003 Security Considerations MSML can be used to affect the service received by any user connected to a media server. The transport used to carry MSML must use appropriate security measures to insure that only authorized entities are able to use MSML. Melanchuk Expires - April 2004 [Page 37] Media Sessions Markup Language (MSML) Oct 2003 References [1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R. Sparks, M. Handley, and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, Internet Engineering Taskforce, June 2002. [2] J. Rosenberg, J. Peterson, H. Schulzrinne, and G. Camarillo, "Best Current Practices for Third Party Call Control in the Session Initiation Protocol", Internet Draft, Internet Engineering Task Force, March 2003. Work in progress. [3] S. Donovan, "The SIP INFO Method", RFC 2976, Internet Engineering Taskforce, Oct. 2000. [4] R. Even, O. Levin, and N. Ismail, "Conferencing Media Policy Requirements", Internet Draft, Internet Engineering Taskforce, Feb. 2003. Work in progress. [5] R. Mahy and N. Ismail, "Media Policy Manipulation in the Conference Policy Control Protocol", Internet Draft, Internet Engineering Taskforce, Feb. 2003. Work in progress. [6] J. Van Dyke, E. Burger, and A. Spitzer, "Basic Network Media Services with SIP", Internet Draft, Internet Engineering Taskforce, Mar. 2003. Work in progress. [7] World Wide Web Consortium, "Voice Extensible Markup Language: VoiceXML, Version 2.0", W3C Candidate Recommendation, Feb. 2003. Work in progress. [8] World Wide Web Consortium, "Voice Browser Call Control: CCXML Version 1.0", W3C Working Draft, Oct. 2002. Work in progress. [9] T. Melanchuk, "Media Objects Markup Language (MOML)", Internet Draft, Internet Engineering Task Force, Oct. 2003. Work in progress. [10] D. Willis, "Packaging and Negotiation of INFO Methods for the Session Initiation Protocol (SIP)", Internet Draft, Internet Engineering Task Force, Jan 2003. Work in progress. Acknowledgments Adnan Saleem and Yong Xin of Convedia, have provided key insights, both theoretic and through development experience. Gilles Compienne and Ben Smith, both of Ubiquity Software, provided important feedback on a pre-release version of the -00 draft. Chris Boulton of Ubiquity, and Michael Rice of VocalData helped clarify several issues in the - 00 draft, while Bruce Walsh and Kevin Fitzgerald, both of Spectel, Melanchuk Expires - April 2004 [Page 38] Media Sessions Markup Language (MSML) Oct 2003 provided important feedback on that draft. Pete Danielsen of Lucent provided a thorough and detailed review. Authors' Addresses Tim Melanchuk Convedia 4190 Still Creek Drive, Suite 300 Vancouver, BC, V5C 6C6 Canada email: timm@convedia.com Garland Sharratt Convedia 4190 Still Creek Drive, Suite 300 Vancouver, BC, V5C 6C6 Canada email: gsharratt@convedia.com Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Melanchuk Expires - April 2004 [Page 39] Media Sessions Markup Language (MSML) Oct 2003 Full Copyright Statement Copyright (C) The Internet Society 2003. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Melanchuk Expires - April 2004 [Page 40]