RTCWEB J. Rosenberg Internet-Draft M. Kaufman Intended status: Informational M. Hiie Expires: August 12, 2011 F. Audet Skype February 8, 2011 An Architectural Framework for Browser based Real-Time Communications (RTC) draft-rosenberg-rtcweb-framework-00 Abstract This document defines an architectural framework for browser-based real-time communications (RTC). We propose a media component model, where the browser provides an API abstraction which models media components and connections. The underlying protocols within the browser provide for a minimum set of functionality related to transport of media. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on August 12, 2011. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect Rosenberg, et al. Expires August 12, 2011 [Page 1] Internet-Draft RTCWEB-fw February 2011 to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. The Media Component Model . . . . . . . . . . . . . . . . . . 4 3. The Role of Signaling . . . . . . . . . . . . . . . . . . . . 5 4. The Role of Media Transport . . . . . . . . . . . . . . . . . 8 5. Benefits of the Media Component Model . . . . . . . . . . . . 9 5.1. Enabling Innovation . . . . . . . . . . . . . . . . . . . 9 5.2. The Importance of Flexibility . . . . . . . . . . . . . . 10 6. Interoperability with Existing VoIP Gear . . . . . . . . . . . 11 7. Informative References . . . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Rosenberg, et al. Expires August 12, 2011 [Page 2] Internet-Draft RTCWEB-fw February 2011 1. Introduction Real-time communications (RTC) remains one of the few - if only - classes of desktop applications that is not yet possible using the native capabilities of the web browser. These applications run natively on the desktop, or are powered by plugins. The functionality provided by these desktop clients is rich and complex - ranging from user interface, to real-time notifications, to call signaling and call processing, to instant messaging and presence, and of course - the real-time media stack itself, including codecs, transport, firewall and NAT traversal, security, and so on. Given the breadth of functionality in today's desktop RTC clients, careful consideration needs to be paid to how that functionality manifests in the browser. What functionality lives within the browser itself? What functionality lives on top of it - either in client-side Javascript or within servers? What protocols are spoken by the browser itself? What protocols can be implemented within the Javascript? What protocols need to be standardized, and which do not? Pictorially, the question is what protocols, APIs, and functionality reside within the box marked "Browser RTC Function" in Figure 1. Indeed, the central question is what functionality resides in that box, as the functionality will ultimately dictate the protocols that interface to it, and the APIs which control it. Rosenberg, et al. Expires August 12, 2011 [Page 3] Internet-Draft RTCWEB-fw February 2011 +------------------------+ On-the-wire | | Protocols | Servers |---------> | | | | +------------------------+ ^ | | | HTTP/ | Websockets | | | | +----------------------------+ | Javascript/HTML/CSS | +----------------------------+ Other ^ ^RTC APIs | |APIs +---|-----------------|------+ | | | | | +---------+| | | Browser || On-the-wire | Browser | RTC || Protocols | | Function|-----------> | | || | | || | +---------+| +---------------------|------+ | V Native OS Services Figure 1: Browser Model 2. The Media Component Model It is our position that the functionality that manifests within the box be a media component model. In this model, the browser Rosenberg, et al. Expires August 12, 2011 [Page 4] Internet-Draft RTCWEB-fw February 2011 implements the necessary functionality to perform the real-time processing of media, starting from capture/render, through encapsulation in real-time transport protocols sent over the Internet. This functionality must be built into the browser, rather than within Javascript, due to its tight timing requirements and complexity. Furthermore, the functionality manifest as a set of loosely coupled components, each of which performs some aspect of the real-time processing. Each component has APIs which allow that component to be configured (with sensible defaults where appropriate), along with APIs that allow applications to gather information and statistics about the performance of that module. The modules would include the codec itself, the acoustic echo canceller (AEC), the jitter buffer, audio and video pre-processing modules, and network transport components (including encryption and integrity protection of media) which speak specific transport protocols (such as the Real-Time Transport Protocol (RTP)). The media component model is purposefully minimalistic. It opts for maximizing the functionality that lives outside of the browser itself - within Javascript or servers. In particular, only functionality which is real-time - which cannot be done using Javascript or server functionality - resides within the browser itself. As explained in Section 5, this facilitates innovation, differentiation, and development velocity - all of the key characteristics that have made the web what it is. As an example, a codec component implementing Opus [I-D.ietf-codec-opus] might be represented by a Javascript object with properties that mirror the configuration settings of the codec itself - the sample rate (one of narrowband, mediumband, wideband or super-wideband), the packet rate (number of frames per packet), the bitrate (which can vary between 6 and 40kbps), a slider that adjusts the packet loss resilience, a Boolean which indicates whether inband FEC should be used, and another Boolean which indicates whether to apply silence suppression. Of course, all of these parameters might have reasonable defaults so that non-expert programmers can just make it work. However, an advanced programmer could force a mode or change a setting as needed. After all, the Opus codec itself makes these parameters tunable exactly because there is no one right value; the correct setting depends on the application scenario and needs of the developer. 3. The Role of Signaling It is our view that signaling is accomplished using a combination of existing client-server web protocols (HTTP, COMET, and websockets) and standards-based server-to-server protocols, such as SIP. A view Rosenberg, et al. Expires August 12, 2011 [Page 5] Internet-Draft RTCWEB-fw February 2011 of the "browser RTC Trapezoid" is shown in Figure 2. +-----------+ +-----------+ | Web/ | | Web/ | | SIP | SIP | SIP | | |-------------| | | Server | | Server | | | | | +-----------+ +-----------+ / \ / \ Proprietary over / \ HTTP/Websockets / \ / Proprietary over \ / HTTP/Websockets \ / \ +-----------+ +-----------+ |JS/HTML/CSS| |JS/HTML/CSS| +-----------+ +-----------+ +-----------+ +-----------+ | | | | | | | | | Browser | ------------------------- | Browser | | | Media | | | | | | +-----------+ +-----------+ Figure 2: Browser RTC Trapezoid In this example, a call is placed between two different providers. They use a SIP-based interface to federate between them. However, each of their respective browser-based clients signals to its server using proprietary application protocols built ontop of HTTP and Websockets. For example, provider A might offer simple calling services, and have a very simple web services interface for placing calls: http://calling.providerA.com/call?target=joe@providerB.com&myIP=1.2.3.4:4476 Which takes only the called party and local IP/port as arguments. Provider A's server infrastructure - some combination of web and SIP servers built in any way it likes - uses the identity of the target, along with previously-known information on the capabilities of the caller's browser learned through a web-services registration, to Rosenberg, et al. Expires August 12, 2011 [Page 6] Internet-Draft RTCWEB-fw February 2011 generate a SIP INVITE. This arrives at provider B's server infrastructure, which alerts its browser-based client of the incoming call. Provider B might be an enterprise service provider, and offer much richer features and signaling. Provider B uses a websocket interface to the browser, providing it the identity of the caller, the list of available codecs, and so on. B's service provider offers web-services based APIs for answering the call, declining it, sending to voicemail, redirecting to another number, parking it, and so on. APIs within the browser allow each side to instruct the browsers to send media, including selection of media types and codecs. In this model, there is no SIP in the browser. It is our view that SIP has no place within the browser. SIP is an application protocol - providing call setup, registration, codec negotiation, chat and presence, amongst other features. For each and every new feature that is desired to run between a SIP client and a SIP server, a new standard must be defined and then implemented. The feature set is indeed vast, considering the wealth of potential endpoints, ranging from simple consumer "voice only" clients, to richer videophones, to voice and video multiparty conferencing (including content sharing), to low-end enterprise phones, to high end executive admin phones, to contact centers endpoints, and beyond. Each of those requires more and more SIP extensions in order to function. This has resulted in a growing number of specifications, with diminishing returns of interoperability and feature velocity. As an example, the BLISS working group in IETF was formed to tackle some basic business phone features - including line sharing, park, call queuing, and automated call handling. Each of these individual features requires one or more specifications, and needs to be designed to meet the needs of all of the participants in the process. There are two important consequences of this. First, the requirement of standardization acts as a huge deterrent to innovation. Indeed, in many ways, it is anathema to the very notion of how the web is supposed to work. In the web model, the provider can define arbitrary content to render to users, craft arbitrary UI, and define arbitrary messaging from the browser back to the server, all without standardization or change to the web browser. Google does not need to wait for the browsers to implement IMAP in order to provide mail service. Facebook does not need the browser to have XMPP or SIP to enable presence and instant messaging. Why is call processing any different? Why should Skype or any other real-time communications provider be constrained by standardized application protocols? Each provider should be able to design and innovate what it needs, and not be constrained by the functionalities of the application protocols burned into the browser. Rosenberg, et al. Expires August 12, 2011 [Page 7] Internet-Draft RTCWEB-fw February 2011 While it is true that standardization will be required in order to extend these features between domains, that standardization process can be the successor - not the predecessor - to successsful deployment and usage of the feature within a domain. Furthermore, many features and services do not need to be extended between domains. Many of the BLISS features are good examples of this. Inclusion of SIP in the browser for client to server signaling will also harm interoperability. Unfortunately, SIP interoperability betweend endpoints and servers has been relatively poor; working only for basic call setup, teardown, and basic features. Important concepts like configuration remain poorly standardized and almost never interoperate. The web has certainly had interoperability problems, but the nature of those problems is different. In the web, content providers often need to code differently for different browsers, but at least they can deliver their application functionality. On the other hand, with SIP phones, many cases features simply do not and cannot work, and this cannot be resolved through software development on behalf of the SIP provider. Interoperability is improved when there are fewer standards and not more. Instead of adding SIP and its extensions to the browser, application providers can use the tools that are already there - HTTP and websockets, and then define whatever signaling functions they desire ontop, without interoperability consequences. Make no mistake - SIP remains important as a glue between service providers, and between server infrastructure within service provider networks. However, in a web context, there is simply no need for SIP support in the browser. 4. The Role of Media Transport Unlike signaling, media transport does need to be in the browser, for two important reasons: 1. It operates in real-time and does not fit well with the programming model of Javascript 2. It needs to flow between endpoints directly - over UDP - in order to achieve low latency, and therefore requires standardization in order to interoperate with other providers or endpoints The second point is important. Unlike most other web protocols, real-time media needs to be sent from the browser client to recipients other than the origin server or domain from which the web content came from. This is essential for ensuring low latency operations - one of the key metrics of quality in Voice over IP Rosenberg, et al. Expires August 12, 2011 [Page 8] Internet-Draft RTCWEB-fw February 2011 systems. In some cases, the recipient will be another browser endpoint from the same provider. However, it could be a desktop client or mobile client from the same provider, or as shown in Figure 2, it could be a browser endpoint or desktop endpoint from another service provider. In all cases, a direct connection - indeed a direct UDP connection - is important whenever possible. From a security perspective however, the browser cannot just have an API that tells it to send arbitrary UDP datagrams or even standardized-format voice (or worse - video) media packets to an arbitary IP address. The former introduces the opportunity for malicious JavaScript to craft packets that mimick other application protocols and send them to arbitrary endpoints (for example, an enterprise SNMP server). The latter would introduce a substantial opportunity for denial-of-service attacks. Malicious Javascript could tell the browser to "spam" an unwitting recipient with high bandwidth video. In the voice literature, this is referred to as the voice hammer attack [RFC5245]. In existing voice systems, this attack is possible but not likely due to the closed nature of most of the software and systems. In a web environment, where all it takes is one line of malicious Javascript, the attack becomes almost a certainty. To avoid this attack, a simple handshake can be utilized. The browser should support a simple STUN-based [RFC5389] connection handshake. The exchange of the STUN transaction ID prior to transmission of media prevents the attack. 5. Benefits of the Media Component Model There are several important benefits of the media component model proposed here. 5.1. Enabling Innovation One of the reasons why the Web has been successful as a user interface platform is the short turn-around time to deploy new versions of web-based services. Often, these new versions are experiments that vary small details which are important to make the service successful. It is the fine granularity of user interface elements in HTML and related technologies that allow this experimentation with details. As there is no agreed-upon configuration of real-time audio/video communication technologies that always delivers the best result, we think that it is essential to give the application developers the same benefit of short turn- around time and ability to experiment with details. Therefore, the real-time communication primitives offered by user agents to web Rosenberg, et al. Expires August 12, 2011 [Page 9] Internet-Draft RTCWEB-fw February 2011 applications/services should be fine-grained enough to allow for enhanced configurations and possibly new scenarios. Also, these interfaces to the primitives should allow gathering real-world data in enough detail on how the primitives are operating, to enable the feedback loop of deploy-measure-reconfigure-redeploy. One of the areas where perhaps the most innovation can be expected is signaling - one only needs to look at the plethora of standards around SIP. Proposing user-agent vendors to implement all these standards is a sure way to make the common denominator across user agents marginal. Instead, the browser already has a programmability model (JavaScript) that can handle all these use cases, and more, provided the programming environment has access to the underlying media components as we propose here. Drawing again parallels from user interface development, there is an undecided problem of what should be executed by the user agent, and what by the web servers (e.g. validation). Similar gray boundary between the client and the server exists in the field of real-time communications. Therefore we propose to leave standardization of signaling out of scope for this activity, and let the web service providers define signaling as they see fit. 5.2. The Importance of Flexibility There are obviously tradeoffs between built-in functionality and programmability. It is often tempting to provide the web page author with a simple and relatively inflexible way of expressing their intent so as to minimize the page author's effort and accelerate adoption. As an example, the "" tag was adopted much more rapidly than it would have been if blinking text could have only been implemented by writing a JavaScript timer task to manipulate the DOM style objects. On the other hand, such built-in functionality comes at two important costs. First, each browser implementation must implement the functionality, and the more which is moved from JavaScript to built-in functionality the more code must be present for that implemention. Second, and more important, the page author is now restricted to the subset of functionality which is provided by these browser implementations. The "