Network Working Group Lingli Deng Internet Draft Jin Peng Intended status: Informational China Mobile Expires: January 2013 July 3, 2012 Sender Media Control based on Local Status Detection draft-deng-rtcweb-svccontrol-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on January 3, 2009. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract Peng, et al. Expires January 3, 2013 [Page 1] Internet-Draft svc-control July 2012 This document proposes to add sender media control based on local status detection by the browser, to further reduce multiparty video conferencing's media bandwidth consumption with SVC cdoec for RTCWEB use-cases. Table of Contents 1. Introduction ................................................. 2 2. Problem Statement ............................................ 3 2.1. Existing Proposals in [use-case-draft] .................. 3 2.1.1. Receiver-harvested SVC-stream (SVC scheme).......... 3 2.1.2. Receiver-selected multi-stream (Multi-stream scheme) 4 2.1.3. Receiver-transcoded HD stream (transcoding scheme).. 4 2.2. Analysis of existing proposals .......................... 4 3. Enhanced SVC scheme with sender control ...................... 5 3.1. Overview of the core idea................................ 5 3.2. A simple example......................................... 5 3.3. Extensions to the simple example ....................... 11 3.4. Applicable Scenarios ................................... 12 4. Derived Requirements ........................................ 12 5. Security Considerations ..................................... 13 6. IANA Considerations ......................................... 13 7. References .................................................. 13 7.1. Normative References ................................... 13 7.2. Informative References ................................. 13 1. Introduction As an important use-case of RTCWEB, multiparty video conferencing has been discussed in depth in [use-case-draft], which states the requirement for the receiving party (either a participant in the p2p use-case or the centralized mixing server use-case) tend to render the remote video input based on the audio input status (i.e. whether the user is the spokesman of the conference at this moment) of the sending party. Assume an organization needs to establish a multiparty video conference through a video communication system and hopes to be connected point to point directly among the peers involved in the process of the video conference session. Each conference participant can send audio and video media stream to the other conference participants. When one session participant receives the video and audio stream from the others, he wants the conference spokesman to be presented in a high-definition large window in the middle while the other conference participants (including himself as local playback) to a group of small windows by the side. The spokesman Peng, et al. Expires January 3, 2013 [Page 2] Internet-Draft svc-control July 2012 changes frequently from one to another with conducting of the conference, while the participants always want to present the video stream of the current conference spokesman in a high-definition large window in the local browser and would like to present the video stream of ordinary participants in small windows. Taking into account the peer-to-peer video conferencing traffic overhead and terminal traffic restrictions, how can we minimize the cost of the video conferencing traffic overhead without affecting the effects of the participants?presentation in video conference? To satisfy the requirement that for each participant one high resolution video is displayed in a large window, while a number of low resolution videos are displayed in smaller windows, [use-case- draft] provides three solutions: 1. The sender sends the high resolution SVC stream and the receiver/server selects part or full presentation. 2. The sender sends both high resolution and low resolution stream, and the receiver selects one to present. 3. The sender sends a high resolution stream, the server/receiver transcodes into low/high resolution streams as required. This proposal proposes to add sender media control based on the determination on whether the local participant is speaking. For the sender, high resolution media stream is only sent by the current spokesman, and the ordinary media stream is sent by the other participants. For the receiver, only to receive and present the HD media stream from the spokesman while the ordinary media stream from the others in the conference. The consumption of both the sender and receiver thus is further reduced in this way. 2. Problem Statement 2.1. Existing Proposals in [use-case-draft] [use-case-draft] provides three solutions to the window resizing requirement for multiparty conferencing scenarios. Despite of their differences elaborated in the following, they share a common nature that they rely on the receiver to do the trick, while the sender blindly offers HD media all the time. 2.1.1. Receiver-harvested SVC-stream (SVC scheme) During a video conference, the sender uses the SVC coding, sends SVC HD media streams (base layer plus extended layer) and the receiver / Peng, et al. Expires January 3, 2013 [Page 3] Internet-Draft svc-control July 2012 server selects the presentation based on the requirements. For example, present the media stream from the current conference spokesman in a high resolution full window and show part media stream (only base layer) from ordinary participants in small windows. 2.1.2. Receiver-selected multi-stream (Multi-stream scheme) During a video conference, the sender sends both high resolution and low resolution media stream, the receiver choose one to present in the local browser according to requirements. For example, the receiver presents the high resolution media stream from the current conference spokesman and shows low resolution media stream (only base layer) from ordinary participants. 2.1.3. Receiver-transcoded HD stream (transcoding scheme) During a video conference, the sender sends high resolution media stream, and the server /receiver transcodes as requirement. For example, the receiver transcodes the received media stream from ordinary participants into low media stream and then present the stream in local browser. 2.2. Analysis of existing proposals We can analyze the existing three schemes mentioned above from the two dimensions of the flow transmission overhead and local processing consume. Firstly, from the perspective of high resolution traffic transmission consume, multi-stream seems the last thing to do. In terms of the SVC and transcoding schemes, whether the user is speaking, his local browser will send the high resolution media stream collected by local devices. If the number of participants in the video conference based on the style of peer-to-peer interaction is N, there would be 2*(N-1) high resolution media stream resource consume needed in the transmission (including: the number of high- definition video media stream sent by the local participants is N-1; the number of high-definition video media stream received by the remote peers is N-1, too). While in the multi-stream scheme, transmission overhead is raised for additional low resolution media streams. Secondly, we analyze their computation overheads from the perspectives of the sender and receiver, respectively. For the sender, the transcoding scheme incurs the lowest cost to encoding the outgoing media stream. On the contrary, the multi- Peng, et al. Expires January 3, 2013 [Page 4] Internet-Draft svc-control July 2012 stream scheme, which dictates that two versions of the local media stream be encoded and sent separately, incurs the highest cost. The encoding cost for a local sender to perform SVC scheme is between the above two. For the receiver, the transcoding scheme needs to do real-time transcoding and therefore takes the highest computation overheads to the receiver side (mixing server or a participating peer). In order to switch from and to HD timely, a receiver in multi-stream scheme needs to synchronize both high and low resolution media stream with considerable consumption. While SVC scheme adjusts the processing consume by overlaying / removing the extension layer sub-stream data with a minimum extra cost. 3. Enhanced SVC scheme with sender control 3.1. Overview of the core idea The core idea of this proposal is: o Both ends use SVC, negotiating with the other side to establish peer-to-peer media stream. o The sender's browser detects local user call status by calling some local devices, such as to detect the sustained voice input. o The browser adjusts sending policy of different coding levels according to the results of the detected local status: o If the local user is the spokesman, send base layer and extended layer media stream. o If the local user is in a quiescent state, just send the base layer media stream. o The receiver's browser has the ability to call the local devices, and presents the decoded input media stream according to SVC. 3.2. A simple example In the multi-party video conferencing session (For example, in the illustrative example, three users participate the video conferencing), the workflow as shown in the figure goes as follows: Peng, et al. Expires January 3, 2013 [Page 5] Internet-Draft svc-control July 2012 A B C | | | | | | A/B/C peer-to- |<=============>|<=============>| peer connection |<=============================>| | | | | | | Detect the +--| Detect the +--| Detect the +--| local user | | local user | | local user | | status +->| status +->| status +->| | | | +------------+ +------------+ +------------+ | Continuous | | Slient | | Slient | |speech input| | status | | status | +------------+ +------------+ +------------+ | | | | | | | Base layer + | | |extended layer | | |-------------->| | | | | | Base layer + extended layer | |------------------------------>| | | Base layer | Peng, et al. Expires January 3, 2013 [Page 6] Internet-Draft svc-control July 2012 | |-------------->| | | | | Base layer | | |<--------------| | | | | | | | | | Base layer | | |<--------------| | | | | Base layer | |<------------------------------| | | | | | | Detect the +--| Detect the +--| Detect the +--| local user | | local user | | local user | | status +->| status +->| status +->| | | | +------------+ +------------+ +------------+ | Continuous | | Slient | | Slient | |speech input| | status | | status | +------------+ +------------+ +------------+ | | | |------------+ |------------+ |------------+ | Present A | | Present A | | Present A | Peng, et al. Expires January 3, 2013 [Page 7] Internet-Draft svc-control July 2012 |(Large Size)| |(Large Size)| |(Large Size)| |------------+ |------------+ |------------+ |------------+ |------------+ |------------+ | Present B | | Present B | | Present B | |(Small Size)| |(Small Size)| |(Small Size)| |------------+ |------------+ |------------+ |------------+ |------------+ |------------+ | Present C | | Present C | | Present C | |(Small Size)| |(Small Size)| |(Small Size)| |------------+ |------------+ |------------+ | | | | | | Detect the +--| Detect the +--| Detect the +--| local user | | local user | | local user | | status +->| status +->| status +->| | | | +------------+ +------------+ +------------+ | Slient | | Continuous | | Slient | | status | |speech input| | status | +------------+ +------------+ +------------+ | | | | | | | Base layer | | |-------------->| | Peng, et al. Expires January 3, 2013 [Page 8] Internet-Draft svc-control July 2012 | | | | Base layer | |------------------------------>| | | | | | | | | Base layer + | | |extended layer | | |-------------->| | | | | Base layer + | | |extended layer | | |<--------------| | | | | | | | | | Base layer | | |<--------------| | | | | Base layer | |<------------------------------| | | | | | | Detect the +--| Detect the +--| Detect the +--| local user | | local user | | local user | | status +->| status +->| status +->| Peng, et al. Expires January 3, 2013 [Page 9] Internet-Draft svc-control July 2012 | | | +------------+ +------------+ +------------+ | Slient | | Continuous | | Slient | | status | |speech input| | status | +------------+ +------------+ +------------+ | | | |------------+ |------------+ |------------+ | Present A | | Present A | | Present A | |(Small Size)| |(Small Size)| |(Small Size)| |------------+ |------------+ |------------+ |------------+ |------------+ |------------+ | Present B | | Present B | | Present B | |(Large Size)| |(Large Size)| |(Large Size)| |------------+ |------------+ |------------+ |------------+ |------------+ |------------+ | Present C | | Present C | | Present C | |(Small Size)| |(Small Size)| |(Small Size)| |------------+ |------------+ |------------+ | | | | | | Figure 1 A B C Tripartite session flowchart First of all, the session participant's browser of A, establishes a peer-to-peer media stream connection with the other participants? browsers with SVC codec. Subsequently, the browsers utilize local device's capability to monitor the local user audio input status. Since A is the spokesman at beginning, A's browser detects a Peng, et al. Expires January 3, 2013 [Page 10] Internet-Draft svc-control July 2012 continuous audio input from the local user, A's browser sends both the base and extended layers of SVC sub-streams to the others(B and C). At the meantime, B and C, as the listeners to A, remain silent. Therefore, B's local browser sends only the SVC base layer media stream to the others (A and C) according to the result of silent status detected by calling the local devices. Similarly C sends only the base layer to A and B. The spokesman changes from one to another as the conference continues. For example, when A detects himself in a silent status from the local devices, A switches to send only the base layer to the others (B and C). Conversely, when B detects a continuous audio input from the local user, B would change to send both the base and extended layers to the others (A and C). While C detects no changes locally, C would continue sending base layer to A and B. 3.3. Extensions to the simple example The above description of a sender controlled SVC transmission modes based on local audio input status detection, is a simple example of a more generalized sender media control scheme, where the types of status transition as control triggers, the mechanism to enforce media adjustment afterwards, and whether to give JS API the ability to influence the media behavior, may vary accordingly. In particular, a few extensions would be in terms of : 1. Options of implementation on send-side state detection, include: a.to make use of the DTX module at voice codec level, and decide the strategy of video mode switching in accordance with a given strategy (Note that the strategy of video mode switching (whether need for local high resolution presentation) can be different from the strategy of voice codec (silence / voice signals), b.to defer user presence status indirectly, by monitoring whether the local audio input is muted from the web page, or whether the conferencing page is currently active, etc. 2. Options of implementation on send-side media control, include: a.to adjust the definition of the local camera, or b.to adjust the bitrate of a given codec in use (the layer composition in case of SVC codecs, for instance), or Peng, et al. Expires January 3, 2013 [Page 11] Internet-Draft svc-control July 2012 c.to trigger a re-negotiation for a new codec instead of the one in use. 3. Options of implementation on clients include: a.to enforce the transition detection and conduct media control singly by the local browser, or b.the browser offers related callback APIs for both defined state transition events and media controlling to JS, who may provide SP's entailed control behavior. 3.4. Applicable Scenarios For simplicity, we used the fully distributed scenario as an example to elaborate our proposal. It should be noticed that the proposed scheme is also applicable and would bring benefit to a centralized mixer-based conferencing setting. In a fully distributed P2P conference, without any centralized mixing server at the media plane, each participating node processes the modulation of video stream locally. As mentioned above, the application of our proposal would reduce the transmission overhead of both send-side nodes and receive-side nodes, also reduces the processing and modulation overhead of the receive-side nodes. While in a mixing server-based conference, the dedicated mixer server receives the participants?local captured video stream, and renders a uniform rendered collective screen for display. The application of our proposal will significantly reduce the transmission overhead for uploading local video to the server, thus saving its bandwidth as a centralized sink. 4. Derived Requirements In our proposal, video conferencing client detects the local user session state (speaking/silence), achieving the purpose of saving media plane transmission overhead by adjusting the sending rate of video media stream. In order to realize it in a RTCWEB setting, additional requirements are derived as follows: 1. Function Requirement for Browser Fxx: The browser SHOULD be able to detect the audio input status (speaking/ silent) of the local user. 2. API Requirements for Browser Peng, et al. Expires January 3, 2013 [Page 12] Internet-Draft svc-control July 2012 Axx: It SHOULD be possible for the JS to be notified about the audio input status (speaking/silent) of the local user, and to entail the media control behavior in response. 5. Security Considerations TBA 6. IANA Considerations None. 7. References 7.1. Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [2] Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, Internet Mail Consortium and Demon Internet Ltd., November 1997. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, Internet Mail Consortium and Demon Internet Ltd., November 1997. 7.2. Informative References [use-case-draft] Holmberg, C., Hakansson, S. and Eriksson, G., "Web Real-Time Communication Use-cases and Requirements", draft-ietf-rtcweb-use-cases-and-requirements-09 (work in progress), June 27, 2012 Peng, et al. Expires January 3, 2013 [Page 13] Internet-Draft svc-control July 2012 Authors' Addresses Lingli Deng China Mobile Email: denglingli@chinamobile.com Jin Peng China Mobile Email: pengjin@chinamobile.com Peng, et al. Expires January 3, 2013 [Page 14]