Internet DRAFT - draft-deng-rtcweb-svccontrol

draft-deng-rtcweb-svccontrol














    Network Working Group                                      Lingli Deng   
    Internet Draft                                                Jin Peng
    Intended status: Informational                            China Mobile
    Expires: April 2013                                   October 15, 2012



                Sender Media Control based on Local Status Detection
                        draft-deng-rtcweb-svccontrol-01.txt


    Status of this Memo

       This Internet-Draft is submitted in full conformance with the
       provisions of BCP 78 and BCP 79.

       Internet-Drafts are working documents of the Internet Engineering
       Task Force (IETF), its areas, and its working groups.  Note that
       other groups may also distribute working documents as Internet-
       Drafts.

       Internet-Drafts are draft documents valid for a maximum of six
       months and may be updated, replaced, or obsoleted by other documents
       at any time.  It is inappropriate to use Internet-Drafts as
       reference material or to cite them other than as "work in progress."

       The list of current Internet-Drafts can be accessed at
       http://www.ietf.org/ietf/1id-abstracts.txt

       The list of Internet-Draft Shadow Directories can be accessed at
       http://www.ietf.org/shadow.html

       This Internet-Draft will expire on April 24, 2009.

    Copyright Notice

       Copyright (c) 2012 IETF Trust and the persons identified as the
       document authors. All rights reserved.

       This document is subject to BCP 78 and the IETF Trust's Legal
       Provisions Relating to IETF Documents
       (http://trustee.ietf.org/license-info) in effect on the date of
       publication of this document. Please review these documents
       carefully, as they describe your rights and restrictions with
       respect to this document.

    Abstract




    Deng, et al.             Expires April 15, 2013                 [Page 1]
     








    Internet-Draft               svc-control                   October 2012


       This document proposes to add sender media control based on local
       status detection by the browser, to further reduce multiparty video
       conferencing's media bandwidth consumption with SVC codec for RTCWEB
       use-cases.

    Table of Contents


       1. Introduction ................................................. 2
       2. Problem Statement ............................................ 3
          2.1. Existing Proposals in [use-case-draft] .................. 3
             2.1.1. Receiver-harvested SVC-stream (SVC scheme) ......... 3
             2.1.2. Receiver-selected multi-stream (Multi-stream scheme) 4
             2.1.3. Receiver-transcoded HD stream (transcoding scheme) . 4
          2.2. Analysis of existing proposals .......................... 4
       3. Enhanced SVC scheme with sender control ...................... 5
          3.1. Overview of the core idea ............................... 5
          3.2. A simple example ........................................ 5
          3.3. Extensions to the simple example ....................... 11
          3.4. Applicable Scenarios ................................... 12
       4. Derived Requirements......................................... 14
       5. Security Considerations ..................................... 14
       6. IANA Considerations ......................................... 14
       7. References .................................................. 14
          7.1. Normative References ................................... 14
          7.2. Informative References ................................. 15

    1. Introduction

       As an important use-case of RTCWEB, multiparty video conferencing
       has been discussed in depth in [use-case-draft], which states the
       requirement for the receiving party (either a participant in the p2p
       use-case or the centralized mixing server use-case) tend to render
       the remote video input based on the audio input status (i.e. whether
       the user is the spokesman of the conference at this moment) of the
       sending party.

       Assume an organization needs to establish a multiparty video
       conference through a video communication system and hopes to be
       connected point to point directly among the peers involved in the
       process of the video conference session. Each conference participant
       can send audio and video media stream to the other conference
       participants. When one session participant receives the video and
       audio stream from the others, he wants the conference spokesman to
       be presented in a high-definition large window in the middle while
       the other conference participants (including himself as local
       playback) to a group of small windows by the side. The spokesman


    Deng, et al.             Expires April 24, 2013                 [Page 2]
        








    Internet-Draft               svc-control                   October 2012


       changes frequently from one to another with conducting of the
       conference, while the participants always want to present the video
       stream of the current conference spokesman in a high-definition
       large window in the local browser and would like to present the
       video stream of ordinary participants in small windows.

       Taking into account the peer-to-peer video conferencing traffic
       overhead and terminal traffic restrictions, how can we minimize the
       cost of the video conferencing traffic overhead without affecting
       the effects of the participants' presentation in video conference?
       To satisfy the requirement that for each participant one high
       resolution video is displayed in a large window, while a number of
       low resolution videos are displayed in smaller windows, [use-case-
       draft] provides three solutions:

       1. The sender sends the high resolution SVC stream and the
          receiver/server selects part or full presentation.

       2. The sender sends both high resolution and low resolution stream,
          and the receiver selects one to present.

       3. The sender sends a high resolution stream, the server/receiver
          transcodes into low/high resolution streams as required.

          This proposal proposes to add sender media control based on the
       determination on whether the local participant is speaking. For the sender,
       high resolution media stream is only sent by the current spokesman, and the
       ordinary media stream is sent by the other participants. For the receiver,
       only to receive and present the HD media stream from the spokesman while
       the ordinary media stream from the others in the conference. The
       consumption of both the sender and receiver thus is further reduced in this
       way.

    2. Problem Statement

    2.1. Existing Proposals in [use-case-draft]

       [use-case-draft] provides three solutions to the window resizing
       requirement for multiparty conferencing scenarios. Despite of their
       differences elaborated in the following, they share a common nature that
       they rely on the receiver to do the trick, while the sender blindly offers
       HD media all the time.

    2.1.1. Receiver-harvested SVC-stream (SVC scheme)

          During a video conference, the sender uses the SVC coding, sends SVC HD
       media streams (base layer plus extended layer) and the receiver / server


    Deng, et al.             Expires April 24, 2013                 [Page 3]
        








    Internet-Draft               svc-control                   October 2012


       selects the presentation based on the requirements. For example, present
       the media stream from the current conference spokesman in a high resolution
       full window and show part media stream (only base layer) from ordinary
       participants in small windows.

    2.1.2. Receiver-selected multi-stream (Multi-stream scheme)

          During a video conference, the sender sends both high resolution and low
       resolution media stream, the receiver choose one to present in the local
       browser according to requirements. For example, the receiver presents the
       high resolution media stream from the current conference spokesman and
       shows low resolution media stream (only base layer) from ordinary
       participants.

    2.1.3. Receiver-transcoded HD stream (transcoding scheme)

          During a video conference, the sender sends high resolution media stream,
       and the server /receiver transcodes as requirement. For example, the
       receiver transcodes the received media stream from ordinary participants
       into low media stream and then present the stream in local browser.

    2.2. Analysis of existing proposals

          We can analyze the existing three schemes mentioned above from the two
       dimensions of the flow transmission overhead and local processing consume.

          Firstly, from the perspective of high resolution traffic transmission
       consume, multi-stream seems the last thing to do. In terms of the SVC and
       transcoding schemes, whether the user is speaking, his local browser will
       send the high resolution media stream collected by local devices. If the
       number of participants in the video conference based on the style of peer-
       to-peer interaction is N, there would be 2*(N-1) high resolution media
       stream resource consume needed in the transmission (including: the number
       of high-definition video media stream sent by the local participants is N-1;
       the number of high-definition video media stream received by the remote
       peers is N-1, too). While in the multi-stream scheme, transmission overhead
       is raised for additional low resolution media streams.

          Secondly, we analyze their computation overheads from the perspectives
       of the sender and receiver, respectively.

          For the sender, the transcoding scheme incurs the lowest cost to
       encoding the outgoing media stream. On the contrary, the multi-stream
       scheme, which dictates that two versions of the local media stream be
       encoded and sent separately, incurs the highest cost. The encoding cost for
       a local sender to perform SVC scheme is between the above two.



    Deng, et al.             Expires April 24, 2013                 [Page 4]
        








    Internet-Draft               svc-control                   October 2012


          For the receiver, the transcoding scheme needs to do real-time
       transcoding and therefore takes the highest computation overheads to the
       receiver side (mixing server or a participating peer). In order to switch
       from and to HD timely, a receiver in multi-stream scheme needs to
       synchronize both high and low resolution media stream with considerable
       consumption. While SVC scheme adjusts the processing consume by overlaying
       / removing the extension layer sub-stream data with a minimum extra cost.

    3. Enhanced SVC scheme with sender control

    3.1. Overview of the core idea

       The core idea of this proposal is:

       o Both ends use SVC, negotiating with the other side to establish
          peer-to-peer media stream.

       o The sender's browser detects local user call status by calling
          some local devices, such as to detect the sustained voice input.

       o The browser adjusts sending policy of different coding levels
          according to the results of the detected local status:

            o If the local user is the spokesman, send base layer and
              extended layer media stream.

            o If the local user is in a quiescent state, just send the base
              layer media stream.

       o The receiver's browser has the ability to call the local devices,
          and presents the decoded input media stream according to SVC.

    3.2. A simple example

          In the multi-party video conferencing session (For example, in the
       illustrative example, three users participate the video conferencing), the
       workflow as shown in the figure goes as follows:











    Deng, et al.             Expires April 24, 2013                 [Page 5]
        








    Internet-Draft               svc-control                   October 2012




                       A               B               C

                       |               |               |

                       |               |               |

       A/B/C peer-to-  |<=============>|<=============>|

       peer connection |<=============================>|

                       |               |               |

                       |               |               |

         Detect the +--| Detect the +--| Detect the +--|

         local user |  | local user |  | local user |  |

           status   +->|   status   +->|   status   +->|

                       |               |               |

                +------------+  +------------+  +------------+

                | Continuous |  |   Slient   |  |   Slient   |

                |speech input|  |   status   |  |   status   |

                +------------+  +------------+  +------------+

                       |               |               |

                       |               |               |

                       | Base layer +  |               |

                       |extended layer |               |

                       |-------------->|               |

                       |               |               |

                       |  Base layer + extended layer  |

                       |------------------------------>|


    Deng, et al.             Expires April 24, 2013                 [Page 6]
        








    Internet-Draft               svc-control                   October 2012


                       |               |  Base layer   |

                       |               |-------------->|

                       |               |               |

                       |  Base layer   |               |

                       |<--------------|               |

                       |               |               |

                       |               |               |

                       |               |  Base layer   |

                       |               |<--------------|

                       |               |               |

                       |          Base layer           |

                       |<------------------------------|

                       |               |               |

                       |               |               |

         Detect the +--| Detect the +--| Detect the +--|

         local user |  | local user |  | local user |  |

           status   +->|   status   +->|   status   +->|

                       |               |               |

                +------------+  +------------+  +------------+

                | Continuous |  |   Slient   |  |   Slient   |

                |speech input|  |   status   |  |   status   |

                +------------+  +------------+  +------------+

                       |               |               |

                       |------------+  |------------+  |------------+


    Deng, et al.             Expires April 24, 2013                 [Page 7]
        








    Internet-Draft               svc-control                   October 2012


                       | Present A  |  | Present A  |  | Present A  |

                       |(Large Size)|  |(Large Size)|  |(Large Size)|

                       |------------+  |------------+  |------------+

                       |------------+  |------------+  |------------+

                       | Present B  |  | Present B  |  | Present B  |

                       |(Small Size)|  |(Small Size)|  |(Small Size)|

                       |------------+  |------------+  |------------+

                       |------------+  |------------+  |------------+

                       | Present C  |  | Present C  |  | Present C  |

                       |(Small Size)|  |(Small Size)|  |(Small Size)|

                       |------------+  |------------+  |------------+

                       |               |               |

                       |               |               |

         Detect the +--| Detect the +--| Detect the +--|

         local user |  | local user |  | local user |  |

           status   +->|   status   +->|   status   +->|

                       |               |               |

                +------------+  +------------+  +------------+

                |   Slient   |  | Continuous |  |   Slient   |

                |   status   |  |speech input|  |   status   |

                +------------+  +------------+  +------------+

                       |               |               |

                       |               |               |

                       |  Base layer   |               |


    Deng, et al.             Expires April 24, 2013                 [Page 8]
        








    Internet-Draft               svc-control                   October 2012


                       |-------------->|               |

                       |               |               |

                       |          Base layer           |

                       |------------------------------>|

                       |               |               |

                       |               |               |

                       |               | Base layer +  |

                       |               |extended layer |

                       |               |-------------->|

                       |               |               |

                       | Base layer +  |               |

                       |extended layer |               |

                       |<--------------|               |

                       |               |               |

                       |               |               |

                       |               |  Base layer   |

                       |               |<--------------|

                       |               |               |

                       |          Base layer           |

                       |<------------------------------|

                       |               |               |

                       |               |               |

         Detect the +--| Detect the +--| Detect the +--|

         local user |  | local user |  | local user |  |


    Deng, et al.             Expires April 24, 2013                 [Page 9]
        








    Internet-Draft               svc-control                   October 2012


           status   +->|   status   +->|   status   +->|

                       |               |               |

                +------------+  +------------+  +------------+

                |   Slient   |  | Continuous |  |   Slient   |

                |   status   |  |speech input|  |   status   |

                +------------+  +------------+  +------------+

                       |               |               |

                       |------------+  |------------+  |------------+

                       | Present A  |  | Present A  |  | Present A  |

                       |(Small Size)|  |(Small Size)|  |(Small Size)|

                       |------------+  |------------+  |------------+

                       |------------+  |------------+  |------------+

                       | Present B  |  | Present B  |  | Present B  |

                       |(Large Size)|  |(Large Size)|  |(Large Size)|

                       |------------+  |------------+  |------------+

                       |------------+  |------------+  |------------+

                       | Present C  |  | Present C  |  | Present C  |

                       |(Small Size)|  |(Small Size)|  |(Small Size)|

                       |------------+  |------------+  |------------+

                       |               |               |

                       |               |               |

                    Figure 1 A B C Tripartite session flowchart

          First of all, the session participant's browser of A, establishes a
       peer-to-peer media stream connection with the other participants' browsers
       with SVC codec. Subsequently, the browsers utilize local device's


    Deng, et al.             Expires April 24, 2013                [Page 10]
        








    Internet-Draft               svc-control                   October 2012


       capability to monitor the local user audio input status. Since A is the
       spokesman at beginning, A's browser detects a continuous audio input from
       the local user, A's browser sends both the base and extended layers of SVC
       sub-streams to the others(B and C). At the meantime, B and C, as the
       listeners to A, remain silent. Therefore, B's local browser sends only the
       SVC base layer media stream to the others (A and C) according to the result
       of silent status detected by calling the local devices. Similarly C sends
       only the base layer to A and B.

          The spokesman changes from one to another as the conference continues.
       For example, when A detects himself in a silent status from  the local
       devices, A switches to send only the base layer to the others (B and C).
       Conversely, when B detects a continuous audio input from the local user, B
       would change to send both the base and extended layers to the others (A and
       C). While C detects no changes locally, C would continue sending base layer
       to A and B.

    3.3. Extensions to the simple example

          The above description of a sender controlled SVC transmission modes
       based on local audio input status detection, is a simple example of a more
       generalized sender media control scheme, where the types of status
       transition as control triggers, the mechanism to enforce media adjustment
       afterwards, and whether to give JS API the ability to influence the media
       behavior, may vary accordingly. In particular, a few extensions would be in
       terms of:

       1. Options of implementation on send-side state detection, include:

            a. to make use of the DTX module at voice codec level, and
              decide the strategy of video mode switching in accordance
              with a given strategy (Note that the strategy of video mode
              switching (whether need for local high resolution
              presentation) can be different from the strategy of voice
              codec (silence / voice signals),

            b. to defer user presence status indirectly, by monitoring
              whether the local audio input is muted from the web page, or
              whether the conferencing page is currently active, etc.

       2. Options of implementation on send-side media control, include:

            a. to adjust the definition of the local camera, or

            b. to adjust the bitrate of a given codec in use (the layer
              composition in case of SVC codecs, for instance), or



    Deng, et al.             Expires April 24, 2013                [Page 11]
        








    Internet-Draft               svc-control                   October 2012


            c. to trigger a re-negotiation for a new codec instead of the
              one in use.

       3. Options of implementation on clients include:

            a. to enforce the transition detection and conduct media control
              singly by the local browser, or

            b. the browser offers related callback APIs for both defined
              state transition events and media controlling to JS, who may
              provide SP's entailed control behavior.

    3.4. Applicable Scenarios

          For simplicity, we used the fully distributed scenario as an example to
       elaborate our proposal. It should be noticed that the proposed scheme is
       also applicable and would bring benefit to a centralized mixer-based
       conferencing setting.

          In a fully distributed P2P conference, without any centralized mixing
       server at the media plane, each participating node processes the modulation
       of video stream locally. As mentioned above, the application of our
       proposal would reduce the transmission overhead of both send-side nodes and
       receive-side nodes, also reduces the processing and modulation overhead of
       the receive-side nodes.

          While in a mixing server-based conference, the dedicated mixer server
       receives the participants' local captured video stream, and renders a
       uniform rendered collective screen for display. The application of our
       proposal will significantly reduce the transmission overhead for uploading
       local video to the server, thus saving its bandwidth as a centralized sink.

    3.5. Simulation Results

          To further understand both the applicability and effectiveness of the
       proposed sender-control mechanism, we setup a set of simple experiments to
       simulate common P2P conferencing scenarios.

    3.5.1. Setup

          The experiment settings choices and reasoning behind are summarized as
       follows:

       1. Conferencing settings:





    Deng, et al.             Expires April 24, 2013                [Page 12]
        








    Internet-Draft               svc-control                   October 2012


            a. We simulate up to 6-way P2P conferencing, for according to the
              experience gained through a distributed conferencing services, we
              found that most conferences involves 3-4 different participants,
              and the current mainstream laptop supports at most 5-way
              conferencing using H.264 codec.

            b. We use Zipf distribution for a participants' speaking frequency and
              duration values, and differentiate one-to-mul, mul-to-mul, and
              roundtable conferencing, where the first stands for a lecture-like
              conferencing with a dedicated lecturer, while the second for a
              lecture without a dedicated lecturer, and the third for a free
              discussion. In other words, in the first two types of conferences
              there are dedicated listeners.

            c. We use Uniform distribution for the transmission rate between a
              specific pair of peers, and differentiate separate variables for
              HD(uniform(500,900)KBps) and FD(uniform(100,300)KBps) streams
              respectively.

       2. Triggering Timer settings:

            a. We evaluate the proposed scheme with variable triggering timer
              values, ranging from 0 to 30 seconds. In practice, in order to
              avoid frequent vibrations caused by aggressive switching for short-
              lasting activities, an implementation always has to tradeoff
              bandwidth reduction to traffic smoothness.

    3.5.2. Results

          We conduct repeated simulations and compare the mean traffic loads
       between the proposed scheme and the exiting SVC scheme. The findings are
       summarized as follows:

       1. The theoretical traffic reduction (with the triggering timer
          equals to zero) is considerable and ranges from X to Y for
          different conferencing types. For instance, in 3-way scenarios,
          the achieved traffic reduction is 60% for one-to-mul cases; 70%
          for mul-to-mul cases; and 50% for roundtable cases.

       2. Even with a reluctant triggering timer setting, the proposed
          scheme outperforms existing SVC scheme in various conferencing
          types. For instance, the longer timer up to 30 seconds does not
          impact the reduction ratio in a noticeable way.






    Deng, et al.             Expires April 24, 2013                [Page 13]
        








    Internet-Draft               svc-control                   October 2012


    4. Derived Requirements

          In our proposal, video conferencing client detects the local user
       session state (speaking/silence), achieving the purpose of saving media
       plane transmission overhead by adjusting the sending rate of video media
       stream. In order to realize it in a RTCWEB setting, additional requirements
       are derived as follows:

       1. Function Requirement for Browser

          Fxx: The browser SHOULD be able to detect the audio input status
       (speaking/ silent) of the local user.

       2. API Requirements for Browser

       Axx: It SHOULD be possible for the JS to be notified about the audio input
       status (speaking/silent) of the local user, and to entail the media control
       behavior in response.



    5. Security Considerations

       TBA

    6. IANA Considerations

       None.

    7. References

    7.1. Normative References

       [1]  Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

       [2]  Crocker, D. and Overell, P.(Editors), "Augmented BNF for
             Syntax Specifications: ABNF", RFC 2234, Internet Mail
             Consortium and Demon Internet Ltd., November 1997.

       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
                 Requirement Levels", BCP 14, RFC 2119, March 1997.

       [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for
                 Syntax Specifications: ABNF", RFC 2234, Internet Mail
                 Consortium and Demon Internet Ltd., November 1997.



    Deng, et al.             Expires April 24, 2013                [Page 14]
        








    Internet-Draft               svc-control                   October 2012


    7.2. Informative References

       [use-case-draft] Holmberg, C., Hakansson, S. and Eriksson, G., "Web
                        Real-Time Communication Use-cases and
                        Requirements", 
                        draft-ietf-rtcweb-use-cases-and-requirements-09
                        (work in progress), June 27, 2012










































    Deng, et al.             Expires April 24, 2013                [Page 15]
        








    Internet-Draft               svc-control                   October 2012


    Authors' Addresses

       Lingli Deng
       China Mobile
       Email: denglingli@chinamobile.com

       Jin Peng
       China Mobile
       Email: pengjin@chinamobile.com






































    Deng, et al.             Expires April 24, 2013                [Page 16]