CLUE WG                                                       A. Romanow
Internet-Draft                                             Cisco Systems
Intended status: Informational                              M. Duckworth
Expires: January 4, 2012                                         Polycom
                                                            A. Pepperell
                                                              B. Baldino
                                                           Cisco Systems
                                                           M. Goryzinski
                                                 HP Visual Collaboration
                                                            July 3, 2011


                Framework for Telepresence Multi-Streams
                  draft-romanow-clue-framework-00.txt

Abstract

   This memo offers a framework for a protocol that enables devices in a
   telepresence conference to interoperate by specif;ying the
   relationships between multiple RTP streams.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 4, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect


Romanow, et al.          Expires January 4, 2012                [Page 1]

Internet-Draft         CLUE Telepresence Framework             July 2011


   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Romanow, et al.          Expires January 4, 2012                [Page 2]

Internet-Draft         CLUE Telepresence Framework             July 2011


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  6
   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  6
   4.  Two Necessary Functions  . . . . . . . . . . . . . . . . . . .  9
   5.  Protocol Features  . . . . . . . . . . . . . . . . . . . . . .  9
   6.  Stream Content . . . . . . . . . . . . . . . . . . . . . . . . 10
     6.1.  Media capture  . . . . . . . . . . . . . . . . . . . . . . 10
     6.2.  Attributes . . . . . . . . . . . . . . . . . . . . . . . . 11
     6.3.  Capture Set  . . . . . . . . . . . . . . . . . . . . . . . 12
   7.  Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 13
     7.1.  Physical Simultaneity  . . . . . . . . . . . . . . . . . . 14
     7.2.  Encoding Groups  . . . . . . . . . . . . . . . . . . . . . 15
       7.2.1.  Sample video encoding group specification #1 . . . . . 17
       7.2.2.  Sample video encoding group specification #2 . . . . . 18
   8.  Media provider behavior  . . . . . . . . . . . . . . . . . . . 19
   9.  Putting it together - using the Capture Set  . . . . . . . . . 19
   10. Media consumer behaviour . . . . . . . . . . . . . . . . . . . 22
     10.1. One screen receiver configuring the example
           capture-side device above  . . . . . . . . . . . . . . . . 23
     10.2. Two screen receiver configuring the example
           capture-side device above  . . . . . . . . . . . . . . . . 23
     10.3. Three screen receiver configuring the example
           capture-side device above  . . . . . . . . . . . . . . . . 24
     10.4. Configuration of sender streams by a receiver  . . . . . . 24
     10.5. Advertisement of capabilities sent by receiver to
           sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24
   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25
   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
   13. Security Considerations  . . . . . . . . . . . . . . . . . . . 25
   14. Informative References . . . . . . . . . . . . . . . . . . . . 25
   Appendix A.  Attributes  . . . . . . . . . . . . . . . . . . . . . 26
     A.1.  Purpose  . . . . . . . . . . . . . . . . . . . . . . . . . 26
       A.1.1.  Main . . . . . . . . . . . . . . . . . . . . . . . . . 26
       A.1.2.  Presentation . . . . . . . . . . . . . . . . . . . . . 26
     A.2.  Audio mixed  . . . . . . . . . . . . . . . . . . . . . . . 26
     A.3.  Audio Channel Format . . . . . . . . . . . . . . . . . . . 26
       A.3.1.  Linear Array . . . . . . . . . . . . . . . . . . . . . 26
       A.3.2.  Stereo . . . . . . . . . . . . . . . . . . . . . . . . 27
       A.3.3.  Mono . . . . . . . . . . . . . . . . . . . . . . . . . 27
     A.4.  Audio Linear Position  . . . . . . . . . . . . . . . . . . 27
     A.5.  Video Scale  . . . . . . . . . . . . . . . . . . . . . . . 28
     A.6.  Video composed . . . . . . . . . . . . . . . . . . . . . . 28
     A.7.  Video Auto-switched  . . . . . . . . . . . . . . . . . . . 28
   Appendix B.  Spatial Relationship  . . . . . . . . . . . . . . . . 28
     B.1.  Spatial relationship of audio with video . . . . . . . . . 29
   Appendix C.  Capture sets for the MCU Case . . . . . . . . . . . . 29


Romanow, et al.          Expires January 4, 2012                [Page 3]

Internet-Draft         CLUE Telepresence Framework             July 2011


   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30


Romanow, et al.          Expires January 4, 2012                [Page 4]

Internet-Draft         CLUE Telepresence Framework             July 2011


1.  Introduction

   Current telepresence systems, though based on open standards such as
   RTP and SIP, cannot easily interoperate with each other.  A major
   factor limiting the interoperability of telepresence systems is the
   lack of a standardized way to describe and negotiate the use of the
   multiple streams of audio and video comprising the media flows.  This
   draft provides a framework for a protocol to enable interoperability
   by handling multiple streams in a standardized way.  It is intended
   to support the use cases described in
   draft-ietf-clue-telepresence-use-cases-00 and to meet the
   requirements in draft-romanow-clue-requirements-xx.

   The solution described here is strongly focused on what is being done
   today, rather than a vision of future conferencing.  However, the
   highest priority has been given to creating an extensible framework
   to make it easy to add new information needed to accommodate future
   conferencing functionality.

   The purpose of this effort is to make it possible to handle multiple
   streams of media in such a way that a satisfactory user experience is
   possible even when participants are on different vendor equipment and
   when they are using devices with different types of communication
   capabilities.  Information about the relationship of media streams
   must be communicated so that audio/video rendering can be done in the
   best possible manner.  In addition, it is necessary to choose which
   media streams are sent.

   This first draft of the CLUE framework is to introduce the basic
   approach.  The draft is deliberately as simple as possible in order
   to make it possible to focus discussion on the basic approach.  Some
   of the more descriptive material has been put into appendices in this
   version, in order to keep the framework material from being
   overwhelmed by detail.  In addition, only the basic mechanism is
   described here.  In subsequent drafts, additional mechanisms
   consistent with the basic approach will be added to handle more use
   cases.

   Several important use cases require such additional mechanism to be
   handled.  Nonetheless, we feel that it is better to go step by step,
   and we are defering that material until the next version of the
   model.  It will provide a good illustration of how to use the
   extensible feature of the framework to handle new use cases.

   If you look at this framework from the perspective of trying to
   catch-it-out and see where it breaks down in a special case, you will
   easily be able to succeed.  But we urge you to hold that perspective
   temporarily in order to concentrate on how this model works in common


Romanow, et al.          Expires January 4, 2012                [Page 5]

Internet-Draft         CLUE Telepresence Framework             July 2011


   cases, and how it can be expanded to other use cases.

   [Edt. Similarly, some of the wording is not as precise and accurate
   as might be possible.  Although of course this is very important, it
   might be useful to postpone definition issues temporarily where
   possible in order to concentrate on the framework.]

   After the following Definitions, two short sections introduce key
   concepts.  The body of the text comprises three sections that deal
   with in turn stream content, choosing streams and an implementation
   example.  The media provider and media consumer behavior are
   described in separate sections as well.  Several appendices describe
   further details for using the framework.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].


3.  Definitions

   The definitions marked with an "*" are new; all the others are from
   draft-wenger-clue-definitions-00-01.txt.

      *Audio Capture: Media Capture for audio.  Denoted as ACn.

      Capture Device: A device that converts audio and video input into
      an electrical signal, in most cases to be fed into a media
      encoder.  Cameras and microphones are examples for capture
      devices.

      Capture Scene: the scene that is captured by a collection of
      Capture Devices.  A Capture Scene may be represented by more than
      one type of Media.  A Capture Scene may include more than one
      Media Capture of the same type.  An example of a Capture Scene is
      the video image of a group of people seated next to each other,
      along with the sound of their voices, which could be represented
      by some number of VCs and ACs.  A middle box may also express
      Capture Scenes that it constructs from Media streams it receives.

      A Capture Set includes Media Captures that all represent some
      aspect of the same Capture Scene.  The items (rows) in a Capture
      Set represent different alternatives for representing the same
      Capture Scene.


Romanow, et al.          Expires January 4, 2012                [Page 6]

Internet-Draft         CLUE Telepresence Framework             July 2011


      Conference: used as defined in [RFC4353], A Framework for
      Conferencing within the Session Initiation Protocol (SIP).

      *Encoding Group: A set of encoding parameters representing one or
      more media encoders.  An Encoding Group describes constraints on
      encoding parameters used for mapping Media Captures to encoded
      Streams.

      Endpoint: The logical point of final termination through
      receiving, decoding and rendering, and/or initiation through
      capturing, encoding, and sending of media streams.  An endpoint
      consists of one or more physical devices which source and sink
      media streams, and exactly one [RFC4353] Participant (which, in
      turn, includes exactly one SIP User Agent).  In contrast to an
      endpoint, an MCU may also send and receive media streams, but it
      is not the initiator nor the final terminator in the sense that
      Media is Captured or Rendered.  Endpoints can be anything from
      multiscreen/multicamera rooms to handheld devices.

      Endpoint Characteristics: include placement of Capture and
      Rendering Devices, capture/render angle, resolution of cameras and
      screens, spatial location and mixing parameters of microphones.
      Endpoint characteristics are not specific to individual media
      streams sent by the endpoint.

      Left: to be interpreted as a stage direction, see also
      [StageDirection(Wikipadia)] (Edt. note: needs more clarification)

      MCU: Multipoint Control Unit (MCU) - a device that connects two or
      more endpoints together into one single multimedia conference
      [RFC5117].  An MCU includes an [RFC4353] Mixer.  Edt. Note:
      RFC4353 is tardy in requireing that media from the mixer be sent
      to EACH participant.  I think we have practical use cases where
      this is not the case.  But the bug (if it is one) is in 4353 and
      not herein.

      Media: Any data that, after suitable encoding, can be conveyed
      over RTP, including audio, video or timed text.

      *Media Capture: a source of Media, such as from one or more
      Capture Devices.  A Media Capture may be the source of one or more
      Media streams.  A Media Capture may also be constructed from other
      Media streams.  A middle box can express Media Captures that it
      constructs from Media streams it receives.


Romanow, et al.          Expires January 4, 2012                [Page 7]

Internet-Draft         CLUE Telepresence Framework             July 2011


      *Media Consumer: an Endpoint or middle box that receives Media
      streams

      *Media Provider: an Endpoint or middle box that sends Media
      streams

      Model: a set of assumptions a telepresence system of a given
      vendor adheres to and expects the remote telepresence system(s)
      also to adhere to.

      Right: to be interpreted as stage direction, see also
      [StageDirection(Wikipadia)] (Edt. note: needs more clarification)

      Render: the process of generating a representation from a media,
      such as displayed motion video or sound emitted from loudspeakers.

      *Simultaneous Transmission Set: a set of media captures that can
      be transmitted simultaneously from a Media Sender.

      Spatial Relation: The arrangement in space of two objects, in
      contrast to relation in time or other relationships.  See also
      Left and Right.

      *Stream: RTP stream as in RFC 3550.

      Stream Characteristics: include media stream attributes commonly
      used in non-CLUE SIP/SDP environments (such as: media codec, bit
      rate, resolution, profile/level etc.) as well as CLUE specific
      attributes (which could include for example and depending on the
      solution found: the I-D or spatial location of a capture device a
      stream originates from).

      Telepresence: an environment that gives non co-located users or
      user groups a feeling of (co-located) presence - the feeling that
      a Local user is in the same room with other Local users and the
      Remote parties.  The inclusion of Remote parties is achieved
      through multimedia communication including at least audio and
      video signals of high fidelity.

      *Video Capture: Media Capture for video.  Denoted as VCn.

      Video composite: A single image that is formed from combining
      visual elements from separate sources.


Romanow, et al.          Expires January 4, 2012                [Page 8]

Internet-Draft         CLUE Telepresence Framework             July 2011


4.  Two Necessary Functions

   In simplified terms, here is a description of the functions in a
   telepresence conference.

   1.   Capture media

   2.   FIGURE OUT WHICH MEDIA STREAMS TO SEND (CHOOSING STREAMS)

   3.   Encode it

   4.   ADD SOME NOTES (STREAM CONTENT)

   5.   Package it

   6.   Send it

   7.   Unpack it

   8.   Decode it

   9.   Understand the notes

   10.  Render the stream content according to the notes

   This gross oversimplification is to show clearly that there are only
   2 functions that the CLUE protocol needs to accomplish - choose which
   streams the sender should send to the receiver, and add the right
   information to the streams that get sent.  The framework/model that
   we are presenting can be understood as addressing these two issues.


5.  Protocol Features

   Central to the framework are stream providers and media stream
   consumers.  The provider's job is to advertise its capabilities (as
   described here) to the consumer, whose job it is to configure the
   provider's encoding capabilities (described below).  Both providers
   and consumers can each send and receive information, that is, we do
   not have one party as the sender and one as the receiver exclusively,
   but all parties have both sending and receiving parts to them.  Most
   devices function as both a media provider and as a media consumer.
   For two devices to communicate bidirectionally, with media flowing in
   both directions, both devices act as both a media provider and a
   media consumer.  The protocol exchange shown later in the "Choosing
   Streams" section including hints, announcement and request messages,
   happens twice independently between the 2 bidirectional devices.


Romanow, et al.          Expires January 4, 2012                [Page 9]

Internet-Draft         CLUE Telepresence Framework             July 2011


   For short we will sometimes refer to the media stream provider as the
   "sender" and the media stream consumer as the "receiver".

   Both endpoints and MCUs, or more generally a "middleboxes" can be
   media senders and receivers.

   The protocol resulting from the framework will be declarative rather
   than negotiative.  What this means here is that information is passed
   in either direction, but there is no formalized or explicit agreement
   between participants in the protocol.


6.  Stream Content

   This section describes the structure for communicating information
   between senders and receivers.  Figure illustrates how information to
   be communicated is organized.  Each construct is discussed in the
   sections below.  This diagram is for reference.

   Diagram for Stream Content

                             +---------------+
                             |               |
                             |  Capture Set  |
                             |               |
                             +-------+-------+
                          _..-'      |    ``-._
                      _.-'           |         ``-._
                  _.-'               |              ``-._
         +----------------+  +----------------+  +----------------+
         | Media Capture  |  | Media Capture  |  | Media Capture  |
         | Audio or Video |  | Audio or Video |  | Audio or Video |
         +----------------+  +----------------+  +----------------+
            .'         `.
          .'             `.
      ,-----.         ,---------.
    ,' Encode`.     ,'           `.
   (   Group   )   (  Attributes   )
    `.       ,'     `.           ,'
      `-----'         `---------'

6.1.  Media capture

   A media capture (defined in definitions) is a fundamental concept of
   the model.  Media can be captured in different ways, for example by
   various arrangements of cameras and microphones.  The model uses the
   terms "video capture" (VC) and "audio capture" (AC) to refer to
   sources of media streams.  To distinguish between multiple instances,


Romanow, et al.          Expires January 4, 2012               [Page 10]

Internet-Draft         CLUE Telepresence Framework             July 2011


   they are numbered for example VC1, VC2, and VC3 could refer to three
   different video captures that can be used simultaneously.

   Media captures are dynamic.  They can come and go in a conference -
   and their parameters can change.  A sender can advertise a new list
   of captures at any time.  Both the media sender and media receiver
   can send "their messages" (i.e., capture set advertisements, stream
   configurations) any number of times during a call, and the other end
   is always required to act on any new information received (e.g.,
   stopping streams it had previously configured that are no longer
   valid).

   A media capture can be a media source such as video from a specific
   camera, or it can be more conceptual such as a composite image from
   several cameras, or an automatic dynamically switched capture
   choosing from several cameras depending on who is talking or other
   factors.

   A media capture is described by Attributes and associated with an
   Encode Group.  Audio and video captures are aggregated into Capture
   Sets.

6.2.  Attributes

   Audio and video capture attributes carry the information about
   streams and their relationships that a sender or receiver wants to
   communicate.  [Edt: We do not mean to duplicate SDP, if an SDP
   description can be used, great.]

   The attributes of media streams refer to the current state of a
   stream, rather than the capabilities of a video capture device which
   are described in the encode capabilities, as descried below.

   The mechanism of Attributes make the framework extensible.  Although
   we are defining some attributes now based on the most common use
   cases, new attributes can be added for new use cases as they arise.
   If the model does not do something you want it to, chances are
   defining an attribute will handle your case.

   We describe attributes by variables and their values.  The current
   attributes are listed below.  The variable is shown in parentheses,
   and the values follow after the colon:

   o  (Purpose): main audio, main video, presentation

   o  (Audio mixed): true, false


Romanow, et al.          Expires January 4, 2012               [Page 11]

Internet-Draft         CLUE Telepresence Framework             July 2011


   o  (Audio Channel Format): linear array, mono, stereo, tbd

   o  (Audio linear position): integer 0 to 100

   o  (Video scale): integer indicating scale

   o  (Video composed): true, false

   o  (Video auto-switched): true, false

   The attributes listed here are discussed in Appendix A, in order to
   keep the emphasis of this draft on the overall approach, rather than
   the more specific details.

6.3.  Capture Set

   A sender describes its ability to send alternatives of media streams
   by defining capture sets.

   A capture set is a list of media captures expressed in rows.  Each
   row of the capture set or list consists of either a single capture or
   groups of captures.  A group means the individual captures in the
   group are spatially related, and the order of the captures within the
   group, along with attribute values, defines the spatial ordering of
   the captures.  Spatial relationships are discussed in detail in
   Appendix B.

   The items (rows) in a capture set represent different alternatives
   for representing the same Capture Scene.  For example the following
   are alternative ways of capturing the same Capture Scene - two
   cameras each viewing half of a room, or one camera viewing the whole
   room, or one stream that automatically captures the person in the
   room who is currently speaking.  Each row of the Capture Set contains
   either a single media capture or one group of media captures.

   The following example shows a capture set for an endpoint media
   sender where:

   o  (VC0 - left camera capture, VC1 - center camera capture, VC2 -
      right camera capture)

   o  (VC3 - capture associated with loudest)

   o  (VC4 - zoomed out view of all people in the room.)

   o  (AC0 - room audio)

   The first item in this capture set example is a group of video


Romanow, et al.          Expires January 4, 2012               [Page 12]

Internet-Draft         CLUE Telepresence Framework             July 2011


   captures with a spatial relationship to each other.  VC1 is to the
   left of VC2, and VC0 is to the left of VC1.  VC3 and VC4 are other
   alternatives of how to capture the same room in different ways.  The
   audio capture is included in the same capture set to indicate AC0 is
   associated with those video captures, meaning the audio should be
   rendered along with the video in the same set.

   The idea is to have sets of captures that represent the same
   information ("information" in this context might be a set of people
   and their associated audio / video streams, or might be a
   presentation supplied by a laptop, perhaps with an accompanying audio
   commentary).  Spatial ordering of media captures is imposed here by
   the simplicity of a left to right ordering among media captures in a
   group in the set.

   A media receiver could choose one row of each media type (e.g., audio
   and video) from a capture set.  For example a three stream receiver
   could choose the first video row plus the audio row, while a single
   stream receiver could choose the second or third video row plus the
   audio row.  An MCU receiver might choose to receive multiple rows.

   The simultaneity groups and encoding groups as discussed in the next
   section apply to media captures listed in capture sets.  The
   simultaneity groups and encoding groups MUST allow all the Media
   Captures in a particular group to be used simultaneously.


7.  Choosing Streams

   The following diagram shows the flow of information messages between
   a media provider and a media consumer.  The provider sends
   information about its capabilities (as specified in this section),
   then the consumer chooses which streams it wants, which we refer to
   as "configure".  Optionally, the consumer may send hints to the
   provider about its own capabilities, in which case the provider might
   tailor its announcements to the consumer.

   Diagram for Choosing Streams


Romanow, et al.          Expires January 4, 2012               [Page 13]

Internet-Draft         CLUE Telepresence Framework             July 2011


    Media Receiver                         Media Sender
    --------------                         ------------
          |                                     |
          |------------- Hints ---------------->|
          |                                     |
          |                                     |
          |<---- Capabilities (announce) -------|
          |                                     |
          |                                     |
          |------ Configure (request) --------->|
          |                                     |

   In order for appropriate streams to be sent from senders to
   receivers, certain characteristics of the multiple streams must be
   understood by both senders and receivers.  Two separate aspects of
   streams suffice to describe the necessary information to be shared by
   senders and receivers.  The first aspect we call "physical
   simultaneity" and the other aspect we refer to as "encoding group".
   These are described in the following sections.

7.1.  Physical Simultaneity

   An endpoint or MCU can send multiple captures simultaneously.
   However, there may be constraints that limit which captures can be
   sent simultaneously with other captures.

   Physical or device simultaneity refers to fact that a device may not
   be able to be used in different ways at the same time.  This shapes
   the way that offers are made from the sender.  The offers are made so
   that the receiver will choose one of several possible usages of the
   device.  This is easier to show in an example.

   Consider the example of a room system where there are 3 cameras each
   of which can send a separate capture covering 2 persons each- VC0,
   VC1, VC2.  The middle camera can also zoom out and show all 6
   persons, VC3.  But the middle camera cannot be used in both modes at
   the same time - it has to either show the space where 2 participants
   sit or the whole 6 seats.  We refer to this as a physical device
   simultaneity constraint.

   The following illustration shows 3 cameras with 4 video streams.  The
   middle camera can be used as main video zoomed in on 2 people or it
   could be used in zoomed out mode and capture the whole endpoint.  The
   idea here is that the middle camera cannot be used for both zoomed in
   and zoomed out captures simultaneously.  This is a constraint imposed
   by the physical limitations of the devices.

   Diagram for Simultaneity


Romanow, et al.          Expires January 4, 2012               [Page 14]

Internet-Draft         CLUE Telepresence Framework             July 2011


   `-.   +--------+   VC2
      .-'_Camera 3|---------->
   .-'   +--------+
                       VC3
                     -------->
   `-.   +--------+ /
      .-'|Camera 2|<
   .-'   +--------+ \  VC1
                     -------->

   `-.   +--------+   VC0
      .-'|Camera 1|---------->
   .-'   +--------+

   VC0- video zoomed in on 2 people        VC2- video zoomed in on 2 people
   VC1- video zoomed in on 2 people        VC3- video zoomed out on 6 people

   Simultaneous transmission sets can be expressed as sets of the VCs
   that could physically be transmitted at the same time, though it may
   not make sense to do so.

   In this example the two simultaneous sets are:

   o  {VC0, VC1, VC2}

   o  {VC0, VC3, VC2}

   In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2.
   Only one set can be transmitted at a time.  These are physical
   capabilities describing what can physically be sent at the same time,
   not what might make sense to send.  For example, in the second set
   both VC0 and VC2 are redundant if VC3 is included.

   In describing its capabilities, the provider must take physical
   simultaneity into account and send a list of its simultaneity groups
   to the consumer.

7.2.  Encoding Groups

   The second aspect of multiple streams that must be understood by
   senders and receivers in order to create the best experience
   possible, i. e., for the "right" or "best" streams to be sent, is the
   encoding characteristics of the possible streams that can be sent.
   Just in the way that there is a constraint imposed on the multiple
   streams due to the physical limitations, there are also constraints
   due to encoding limitations.  These are described in an Encoding
   Group as follows.


Romanow, et al.          Expires January 4, 2012               [Page 15]

Internet-Draft         CLUE Telepresence Framework             July 2011


   An encoding group is an attribute of a video capture (VC) as
   discussed above.

   An encoding group has the following variables, as shown in the
   following table.

   +--------------+----------------------------------------------------+
   | Name         | Description                                        |
   +--------------+----------------------------------------------------+
   | maxBandwidth | Maximum number of bits per second relating to a    |
   |              | single video encoding                              |
   | maxMbps      | Maximum number of macroblocks per second relating  |
   |              | to a single video encoding: ((width + 15) / 16) *  |
   |              | ((height + 15) / 16) * framesPerSecond             |
   | maxWidth     | Video resolution's maximum supported width,        |
   |              | expressed in pixels                                |
   | maxHeight    | Video resolution's maximum supported height,       |
   |              | expressed in pixels                                |
   | maxFrameRate | Maximum supported frame rate                       |
   +--------------+----------------------------------------------------+

   An encoding group is the basic method of describing encoding
   capability.  There may be multiple encoding groups per endpoint.  For
   example, each video capture device might have an associated encoding
   group that describes the video streams that can result from that
   capture.

   An encoding group EG<n> comprises one or more potential encodings
   ENC<n>.  For example,

EG0:  maxMbps=489600, maxBandwidth=6000000
     VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
     VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
     AUDIO_ENC0: maxBandwidth=96000
     AUDIO_ENC1: maxBandwidth=96000
     AUDIO_ENC2: maxBandwidth=96000

   Here, the encoding group is EG0.  It can transmit up to two 1080p30
   encodings (Mbps for 1080p = 244800), but it is capable of
   transmitting a maxFrameRate of 60 frames per second (fps).  To
   achieve the maximum resolution (1920 x 1088) the frame rate is
   limited to 30 fps.  However 60 fps can be achieved at a lower
   resolution if required by the receiver.  Although the encoding group
   is capable of transmitting up to 6Mbit/s, no individual video
   encoding can exceed 4Mbit/s.

   This encoding group also allows up to 3 audio encodings,
   AUDIO_ENC<0-2>.  It is not required that audio and video encodings


Romanow, et al.          Expires January 4, 2012               [Page 16]

Internet-Draft         CLUE Telepresence Framework             July 2011


   reside within the same encoding group, but if so then the group's
   overall maxBandwidth value is a limit on the sum of all audio and
   video encodings configured by the receiver.  A system that does not
   wish or need to combine bandwidth limitations in this way should
   instead use separate encoding groups for audio and video in order for
   the bandwidth limitations on audio and video to not interact.

   Here is an example written with separate audio and video encode
   groups.

VIDEO_EG0:  maxMbps=489600, maxBandwidth=6000000
     VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
     VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
AUDIO_EG0: maxBandwidth=500000
     AUDIO_ENC0: maxBandwidth=96000
     AUDIO_ENC1: maxBandwidth=96000
     AUDIO_ENC2: maxBandwidth=96000


   The following two sections describe further examples of encoding
   groups.  In the first example, the capability parameters are the same
   across ENCs.  In the second example, they vary.

7.2.1.  Sample video encoding group specification #1

   An endpoint that has 3 similar video capture devices would advertise
   3 encoding groups that can each transmit up to 2 1080p30 encondings,
   as follows:

EG0:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG1:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG2:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000


   A remote receiver configures some or all of the specific encodings
   such that:

   o  The configuration of each active ENC<n> parameter values does not
      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
      exceeded


Romanow, et al.          Expires January 4, 2012               [Page 17]

Internet-Draft         CLUE Telepresence Framework             July 2011


   o  The total bandwidth of the configured ENC <n> encodings does not
      exceed the maxBandwidth of the encoding group

   o  The sum of the "macroblocks per second" values of each configured
      encoding does not exceed the maxMbps of the encoding group

   There is no requirement for all encodings within an encoding group to
   be activated when configured by the receiver.

   Depending on the sender's encoding methods, the receiver may be able
   to request fixed encode values or choose encode values in the range
   less than the maximum offered.  We will discuss receiver behavior in
   more detail in a section below.

7.2.2.  Sample video encoding group specification #2

   An endpoint that has 3 similar video capture devices would advertise
   3 encoding groups that can each transmit up to 2 1080p30 encondings,
   as follows:

EG0:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG1:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG2:  maxMbps = 489600, maxBandwidth=6000000
        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000

   A remote receiver configures some or all of the specific encodings
   such that:

   o  The configuration of each active ENC<n> parameter values does not
      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
      exceeded

   o  The total bandwidth of the configured ENC <n> encodings does not
      exceed the maxBandwidth of the encoding group

   o  The sum of the "macroblocks per second" values of each configured
      encoding does not exceed the maxMbps of the encoding group

   There is no requirement for all encodings within an encoding group to
   be activated when configured by the receiver.

   Depending on the sender's encoding methods, the receiver may be able
   to request fixed encode values or choose encode values in the range


Romanow, et al.          Expires January 4, 2012               [Page 18]

Internet-Draft         CLUE Telepresence Framework             July 2011


   less than the maximum offered.  We will discuss receiver behavior in
   more detail in a section below.


8.  Media provider behavior

   In summary, what is included in the sender capabilities announce
   messing includes:

   o  the list of captures and their attributes

   o  the list of capture sets

   o  the list of physical simultaneity groups

   o  the list of the encoding groups


9.  Putting it together - using the Capture Set

   This section shows how to use the framework to represent a typical
   case for telepresence rooms.

   Appendix B includes an additional example showing the MCU case.
   [Edt. It is in the Appendix just to allow the body of the document to
   focus on the basic ideas.  It can be brought in to the main text in a
   later draft.]

   Consider an endpoint with the following characteristics:

   o  3 cameras, 3 displays, a 6 person table

   o  Each video device can provide one capture for each 1/3 section of
      the table

   o  A single capture representing the active speaker can be provided

   o  A single capture representing the active speaker with the other 2
      captures shown picture in picture within the stream can be
      provided

   o  A capture showing a zoomed out view of all 6 seats in the room can
      be provided

   The audio and video captures for this endpoint can be described as
   follows.  The Encode Group specifications can be found above in
   section 6.2.2, Sample video encoding group specification #2.


Romanow, et al.          Expires January 4, 2012               [Page 19]

Internet-Draft         CLUE Telepresence Framework             July 2011


   Video Captures:

   1.  VC0- (the left camera stream), encoding group:EG0, attributes:
       purpose=main;auto-switched:no

   2.  VC1- (the center camera stream), encoding group:EG1, attributes:
       purpose=main; auto-switched:no

   3.  VC2- (the right camera stream), encoding group:EG2, attributes:
       purpose=main;auto-switched:no

   4.  VC3- (the loudest panel stream), encoding group:EG1, attributes:
       purpose=main;auto-switched:yes

   5.  VC4- (the loudest panel stream with PiPs), encoding group:EG1,
       attributes: purpose=main; composed=true; auto-switched:yes

   6.  VC5- (the zoomed out view of all people in the room), encoding
       group:EG1, attributes: purpose=main;auto-switched:no

   7.  VC6- (presentation stream), encoding group:EG1, attributes:
       purpose=presentation;auto-switched:no

   Summary of video captures - 3 codecs, center one is used for center
   camera stream, presentation stream, auto-switched, and zoomed views.
   [edt.  It is arbitrary that for this example the alternative views
   are on EG1 - they could have been spread out- it was not a necessary
   choice.]

   Audio Captures:

   o  AC0 (left), attributes: purpose=main;channel format=linear array;
      linear position=0;

   o  AC1 (right), attributes: purpose=main;channel format=linear array;
      linear position=100;

   o  AC2 (center) attributes: purpose=main;channel format=linear array;
      linear position=50;

   o  AC3 being a simple pre-mixed audio stream from the room (mono),
      attributes: purpose=main;channel format=linear array; linear
      position=50; mixed=true

   o  AC4 audio stream associated with the presentation video (mono)
      attributes: purpose=presentation;channel format=linear array;
      linear position=50;


Romanow, et al.          Expires January 4, 2012               [Page 20]

Internet-Draft         CLUE Telepresence Framework             July 2011


   The physical simultaneity information is:

      {VC0, VC1, VC2, VC3, VC4, VC6}

      {VC0, VC2, VC5, VC6}

   You can physically do any selection within one set at the same time.
   This is strictly what is possible from the devices.  However, using
   every member in the set simultaneously may not make sense- for
   example VC3(loudest) and VC4 (loudest with PIP).  (In addition, there
   are encoding constraints that make choosing all of the VCs in a set
   impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1has only 3
   ENCs.  This constraint shows up in the Capture list, not in the
   physical simultaneity list.)

   In this example there are no restrictions on which audio captures can
   be sent simultaneously.

   The following table represents the capture sets for this sender.
   Recall that a capture set is composed of alternative captures
   covering the same scene.  Capture Set #1 is for the main people
   captures, and Capture Set #2 is for presentation.

                            +----------------+
                            | Capture Set #1 |
                            +----------------+
                            | VC0, VC1, VC2  |
                            | VC3            |
                            | VC4            |
                            | VC5            |
                            | AC0, AC1, AC2  |
                            | AC3            |
                            +----------------+

                            +----------------+
                            | Capture Set #2 |
                            +----------------+
                            | VC6            |
                            | AC4            |
                            +----------------+

   Different capture sets are unique to each other, non-overlapping.  A
   receiver chooses a capture row from each capture set.  In this case
   the three captures VC0, VC1, and VC2 are one way of representing the
   video from the endpoint.  These three captures should appear adjacent
   next to each other.  Alternatively, another way of representing the
   Capture Scene is with the capture VC3, which automatically shows the
   person who is talking.  Similarly for the VC4 and VC5 alternatives.


Romanow, et al.          Expires January 4, 2012               [Page 21]

Internet-Draft         CLUE Telepresence Framework             July 2011


   As in the video case, the different rows of audio in Capture Set #1
   represent the "same thing", in that one way to receive the audio is
   with the 3 linear position audio captures (AC0, AC1, AC2), and
   another way is with the single channel monaural format AC3.  The
   Media Consumer would choose the one audio capture row it is capable
   of receiving.

   The spatial ordering is understood by the left to right ordering
   among the VC7lt;n&gtr;s on the same row of the table.

   The receiver finds a "row" in each capture set #x section of the
   table that it wants.  It configures the streams according to the
   encoding group for the row.

   A Media Receiver would likely want to choose a row to receive based
   in part on how many streams it can simultaneously receive.  A
   receiver that can receive three people streams would probably prefer
   to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not
   receive the other rows.  A receiver that can receive only one people
   stream would probably choose one of the other rows.

   If the receiver can receive a presentation stream too, it would also
   choose to receive the only row from Capture Set #2 (VC6).


10.  Media consumer behaviour

   The receive side of a call needs to balance its requirements, based
   on number of screens and speakers, its decoding capabilities and
   available bandwidth, and the sender's capabilities in order to
   optimally configure the sender's streams.  Typically it would want to
   receive and decode media from each capture set advertised by the
   sender.

   A sane, basic, algorithm might be for the receiver to go through each
   capture set in turn and find the collection of video captures that
   best matches the number of screens it has (this might include
   consideration of screens dedicated to presentation video display
   rather than "people" video) and then decide between alternative rows
   in the video capture sets based either on hard-coded preferences or
   user choice.  Once this choice has been made, the receiver would then
   decide how to configure the sender's encode groups in order to make
   best use of the available network bandwidth and its own decoding
   capabilities.


Romanow, et al.          Expires January 4, 2012               [Page 22]

Internet-Draft         CLUE Telepresence Framework             July 2011


10.1.  One screen receiver configuring the example capture-side device
       above

   The receive side of a call needs to balance its requirements, based
   on number of screens and speakers, its decoding capabilities and
   available bandwidth, and the sender's capabilities in order to
   optimally configure the sender's streams.  Typically it would want to
   receive and decode media from each capture set advertised by the
   sender.

   A sane, basic, algorithm might be for the receiver to go through each
   capture set in turn and find the collection of video captures that
   best matches the number of screens it has (this might include
   consideration of screens dedicated to presentation video display
   rather than "people" video) and then decide between alternative rows
   in the video capture sets based either on hard-coded preferences or
   user choice.  Once this choice has been made, the receiver would then
   decide how to configure the sender's encode groups in order to make
   best use of the available network bandwidth and its own decoding
   capabilities.

10.2.  Two screen receiver configuring the example capture-side device
       above

   Mixing systems with an even number of screens, "2n", and those with
   "2n+1" cameras (and vice versa) is always likely to be the
   problematic case.  In this instance, the behaviour is likely to be
   determined by whether a "2 screen" system is really a "2 decoder"
   system, i.e., whether only one received stream can be displayed per
   screen or whether more than 2 streams can be received and spread
   across the available screen area.  To enumerate 3 possible behaviours
   here for the 2 screen system when it learns that the far end is
   "ideally" expressed via 3 capture streams:

   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
       per the 1 screen receiver case above) and either leave one screen
       blank or use it for presentation if / when a presentation becomes
       active

   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
       (either with each capture being scaled to 2/3 of a screen and the
       centre capture being split across 2 screens) or, as would be
       necessary if there were large bezels on the screens, with each
       stream being scaled to 1/2 the screen width and height and there
       being a 4th "blank" panel.  This 4th panel could potentially be
       used for any presentation that became active during the call.


Romanow, et al.          Expires January 4, 2012               [Page 23]

Internet-Draft         CLUE Telepresence Framework             July 2011


   3.  Receive 3 streams, decode all 3, and use control information
       indicating which was the most active to switch between showing
       the left and centre streams (one per screen) and the centre and
       right streams.

   For an endpoint capable of all 3 methods of working described above,
   again it might be appropriate to offer the user the choice of display
   mode.

10.3.  Three screen receiver configuring the example capture-side device
       above

   This is the most straightforward case - the receiver would look to
   identify a set of streams to receive that best matched its available
   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
   spatial ordering would give sufficient information for the correct
   video capture to be shown on the correct screen, and the receiver
   would either need to divide a single encode group's capability by 3
   to determine what resolution and frame rate to configure the sender
   with or to configure the individual video captures' encode groups
   with what makes most sense (taking into account the receive side
   decode capabilities, overall call bandwidth, the resolution of the
   screens plus any user preferences such as motion vs sharpness).

10.4.  Configuration of sender streams by a receiver

   After receiving a set of video capture information from a sender and
   making its choice of what media streams to receive based on the
   receiver's own capabilities and any sender-side simultaneity
   restrictions, the receiver needs to essentially configure the sender
   to transmit the chosen set.

   The expectation is that this message will enumerate each of the
   encoding groups and potential encoders within those groups that the
   receiver wishes to be active (this may well be a subset of the
   complete set available).  For each such encoder within an encoding
   group, the receiver would specify the video capture (i.e., VC<n&t; as
   described above) along with the specifics of the video encoding
   required, i.e. width, height, frame rate and bit rate.  At this
   stage, the receiver would also provide RTP demultiplexing information
   as required to distinguish each stream from the others being
   configured by the same mechanism.

10.5.  Advertisement of capabilities sent by receiver to sender

   In order for a maximally-capable sender to be able to advertise a
   manageable number of video captures to a receiver, there is a
   potential use for the receiver being able, at the start of CLUE to be


Romanow, et al.          Expires January 4, 2012               [Page 24]

Internet-Draft         CLUE Telepresence Framework             July 2011


   able to inform the sender of its capabilities.  One example here
   would be the video capture attribute set - a receiver could tell the
   sender the complete set of video capture attributes it is able to
   understand and so the sender would be able to reduce the capture set
   it advertises to be tailored to the receiver.


11.  Acknowledgements

   We want to thank Stephen Botzko for helpful discussions on audio.


12.  IANA Considerations

   TBD


13.  Security Considerations

   TBD


14.  Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
              Session Initiation Protocol (SIP)", RFC 4353,
              February 2006.

   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
              January 2008.

   [StageDirection(Wikipedia)]
              Wikipedia, "Blocking (stage), available from http://
              en.wikipedia.org/wiki/Stage_direction#Stage_directions",
              May 2011, <http://en.wikipedia.org/wiki/
              Stage_direction#Stage_directions>.


Romanow, et al.          Expires January 4, 2012               [Page 25]

Internet-Draft         CLUE Telepresence Framework             July 2011


Appendix A.  Attributes

   This section discusses the attributes and their values in more
   detail, and many have additional details provided elsewhere in the
   draft.  In general, the way to extend the solution to handle new
   features is by adding attributes and/or values.

A.1.  Purpose

   A variable with enumerated values describing the purpose or role of
   the Media Capture.  It could be applied to any media type.  Possible
   values: main, presentation, others TBD.

A.1.1.  Main

   The audio or video capture is of one or more people participating in
   a conference (or where they would be if they were there).  It is of
   part or all of the Capture Scene.

A.1.2.  Presentation

A.2.  Audio mixed

A.3.  Audio Channel Format

   The "channel format" attribute of an Audio Capture indicates how the
   meaning of the channels is determined.  It is an enumerated variable
   describing the type of audio channel or channels in the Aucio
   Capture.  The possible values of the "channel format" attribute are:

   o  linear array (linear position)

   o  stereo

   o  TBD - other possible future values (to potentially include other
      things like 3.0, 3.1, 5.1 surround sound and binaural)

   All ACs in the same row of a Capture Set MUST have the same value of
   the "channel format" attribute.

A.3.1.  Linear Array

   An AC with channel format = "linear array" has exactly one audio
   channel.  For the "linear array" channel format, there is another
   required attribute to specify position within the array.  This is the
   "linear position" attribute, which is an integer value within the
   range 0 to 100. 0 means leftmost, 100 means rightmost, with other
   values spaced equally between.  A value of 50 means in the center,


Romanow, et al.          Expires January 4, 2012               [Page 26]

Internet-Draft         CLUE Telepresence Framework             July 2011


   spatially.  Any AC can have any value, even multiple ACs in a capture
   set row can have the same value.  The 0-100 linear position is
   intentionally dimensionless, since we are presuming that receivers
   will use different sized video displays, and the audio spatial
   location can be adjusted at the receiving side to correspond to the
   displays.

   The linear position value is fixed until the receiver asks for a
   different AC from the capture set, which may be triggered by the
   provider sending an updated capture set.

   The streams being sent might be correlated (that is, someone talking
   might be heard in multiple captures from the same room).  Echo
   cancellation and stream synchronization in receivers should take this
   into account.

   With three audio channels representing left, center, and right:

      AC0 - channel format = linear array; linear position = 0

      AC1 - channel format = linear array; linear position = 50

      AC2 - channel format = linear array; linear position = 100

A.3.2.  Stereo

   An AC with channel format = "stereo" has exactly two audio channels,
   left and right, as part of the same AC.  [Edt: should we mention RFC
   3551 here?  The channel format may be related to how Audio Captures
   are mapped to RTP streams.  This stereo is not the same as the effect
   produced from two mono ACs one from the left and one from the right.
   ]

A.3.3.  Mono

   An AC with channel format="mono" has one audio channel.  This can be
   represented by audio linear position with a single member at a single
   integer location.  [Edt. Mono can be represented as an as a
   particular case of linear array (=1]

A.4.  Audio Linear Position

   An integer valued variable from 0 - 100, where 0 signifies the left
   and 100 signifies the right.


Romanow, et al.          Expires January 4, 2012               [Page 27]

Internet-Draft         CLUE Telepresence Framework             July 2011


A.5.  Video Scale

   An optional integer valued variable indicating the spatial scale of
   the video capture, for example centimeters for horizontal image
   width.

A.6.  Video composed

   An optional Boolean variable indicating if the VC is constructed by
   composing multiple other video captures together. stream incorporates
   multiple composed panes (This could indicate for example a continuous
   presence view of multiple images in a grid, or a large image with
   smaller picture-in-picture images in it.)

A.7.  Video Auto-switched

   A Boolean variable.  In this case the offered VC varies depending on
   some rule; it is auto-switched between possible VCs.  The most common
   example of this is sending the video capture associated with the
   "loudest" speaker according to an audio detection algorithm.


Appendix B.  Spatial Relationship

   Here is an example of a simple capture set with three video captures
   and three audio channels, each in a separate row:

      (VC0, VC1, VC2)

      (AC0, AC1, AC2)

   The three ACs together in a row indicate those channels are spatially
   related to each other, and spatially related to the VCs in the same
   capture set.

   Multiple Media Captures of the same media type are often spatially
   related to each other.  Typically multiple Video Captures should be
   rendered next to each other in a particular order, or multiple audio
   channels should be rendered to match different speakers in a
   particular way.  Also, media of different types are often associated
   with each other, for example a group of Video Captures can be
   associated with a group of Audio Captures meaning they should be
   rendered together.

   Media Captures of the same media type are associated with each other
   by grouping them together in a single row of a Capture Set. Media
   Captures of different media types are associated with each other by
   putting them in different rows of the same Capture Set.


Romanow, et al.          Expires January 4, 2012               [Page 28]

Internet-Draft         CLUE Telepresence Framework             July 2011


   For video the spatial relationship is horizontal adjacency in one
   dimension.  So Video Captures can be described as being adjacent to
   each other, in a horizontal row, ordered left to right.  When VCs are
   grouped together in a capture set row, it means they are horizontally
   adjacent to each other, such that when more than one of them are
   rendered together they should be rendered next to each other in the
   proper order.  The first VC in the group is the leftmost (from the
   point of view of a person looking at the rendered images), and so on
   towards the right.

   [Edt: Additional attributes can be added, such as the ability to
   handle two dimensional array instead of just a one dimensional row of
   video images.]

   Audio Captures that are in the same Capture Set with Video Captures
   are related to each other spatially, such that the multiple audio
   channels should be rendered such that the overall audio field covers
   roughly the same horizontal extent as the rendered video.  This gives
   a reasonable spatial correlation between audio and video.  A more
   exact relationship is out of scope of this framework.

B.1.  Spatial relationship of audio with video

   A row of audio is spatially related to a row of video in the same
   capture set.  The audio and video should be rendered such that they
   appear spatially coincident.  Audio with a linear position of 0
   corresponds to the leftmost side of the group of VCs in the same
   capture set.  Audio with a linear position of 50 corresponds to the
   center of the group of VCs.  Audio with a linear position of 100
   corresponds to the rightmost side of the group of VCs.

   Likewise, for stereo audio, the spatial extent of the audio should be
   coincident with the spatial extent of the corresponding video.


Appendix C.  Capture sets for the MCU Case

   This shows how an MCU might express its Capture Sets, intending to
   offer different choices for receivers that can handle different
   numbers of streams.  A single audio capture stream is provided for
   all single and multi-screen configurations that can be associated
   (e.g. lip-synced) with any combination of video captures at the
   receiver.


Romanow, et al.          Expires January 4, 2012               [Page 29]

Internet-Draft         CLUE Telepresence Framework             July 2011


   +--------------------+---------------------------------------------+
   | Capture Set #1     | note                                        |
   +--------------------+---------------------------------------------+
   | VC0                | video capture for single screen receiver    |
   | VC1, VC2           | video capture for 2 screen receiver         |
   | VC3, VC4, VC5      | video capture for 3 screen receiver         |
   | VC6, VC7, VC8, VC9 | video capture for 4 screen receiver         |
   | AC0                | audio capture representing all participants |
   +--------------------+---------------------------------------------+

   If / when a presentation stream becomes active within the conference,
   the MCU might re-advertise the available media as:

         +----------------+--------------------------------------+
         | Capture Set #2 | note                                 |
         +----------------+--------------------------------------+
         | VC10           | video capture for presentation       |
         | AC1            | presentation audio to accompany VC10 |
         +----------------+--------------------------------------+


Authors' Addresses

   Allyn Romanow
   Cisco Systems
   San Jose, CA  95134
   USA

   Email: allyn@cisco.com


   Mark Duckworth
   Polycom
   Andover, MA  01810
   US

   Email: mark.duckworth@polycom.com


   Andrew Pepperell
   Cisco Systems
   Langely, England
   UK

   Email: apeppere@cisco.com


Romanow, et al.          Expires January 4, 2012               [Page 30]

Internet-Draft         CLUE Telepresence Framework             July 2011


   Brian Baldino
   Cisco Systems
   San Jose, CA  95134
   US

   Email: bbaldino@polycom.com


   Mark Goryzinski
   HP Visual Collaboration
   Corvallis, OR
   USA

   Email: mark.gorzynski@hp.com


Romanow, et al.          Expires January 4, 2012               [Page 31]