CLUE WG A. Romanow Internet-Draft Cisco Systems Intended status: Informational M. Duckworth Expires: January 4, 2012 Polycom A. Pepperell B. Baldino Cisco Systems M. Goryzinski HP Visual Collaboration July 3, 2011 Framework for Telepresence Multi-Streams draft-romanow-clue-framework-00.txt Abstract This memo offers a framework for a protocol that enables devices in a telepresence conference to interoperate by specif;ying the relationships between multiple RTP streams. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 4, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect Romanow, et al. Expires January 4, 2012 [Page 1] Internet-Draft CLUE Telepresence Framework July 2011 to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Romanow, et al. Expires January 4, 2012 [Page 2] Internet-Draft CLUE Telepresence Framework July 2011 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. Two Necessary Functions . . . . . . . . . . . . . . . . . . . 9 5. Protocol Features . . . . . . . . . . . . . . . . . . . . . . 9 6. Stream Content . . . . . . . . . . . . . . . . . . . . . . . . 10 6.1. Media capture . . . . . . . . . . . . . . . . . . . . . . 10 6.2. Attributes . . . . . . . . . . . . . . . . . . . . . . . . 11 6.3. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 12 7. Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 13 7.1. Physical Simultaneity . . . . . . . . . . . . . . . . . . 14 7.2. Encoding Groups . . . . . . . . . . . . . . . . . . . . . 15 7.2.1. Sample video encoding group specification #1 . . . . . 17 7.2.2. Sample video encoding group specification #2 . . . . . 18 8. Media provider behavior . . . . . . . . . . . . . . . . . . . 19 9. Putting it together - using the Capture Set . . . . . . . . . 19 10. Media consumer behaviour . . . . . . . . . . . . . . . . . . . 22 10.1. One screen receiver configuring the example capture-side device above . . . . . . . . . . . . . . . . 23 10.2. Two screen receiver configuring the example capture-side device above . . . . . . . . . . . . . . . . 23 10.3. Three screen receiver configuring the example capture-side device above . . . . . . . . . . . . . . . . 24 10.4. Configuration of sender streams by a receiver . . . . . . 24 10.5. Advertisement of capabilities sent by receiver to sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 13. Security Considerations . . . . . . . . . . . . . . . . . . . 25 14. Informative References . . . . . . . . . . . . . . . . . . . . 25 Appendix A. Attributes . . . . . . . . . . . . . . . . . . . . . 26 A.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.1.1. Main . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.1.2. Presentation . . . . . . . . . . . . . . . . . . . . . 26 A.2. Audio mixed . . . . . . . . . . . . . . . . . . . . . . . 26 A.3. Audio Channel Format . . . . . . . . . . . . . . . . . . . 26 A.3.1. Linear Array . . . . . . . . . . . . . . . . . . . . . 26 A.3.2. Stereo . . . . . . . . . . . . . . . . . . . . . . . . 27 A.3.3. Mono . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.4. Audio Linear Position . . . . . . . . . . . . . . . . . . 27 A.5. Video Scale . . . . . . . . . . . . . . . . . . . . . . . 28 A.6. Video composed . . . . . . . . . . . . . . . . . . . . . . 28 A.7. Video Auto-switched . . . . . . . . . . . . . . . . . . . 28 Appendix B. Spatial Relationship . . . . . . . . . . . . . . . . 28 B.1. Spatial relationship of audio with video . . . . . . . . . 29 Appendix C. Capture sets for the MCU Case . . . . . . . . . . . . 29 Romanow, et al. Expires January 4, 2012 [Page 3] Internet-Draft CLUE Telepresence Framework July 2011 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 Romanow, et al. Expires January 4, 2012 [Page 4] Internet-Draft CLUE Telepresence Framework July 2011 1. Introduction Current telepresence systems, though based on open standards such as RTP and SIP, cannot easily interoperate with each other. A major factor limiting the interoperability of telepresence systems is the lack of a standardized way to describe and negotiate the use of the multiple streams of audio and video comprising the media flows. This draft provides a framework for a protocol to enable interoperability by handling multiple streams in a standardized way. It is intended to support the use cases described in draft-ietf-clue-telepresence-use-cases-00 and to meet the requirements in draft-romanow-clue-requirements-xx. The solution described here is strongly focused on what is being done today, rather than a vision of future conferencing. However, the highest priority has been given to creating an extensible framework to make it easy to add new information needed to accommodate future conferencing functionality. The purpose of this effort is to make it possible to handle multiple streams of media in such a way that a satisfactory user experience is possible even when participants are on different vendor equipment and when they are using devices with different types of communication capabilities. Information about the relationship of media streams must be communicated so that audio/video rendering can be done in the best possible manner. In addition, it is necessary to choose which media streams are sent. This first draft of the CLUE framework is to introduce the basic approach. The draft is deliberately as simple as possible in order to make it possible to focus discussion on the basic approach. Some of the more descriptive material has been put into appendices in this version, in order to keep the framework material from being overwhelmed by detail. In addition, only the basic mechanism is described here. In subsequent drafts, additional mechanisms consistent with the basic approach will be added to handle more use cases. Several important use cases require such additional mechanism to be handled. Nonetheless, we feel that it is better to go step by step, and we are defering that material until the next version of the model. It will provide a good illustration of how to use the extensible feature of the framework to handle new use cases. If you look at this framework from the perspective of trying to catch-it-out and see where it breaks down in a special case, you will easily be able to succeed. But we urge you to hold that perspective temporarily in order to concentrate on how this model works in common Romanow, et al. Expires January 4, 2012 [Page 5] Internet-Draft CLUE Telepresence Framework July 2011 cases, and how it can be expanded to other use cases. [Edt. Similarly, some of the wording is not as precise and accurate as might be possible. Although of course this is very important, it might be useful to postpone definition issues temporarily where possible in order to concentrate on the framework.] After the following Definitions, two short sections introduce key concepts. The body of the text comprises three sections that deal with in turn stream content, choosing streams and an implementation example. The media provider and media consumer behavior are described in separate sections as well. Several appendices describe further details for using the framework. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 3. Definitions The definitions marked with an "*" are new; all the others are from draft-wenger-clue-definitions-00-01.txt. *Audio Capture: Media Capture for audio. Denoted as ACn. Capture Device: A device that converts audio and video input into an electrical signal, in most cases to be fed into a media encoder. Cameras and microphones are examples for capture devices. Capture Scene: the scene that is captured by a collection of Capture Devices. A Capture Scene may be represented by more than one type of Media. A Capture Scene may include more than one Media Capture of the same type. An example of a Capture Scene is the video image of a group of people seated next to each other, along with the sound of their voices, which could be represented by some number of VCs and ACs. A middle box may also express Capture Scenes that it constructs from Media streams it receives. A Capture Set includes Media Captures that all represent some aspect of the same Capture Scene. The items (rows) in a Capture Set represent different alternatives for representing the same Capture Scene. Romanow, et al. Expires January 4, 2012 [Page 6] Internet-Draft CLUE Telepresence Framework July 2011 Conference: used as defined in [RFC4353], A Framework for Conferencing within the Session Initiation Protocol (SIP). *Encoding Group: A set of encoding parameters representing one or more media encoders. An Encoding Group describes constraints on encoding parameters used for mapping Media Captures to encoded Streams. Endpoint: The logical point of final termination through receiving, decoding and rendering, and/or initiation through capturing, encoding, and sending of media streams. An endpoint consists of one or more physical devices which source and sink media streams, and exactly one [RFC4353] Participant (which, in turn, includes exactly one SIP User Agent). In contrast to an endpoint, an MCU may also send and receive media streams, but it is not the initiator nor the final terminator in the sense that Media is Captured or Rendered. Endpoints can be anything from multiscreen/multicamera rooms to handheld devices. Endpoint Characteristics: include placement of Capture and Rendering Devices, capture/render angle, resolution of cameras and screens, spatial location and mixing parameters of microphones. Endpoint characteristics are not specific to individual media streams sent by the endpoint. Left: to be interpreted as a stage direction, see also [StageDirection(Wikipadia)] (Edt. note: needs more clarification) MCU: Multipoint Control Unit (MCU) - a device that connects two or more endpoints together into one single multimedia conference [RFC5117]. An MCU includes an [RFC4353] Mixer. Edt. Note: RFC4353 is tardy in requireing that media from the mixer be sent to EACH participant. I think we have practical use cases where this is not the case. But the bug (if it is one) is in 4353 and not herein. Media: Any data that, after suitable encoding, can be conveyed over RTP, including audio, video or timed text. *Media Capture: a source of Media, such as from one or more Capture Devices. A Media Capture may be the source of one or more Media streams. A Media Capture may also be constructed from other Media streams. A middle box can express Media Captures that it constructs from Media streams it receives. Romanow, et al. Expires January 4, 2012 [Page 7] Internet-Draft CLUE Telepresence Framework July 2011 *Media Consumer: an Endpoint or middle box that receives Media streams *Media Provider: an Endpoint or middle box that sends Media streams Model: a set of assumptions a telepresence system of a given vendor adheres to and expects the remote telepresence system(s) also to adhere to. Right: to be interpreted as stage direction, see also [StageDirection(Wikipadia)] (Edt. note: needs more clarification) Render: the process of generating a representation from a media, such as displayed motion video or sound emitted from loudspeakers. *Simultaneous Transmission Set: a set of media captures that can be transmitted simultaneously from a Media Sender. Spatial Relation: The arrangement in space of two objects, in contrast to relation in time or other relationships. See also Left and Right. *Stream: RTP stream as in RFC 3550. Stream Characteristics: include media stream attributes commonly used in non-CLUE SIP/SDP environments (such as: media codec, bit rate, resolution, profile/level etc.) as well as CLUE specific attributes (which could include for example and depending on the solution found: the I-D or spatial location of a capture device a stream originates from). Telepresence: an environment that gives non co-located users or user groups a feeling of (co-located) presence - the feeling that a Local user is in the same room with other Local users and the Remote parties. The inclusion of Remote parties is achieved through multimedia communication including at least audio and video signals of high fidelity. *Video Capture: Media Capture for video. Denoted as VCn. Video composite: A single image that is formed from combining visual elements from separate sources. Romanow, et al. Expires January 4, 2012 [Page 8] Internet-Draft CLUE Telepresence Framework July 2011 4. Two Necessary Functions In simplified terms, here is a description of the functions in a telepresence conference. 1. Capture media 2. FIGURE OUT WHICH MEDIA STREAMS TO SEND (CHOOSING STREAMS) 3. Encode it 4. ADD SOME NOTES (STREAM CONTENT) 5. Package it 6. Send it 7. Unpack it 8. Decode it 9. Understand the notes 10. Render the stream content according to the notes This gross oversimplification is to show clearly that there are only 2 functions that the CLUE protocol needs to accomplish - choose which streams the sender should send to the receiver, and add the right information to the streams that get sent. The framework/model that we are presenting can be understood as addressing these two issues. 5. Protocol Features Central to the framework are stream providers and media stream consumers. The provider's job is to advertise its capabilities (as described here) to the consumer, whose job it is to configure the provider's encoding capabilities (described below). Both providers and consumers can each send and receive information, that is, we do not have one party as the sender and one as the receiver exclusively, but all parties have both sending and receiving parts to them. Most devices function as both a media provider and as a media consumer. For two devices to communicate bidirectionally, with media flowing in both directions, both devices act as both a media provider and a media consumer. The protocol exchange shown later in the "Choosing Streams" section including hints, announcement and request messages, happens twice independently between the 2 bidirectional devices. Romanow, et al. Expires January 4, 2012 [Page 9] Internet-Draft CLUE Telepresence Framework July 2011 For short we will sometimes refer to the media stream provider as the "sender" and the media stream consumer as the "receiver". Both endpoints and MCUs, or more generally a "middleboxes" can be media senders and receivers. The protocol resulting from the framework will be declarative rather than negotiative. What this means here is that information is passed in either direction, but there is no formalized or explicit agreement between participants in the protocol. 6. Stream Content This section describes the structure for communicating information between senders and receivers. Figure illustrates how information to be communicated is organized. Each construct is discussed in the sections below. This diagram is for reference. Diagram for Stream Content +---------------+ | | | Capture Set | | | +-------+-------+ _..-' | ``-._ _.-' | ``-._ _.-' | ``-._ +----------------+ +----------------+ +----------------+ | Media Capture | | Media Capture | | Media Capture | | Audio or Video | | Audio or Video | | Audio or Video | +----------------+ +----------------+ +----------------+ .' `. .' `. ,-----. ,---------. ,' Encode`. ,' `. ( Group ) ( Attributes ) `. ,' `. ,' `-----' `---------' 6.1. Media capture A media capture (defined in definitions) is a fundamental concept of the model. Media can be captured in different ways, for example by various arrangements of cameras and microphones. The model uses the terms "video capture" (VC) and "audio capture" (AC) to refer to sources of media streams. To distinguish between multiple instances, Romanow, et al. Expires January 4, 2012 [Page 10] Internet-Draft CLUE Telepresence Framework July 2011 they are numbered for example VC1, VC2, and VC3 could refer to three different video captures that can be used simultaneously. Media captures are dynamic. They can come and go in a conference - and their parameters can change. A sender can advertise a new list of captures at any time. Both the media sender and media receiver can send "their messages" (i.e., capture set advertisements, stream configurations) any number of times during a call, and the other end is always required to act on any new information received (e.g., stopping streams it had previously configured that are no longer valid). A media capture can be a media source such as video from a specific camera, or it can be more conceptual such as a composite image from several cameras, or an automatic dynamically switched capture choosing from several cameras depending on who is talking or other factors. A media capture is described by Attributes and associated with an Encode Group. Audio and video captures are aggregated into Capture Sets. 6.2. Attributes Audio and video capture attributes carry the information about streams and their relationships that a sender or receiver wants to communicate. [Edt: We do not mean to duplicate SDP, if an SDP description can be used, great.] The attributes of media streams refer to the current state of a stream, rather than the capabilities of a video capture device which are described in the encode capabilities, as descried below. The mechanism of Attributes make the framework extensible. Although we are defining some attributes now based on the most common use cases, new attributes can be added for new use cases as they arise. If the model does not do something you want it to, chances are defining an attribute will handle your case. We describe attributes by variables and their values. The current attributes are listed below. The variable is shown in parentheses, and the values follow after the colon: o (Purpose): main audio, main video, presentation o (Audio mixed): true, false Romanow, et al. Expires January 4, 2012 [Page 11] Internet-Draft CLUE Telepresence Framework July 2011 o (Audio Channel Format): linear array, mono, stereo, tbd o (Audio linear position): integer 0 to 100 o (Video scale): integer indicating scale o (Video composed): true, false o (Video auto-switched): true, false The attributes listed here are discussed in Appendix A, in order to keep the emphasis of this draft on the overall approach, rather than the more specific details. 6.3. Capture Set A sender describes its ability to send alternatives of media streams by defining capture sets. A capture set is a list of media captures expressed in rows. Each row of the capture set or list consists of either a single capture or groups of captures. A group means the individual captures in the group are spatially related, and the order of the captures within the group, along with attribute values, defines the spatial ordering of the captures. Spatial relationships are discussed in detail in Appendix B. The items (rows) in a capture set represent different alternatives for representing the same Capture Scene. For example the following are alternative ways of capturing the same Capture Scene - two cameras each viewing half of a room, or one camera viewing the whole room, or one stream that automatically captures the person in the room who is currently speaking. Each row of the Capture Set contains either a single media capture or one group of media captures. The following example shows a capture set for an endpoint media sender where: o (VC0 - left camera capture, VC1 - center camera capture, VC2 - right camera capture) o (VC3 - capture associated with loudest) o (VC4 - zoomed out view of all people in the room.) o (AC0 - room audio) The first item in this capture set example is a group of video Romanow, et al. Expires January 4, 2012 [Page 12] Internet-Draft CLUE Telepresence Framework July 2011 captures with a spatial relationship to each other. VC1 is to the left of VC2, and VC0 is to the left of VC1. VC3 and VC4 are other alternatives of how to capture the same room in different ways. The audio capture is included in the same capture set to indicate AC0 is associated with those video captures, meaning the audio should be rendered along with the video in the same set. The idea is to have sets of captures that represent the same information ("information" in this context might be a set of people and their associated audio / video streams, or might be a presentation supplied by a laptop, perhaps with an accompanying audio commentary). Spatial ordering of media captures is imposed here by the simplicity of a left to right ordering among media captures in a group in the set. A media receiver could choose one row of each media type (e.g., audio and video) from a capture set. For example a three stream receiver could choose the first video row plus the audio row, while a single stream receiver could choose the second or third video row plus the audio row. An MCU receiver might choose to receive multiple rows. The simultaneity groups and encoding groups as discussed in the next section apply to media captures listed in capture sets. The simultaneity groups and encoding groups MUST allow all the Media Captures in a particular group to be used simultaneously. 7. Choosing Streams The following diagram shows the flow of information messages between a media provider and a media consumer. The provider sends information about its capabilities (as specified in this section), then the consumer chooses which streams it wants, which we refer to as "configure". Optionally, the consumer may send hints to the provider about its own capabilities, in which case the provider might tailor its announcements to the consumer. Diagram for Choosing Streams Romanow, et al. Expires January 4, 2012 [Page 13] Internet-Draft CLUE Telepresence Framework July 2011 Media Receiver Media Sender -------------- ------------ | | |------------- Hints ---------------->| | | | | |<---- Capabilities (announce) -------| | | | | |------ Configure (request) --------->| | | In order for appropriate streams to be sent from senders to receivers, certain characteristics of the multiple streams must be understood by both senders and receivers. Two separate aspects of streams suffice to describe the necessary information to be shared by senders and receivers. The first aspect we call "physical simultaneity" and the other aspect we refer to as "encoding group". These are described in the following sections. 7.1. Physical Simultaneity An endpoint or MCU can send multiple captures simultaneously. However, there may be constraints that limit which captures can be sent simultaneously with other captures. Physical or device simultaneity refers to fact that a device may not be able to be used in different ways at the same time. This shapes the way that offers are made from the sender. The offers are made so that the receiver will choose one of several possible usages of the device. This is easier to show in an example. Consider the example of a room system where there are 3 cameras each of which can send a separate capture covering 2 persons each- VC0, VC1, VC2. The middle camera can also zoom out and show all 6 persons, VC3. But the middle camera cannot be used in both modes at the same time - it has to either show the space where 2 participants sit or the whole 6 seats. We refer to this as a physical device simultaneity constraint. The following illustration shows 3 cameras with 4 video streams. The middle camera can be used as main video zoomed in on 2 people or it could be used in zoomed out mode and capture the whole endpoint. The idea here is that the middle camera cannot be used for both zoomed in and zoomed out captures simultaneously. This is a constraint imposed by the physical limitations of the devices. Diagram for Simultaneity Romanow, et al. Expires January 4, 2012 [Page 14] Internet-Draft CLUE Telepresence Framework July 2011 `-. +--------+ VC2 .-'_Camera 3|----------> .-' +--------+ VC3 --------> `-. +--------+ / .-'|Camera 2|< .-' +--------+ \ VC1 --------> `-. +--------+ VC0 .-'|Camera 1|----------> .-' +--------+ VC0- video zoomed in on 2 people VC2- video zoomed in on 2 people VC1- video zoomed in on 2 people VC3- video zoomed out on 6 people Simultaneous transmission sets can be expressed as sets of the VCs that could physically be transmitted at the same time, though it may not make sense to do so. In this example the two simultaneous sets are: o {VC0, VC1, VC2} o {VC0, VC3, VC2} In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2. Only one set can be transmitted at a time. These are physical capabilities describing what can physically be sent at the same time, not what might make sense to send. For example, in the second set both VC0 and VC2 are redundant if VC3 is included. In describing its capabilities, the provider must take physical simultaneity into account and send a list of its simultaneity groups to the consumer. 7.2. Encoding Groups The second aspect of multiple streams that must be understood by senders and receivers in order to create the best experience possible, i. e., for the "right" or "best" streams to be sent, is the encoding characteristics of the possible streams that can be sent. Just in the way that there is a constraint imposed on the multiple streams due to the physical limitations, there are also constraints due to encoding limitations. These are described in an Encoding Group as follows. Romanow, et al. Expires January 4, 2012 [Page 15] Internet-Draft CLUE Telepresence Framework July 2011 An encoding group is an attribute of a video capture (VC) as discussed above. An encoding group has the following variables, as shown in the following table. +--------------+----------------------------------------------------+ | Name | Description | +--------------+----------------------------------------------------+ | maxBandwidth | Maximum number of bits per second relating to a | | | single video encoding | | maxMbps | Maximum number of macroblocks per second relating | | | to a single video encoding: ((width + 15) / 16) * | | | ((height + 15) / 16) * framesPerSecond | | maxWidth | Video resolution's maximum supported width, | | | expressed in pixels | | maxHeight | Video resolution's maximum supported height, | | | expressed in pixels | | maxFrameRate | Maximum supported frame rate | +--------------+----------------------------------------------------+ An encoding group is the basic method of describing encoding capability. There may be multiple encoding groups per endpoint. For example, each video capture device might have an associated encoding group that describes the video streams that can result from that capture. An encoding group EG comprises one or more potential encodings ENC. For example, EG0: maxMbps=489600, maxBandwidth=6000000 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 AUDIO_ENC0: maxBandwidth=96000 AUDIO_ENC1: maxBandwidth=96000 AUDIO_ENC2: maxBandwidth=96000 Here, the encoding group is EG0. It can transmit up to two 1080p30 encodings (Mbps for 1080p = 244800), but it is capable of transmitting a maxFrameRate of 60 frames per second (fps). To achieve the maximum resolution (1920 x 1088) the frame rate is limited to 30 fps. However 60 fps can be achieved at a lower resolution if required by the receiver. Although the encoding group is capable of transmitting up to 6Mbit/s, no individual video encoding can exceed 4Mbit/s. This encoding group also allows up to 3 audio encodings, AUDIO_ENC<0-2>. It is not required that audio and video encodings Romanow, et al. Expires January 4, 2012 [Page 16] Internet-Draft CLUE Telepresence Framework July 2011 reside within the same encoding group, but if so then the group's overall maxBandwidth value is a limit on the sum of all audio and video encodings configured by the receiver. A system that does not wish or need to combine bandwidth limitations in this way should instead use separate encoding groups for audio and video in order for the bandwidth limitations on audio and video to not interact. Here is an example written with separate audio and video encode groups. VIDEO_EG0: maxMbps=489600, maxBandwidth=6000000 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 AUDIO_EG0: maxBandwidth=500000 AUDIO_ENC0: maxBandwidth=96000 AUDIO_ENC1: maxBandwidth=96000 AUDIO_ENC2: maxBandwidth=96000 The following two sections describe further examples of encoding groups. In the first example, the capability parameters are the same across ENCs. In the second example, they vary. 7.2.1. Sample video encoding group specification #1 An endpoint that has 3 similar video capture devices would advertise 3 encoding groups that can each transmit up to 2 1080p30 encondings, as follows: EG0: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 EG1: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 EG2: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 A remote receiver configures some or all of the specific encodings such that: o The configuration of each active ENC parameter values does not cause that encoding's maxWidth, maxHeight, maxFrameRate to be exceeded Romanow, et al. Expires January 4, 2012 [Page 17] Internet-Draft CLUE Telepresence Framework July 2011 o The total bandwidth of the configured ENC encodings does not exceed the maxBandwidth of the encoding group o The sum of the "macroblocks per second" values of each configured encoding does not exceed the maxMbps of the encoding group There is no requirement for all encodings within an encoding group to be activated when configured by the receiver. Depending on the sender's encoding methods, the receiver may be able to request fixed encode values or choose encode values in the range less than the maximum offered. We will discuss receiver behavior in more detail in a section below. 7.2.2. Sample video encoding group specification #2 An endpoint that has 3 similar video capture devices would advertise 3 encoding groups that can each transmit up to 2 1080p30 encondings, as follows: EG0: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 EG1: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 EG2: maxMbps = 489600, maxBandwidth=6000000 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 A remote receiver configures some or all of the specific encodings such that: o The configuration of each active ENC parameter values does not cause that encoding's maxWidth, maxHeight, maxFrameRate to be exceeded o The total bandwidth of the configured ENC encodings does not exceed the maxBandwidth of the encoding group o The sum of the "macroblocks per second" values of each configured encoding does not exceed the maxMbps of the encoding group There is no requirement for all encodings within an encoding group to be activated when configured by the receiver. Depending on the sender's encoding methods, the receiver may be able to request fixed encode values or choose encode values in the range Romanow, et al. Expires January 4, 2012 [Page 18] Internet-Draft CLUE Telepresence Framework July 2011 less than the maximum offered. We will discuss receiver behavior in more detail in a section below. 8. Media provider behavior In summary, what is included in the sender capabilities announce messing includes: o the list of captures and their attributes o the list of capture sets o the list of physical simultaneity groups o the list of the encoding groups 9. Putting it together - using the Capture Set This section shows how to use the framework to represent a typical case for telepresence rooms. Appendix B includes an additional example showing the MCU case. [Edt. It is in the Appendix just to allow the body of the document to focus on the basic ideas. It can be brought in to the main text in a later draft.] Consider an endpoint with the following characteristics: o 3 cameras, 3 displays, a 6 person table o Each video device can provide one capture for each 1/3 section of the table o A single capture representing the active speaker can be provided o A single capture representing the active speaker with the other 2 captures shown picture in picture within the stream can be provided o A capture showing a zoomed out view of all 6 seats in the room can be provided The audio and video captures for this endpoint can be described as follows. The Encode Group specifications can be found above in section 6.2.2, Sample video encoding group specification #2. Romanow, et al. Expires January 4, 2012 [Page 19] Internet-Draft CLUE Telepresence Framework July 2011 Video Captures: 1. VC0- (the left camera stream), encoding group:EG0, attributes: purpose=main;auto-switched:no 2. VC1- (the center camera stream), encoding group:EG1, attributes: purpose=main; auto-switched:no 3. VC2- (the right camera stream), encoding group:EG2, attributes: purpose=main;auto-switched:no 4. VC3- (the loudest panel stream), encoding group:EG1, attributes: purpose=main;auto-switched:yes 5. VC4- (the loudest panel stream with PiPs), encoding group:EG1, attributes: purpose=main; composed=true; auto-switched:yes 6. VC5- (the zoomed out view of all people in the room), encoding group:EG1, attributes: purpose=main;auto-switched:no 7. VC6- (presentation stream), encoding group:EG1, attributes: purpose=presentation;auto-switched:no Summary of video captures - 3 codecs, center one is used for center camera stream, presentation stream, auto-switched, and zoomed views. [edt. It is arbitrary that for this example the alternative views are on EG1 - they could have been spread out- it was not a necessary choice.] Audio Captures: o AC0 (left), attributes: purpose=main;channel format=linear array; linear position=0; o AC1 (right), attributes: purpose=main;channel format=linear array; linear position=100; o AC2 (center) attributes: purpose=main;channel format=linear array; linear position=50; o AC3 being a simple pre-mixed audio stream from the room (mono), attributes: purpose=main;channel format=linear array; linear position=50; mixed=true o AC4 audio stream associated with the presentation video (mono) attributes: purpose=presentation;channel format=linear array; linear position=50; Romanow, et al. Expires January 4, 2012 [Page 20] Internet-Draft CLUE Telepresence Framework July 2011 The physical simultaneity information is: {VC0, VC1, VC2, VC3, VC4, VC6} {VC0, VC2, VC5, VC6} You can physically do any selection within one set at the same time. This is strictly what is possible from the devices. However, using every member in the set simultaneously may not make sense- for example VC3(loudest) and VC4 (loudest with PIP). (In addition, there are encoding constraints that make choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1has only 3 ENCs. This constraint shows up in the Capture list, not in the physical simultaneity list.) In this example there are no restrictions on which audio captures can be sent simultaneously. The following table represents the capture sets for this sender. Recall that a capture set is composed of alternative captures covering the same scene. Capture Set #1 is for the main people captures, and Capture Set #2 is for presentation. +----------------+ | Capture Set #1 | +----------------+ | VC0, VC1, VC2 | | VC3 | | VC4 | | VC5 | | AC0, AC1, AC2 | | AC3 | +----------------+ +----------------+ | Capture Set #2 | +----------------+ | VC6 | | AC4 | +----------------+ Different capture sets are unique to each other, non-overlapping. A receiver chooses a capture row from each capture set. In this case the three captures VC0, VC1, and VC2 are one way of representing the video from the endpoint. These three captures should appear adjacent next to each other. Alternatively, another way of representing the Capture Scene is with the capture VC3, which automatically shows the person who is talking. Similarly for the VC4 and VC5 alternatives. Romanow, et al. Expires January 4, 2012 [Page 21] Internet-Draft CLUE Telepresence Framework July 2011 As in the video case, the different rows of audio in Capture Set #1 represent the "same thing", in that one way to receive the audio is with the 3 linear position audio captures (AC0, AC1, AC2), and another way is with the single channel monaural format AC3. The Media Consumer would choose the one audio capture row it is capable of receiving. The spatial ordering is understood by the left to right ordering among the VC7lt;n>r;s on the same row of the table. The receiver finds a "row" in each capture set #x section of the table that it wants. It configures the streams according to the encoding group for the row. A Media Receiver would likely want to choose a row to receive based in part on how many streams it can simultaneously receive. A receiver that can receive three people streams would probably prefer to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not receive the other rows. A receiver that can receive only one people stream would probably choose one of the other rows. If the receiver can receive a presentation stream too, it would also choose to receive the only row from Capture Set #2 (VC6). 10. Media consumer behaviour The receive side of a call needs to balance its requirements, based on number of screens and speakers, its decoding capabilities and available bandwidth, and the sender's capabilities in order to optimally configure the sender's streams. Typically it would want to receive and decode media from each capture set advertised by the sender. A sane, basic, algorithm might be for the receiver to go through each capture set in turn and find the collection of video captures that best matches the number of screens it has (this might include consideration of screens dedicated to presentation video display rather than "people" video) and then decide between alternative rows in the video capture sets based either on hard-coded preferences or user choice. Once this choice has been made, the receiver would then decide how to configure the sender's encode groups in order to make best use of the available network bandwidth and its own decoding capabilities. Romanow, et al. Expires January 4, 2012 [Page 22] Internet-Draft CLUE Telepresence Framework July 2011 10.1. One screen receiver configuring the example capture-side device above The receive side of a call needs to balance its requirements, based on number of screens and speakers, its decoding capabilities and available bandwidth, and the sender's capabilities in order to optimally configure the sender's streams. Typically it would want to receive and decode media from each capture set advertised by the sender. A sane, basic, algorithm might be for the receiver to go through each capture set in turn and find the collection of video captures that best matches the number of screens it has (this might include consideration of screens dedicated to presentation video display rather than "people" video) and then decide between alternative rows in the video capture sets based either on hard-coded preferences or user choice. Once this choice has been made, the receiver would then decide how to configure the sender's encode groups in order to make best use of the available network bandwidth and its own decoding capabilities. 10.2. Two screen receiver configuring the example capture-side device above Mixing systems with an even number of screens, "2n", and those with "2n+1" cameras (and vice versa) is always likely to be the problematic case. In this instance, the behaviour is likely to be determined by whether a "2 screen" system is really a "2 decoder" system, i.e., whether only one received stream can be displayed per screen or whether more than 2 streams can be received and spread across the available screen area. To enumerate 3 possible behaviours here for the 2 screen system when it learns that the far end is "ideally" expressed via 3 capture streams: 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as per the 1 screen receiver case above) and either leave one screen blank or use it for presentation if / when a presentation becomes active 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens (either with each capture being scaled to 2/3 of a screen and the centre capture being split across 2 screens) or, as would be necessary if there were large bezels on the screens, with each stream being scaled to 1/2 the screen width and height and there being a 4th "blank" panel. This 4th panel could potentially be used for any presentation that became active during the call. Romanow, et al. Expires January 4, 2012 [Page 23] Internet-Draft CLUE Telepresence Framework July 2011 3. Receive 3 streams, decode all 3, and use control information indicating which was the most active to switch between showing the left and centre streams (one per screen) and the centre and right streams. For an endpoint capable of all 3 methods of working described above, again it might be appropriate to offer the user the choice of display mode. 10.3. Three screen receiver configuring the example capture-side device above This is the most straightforward case - the receiver would look to identify a set of streams to receive that best matched its available screens and so the VC0 plus VC1 plus VC2 should match optimally. The spatial ordering would give sufficient information for the correct video capture to be shown on the correct screen, and the receiver would either need to divide a single encode group's capability by 3 to determine what resolution and frame rate to configure the sender with or to configure the individual video captures' encode groups with what makes most sense (taking into account the receive side decode capabilities, overall call bandwidth, the resolution of the screens plus any user preferences such as motion vs sharpness). 10.4. Configuration of sender streams by a receiver After receiving a set of video capture information from a sender and making its choice of what media streams to receive based on the receiver's own capabilities and any sender-side simultaneity restrictions, the receiver needs to essentially configure the sender to transmit the chosen set. The expectation is that this message will enumerate each of the encoding groups and potential encoders within those groups that the receiver wishes to be active (this may well be a subset of the complete set available). For each such encoder within an encoding group, the receiver would specify the video capture (i.e., VC. Romanow, et al. Expires January 4, 2012 [Page 25] Internet-Draft CLUE Telepresence Framework July 2011 Appendix A. Attributes This section discusses the attributes and their values in more detail, and many have additional details provided elsewhere in the draft. In general, the way to extend the solution to handle new features is by adding attributes and/or values. A.1. Purpose A variable with enumerated values describing the purpose or role of the Media Capture. It could be applied to any media type. Possible values: main, presentation, others TBD. A.1.1. Main The audio or video capture is of one or more people participating in a conference (or where they would be if they were there). It is of part or all of the Capture Scene. A.1.2. Presentation A.2. Audio mixed A.3. Audio Channel Format The "channel format" attribute of an Audio Capture indicates how the meaning of the channels is determined. It is an enumerated variable describing the type of audio channel or channels in the Aucio Capture. The possible values of the "channel format" attribute are: o linear array (linear position) o stereo o TBD - other possible future values (to potentially include other things like 3.0, 3.1, 5.1 surround sound and binaural) All ACs in the same row of a Capture Set MUST have the same value of the "channel format" attribute. A.3.1. Linear Array An AC with channel format = "linear array" has exactly one audio channel. For the "linear array" channel format, there is another required attribute to specify position within the array. This is the "linear position" attribute, which is an integer value within the range 0 to 100. 0 means leftmost, 100 means rightmost, with other values spaced equally between. A value of 50 means in the center, Romanow, et al. Expires January 4, 2012 [Page 26] Internet-Draft CLUE Telepresence Framework July 2011 spatially. Any AC can have any value, even multiple ACs in a capture set row can have the same value. The 0-100 linear position is intentionally dimensionless, since we are presuming that receivers will use different sized video displays, and the audio spatial location can be adjusted at the receiving side to correspond to the displays. The linear position value is fixed until the receiver asks for a different AC from the capture set, which may be triggered by the provider sending an updated capture set. The streams being sent might be correlated (that is, someone talking might be heard in multiple captures from the same room). Echo cancellation and stream synchronization in receivers should take this into account. With three audio channels representing left, center, and right: AC0 - channel format = linear array; linear position = 0 AC1 - channel format = linear array; linear position = 50 AC2 - channel format = linear array; linear position = 100 A.3.2. Stereo An AC with channel format = "stereo" has exactly two audio channels, left and right, as part of the same AC. [Edt: should we mention RFC 3551 here? The channel format may be related to how Audio Captures are mapped to RTP streams. This stereo is not the same as the effect produced from two mono ACs one from the left and one from the right. ] A.3.3. Mono An AC with channel format="mono" has one audio channel. This can be represented by audio linear position with a single member at a single integer location. [Edt. Mono can be represented as an as a particular case of linear array (=1] A.4. Audio Linear Position An integer valued variable from 0 - 100, where 0 signifies the left and 100 signifies the right. Romanow, et al. Expires January 4, 2012 [Page 27] Internet-Draft CLUE Telepresence Framework July 2011 A.5. Video Scale An optional integer valued variable indicating the spatial scale of the video capture, for example centimeters for horizontal image width. A.6. Video composed An optional Boolean variable indicating if the VC is constructed by composing multiple other video captures together. stream incorporates multiple composed panes (This could indicate for example a continuous presence view of multiple images in a grid, or a large image with smaller picture-in-picture images in it.) A.7. Video Auto-switched A Boolean variable. In this case the offered VC varies depending on some rule; it is auto-switched between possible VCs. The most common example of this is sending the video capture associated with the "loudest" speaker according to an audio detection algorithm. Appendix B. Spatial Relationship Here is an example of a simple capture set with three video captures and three audio channels, each in a separate row: (VC0, VC1, VC2) (AC0, AC1, AC2) The three ACs together in a row indicate those channels are spatially related to each other, and spatially related to the VCs in the same capture set. Multiple Media Captures of the same media type are often spatially related to each other. Typically multiple Video Captures should be rendered next to each other in a particular order, or multiple audio channels should be rendered to match different speakers in a particular way. Also, media of different types are often associated with each other, for example a group of Video Captures can be associated with a group of Audio Captures meaning they should be rendered together. Media Captures of the same media type are associated with each other by grouping them together in a single row of a Capture Set. Media Captures of different media types are associated with each other by putting them in different rows of the same Capture Set. Romanow, et al. Expires January 4, 2012 [Page 28] Internet-Draft CLUE Telepresence Framework July 2011 For video the spatial relationship is horizontal adjacency in one dimension. So Video Captures can be described as being adjacent to each other, in a horizontal row, ordered left to right. When VCs are grouped together in a capture set row, it means they are horizontally adjacent to each other, such that when more than one of them are rendered together they should be rendered next to each other in the proper order. The first VC in the group is the leftmost (from the point of view of a person looking at the rendered images), and so on towards the right. [Edt: Additional attributes can be added, such as the ability to handle two dimensional array instead of just a one dimensional row of video images.] Audio Captures that are in the same Capture Set with Video Captures are related to each other spatially, such that the multiple audio channels should be rendered such that the overall audio field covers roughly the same horizontal extent as the rendered video. This gives a reasonable spatial correlation between audio and video. A more exact relationship is out of scope of this framework. B.1. Spatial relationship of audio with video A row of audio is spatially related to a row of video in the same capture set. The audio and video should be rendered such that they appear spatially coincident. Audio with a linear position of 0 corresponds to the leftmost side of the group of VCs in the same capture set. Audio with a linear position of 50 corresponds to the center of the group of VCs. Audio with a linear position of 100 corresponds to the rightmost side of the group of VCs. Likewise, for stereo audio, the spatial extent of the audio should be coincident with the spatial extent of the corresponding video. Appendix C. Capture sets for the MCU Case This shows how an MCU might express its Capture Sets, intending to offer different choices for receivers that can handle different numbers of streams. A single audio capture stream is provided for all single and multi-screen configurations that can be associated (e.g. lip-synced) with any combination of video captures at the receiver. Romanow, et al. Expires January 4, 2012 [Page 29] Internet-Draft CLUE Telepresence Framework July 2011 +--------------------+---------------------------------------------+ | Capture Set #1 | note | +--------------------+---------------------------------------------+ | VC0 | video capture for single screen receiver | | VC1, VC2 | video capture for 2 screen receiver | | VC3, VC4, VC5 | video capture for 3 screen receiver | | VC6, VC7, VC8, VC9 | video capture for 4 screen receiver | | AC0 | audio capture representing all participants | +--------------------+---------------------------------------------+ If / when a presentation stream becomes active within the conference, the MCU might re-advertise the available media as: +----------------+--------------------------------------+ | Capture Set #2 | note | +----------------+--------------------------------------+ | VC10 | video capture for presentation | | AC1 | presentation audio to accompany VC10 | +----------------+--------------------------------------+ Authors' Addresses Allyn Romanow Cisco Systems San Jose, CA 95134 USA Email: allyn@cisco.com Mark Duckworth Polycom Andover, MA 01810 US Email: mark.duckworth@polycom.com Andrew Pepperell Cisco Systems Langely, England UK Email: apeppere@cisco.com Romanow, et al. Expires January 4, 2012 [Page 30] Internet-Draft CLUE Telepresence Framework July 2011 Brian Baldino Cisco Systems San Jose, CA 95134 US Email: bbaldino@polycom.com Mark Goryzinski HP Visual Collaboration Corvallis, OR USA Email: mark.gorzynski@hp.com Romanow, et al. Expires January 4, 2012 [Page 31]