CLUE WG A. Romanow Internet-Draft Cisco Systems Intended status: Informational March 5, 2011 Expires: September 6, 2011 Requirements for Telepresence Multi-Streams draft-romanow-clue-telepresence-requirements-00.txt Abstract This document discusses the problems, assumptions, and requirements for handling multiple media streams to enable Telepresence interoperability. Definitions are also covered here as well. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 6, 2011. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Romanow Expires September 6, 2011 [Page 1] Internet-Draft CLUE Telepresence Requirements March 2011 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 5. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 6 6. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 10. Informative References . . . . . . . . . . . . . . . . . . . . 10 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 10 Romanow Expires September 6, 2011 [Page 2] Internet-Draft CLUE Telepresence Requirements March 2011 1. Introduction Telepresence systems greatly enable collaboration; however currently different systems do not interoperate, as discussed in detail in the Problem Statement section below. The approach taken here to provide interoperability between IETF protocol based systems, is to exchange adequate information about participating sites so that interworking is possible. The purpose of this document is to set out the assumptions and requirements for a specification that enables interworking between different SIP-based [RFC3261] telepresence systems by exchanging and negotiating appropriate information. [more] The document structure is as follows: Definitions are discussed, followed by a description of the problem of telepresence interoperability. Then assumptions and requirements are enumerated and discussed. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 3. Definitions Audio mixing - multiple audio captures in a single stream. See RTP Topologies, [RFC5117]. Capture device - A device that captures input from the environment (audio or video) and converts it to a digital form. Usually a camera or a microphone in a room. Conceptual stream - A media stream that is determined by some defined policy, such as the loudest speaker, or the last speaker. Contrast conceptual stream with specific stream - for example, show me the loudest speaker vs. show me the guy on the farthest right. Conference - defined in [RFC4353], A Framework for Conferencing within the Session Initiation Protocol (SIP). Romanow Expires September 6, 2011 [Page 3] Internet-Draft CLUE Telepresence Requirements March 2011 Endpoint - The end system in a telepresence conference, which sources and sinks media streams. Distinct from a middle box that may also send and receive media. Used interchangeably with room. Layout - how media streams are rendered on a single screen and how sources are positioned on multiple screens. See also source placement, Participant - a person in a telepresence conference. Region - a basic unit of space. The capture range of audio and video are associated with a region. It is possible for a capture device (e. g., a camera) to cover multiple regions. Remote - Sender or receiver on the other side, not local. It could be either an endpoint or a middle box. Render - generate a representation, an image or a sound. Render device - usually a display or speaker in a room. Room - an endpoint in a telepresence conference. Does not have to be a physical room. Session - RTP session. An association among a set of participants communicating with RTP. The distinguishing feature of an RTP session is that each maintains a full, separate space of SSRC identifiers. [RFC3550] Source placement (aka layout) - Deciding where to render streams. Includes how spatial relations are represented. Relates to both audio and video sources. Source selection - Deciding which streams should be rendered. Closely related to layout, source placement. Relates to both video and audio sources. Spatial relations - How objects are situated in relationship to each other. Video composite - A single image that is formed from combining visual elements from separate sources. Romanow Expires September 6, 2011 [Page 4] Internet-Draft CLUE Telepresence Requirements March 2011 4. Problem Statement In a Telepresence conference, the idea is to create a feeling of presence - that you are in the same room with the remote parties. In order to create the "being there" or telepresence experience, a number of technical issues need to be solved. These issues are addressed by manipulating multiple media streams, video and audio - by describing them, controlling them, and negotiating about them. The fundamental features of telepresence require handling multiple streams of media and considering additional characteristics of those streams beyond those normally specified in existing videoconferencing standards, such as described in the Use Case draft. [ref] Different telepresence systems approach solving the basic issues differently. They use disparate techniques, and they describe, control and negotiate media in dissimilar fashions. Such diversity creates an interoperability problem. The same issues are solved in different ways by different systems, so that they are not directly interoperable. This makes interworking difficult at best and sometimes impossible. Some degree of interworking is possible through transcoding and translation. This requires additional devices, which are expensive and not entirely automatic. Specialized knowledge is required to operate a telepresence conference where the endpoints use different equipment and a transcoding and translating device is employed for interoperability. Often such conferences are interrupted by difficulties that arise. The general problem that needs to be solved is this. The transmitting side sends audio and video streams based upon a model for rendering a realistic depiction from this information. If the receiving side belongs to the same vendor, it works with the same model and renders the information according to that shared model. However, if the receiver and the sender are from different vendors, the models they each have for rendering presence differ. It is as if Alice and Bob are at different sites. Alice needs to tell Bob information about what her camera and sound equipment see at her site so that Bob's receiver can create a display that will capture the important characteristics of her site. Alice and Bob need to agree on what the salient characteristics are as well as how to represent and communicate them. The telepresence multi-steam work seeks to describe the sender situation in a way that allows the receiver to render it realistically though it may have a different rendering model than the sender; and for the receiver to provide information to the sender in order to help the sender create adequate content for interworking. Romanow Expires September 6, 2011 [Page 5] Internet-Draft CLUE Telepresence Requirements March 2011 5. Assumptions ASMP-1: Endpoints sending spatially related streams should make sure they are disjoint and arranged as announced. ASMP-2: Endpoints rendering spatially related streams should render them according to the source description of the streams. ASMP-3: Layout can be done locally, remotely, or by a combination of the two. ASMP-4: Layout decisions can be made by various mechanisms, such as an algorithm, administratively determined, or local user- based. ASMP-5: Layout can be static, fixed at call setup, or dynamic, changing during runtime, or both. ASMP-6: Different systems do things differently. It is not the intention of CLUE to dictate Telepresence architectural and implementation choices. Rather CLUE enables different systems to interwork by exchanging information that systems can use to be interoperable. * CLUE provides information and the receiver will display it to best advantage. What the receiver does is up to it, not specified by the standard. The sender can give a hint but it is primarily up to receiver how it renders the stream. ASMP-7: It is not mandatory to support all of the features that can be negotiated. ASMP-8: Mechanisms and features necessary for interoperability that do not concern handling multiple RTP media streams are not considered in this standard, although they may be considered by the working group. For example any call setup issues might be discussed in CLUE and handled by an appropriate SIP WG, not here. ASMP-9: For decoders, if a particular codec is advertised, it is assumed that lower resolution codecs are also supported. For example if 1080p30 is advertised, the inference is that 720p30 is also supported. Romanow Expires September 6, 2011 [Page 6] Internet-Draft CLUE Telepresence Requirements March 2011 6. Requirements Requirements are grouped by topic as much as possible. Media Streams: REQMT-1: Provide a means of describing the relationship of one stream to another stream, including spatial information, sender and receiver capabilities and preferences. [more?] Note: A middle box needs to indicate spatial relationships between multiple video composite streams, or mixed audio streams, that it constructs out of streams coming from multiple endpoints. REQMT-2: Support media streams being added or removed or modified throughout the course of the session REQMT-3: Support asymmetric streams, meaning the number of media streams sent from and received at different endpoints are different. For example, in a point to point conference, one side may have 2 cameras generating 2 video streams, and the other side may have 3 cameras, generating 3 video streams. REQMT-4: Specify a mechanism for the receiver to tell the sender what streams it wants to receive. REQMT-5: Support synchronization of multiple audio streams with multiple video streams, including when there is not an equivalent number of audio and video streams. REQMT-6: Support for H.264 baseline. Scale: REQMT-7: CLUE needs to support interoperability between an arbitrary number of media capture devices (e. g., microphones and cameras) and an arbitrary number of media rendering devices (e. g., screens and loudspeakers). [There should be a use case for large number of senders and receivers.] * Is there an upper number? For practicality in implementation, we will need to pick a number for but we don't want to limit the number theoretically. Romanow Expires September 6, 2011 [Page 7] Internet-Draft CLUE Telepresence Requirements March 2011 Negotiation: REQMT-8: Must support a mechanism for negotiation and re- negotiation during a call. In some cases, interaction is necessary, not just a one way flow of information. Diversity: REQMT-9: Heterogeneous devices must be supported in the same conference. For example, this includes large screens, software applications for multi-window conferencing; handheld devices; audio only devices. REQMT-10: A range of video resolutions need to be accommodated in the same conference to work with endpoints that have different capabilities. Similarly, endpoints with a range of audio qualities must be allowed. Classification: REQMT-11: Must specify and support content types necessary for interoperability. [RFC4796] is relevant. REQMT-12: Stream characteristics -- CLUE must specify characteristics of SDP streams necessary for interoperability not covered elsewhere. REQMT-13: Support dynamic changes in stream characteristics, e. g., resolution (tied to available bandwidth). Layout/source placement: REQMT-14: Must exchange appropriate information to enable the layout to be accomplished. REQMT-15: CLUE must support different models of where layout decisions are made - local and remote. * Local does the placement + Remote tells local what streams it has (content type, anything else?? What exactly? [to be filled in] + Local tells the remote which streams it wants Romanow Expires September 6, 2011 [Page 8] Internet-Draft CLUE Telepresence Requirements March 2011 * Remote does the placement + Local tells the remote what its displays are, what its receive positions are, how many streams it can receive + Remote specifies the placement for the local REQMT-16: Support for dynamic change of layout throughout the conference is necessary. Both stream announcements and stream requests might change during a conference, re- negotiation is mandatory. Video Source selection: What is rendered where changes throughout the conference. REQMT-17: Support for dynamic management of choosing which video captures should be rendered. * Different strategies are possible, often use active speaker to decide which video stream to display. * Could indicate priority on the capture stream. Using this would be negotiated. REQMT-18: Support is necessary for video source selection policies, or specified sets of rules, that decide layout. Examples of such sets of rules for selection include: administrative determination, user choice, loudest speaker, former speaker, PIP or not, continuous presence, room switched, segment switched. (Do we need a defined set of implementation profiles?) Audio Source Selection: REQMT-19: Support for dynamic management of choosing which audio captures should be rendered. REQMT-20: Must support various methods of speaker selection - such as Voice Activity Detection (VAD). User experience derived requirements: REQMT-21: Responsiveness. - CLUE must not add latency such that it goes over the bounds for a good user experience. Romanow Expires September 6, 2011 [Page 9] Internet-Draft CLUE Telepresence Requirements March 2011 7. Acknowledgements This draft has benefited substantially from thoughtful review by Espen Berger, Peter Musgrave, Mark Duckworth, Jim Cole, Paul Kyzivat, and Roni Even. It also benefits from Brian Rosen's helpful thoughts on the topic. 8. IANA Considerations TBD 9. Security Considerations TBD 10. Informative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. [RFC4353] Rosenberg, J., "A Framework for Conferencing with the Session Initiation Protocol (SIP)", RFC 4353, February 2006. [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description Protocol (SDP) Content Attribute", RFC 4796, February 2007. [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, January 2008. Romanow Expires September 6, 2011 [Page 10] Internet-Draft CLUE Telepresence Requirements March 2011 Author's Address Allyn Romanow Cisco Systems San Jose, CA 95134 US Email: allyn@cisco.com Romanow Expires September 6, 2011 [Page 11]