DISPATCH WG                                                   A. Romanow
Internet-Draft                                                     Cisco
Intended status: Informational                                 S. Botzko
Expires: January 13, 2011                                        Polycom
                                                           July 12, 2010


            Problem Statement for Telepresence Multi-streams
       draft-romanow-dispatch-telepresence-prob-statement-01.txt

Abstract

   Telepresence systems create a "being there" conferencing experience.
   A number of issues need to be solved largely by manipulating multiple
   audio and video streams.  Different systems take different
   approaches, employ different techniques, and convey information by
   using different vocabularies, making interoperability extremely
   challenging.  This problem statement describes the typical issues
   that must be solved and uses examples to illustrate the kind of
   diversity that makes interworking problematic.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 13, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect


Romanow & Botzko        Expires January 13, 2011                [Page 1]

Internet-Draft       Telepresence Problem Statement            July 2010


   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Fundamental Issues for Telepresence  . . . . . . . . . . . . .  4
   4.  Manipulating Media Streams . . . . . . . . . . . . . . . . . .  5
   5.  Examples of Interworking Issues  . . . . . . . . . . . . . . .  6
     5.1.  Designating Roles and Positions for transmitted streams  .  6
     5.2.  Multipoint . . . . . . . . . . . . . . . . . . . . . . . .  7
     5.3.  Capability Negotiation . . . . . . . . . . . . . . . . . .  9
     5.4.  Differences in Media Characteristics . . . . . . . . . . .  9
       5.4.1.  Aspect Ratio . . . . . . . . . . . . . . . . . . . . .  9
       5.4.2.  Visual Scale . . . . . . . . . . . . . . . . . . . . . 11
   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
   9.  Informative References . . . . . . . . . . . . . . . . . . . . 13
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13


Romanow & Botzko        Expires January 13, 2011                [Page 2]

Internet-Draft       Telepresence Problem Statement            July 2010


1.  Introduction

   In a Telepresence conference, the idea is to create a feeling of
   presence - that you are in the same room with the remote parties.  In
   order to create the "being there" or telepresence experience, a
   number of technical issues need to be solved.  These issues are
   addressed by manipulating multiple media streams, video and audio -
   by describing them, controlling them, and signaling about them.  The
   fundamental features of telepresence require handling multiple
   streams of media, and considering additional characteristics of those
   streams beyond those normally specified in existing videoconferencing
   standards.

   Different telepresence systems approach solving the basic issues
   differently.  They use disparate techniques, and they describe,
   control and signal media in dissimilar fashions.  Such diversity
   creates an interoperability problem.  The same issues are solved in
   different ways by different systems, so that they are not directly
   interoperable.  This makes interworking difficult at best and
   sometimes impossible.

   Some degree of interworking is possible through transcoding and
   translation.  This requires additional devices, which are expensive
   and not entirely automatic.  Specialized knowledge is required to
   operate a telepresence conference where the endpoints use different
   equipment and a transcoding and translating device is employed for
   interoperability.  Often such conferences are interrupted by
   difficulties that arise.

   The general problem that needs to be solved is this.  The
   transmitting side sends audio and video streams based upon a model
   for rendering a realistic depiction from this information.  If the
   receiving side belongs to the same vendor, it works with the same
   model and renders the information according to that shared model.
   However, if the receiver and the sender are from different vendors,
   the models they each have for rendering presence differ.

   It is as if Alice and Bob are at different sites.  Alice needs to
   tell Bob information about what her camera and sound equipment see at
   her site so that Bob's receiver can create a display that will
   capture the important characteristics of her site.  Alice and Bob
   need to agree on what the salient characteristics are as well as how
   to represent and communicate them.  The telepresence multi-steam work
   seeks to describe the sender situation in a way that allows the
   receiver to render it realistically though it may have a different
   rendering model than the sender.

   This problem statement identifies the fundamental issues that need to


Romanow & Botzko        Expires January 13, 2011                [Page 3]

Internet-Draft       Telepresence Problem Statement            July 2010


   be addressed to provide telepresence in typical use case scenarios.
   We show how different approaches to solving the problems and
   different techniques for handling multiple media create a challenge
   for interoperability.

   This document describes some of the problems that arise, it is not an
   complete list, but rather it is more illustrative than exhaustive.
   Requirements, use cases and solutions are discussed in other
   documents.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].


3.  Fundamental Issues for Telepresence

   The fundamental issues that must be handled to produce a typical
   telepresence conference, either point to point or multipoint include:

   1.  Participant display

       A.  Placement of video

       B.  Size

       C.  Angle

       D.  Overlap

       E.  Display technology

   2.  Audio

       A.  Placement, emanating from right place

       B.  Type of audio

   3.  Different number of screens on sender and receiver sides

   4.  Participant display for multipoint

       A.  Placement of video


Romanow & Botzko        Expires January 13, 2011                [Page 4]

Internet-Draft       Telepresence Problem Statement            July 2010


       B.  Continuous presence

       C.  Control of display, how does it change? - automatic, user

   5.  Maintaining eye contact and gaze connection

   6.  Panoramic view for site switching

   7.  Mismatches between media characteristics between sender and
       receiver, such as:

       A.  aspect ratio

       B.  format

       C.  frame rate

       D.  resolution

   8.  Presentation

       A.  What methodology?

   9.  Security

       A.  SRTP?

       B.  Key methodology


4.  Manipulating Media Streams

   In addressing the fundamental issues, multiple media streams are
   handled in the following ways:

   1.   Sender and receiver understand each others capabilities

        A.  Number of video, audio and presentation streams that can be
            sent/received simultaneously

        B.  What media signaling protocol being used (SDP, proprietary,
            etc.)

   2.   Streaming control

   3.   Feedback mechanisms


Romanow & Botzko        Expires January 13, 2011                [Page 5]

Internet-Draft       Telepresence Problem Statement            July 2010


   4.   Signaling about RTP payload

   5.   Media control signaling

        A.  Video refresh

        B.  Flow control

   6.   Signaling media formats and media capabilities

   7.   Signaling content type

   8.   Signaling device type

   9.   Signaling network characteristics per stream

   10.  Floor control signaling


5.  Examples of Interworking Issues

   This section describes several examples that illustrate the kinds of
   incompatibilities that arise when different systems take different
   approaches to an issue.

5.1.  Designating Roles and Positions for transmitted streams

   Senders and receivers need to have the same vocabulary and
   understanding of stream roles and positions in order to place them
   appropriately.  For example one system may define roles as: center,
   left, right, legacy center, legacy right, legacy left, auxiliary 1/5
   fps and auxiliary 30 fps positions.  These roles as defined are a
   combination of "input devices" + "codec type/format" for transmission
   positions, and a combination of "stream decoders/output devices" +
   "codec type/format" for receive positions.  Another system will not
   have the exact same vocabulary and meaning, though it still has to
   accomplish the same placement task.

   How the cameras and encoders are wired determines how the local scene
   is displayed on the remote screen.  In many systems right and left
   need to be exchanged to be seen properly, but this depends on the way
   the equipment is wired.

   In describing how to display the local scene, the language can be
   misleading if there is no agreed upon reference for right and left.
   [for example, more]

   Although often the video is displayed on separate monitors, it is


Romanow & Botzko        Expires January 13, 2011                [Page 6]

Internet-Draft       Telepresence Problem Statement            July 2010


   also possible to use projectors to create a video wall.  In this
   case, there may be an overlap region between cameras which allows for
   projector blending.  Also, although cameras are generally arranged to
   create a seamless panoramic view of the participants, it is also
   possible for there to be gaps between cameras (and corresponding gaps
   between displays).

   There is also no reference for image size.  Some rooms use
   proportionally larger displays, and set the camera field of view to
   show participants either standing or sitting at life size.  Others
   use smaller displays, and set the field of view for sitting
   participants (cropping off heads when people stand).  In order to
   preserve full size display when these systems interoperate, both
   systems must rescale their video.

5.2.  Multipoint

   Multipoint conferences, where there are more than two endpoints,
   create a wealth of technical issues to be solved.  The primary one is
   which participants to display on each screen at each site.  If the
   number of sites is greater than can be shown on the number of
   displays at a site, this adds to the complexity.  There are, of
   course, almost unlimited ways this can be handled.  We discuss the
   common approaches and how they differ.

   The local screens can show all the camera image from the a particular
   remote site (site switching); or each local screen can show a
   participant or two from each of the remote sites (segment switching);
   or local displays can show a composite of remote camera shots
   (continuous presence).  The choice of who to display on a screen can
   be determined by users, or, more often, automated according to voice
   activity level.

   [Add user-controlled personal telepresence scenario.]

   Policies are created and implemented in many ways.  They tend to be
   based on some combination of what H.323 defines as centralized and
   decentralized.  One of the challenges is that the endpoints in the
   conference may have different number of cameras and displays from
   each other so a common mode on the number of streams and their
   priority is required.  Also, the various endpoints might have
   different bandwidth constraints and support different codec profiles.

   A centralized multipoint conference is one in which all participating
   endpoints communicate in a point-to-point fashion with an MCU.  The
   endpoints transmit their control, audio, video, and/or data streams
   to the MCU.  The MCUA centrally manages the conference, processes the
   audio, video and/or data streams, and returns the processed streams


Romanow & Botzko        Expires January 13, 2011                [Page 7]

Internet-Draft       Telepresence Problem Statement            July 2010


   to each endpoint.  In this mode, the MCU will mix the audio streams;
   and if using centralized video, will either use voice activated video
   switch, where everyone will see the active speaker and the speaker
   will see the previous speaker, or will use continuous presence mode,
   where the MCU will create a video stream with sub windows for each of
   the participants.  MCUs can support multiple video layouts and they
   can be created automatically based on the number of participants or
   by a conference management application.

   There are three methods commonly used for video stream distribution
   in centralized multipoint conferences.  The three conference policies
   above can be implemented using any of these technologies.

   Simple video switching (forwarding) has the advantage of low latency
   and low complexity.  It can be used if all systems are capable of
   receiving the encodings used by the sending endpoints (including both
   the video codec and the image resolution/aspect ratio).  In some
   situations it can be wasteful of bandwidth.

   Full video transcoding usually has higher latency than switching It
   does not require system to be capable of receiving identical
   encodings, and different sites can connect with different bandwidths.

   Layered video encoding combines some of the benefits of video
   switching and video transcoding.  It is more complex than video
   switching, but less complex than video transcoding.  Bandwidth and
   resolution can be reduced for each site.  Since this is done by
   filtering out layers of the original encoding, the available
   bandwidths and resolutions are not as fine-grained as full video
   transcoding.

   In decentralized mode or full mesh mode each endpoint creates its
   display mode.  This requires each endpoint to receive multiple
   streams and send its video and audio to all participants, using
   multicast of unicast.

   In practice, multicast is not now being used in commercial systems,
   so the size of a strictly decentralized multipoint conference is
   limited.

   There are analogous issues for audio.  Like video, the audio is
   rotated, so there is no clarity on the meaning of left and right.
   Since the number of streams, microphones, and speakers are not
   matched, the systems need to re-process the received audio in order
   to create the correct sound field for their respective rooms.

   There are two ways in which the audio might be handled in this use
   case:


Romanow & Botzko        Expires January 13, 2011                [Page 8]

Internet-Draft       Telepresence Problem Statement            July 2010


   o  A single stereo audio stream is sent to the remote site, just as
      in standard videoconferencing.

   o  Three monaural audio streams are sent to the remote site, with
      proprietary signaling to associate each audio stream with a video
      stream.

   Microphones and speakers positions vary; and there is no agreed upon
   way to describe their placement.  There is no agreed upon reference
   for audio level.  In addition, audio may be sent as an independent
   stream from each microphone or as a multi-channel channel stream.

5.3.  Capability Negotiation

   Call setup for the telepresence conference will start with a single
   call establishing one video media stream.  After the connection is
   established, a proprietary capability negotiation takes place that
   will enable both sides to identify that they are telepresence
   applications and capable of having two more video sessions and
   provide the connectivity information.  The result is that two or more
   video sessions are established.  The system may use two new SIP call
   legs or just add the two new video streams to the existing dialog.

   [more to be added]

5.4.  Differences in Media Characteristics

   Media characteristics such as video format, aspect ratio, and visual
   scale can be handled differently at different sites creating
   incompatibility.  To interwork, an adaptive strategy is necessary.
   Although differences in media characteristic must also be handled in
   a typical video conference, the problem is made more complex in
   Telepresence due to the multiple screens, cameras and streams.

   Two examples - aspect ratio and visual scale are described here.

5.4.1.  Aspect Ratio

   If the aspect ratios in different sites are not the same, some
   technique needs to be applied to adjust for the difference.  Although
   the same situation arises in normal video conferencing, multiple
   streams in telepresence conferencing causes more difficulties.

   For simplicity let us assume a point to point case - two conference
   room on a point to point call.  Both rooms have 3 screens and 3
   cameras, as in 4.1 above.  Both rooms have identical visual scale -
   the display width and distance between the participants and the
   displays are identical in both rooms.  However the equipment -


Romanow & Botzko        Expires January 13, 2011                [Page 9]

Internet-Draft       Telepresence Problem Statement            July 2010


   cameras and displays - in each room has a different aspect ratio,
   16:9 in one room and 4:3 in the other.

   Although 4:3 is usually associated with standard definition TV and
   16:9 with HDTV, telepresence systems may choose the aspect ratio to
   obtain a particular field of view.  Projecting images in the 16:9
   aspect ratio offers a wider presentation angle that shows fine
   details well (the pixel density is greater than a 4:3 system of the
   same resolution and scale).  In the room with 16:9 media
   characteristic, people are shown at full size when they are seated.
   However, when they stand up the height of the display results in
   their image being cropped so that their heads are not shown.  The
   other room uses projectors to display HD images with 4:3 aspect
   ratios.  This results in an increased image height - the vertical
   field of view is 33% greater than the 16:9 system.  The increased
   height allows most of the population to be shown full size whether
   they are standing or sitting.

   Some strategy is necessary to deal with the case of the two sites
   having a point to point call.  In order to convert formats of unequal
   ratios a variety of techniques can be used, such as: zooming
   (enlarging) and cropping (removing), letterboxing (adding horizontal
   bars), pillarboxing (adding vertical bars) to retain the original
   format's aspect ratio, or scaling (which distorts) in a variety of
   ways.

   For the video sent from the 4:3 room to the 16:9 room, several
   techniques can be used:

   1.  The 16:9 system might simply crop the top 1/4 of each 4:3 image.
       This will result in full size display, eye contact, and gaze
       awareness for the individuals who are seated.  However, the
       standing presenter's head will be cropped.

   2.  The 16:9 system might stretch each to the 4:3 images to fully fit
       the 16:9 display.  This would reduce image height (creating
       geometric distortion) and create eye-contact error.  Continuity
       of the panoramic image would be preserved.

   3.  The 16:9 system could pillarbox each of the 4:3 images, placing
       horizontal borders on the three displays.  This results in
       reducing the image size to less than full size.  It also destroys
       the continuity of the panoramic image, and introduces additional
       error in eye contact and gaze awareness.

   4.  The 16:9 system could pillarbox only the center display.  This
       reduces the size of the presenter who is the focus of the
       meeting.


Romanow & Botzko        Expires January 13, 2011               [Page 10]

Internet-Draft       Telepresence Problem Statement            July 2010


   5.  The 16:9 system could also crop the bottom of the center display.
       Visually this reduces the height of the presenter, but maintains
       full size.  There is a vertical discontinuity in the panoramic
       image.  Whether this is objectionable or not depends on the room
       layout.

   Strategies 4 and 5 could be accomplished in response to a user
   command or automatically.  The details will be discussed in more
   detail in future documents.

   For the video sent from the 16:9 room to the 4:3 room, the receiving
   system simply letterboxes the video displays.  Since the scales are
   identical, this full size image displays in the 4:3 room.

   For the video sent from the 16:9 room to the 4:3 room, the common
   techniques are:

   1.  The 4:3 system places the border above the image.  This maintains
       eye contact for those who are seated, but cannot maintain eye
       contact for the presenter.

   2.  The 4:3 system places the border below the images.  If the 16:9
       system crops the bottom of the center display then this will
       maintain eye contact for the presenter and the remote site.

   3.  The 4:3 system centers the images.  Eye contact suffers for
       everyone, but the worst case eye contact error is better
       controlled.

   In this use case, negotiation between the systems is not strictly
   necessary, no matter which scheme is used.  However, the best user
   experience is obtained if both systems have knowledge about apect
   ratios being used and which participants are standing and which are
   sitting so they can adjust optimally.

5.4.2.  Visual Scale

   The visual scale of displays may differ between sites.  Again, let us
   use the point to point case as a simple example.  Assume two
   conference rooms in a point to point call.  One room is designed for
   6 participants, and has three 16:9 screens and 3 cameras.  This room
   is designed to show participants at their normal size when seated (2
   participants per camera/display).  It does not have adequate display
   height to capture those who are standing.  The second room is also
   designed for 6 participants, but shows 3 participants per camera/
   display also at their full size.  Therefore, it only needs two 16:9
   cameras/display pairs.  Since the field of view in both the vertical
   and horizontal is increased by 50%, it also shows those who are


Romanow & Botzko        Expires January 13, 2011               [Page 11]

Internet-Draft       Telepresence Problem Statement            July 2010


   standing without cropping.

   For the video sent from the 2 screen (larger scale) room to the 3
   screen (smaller scale) room, two approaches can be used:

   1.  The 3 screen system might simply show the participants on two of
       its displays.  Participants will be shown at 67% of their full
       size.  Eye contact and gaze awareness will be lost.

   2.  The 3 screen system might construct and display a vertically
       cropped 3-screen view, showing 2 participants on each screen.
       Participants will be shown at full size, with preservation of eye
       contact and gaze awareness.

   For the video sent from the 3 screen to the 2 screen room, there are
   two analogous approaches:

   1.  The 2 screen system selects 2 streams and simply shows them on
       its displays.  Participants will be shown at 150% of their normal
       size.  Eye contact and gaze awareness will be lost, and some of
       the remote site is lost.

   2.  The 2 screen system might construct and display a 2 screen view
       (with a vertical border on the top) which shows 3 participants on
       each screen.  Participants will be shown at full size, with
       preservation of eye contact and gaze awareness.

   Although there is no need for negotiation between the systems, the
   best user experience is obtained if both systems have knowledge of
   the visual scale, and where individuals are seated, and can then
   choose the best manner of display.


6.  IANA Considerations

   This document contains no IANA considerations.


7.  Security Considerations

   While there are likely to be security considerations for any solution
   for telepresence interoperability, this document has no security
   considerations.


8.  Acknowledgements

   The draft has benefitted from input from a number of people including


Romanow & Botzko        Expires January 13, 2011               [Page 12]

Internet-Draft       Telepresence Problem Statement            July 2010


   Roni Even, Jim Cole, Nermeen Ismail, Nathan Buckles.


9.  Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


Authors' Addresses

   Allyn Romanow
   Cisco
   San Jose, CA  95134
   US

   Email: allyn@cisco.com


   Stephen Botzko
   Polycom
   Andover, MA  01810
   US

   Email: stephen.botzko@polycom.com


Romanow & Botzko        Expires January 13, 2011               [Page 13]