RTCWEB J. Lennox Internet-Draft Vidyo Intended status: Standards Track A. Romanow Expires: May 3, 2012 P. Witty Cisco Systems October 31, 2011 Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions draft-lennox-clue-rtp-usage-01 Abstract This document describes mechanisms and recommended practice for transmitting the media streams of telepresence sessions using the Real-Time Transport Protocol (RTP). Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on May 3, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Lennox, et al. Expires May 3, 2012 [Page 1] Internet-Draft RTP Usage for Telepresence October 2011 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Source multiplexing - overview . . . . . . . . . . . . . . . . 3 4. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5. Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . 6 5.1. Using the SSRC for demultiplexing . . . . . . . . . . . . 7 5.2. Multiplex ID . . . . . . . . . . . . . . . . . . . . . . . 8 5.3. Combined approach . . . . . . . . . . . . . . . . . . . . 9 6. Transmission of presentation sources . . . . . . . . . . . . . 9 7. Other considerations . . . . . . . . . . . . . . . . . . . . . 10 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10.1. Normative References . . . . . . . . . . . . . . . . . . . 10 10.2. Informative References . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 Lennox, et al. Expires May 3, 2012 [Page 2] Internet-Draft RTP Usage for Telepresence October 2011 1. Introduction Telepresence systems, of the architecture described by [I-D.ietf-clue-telepresence-use-cases] and [I-D.ietf-clue-telepresence-requirements], will send and receive multiple media streams, where the number of streams in use is potentially large and asymmetric between endpoints, and streams can come and go dynamically. These characteristics lead to a number of architectural design choices which, while still in the scope of potential architectures envisioned by the Real-Time Transport Protocol [RFC3550], must be fairly different than those typically implemented by the current generation of voice or video conferencing systems. This document makes recommendations about how streams should be encoded and transmitted in RTP for this telepresence architecture. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations. 3. Source multiplexing - overview Telepresence sessions have lots of media streams: easily dozens at a time (given, e.g., a continuous presence screen in a multi-point conference), potentially out of a possible pool of hundreds. Furthermore, endpoints will have an asymmetric number of media streams. In such an environment the usual model of existing SIP endpoints -- sending zero or one source (in each direction) per RTP session -- doesn't scale, and mapping asymmetric numbers of sources to sessions is needlessly complex. Therefore, telepresence systems SHOULD use a single RTP session per media type, except where there's a need to give sessions different transport treatment. All sources of the same media type are sent over this single RTP session. This architecture (known as "source multiplexing") was defined by [RFC3550], but was used rarely until more recently by some Telepresence systems. Multiplexing multiple media streams in this way has additional advantages. It makes going through middle boxes considerably easier, as it allows Telepresence devices to work through SIP B2BUAs that do Lennox, et al. Expires May 3, 2012 [Page 3] Internet-Draft RTP Usage for Telepresence October 2011 not support multiple media lines of the same media type. It also simplifies NAT and firewall traversal by allowing endpoint to deal with only a single address/port mapping per media type rather than multiple mappings. During call setup, a single RTP session is negotiated for each media type. In SDP, only one media line is negotiated per media and multiple media streams are sent over the same UDP channel negotiated using the SDP media line. A number of protocol issues involved in multiplexing RTP streams into session are discussed in [I-D.westerlund-avtcore-multiplex-architecture] and [I-D.lennox-rtcweb-rtp-media-type-mux]. In this draft we concentrate on examining the demultiplexing of RTP streams, in the specific context of telepresence systems. A key issue to work out is how a receiver interprets the multiple streams it receives, and corrolates them with the captures it has requested. In some cases, the CLUE Framework [I-D.ietf-clue-framework]'s concept of the "capture" maps cleanly to the RTP concept of an SSRC, but in some cases it does not. First we will consider the cases that need to be considered. We will then examine the two most obvious approaches to demultiplexing, showing their pros and cons. We then describe a third possible alternative. 4. Use Cases There are three distinct use cases relevant for telepresence systems: Static stream choice: In this case, the streams sent over the multiplex are constant over the complete session. An example is a triple-camera system to MCU in which left, center and right streams are sent for the duration of the session. This describes an endpoint to endpoint, endpoint to multipoint device, and equivalently a transcoding multipoint device to endpoint. This is illustrated in Figure 1. Lennox, et al. Expires May 3, 2012 [Page 4] Internet-Draft RTP Usage for Telepresence October 2011 ,'''''''''''| +-----------Y | | | | | +--------+|"""""""""""""""""""""""""""|+--------+ | | |EndPoint||---------------------------||EndPoint| | | +--------+|"""""""""""""""""""""""""""|+--------+ | | | | | "-----------' "------------ Figure 1: Point to Point Static Streams Dynamic streams from a finite set: In this case, the receiver has requested a smaller number of streams than the number of media sources that are available, and expects the sender to switch the sources being sent based on criteria chosen by the sender. (This is called auto-switched in the CLUE Framework [I-D.ietf-clue-framework].) An example is a triple-camera system to two-screen system, in which the sender needs to switch either LC -> LR, or CR -> LR. This describes an endpoint to endpoint, endpoint to multipoint device, and a transcoding device to endpoint. This is illustrated in Figure 2. ,'''''''''''| +-----------Y | | |+--------+ | | +--------+|"""""""""""""""""""""""""""||EndPoint| | | |EndPoint|| |+--------+_| | +--------+'''''''''' ''''''''''' | |........ "-----------' Figure 2: Point to Point Finite Source Streams Dynamic streams from an infinite set: This case describes a switched multipoint device to endpoint, in which the multipoint device can choose to send any streams received from any other endpoints within the conference to the endpoint. For example, in an MCU to triple-screen system, the MCU could send e.g. LCR of a triple-camera system -> LCR, or CCC of three single- camera endpoints -> LCR. Lennox, et al. Expires May 3, 2012 [Page 5] Internet-Draft RTP Usage for Telepresence October 2011 This is illustrated in Figure 3. +-+--+--+ | |EP| `-. | +--+ |`.`-. +-------`. `. `. `-.`. `-. `.`-. `-. `-.`. `-.-------+ +------+ +--+--+---+ `.`.| +---+ ---------------| +--+ | | |EP| +----.....:=. |MCU| ...............| |EP| | | +--+ |"""""""""--| +---+ |______________| +--+ | +---------+"""""""""";'.'.'.'---+ +------+ .'.'.'.' .'.'.'.' / /.'.' .'.::-' +--+--+--+ .'.::' | |EP| .'.::' | +--+ .::' +--------.' Figure 3: Multipoint Infinite Streams Within any of these cases, every stream within the multiplexed session MUST have a unique SSRC. The SSRC is chosen at random [RFC3550] to ensure uniqueness (within the conference), and contains no meaningful information. Any source may choose to restart a stream at any time, resulting in a new SSRC. For example, a transcoding MCU might, for reasons of load balancing, transfer an encoder onto a different DSP, and throw away all context of the encoding at this state, sending an RTCP BYE message for the old SSRC, and picking a new SSRC for the stream when started on the new DSP. Because of this possibility of changing the SSRC at any time, all our use cases can be considered to be the third and most difficult case, that of dynamic streams from an infinite set. Thus, this is the only case we will consider. 5. Demultiplexing There are two obvious choices in order to demultiplex: the SSRC, which is guaranteed to be unique for a stream, but conveys no Lennox, et al. Expires May 3, 2012 [Page 6] Internet-Draft RTP Usage for Telepresence October 2011 intrinsic useful information, or an additional multiplex ID tagged on to media packets. There may be other choices, e.g., payload type number, which might be appropriate for multiplexing one audio with one video stream on the same RTP session, but this not relevant for the cases discussed here. For receivers with limited decoding resources, it is particularly important to ensure that the number of streams which the receiver is expecting to receive never exceeds the maximum number it has requested. On a change of stream, the receiver can be expected to have a one-out, one-in policy, so that the decoder of the stream currently being decoded is stopped before starting the decoder for the stream replacing it. The sender should therefore indicate to the receiver which stream will be replaced upon a stream change. 5.1. Using the SSRC for demultiplexing Using the SSRC has the advantage of being included already in each RTP packet. However, there are some disadvantages to consider. First, the SSRC needs to be linked to some metadata to associate it to the capture stream. This is because although it uniquely identifies a media stream, it does not indicate which of the requested streams each SSRC is tied to. If more than one media stream is expected, it is therefore required to send some additional metadata to indicate the link between the SSRC and the CLUE stream ID. This is simply a mapping from transmitted SSRC to stream ID, updated as new SSRCs replace old ones. Because of the one-out, one-in codec policy, the receiver must know in advance of receiving the media stream how to allocate its decoding resources. Athough it could cache incoming media received before it knows what multiplex stream it applies to, this will require an unknown amount of storage space (particularly if the metadata is lost), and could lead to significant latency, after which the receiver may not find it possible to catch up because of resource constraints, or else it would require an expensive state refresh, such as a Full Intra Request (FIR) [RFC5104]. In addition, a receiver will have to store lookup tables of SSRCs to stream IDs/decoders etc. Because of the large SSRC space (32 bits), this will have to be in the form of something like a hash map, and a lookup will have to be performed for every incoming packet, which may prove costly on the receiver side. Consider the choices for where to put the metadata. The metadata could be sent in the CLUE messaging. The use of a reliable transport means that it can be sure that the metadata will not be lost, but if this reliability is acheived through retransmission, the time taken Lennox, et al. Expires May 3, 2012 [Page 7] Internet-Draft RTP Usage for Telepresence October 2011 for the metadata to reach all receivers (particularly in a very large scale conference, e.g., with thousands of users) could result in very poor switching times, providing a bad user experience. A second option for sending the metadata is in RTCP, for instance as a new SDES item. This is likely to follow the same path as media, and therefore if the metadata is sent slightly in advance of the media, it can be expected to be received in advance of the media. However, because RTCP is lossy, the metadata may not be received for some time, resulting in the receiver of the media not knowing how to route the received media. A system of acks and retransmissions could mitigate this, but this results in the same high switching latency behaviour as discussed for using CLUE as a transport for the metadata. 5.2. Multiplex ID The second option is to tag each media packet with an RTP header extension [RFC5285] carrying a multiplex ID. This means that a receiver immediately knows how to interpret received media, even when an unknown SSRC is seen. As long as the media carries a known multiplex ID, it can be assumed that this media stream will replace the stream currently being received with that multiplex ID. This gives significant advantages to switching latency, as a switch between sources can be acheived without any form of negotiation with the receiver. There is no chance of receiving media without knowing to which switched capture it belongs. Although multiplex IDs may be chosen by either the sender or receiver, the multiplex ID can, if chosen by the receiver, contain semantic information relevant to the receiver. For example, on a large multipoint device with many DSPs, the receiver chosen multiplex ID could identify the DSP to which the media should be sent, and possibly contain routing information to the DSP. However, there are also significant disadvantages in using a multiplex ID. It introduces additional processing costs. Multiplex IDs are scoped only within one hop (i.e., within a cascaded conference a multiplex ID that is used from the source to the first MCU is not meaningful between two MCUs, or between an MCU and a receiver), and so they may need to be modified at every stage. To add or modify the multiplex ID is an expensive operation, particularly if SRTP is used to authenticate the packet. Modification to the contents of the RTP header requires a reauthentication of the complete packet, and this could prove to be a Lennox, et al. Expires May 3, 2012 [Page 8] Internet-Draft RTP Usage for Telepresence October 2011 limiting factor in the throughput of a multipoint device. However, it may be that reauthentication is required in any case due to the nature of SDP. SDP permits the receiver to choose payload types, meaning that a similar option to modify the payload type in the packet header will cause the need to reauthenticate. 5.3. Combined approach The two major flaws of the above methods (poor switching performance of SSRC multiplexing, high computational cost on switching nodes) can be mitigated with a combined method. In this, the multiplex ID can be included in packets belonging to the first frame of media (typically an IDR/GDR), but following this only the SSRC is used to demultiplex. Because the IDR is already required to be received before any further frames can be decoded, this does not create any further restrictions on the media stream -- existing mechanisms to ensure the reliability of an IDR frame can be used. It does introduce extra complexity on the demultiplex side, requiring a two stage process of inspecting the packet for a multiplex ID, and, if it is not present, looking for the SSRC in a table of known streams. The solution is somewhat more complex if it is possible for a source to change which switched capture is sending it: for instance, in the second example in Section 4, when the sender switches from sending LC -> LR to sending CR -> LR, the sender's "C" source moves from the receiver's "R" multiplex ID to the receiver's "L" multiplex ID. For reasons of coding efficiency, it is desirable in this case to avoid sending a new IDR frame for the "C" stream, if the receiver's architecture allows the same decoding state to be used for its various multiplex IDs. In this case, the multiplex ID could be sent for a small number of frames after the source's multiplex ID has changed. 6. Transmission of presentation sources Most existing videoconferencing systems use separate RTP sessions for main and presentation video sources, distinguished by the SDP content attribute [RFC4796]. The use of [I-D.ietf-clue-framework]the CLUE telepresence framework to describe multiplexed streams can remove this need. However, it could still be useful in some cases to make the distinction between presentation and main video sources at the transport layer. In particular, if different treatment is desired at the transport layer or below (e.g. different VLANs, different QoS characteristics, etc.) for main video vs presentiation, the use of multiple RTP sessions m lines with different transport addresses Lennox, et al. Expires May 3, 2012 [Page 9] Internet-Draft RTP Usage for Telepresence October 2011 could would be necessary. 7. Other considerations As currently defined, H.281 Far-End Camera Control [ITU.H281.1994][RFC4573] does not, in SIP-based videoconferences, support selecting among multiple remote sources (though it does in H.323 conferences controled by an MCU, which can assign terminal IDs to sources). When RTP sessions contain multiple sources, this limitation becomes pressing. (However, this problem does not appear to be in scope of the CLUE working group.) 8. Security Considerations The security considerations for multiplexed RTP do not seem to be different than for non-multiplexed RTP. 9. IANA Considerations This document makes no requests of IANA. Note to RFC Editor: please remove this section before publication as an RFC. 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. 10.2. Informative References [I-D.ietf-clue-framework] Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, "Framework for Telepresence Multi-Streams", draft-ietf-clue-framework-00 (work in progress), October 2011. [I-D.ietf-clue-telepresence-requirements] Lennox, et al. Expires May 3, 2012 [Page 10] Internet-Draft RTP Usage for Telepresence October 2011 Romanow, A. and S. Botzko, "Requirements for Telepresence Multi-Streams", draft-ietf-clue-telepresence-requirements-01 (work in progress), October 2011. [I-D.ietf-clue-telepresence-use-cases] Romanow, A., Botzko, S., Duckworth, M., Even, R., and I. Communications, "Use Cases for Telepresence Multi- streams", draft-ietf-clue-telepresence-use-cases-01 (work in progress), July 2011. [I-D.lennox-rtcweb-rtp-media-type-mux] Lennox, J. and J. Rosenberg, "Multiplexing Multiple Media Types In a Single Real-Time Transport Protocol (RTP) Session", draft-lennox-rtcweb-rtp-media-type-mux-00 (work in progress), October 2011. [I-D.westerlund-avtcore-multiplex-architecture] Westerlund, M., Burman, B., and C. Perkins, "RTP Multiplexing Architecture", draft-westerlund-avtcore-multiplex-architecture-00 (work in progress), October 2011. [ITU.H281.1994] International Telecommunications Union, "A far end camera control protocol for videoconferences using H.224", ITU- T Recommendation H.281, 11 1994. [RFC4573] Even, R. and A. Lochbaum, "MIME Type Registration for RTP Payload Format for H.224", RFC 4573, July 2006. [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description Protocol (SDP) Content Attribute", RFC 4796, February 2007. [RFC5104] Wenger, S., Chandra, U., Westerlund, M., and B. Burman, "Codec Control Messages in the RTP Audio-Visual Profile with Feedback (AVPF)", RFC 5104, February 2008. [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP Header Extensions", RFC 5285, July 2008. Lennox, et al. Expires May 3, 2012 [Page 11] Internet-Draft RTP Usage for Telepresence October 2011 Authors' Addresses Jonathan Lennox Vidyo, Inc. 433 Hackensack Avenue Seventh Floor Hackensack, NJ 07601 US Email: jonathan@vidyo.com Allyn Romanow Cisco Systems San Jose, CA 95134 USA Email: allyn@cisco.com Paul Witty Cisco Systems Langley, England UK Email: pauwitty@cisco.com Lennox, et al. Expires May 3, 2012 [Page 12]