RTCWEB S. Wenger Internet-Draft A. Eleftheriadis Intended status: Standards Track Vidyo Expires: January 5, 2012 July 4, 2011 The Case for Layered Codecs draft-wenger-rtcweb-layered-codec-00 Abstract RTCWEB is in the process of developing a protocol infrastructure and a browser API to support browser-to-browser real-time communication over IP. Real-time communication necessarily requires the use of encoders and decoders (codecs) for media data. The document advocates mandating support for a class of codecs known as scalable or layered codecs, for their superior network adaptivity, error resilience, and application adaptivity. Examples are provided for use cases currently under discussion, focusing on video coding as the, perhaps, most challenging media type currently under consideration. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 5, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Wenger & Eleftheriadis Expires January 5, 2012 [Page 1] Internet-Draft The Case for Layered Codecs July 2011 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Detour: Suggested additional Requirements . . . . . . . . . . 3 4. Scalable Codecs in the defined Use Cases . . . . . . . . . . . 4 4.1. Use case: Browser to browser use-cases . . . . . . . . . . 5 4.2. Telephony use cases . . . . . . . . . . . . . . . . . . . 7 4.3. Video conferencing use cases . . . . . . . . . . . . . . . 8 4.3.1. Use-case: Multiparty Video Communication . . . . . . . 8 4.3.2. Use-case: Video Conferencing w/ Central Server . . . . 10 4.4. Embedded voice communication use cases . . . . . . . . . . 10 4.5. Bandwidth/QoS/mobility use cases . . . . . . . . . . . . . 11 5. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 7.1. Normative References . . . . . . . . . . . . . . . . . . . 12 7.2. Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Wenger & Eleftheriadis Expires January 5, 2012 [Page 2] Internet-Draft The Case for Layered Codecs July 2011 1. Introduction In this document we advocate the mandatory support of scalable coding techniques for RTCWEB. If consensus towards such as position is not achievable for whatever technical, commercial, or legal reasons, we suggest that not having mandatory codecs may be better than mandating the use of an inferior, non-scalable codec. The current status of RTCWEB's use case and requirements discussion is captured in [I-D.ietf-rtcweb-use-cases-and-requirements]. (Note: Version 00 of this I-D was used for the comparison in this document.) We use this document as a guide and attempt to compare the broad categories of scalable and non-scalable codecs in terms of how well they can satisfy the requirements specified therein. We use video codecs as an example, and conclude that scalable codecs are considerably better-suited for RTCWEB than non-scalable codecs. We first address some additional - and, we believe, self-evident - requirements. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations. 3. Detour: Suggested additional Requirements There are certain requirements that, we believe, appear so obvious to the authors of [I-D.ietf-rtcweb-use-cases-and-requirements] that they have not been explicitly captured. One of these requirements is: Fxx: The browser MUST enable communication with a peer (other browser(s), MCUs, etc.) with a latency adequate for "real-time" communication. This is a crucial requirement when selecting a (video) codec for reasons that will become apparent shortly. We believe that RTCWEB should strive for solutions allowing latency (not counting long- distance network delays) below 200 msec. Furthermore, any solution that cannot guarantee a latency below 500 msec (before dropping video altogether) should not be considered. At the time of writing there are mailing list discussions ongoing regarding a requirement that may be differently formulated but addresses the same operational aspect. We don't specifically care about the formulation, but highly Wenger & Eleftheriadis Expires January 5, 2012 [Page 3] Internet-Draft The Case for Layered Codecs July 2011 recommend to go through the pain of agreeing to fixed millisecond numbers. An additional requirement that should be considered relates to heterogeneity: Fxy: The browser must be able to send fully decodable bitstreams, ideally without wasting resources, for a broad range of receivers (handheld to desktop) and, in multi-party cases, for a heterogeneous receiver population. Seems obvious? Yes, to us. Have the implications, however, been thoroughly considered? Not that we are aware of. We discuss this further when addressing the multiparty use case. Finally, an additional requirement that relates to the heterogeneity of peers has to do with the heterogeneity of connection characteristics among peers: Fxz: The browser MUST be able to perform error control on multiple streams, with potentially different error characteristics, simultaneously. In a full-mesh, multipoint scenario, for example, handling of individual peer connection characteristics can be very challenging. Assume, by way of example, that in a five-way conference one of the streams gets damaged or that the corresponding connection is subject to congestion. If the browser choses to produce a more robust stream, then the other three receiving peers will be penalized through the overhead spent for error protection and resulting inferior picture quality, even if they have perfect connections to the sender. If, on the other hand, the browser choses to produce separate streams for each of its peers, then it must be able to handle an increasing computational load. Even if the computational load is not a concern, battery power consumption certainly is. In fact, there is only one thing that grows faster than CPU power (according to Moore's law): per-pixel cycle demands of video codecs. Also, we should bear in mind that even when staying with moderately complex (and efficient) video codecs, video resolution constantly goes up. Worse, resolution grows in both horizontal and vertical dimensions, whereas CPU power grows only in one dimension. 4. Scalable Codecs in the defined Use Cases The concept of scalable coding (for video, audio, or any media) consists of providing a representation of the source at multiple levels of fidelity using a bitstream comprised of a corresponding Wenger & Eleftheriadis Expires January 5, 2012 [Page 4] Internet-Draft The Case for Layered Codecs July 2011 number of layers. Typically these layers are formed in a pyramidal fashion, such that a given level of fidelity requires the availability of some or all of the lower layers (i.e., those that correspond to lower levels of fidelity). The dependency exists because the encoding process uses prediction between layers, thus providing increased compression efficiency. If no such prediction is used, then the layers are independent of one another, and we have what is called simulcasting. Indeed, from a media coding point of view, simulcasting is scalable coding without inter-layer prediction. In simulcast, each layer provides a different level of fidelity and is independently decodable. One example of a standardized codec that uses such pyramidal layering to implement scalable coding is H.264/SVC. MPEG-2 and H.263+ are two other similar examples. The scalable layers in H.264/SVC enhance fidelity in any of three dimensions: temporal rate, spatial resolution, or spatial quality (or signal-to-noise ratio, SNR). More than one such enhancement can exist in a given video at the same time, i.e., scalability enhancements can be combined. Scalable coding has been a well-known design architecture in the media coding community since at least 1993, but it has only recently been used in products. As a relatively recent addition to the commercial audio-video communication arsenal, it is not yet widely known among people who are not video coding experts. An overview of the H.264/SVC standard can be read in [wiegand2007]. A detailed discussion of how it can be used in videoconferencing systems is provided in [eleft2006]. In the following we revisit each of the use cases and discuss how scalable coding, and particularly pyramidal scalable video coding, can be used to provide a significantly improved user experience. 4.1. Use case: Browser to browser use-cases This category has four sub-cases. The sub-cases "simple video communication" and "simple video communication with inter-operator calling" are, from a codec viewpoint, the same, with the exception of the mentioned interoperability argument resulting in a need for a baseline codec. From a (video) codec viewpoint the "Hockey" sub-case appears to consist of two independent video streams being sent by the mobile phone. (We are actually not sure whether this, rather exotic, multi-camera scenario warrants its own use case, but we are not opposed to it, either.) The sub-case "video size change" appears, to us, to introduce a feature that would be relevant to all video- capable use cases. As a result, from a codec viewpoint, all four use cases appear to have similar requirements and can be discussed jointly. Wenger & Eleftheriadis Expires January 5, 2012 [Page 5] Internet-Draft The Case for Layered Codecs July 2011 On the surface, these use cases appear to be trivially supported by any video codec as long as basic rate control and error resilience features are utilized (probably with the help of protocol support such as RTCP receiver reports and feedback messages therein, retransmission, and so on). Let's see, however, how scalable codecs can offer significant advantages over non-scalable codecs. First, temporal scalability (where pictures coded in an enhancement layer can be decoded in combination with a lower frame rate base layer) can greatly enhance error resilience, through techniques such as the one described in [wenger1998]. Especially in conjunction with feedback messages, virtually latency-neutral repair mechanisms can be devised without relying on latency- and bandwidth-unfriendly intra (I) pictures. It is important to note that such "reactive" techniques do not penalize a system when no errors happen to occur. Also, by being reactive, they are amenable to heterogeneous error control support (thus addressing requirement Fxz). (Note: sending intra pictures is typically not a good idea. First, when transmitting video over a bandwidth-limited link that is close to capacity (e.g., a 3G link), an I frame can bring the latency up to the seconds range. Second, I frames are much larger than P or B frames and, therefore, statistically much more likely to be hit by a packet loss. Sending I frames for error control is, in almost all cases, a bad idea. It has been done for many years, but only because few, if any, better error control mechanisms were available.) A second scalability dimension is spatial scalability. Here the enhancement layer enables decoding of the video in a higher resolution than a base layer, but can predict information from the base layer. Spatial scalability can also offer advantages for error control, something that we would gladly elaborate on in the future. In this use case, however, its key advantage lies in the graceful and, if properly implemented, latency-neutral handling of both changes in available network bandwidth (addressing the bandwidth and error characteristics aspects of F23, among others) and receiver-side rendering requirements (addressing F22). Most, if not all, video coding experts will agree that there is nothing better than shedding pixels when running into a bandwidth issue, and this can be trivially done (without sending an I frame!) when using spatially scalable coding techniques. Similarly, recovering from bandwidth shortages can be achieved without sending an I frame, again in those cases where the sender properly implements spatial scalability. These techniques are generally more flexible and often lead to better-quality pictures compared to the rate- Wenger & Eleftheriadis Expires January 5, 2012 [Page 6] Internet-Draft The Case for Layered Codecs July 2011 control techniques used in single-layer codecs (QP adjustment). Another point that speaks for spatial scalability is the option to gracefully and quickly react to rendering size requirements, i.e., users enlarging or "full-screening" the rendering window (F22). Again adding a high resolution scalable layer can be done without sending an I frame (which, as already pointed out, are evil), and, if implemented properly, can be done without interrupting the smoothness of the video playback experience. (Note that even today's most advanced spatial scalability techniques still encode only a small set of discrete picture sizes in their finite, and small, number of layers. As a result, some form of resampling in the rendering interface will probably still be required.) It has been remarked that simulcasting low and high resolution representations of the same picture can have very similar positive effects. This is correct. However, simulcast comes at a price, which is bandwidth and compute cycle requirements. The quite extensive subjective evaluation tests performed by MPEG in conjunction with the H.264/SVC verification tests have shown that a 2006-generation scalable video encoder offers between 17% and 40% bitrate savings over a same generation encoder using simulcast [mpeg2007]. (Note that "subjective" evaluation is the Mercedes of media quality testing, whereas "objective" evaluation compares to a pre-"Volt" Chevrolet :-). Subjective tests, when performed properly, are by no means unscientific and they are, in fact, considered a much better way to assess the quality of a codec compared with objective tests (e.g., the PSNR, or less crude metrics devised by groups such as the Video Quality Experts Group). Unfortunately, subjective evaluations are also orders of magnitude more expensive and time-consuming than objective tests, as it takes many person-hours to evaluate a single test case, whereas objective tests take only CPU cycles.) Further, while a modern scalable encoder creating two spatial layers may require roughly the same number of cycles as two encoders coding the same pixel count, there are savings in the decoder due to what SVC calls "single loop decoding" (i.e., the fact that the receiver only needs to maintain a prediction loop just for the layer that is being decoded, and not for the layers that the to-be-decoded layer may depend on). 4.2. Telephony use cases As the telephony use cases appear to address interaction with legacy telephone equipment only, there is probably little use for the quality scalability that modern sclabale speech/audio codecs offer. Wenger & Eleftheriadis Expires January 5, 2012 [Page 7] Internet-Draft The Case for Layered Codecs July 2011 As formulated, from a codec viewpoint, any speech/audio codec used in telco environments (and many that have been specified outside this environment) ought to be acceptable choices. 4.3. Video conferencing use cases There are two sub-cases in this broad category: full-mesh video conferencing, and centralized conferencing with resolution switching. The use cases also make particular assumptions regarding audio; the first case involves mono audio with panning at the web application, whereas the second involves mono or stereo audio with mixing at the server. We fist point out that these two use cases offer only a small subset of the functionality offered in today's multiparty videoconferencing solutions. In particular, the centralized server sub-case appears to deal with one, not frequently used, aspect of MCU-based communication that is implemented in a crude and unoptimized fashion (see below). We believe that video conferencing use cases in a standard drafted in 2011 should encompass at least as much functionality as is routinely offered in 2005 generation MCUs. We wonder, specifically, why the group is not (yet?) considering other use cases in the same broad category of centralized cponferencing that require more flexibile server architectures. For example, sometimes active speaker size-up is not wanted, i.e., in scenarios involving presentations with questions. Or, one wants to be able to up-size more than one (but less than the whole population) of speakers. Or a full continuous presence conference where it is up to the user of each receiving browser to decide how he/she wishes to render the received signals. All of these capabilities are available with today's MCUs. Many similar scenarios could be described. We plan to contribute more detailed descriptions in the future unless we receive pushback. We now examine each of the two use-case categories. 4.3.1. Use-case: Multiparty Video Communication On the surface, this use case is not very different from the Simple Video use case, at least from a codec viewpoint. Indeed, it appears that a browser just needs to be able to simultaneously process N incoming independent video bitstreams (as spelled out in F12 and F14). A slightly deeper examination will reveal that this use case appears to be a picture-perfect example of why scalable codecs offer superior performance. First off, all the arguments that were made for scalable codecs in Wenger & Eleftheriadis Expires January 5, 2012 [Page 8] Internet-Draft The Case for Layered Codecs July 2011 the Simple Video use case still apply. Let's consider, however, what happens when more than one stream is received. A first consideration is the CPU requirements for decoding all the pictures in full resolution. Suppose a browser is receiving, for example, four video pictures in standard TV resolution and in a non- scalable format. The computational load for decoding these pictures in addition to the load of encoding one's outgoing picture would pretty much max out today's desktop CPUs. In typical multipoint layouts, the reconstructed pictures of most of the peers would be shown in thumbnail format, thus throwing away anywhere between 75-95% of the reconstructed pixel count! What a waste. But waste what? Battery power? Yes, for handheld devices, which do not have the screen real-estate to show all the pictures in full resolution. We come to that in a moment. Electricity? Yes. This is not a joke. A desktop motherboard can easily double its power intake based on CPU load and memory access. Say, a motherboard's power consumption goes up from 100W to 200W, just because video is decoded in unnecessarily high resolution. One kWh of electricity costs at one of the author's home (Bay Area, California, USA, PG&E as the electricity provider) about 26 cents at his consumption level (single f amily house). This means that 8 hours of unnecessarily decoding of full- resolution video that is not rendered at full resolution will cost $0.20. This, by coincidence, is the exact same maximum amount that MPEG-LA charges as licensing fees for using H.264/SVC :-) Never mind the numbers of trees that can be saved... A second consideration is when multiple streams are received, or transmitted, by the browser in a heterogeneous receiver population. Browsers are today ubiquitous and can be found anywhere from handheld devices with QVGA screens to gaming racks with multiple 2K pixel resolution screens. That, incidentally, is a big part of the appeal of the RTCWEB activity. If you are sitting behind a 23-inch 1080p monitor, your display is approximately 120 dpi. A QVGA image, perfectly appropriate for being rendered on the screens of many smartphones, would be just 2-in by 1.5-in on the PC screen. On one of the authors' laptop (a 1080p, 13.3-in model, better than 200 dpi) it would literally be the size of a thumbnail. Not exactly what one expects to see, judging form daily uses of Skype and our own (Vidyo's) desktop video conferencing services. Further, when upsampling to, say, a quarter HD resolution, a QVGA signal simply doesn't look good no matter how well your upsample filter is designed. This means that a transmitting browser that creates bitstreams for a receiver population including a smartphone and a PC, would have to, ideally, generate at least three resolutions: thumbnail, something Wenger & Eleftheriadis Expires January 5, 2012 [Page 9] Internet-Draft The Case for Layered Codecs July 2011 useful for smartphones (e.g., QVGA), and something useful for PC users (VGA or better). More option would be better. There are commercial products available today that ship, using a software codec on a high-end PC, 1080p60 video conferencing. Two years down the road, and a mainstream PC will be able to handle that type of load. It has already been pointed out that the compute cycle requirements for simulcast and scalable encoding are roughly the same, assuming a common set of coding tools (i.e., H.264 Baseline profile vs. H.264 Scalable Baseline profile). The sending bandwidth requirements, however, are not the same: scalable beats simulcast by several tens of percentage points. There are additional considerations for simulcast vs. scalable coding that have to do with RTP packetization and access unit alignment, resolution switching, and error resilience, that require much more extensive analysis than what's intended in this document. 4.3.2. Use-case: Video Conferencing w/ Central Server This particular use case is interesting because, as formulated, it appears to assume that multiple video resolutions are produced by each sending browser and simulcast to the server. As formulated, the receiving browser receives the active speaker in full resolution and the other participants in low resolution. The server selects what to forward based on speech activity. Apparently, what is emulated here is one operation point implemented in many continuous presence MCUs (namely voice activation), without requiring the video transcoding features of a continuous presence MCU. This use case almost begs for use of scalable video coding. The simulcasting alternative would have all the drawbacks discussed in the Simple Video Communication Service. For example, consider what happens when the active speaker changes. How is the receiving participant to switch resolutions between the two simulcast streams? It appears to require the request (from server to sending browser) of an I frame, with all the drawbacks that entails. Actually, as described, there appears to be no value in simulcasting all resolutions to the server, because of the need of such interaction... 4.4. Embedded voice communication use cases The benefits discussed in multiparty video communication can apply here for scalable audio codecs. Scalable audio codecs haven't been mentioned prominently before, as in audio-visual systems the cost of video processing (in terms of compute cycles and bit rate, among others) typically outweighs that of audio. The requirements for today's games are so high that the CPU and bandwidth requirements for Wenger & Eleftheriadis Expires January 5, 2012 [Page 10] Internet-Draft The Case for Layered Codecs July 2011 a couple of audio channels may fall into the category of background noise. However, note that this ceases to be the case as the number of audio channels increases and the quality and (therefore) complexity of the audio codecs grows. We understand that the audio quality here would probably not be restricted to toll quality, and require something like Opus or MP3 for coding. While one or two of these codecs probably would still qualify as "background noise" on a gaming rack, 20 certainly do not. What appears to be useful here would be scalability both in terms of bitrate scalability and complexity scalability. 4.5. Bandwidth/QoS/mobility use cases The use cases as formulated appear to have limited impact on the codec choices beyond aspects already raised above. NIC changes (or other changes that materially affect connectivity) are best addressed by codecs that can flexibly change their operation points without essentially restarting the codec (such as sending I frames in video, or user-generated codebooks in audio). If QoS mechanisms that take advantage of QoS marking are supported in the underlying network - something that is at least questionable in today's Internet, but may be very relevant in the future and in special application fields such as the military - the pyramid structure of scalable codecs makes the selection of bits that are best to be transported at higher QoS trivial: the lower the layer, the higher the desired QoS. We note that scalable systems today emulate such a desirable network behavior by protecting base layers better than enhancement layers, through techniques such as retransmission or FEC. 5. Concluding Remarks Scalability, while known as a concept for decades, is a relatively new technique in the commercial sphere of video and audio communication products. As a result, a lot of people are not familiar with how it works, and how it can be beneficial to real-time communication systems. Most significantly, scalable coding has been successfully used to solve several fundamental, decades-old challenges in packet video communication, and it therefore behooves any standardization activity that considers codec design or adoption to take it into serious consideration. Wenger & Eleftheriadis Expires January 5, 2012 [Page 11] Internet-Draft The Case for Layered Codecs July 2011 6. Security Considerations None 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 7.2. Informative References [I-D.ietf-rtcweb-use-cases-and-requirements] Holmberg, C., Hakansson, S., and G. Eriksson, "Web Real- Time Communication Use-cases and Requirements", draft-ietf-rtcweb-use-cases-and-requirements-00 (work in progress), June 2011. [eleft2006] Eleftheriadis, A., Cinvanlar, M., and O. Shapiro, "Multipoint Videoconferencing with Scalable Video Coding, Journal of Zhejiang University SCIENCE A, Vol. 7, Nr. 5, April 2006, pp. 696-705. (Proceedings of the Packet Video 2006 Workshop.)", 2006. [mpeg2007] HHI and MPEG, "MPEG Verification Test for SVCd", 2007, . [wenger1998] Wenger, S., "Temporal scalability using P-pictures for low-latency applications, Multimedia Signal Processing Workshop 1998, available from http://www.stewe.org/papers/mmsp98/243/index.htm", 1998. [wiegand2007] Schwarz, H., Marpe, D., and T. Wiegand, "Overview of the Scalable Video Coding Extension of the H.264/AVC Standard, IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Scalable Video Coding, vol. 17, no. 9, pp. 1103-1120, September 2007, Invited Paper.", 2006. Wenger & Eleftheriadis Expires January 5, 2012 [Page 12] Internet-Draft The Case for Layered Codecs July 2011 Authors' Addresses Stephan Wenger Vidyo, Inc. 2400 SKyfarm Dr. Hillsborough, CA 94010 US Email: stewe@stewe.org Alex Eleftheriadis Vidyo, Inc. 433 Hackensack Avenue Seventh Floor Hackensack, NJ 07601 US Email: alex@vidyo.com Wenger & Eleftheriadis Expires January 5, 2012 [Page 13]