CODEC                                                          C. Hoene 
Internet Draft                                   Universitaet Tuebingen 
Intended status: Informational                         December 3, 2010 
Expires: June 2011 
                                    
 
       Measuring the Quality of an Internet Interactive Audio Codec  
                     draft-hoene-codec-quality-00.txt 


Status of this Memo 

   This Internet-Draft is submitted in full conformance with the 
   provisions of BCP 78 and BCP 79. 

   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups. Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 

   Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other documents 
   at any time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 

   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt 

   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html 

   This Internet-Draft will expire on June 3, 2011. 

Copyright Notice 

   Copyright (c) 2010 IETF Trust and the persons identified as the 
   document authors. All rights reserved. 

   This document is subject to BCP 78 and the IETF Trust's Legal 
   Provisions Relating to IETF Documents 
   (http://trustee.ietf.org/license-info) in effect on the date of 
   publication of this document. Please review these documents 
   carefully, as they describe your rights and restrictions with 
   respect to this document. 


Hoene                    Expires June 3, 2011                  [Page 1] 

Internet-Draft              Codec Quality                 December 2010 
    

Abstract 

   The quality of a codec has to be measured by multiple parameters 
   such as audio quality, speech quality, algorithmic efficiency, 
   latency, coding rates and their respective tradeoffs. During 
   standardization, codecs are tested and evaluated multiple times to 
   ensure a high quality outcome. 

   As the upcoming Internet codec is likely to have unique features, 
   there is a need to develop new quality testing procedures to measure 
   these features. Thus, this draft reviews existing methods on how to 
   measure a codec's qualities, proposes a couple of new methods, and 
   gives suggestions which may be used for testing the Internet 
   Interactive Audio Codec (IIAC). 

   This document is work in progress. 

Conventions used in this document 

   In this document, equations are written in Latex syntax. An equation 
   starts with a dollar sign and ends with a dollar sign. The text in 
   between is an equation following the notation of Latex Version 2e. 
   In the PDF version of this document, as a courtesy to its readers, 
   all Latex equations are already rendered.  

Table of Contents 

   Conventions used in this document.................................2 
   1. Introduction...................................................4 
   2. Optimization Goal..............................................6 
   3. Measuring Speech and Audio Quality.............................7 
      3.1. Formal Subjective Tests...................................7 
         3.1.1. ITU-R Recommendation BS.1116-1.......................7 
         3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA)..............8 
         3.1.3. ITU-T Recommendation P.800...........................8 
         3.1.4. ITU-T Recommendation P.805...........................8 
         3.1.5. ITU-T Recommendation P.880...........................9 
         3.1.6. Formal Methods Used for Codec Testing at the ITU.....9 
      3.2. Informal Subjective Tests.................................9 
      3.3. Interview and Survey Tests................................9 
      3.4. Web-based Testing........................................10 
      3.5. Call Length and Conversational Quality...................10 
      3.6. Field Studies............................................12 
      3.7. Objective Tests..........................................13 
         3.7.1. ITU-R Recommendation BS.1387-1......................14 
         3.7.2. ITU-T Recommendation P.862..........................14 
         3.7.3. ITU-T Draft P.OLQA..................................15 
 
 
Hoene                    Expires June 3, 2011                  [Page 2] 

Internet-Draft              Codec Quality                 December 2010 
    

   4. Measuring Complexity..........................................15 
      4.1. ITU-T Approaches to Measuring Algorithmic Efficiency.....15 
      4.2. Software Profiling.......................................17 
      4.3. Cycle Accurate Simulation................................18 
      4.4. Typical run time environments............................19 
   5. Measuring Latency.............................................19 
      5.1. ITU-T Recommendation G.114...............................20 
      5.2. Discussion...............................................20 
   6. Measuring Bit and Frame Rates.................................21 
   7. Codec Testing Procedures Used by Other SDOs...................22 
      7.1. ITU-T Recommendation P.830...............................22 
      7.2. Testing procedure for the ITU-T G.719....................24 
   8. Transmission Channel..........................................25 
      8.1. ITU-T G.1050: Network Model for Evaluating Multimedia 
      Transmission Performance over IP (11/2007)....................26 
      8.2. Draft G.1050 / TIA-921B..................................27 
      8.3. Delay and Throughput Distributions on the Global Internet27 
      8.4. Transmission Variability on the Internet.................30 
      8.5. The Effects of Transport Protocols.......................30 
      8.6. The Effect of Jitter Buffers and FEC.....................33 
      8.7. Discussion...............................................33 
   9. Usage Scenarios...............................................34 
      9.1. Point-to-point Calls (VoIP)..............................34 
      9.2. High Quality Interactive Audio Transmissions (AoIP)......35 
      9.3. High Quality Teleconferencing............................35 
      9.4. Interconnecting to Legacy PSTN and VoIP (Convergence)....36 
      9.5. Music streaming..........................................36 
      9.6. Ensemble Performances over a Network.....................36 
      9.7. Push-to-talk like Services (PTT).........................37 
      9.8. Discussion...............................................38 
   10. Recommendations for Testing the IIAC.........................38 
      10.1. During Codec Development................................38 
      10.2. Characterization Phase..................................39 
         10.2.1. Methodology........................................39 
         10.2.2. Material...........................................39 
         10.2.3. Listening Laboratory...............................40 
         10.2.4. Degradation Factors................................40 
      10.3. Application Developers..................................41 
      10.4. Codec Implementers......................................42 
      10.5. End Users...............................................42 
   11. Security Considerations......................................42 
   12. IANA Considerations..........................................42 
   13. References...................................................43 
      13.1. Normative References....................................43 
      13.2. Informative References..................................43 
   14. Acknowledgments..............................................48 
    
 
Hoene                    Expires June 3, 2011                  [Page 3] 

Internet-Draft              Codec Quality                 December 2010 
    

1. Introduction 

   The IETF Working Group CODEC is standardizing an Internet 
   Interactive Audio and Speech Codec (IIAC). If the codec shall be of 
   high quality it is important to measure the codec's quality 
   throughout the entire process of development, standardization, and 
   usage. Thus, this document supports the standardizing process by 
   providing an overview of quality metrics, quality assessment 
   procedures, and other quality control issues and gives suggestions 
   on how to test the IIAC. 

   Quality must be measured by the following stakeholders and in the 
   following phases of the codec's development: 

   o  Codec developers must decide on different algorithms or parameter 
      sets during the development and enhancement of a codec. These 
      might also include the selection among multiple codec candidates 
      that implement different algorithms; however the WG Codec base 
      its work on a common consensus not on a competitive selection of 
      one of multiple codec contributions. Thus, measuring the quality 
      of codecs to select one might not be required. 
      Besides selection, one is obliged to debug the codec software. To 
      find errors and bugs - and programming mistakes are present in 
      any complex software - the developer has to test this software by 
      conducting quality measurements. 

   o  Typically the codec standardization includes a qualification 
      phase that measures the performance of a codec and verifies 
      whether it confirms to predefined quality requirements. In the 
      qualification phase, it becomes obvious whether the codec 
      development and standardization has been successful. Again, in 
      the process of rigorous testing during qualification phase, 
      algorithmic weaknesses and bugs in the implementation may be 
      found. Still, in complex software such as the IIAC, correctness 
      cannot be proved or guaranteed. 


Hoene                    Expires June 3, 2011                  [Page 4] 

Internet-Draft              Codec Quality                 December 2010 
    

   o  Users of the codec need to know how well the codec is performing 
      while manufactures need to decide whether to include the IIAC in 
      their products. Quality measures play an important role in this 
      decision process. Also, the numerous quality measurement results 
      of the quality help developers of the VoIP system to dimension or 
      tune their system to take optimal advantage of a codec. For 
      example, during network planning, operators can predict the 
      amount of bandwidth needed for high quality voice calls.  
      An adaptive VoIP application needs to know which quality is 
      achieved with a different codec parameters set to be able to make 
      an optimal selection of the codec parameters under varying 
      network conditions.  
      As suggested in [50] an RTP payload specification for an IIAC 
      codec should include a rate control. Similar to the performance 
      of the codec, the rate control unit has a big impact on the 
      overall quality of experience. Thus, it should be tested well 
      too.  

   o  Software implementers need to verify whether their particular 
      codec implementation that might be optimized on a specific 
      platform confirms to the standard's reference implementation. 
      This is particularly important as some intellectual property 
      rights might only be granted, if the codec conforms to the 
      standard.  
      As the IIAC must not to be bit conform, which would allow simple 
      comparisons of correctness, other means of conformance testing 
      must be applied.  
      In addition, the standard conformance and interoperability of 
      multiple implementations must be checked.  
      Last but not least, implementers may implement optimized 
      concealment algorithms, jitter buffers or other algorithms. Those 
      algorithms have to be tested, too. 

   o  Since the success of MP3, end users do acknowledge the existence 
      of a high quality codec. It would make sense to use the IIAC in a 
      brand marketing campaign (such as "Intel inside"). A quality 
      comparison between IIAC and other codecs might be part of the 
      marketing. Online testing with user participation might also 
      raise the awareness level. 

   All those stakeholders might have different requirements regarding 
   the codec's quality testing procedures. Thus, this document tries to 
   identify those requirements and shows which of the existing quality 
   measurement procedures can be applied to fulfill those specific 
   demands efficiently. 


Hoene                    Expires June 3, 2011                  [Page 5] 

Internet-Draft              Codec Quality                 December 2010 
    

   In the following section we describe a primary optimization goal: 
   Quality of Experience (QoE). Next, we briefly list the most common 
   methods of how to perform subjective evaluations on speech and audio 
   quality. In Section 4, 5, and 6, we discuss on how to measure 
   complexity, latency, and bit- and frame rates. Section 7 describes 
   how other SDOs have measured the quality of their codecs. As 
   compared IIAC to previous standardized codecs, the IIAC is likely to 
   have different unique requirements and thus needs newly developed 
   quality testing procedures. To achieve this, in Section 8 we 
   describe the properties of Internet transmission paths. Section 9 
   summarizes the usage scenarios, for which the codec is going to be 
   used and finally, in Section 10, we recommend procedures on how to 
   test the IIAC. 

2. Optimization Goal 

   The aim of the Codec WG is to produce a codec of high quality. 
   However, how can quality be measured? The measurement of the 
   features of a codec can be based on many different criteria. Those 
   include complexity, memory consumption, audio quality, speech 
   quality, and others. But in the end, it's the users' opinions that 
   really count since they are the customers. Thus, one important - if 
   not the most important quality measure of the IIAC - shall be the 
   Quality of Experience (QoE).   

   The ITU-T Standards ITU-T P.10/G.100 [22] defines the term "Quality 
   of Experience" as "the overall acceptability of an application or 
   service, as perceived subjectively by the end-user." The ITU-T 
   document G.RQAM [21] extends this definition by noting that "quality 
   of experience includes the complete end-to-end system effects 
   (client, terminal, network, services infrastructure, etc.)" and that 
   the "overall acceptability may be influenced by user expectations 
   and context". 

   These definitions already give guidelines on how to judge the 
   quality of the IIAC: 

   o  The acceptability and the subjective quality impression of 
      endusers have to be measured (Section 3). 

   o  The IIAC codec has to be tested as part of an entire 
      telecommunication system. It must be carefully considered whether 
      to measure the codec's performance just in a stand-alone setup or 
      to evaluate it as part of the overall system (Section 8).  


Hoene                    Expires June 3, 2011                  [Page 6] 

Internet-Draft              Codec Quality                 December 2010 
    

   o  The environments and contexts of particular communication 
      scenarios have to be considered and controlled because they have 
      an impact on the human rating behavior and on quality 
      expectations and requirements (Section 9). 

3. Measuring Speech and Audio Quality 

   The perceived quality of a service can be measured by various means. 
   If humans are interrogated, those quality tests are called 
   subjective. If the tests are conducted by instrumental means (such 
   as an algorithm) they are called objective. Subjective tests are 
   divided up into formal and informal tests. Formal tests follow 
   strictly defined procedures and methods and typically include a 
   large number of subjects. Informal tests are less precise because 
   they are conducted in an uncontrolled manner. 

3.1. Formal Subjective Tests 

   Formal subjective tests must follow a well-defined procedure. 
   Otherwise the results of multiple tests cannot be mutually compared 
   and are not repeatable. Most subjective testing procedures have been 
   standardized by the ITU.  If applied to coding testing, the testing 
   procedures follow the same pattern [26]: 

     "Performing subjective evaluations of digital codecs proceeds 
     via a number of steps: 

       o Preparation of source speech materials, including recording of 
          talkers; 

       o Selection of experimental parameters to exercise the features 
          of the codec that are of interest; 

       o Design of the experiment; 

       o Selection of a test procedure and conduct of the experiment; 

       o Analysis of results." 

   The ITU has standardized different formal subjective tests to 
   measure the quality of speech and audio transmission, which are 
   described in the following. 

3.1.1. ITU-R Recommendation BS.1116-1 

   The ITU-R BS.1116-1 standard [14] is good for audio items with small 
   degradations (stimuli) and uses a continuous scale from 
 
 
Hoene                    Expires June 3, 2011                  [Page 7] 

Internet-Draft              Codec Quality                 December 2010 
    

   imperceptible (5.0) to very annoying (1.0). It is a double blind 
   triple-stimulus with a hidden reference testing method and must be 
   done twice for the degraded sample and the hidden reference. In a 30 
   minutes session, 10-15 sample items can be judged. Overall, about 20 
   subjects shall rate the items. Testing shall take place with 
   loudspeakers in a controlled environment or with headphones in a 
   quiet room. 

3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA) 

   The ITU-R BS.1534-1 standard [16] defines a method for the 
   subjective assessment of intermediate quality levels. Multiple audio 
   stimuli are compared at the same time. Maximal 12 but preferably 
   only 8 stimuli plus a hidden one with Hidden Reference and an anchor 
   are compared and judged. MUSHRA uses a continuous quality scale 
   (CQS) ranging from 0 to 100 divided into five equal intervals ("bad" 
   to "excellent"). In 30 minutes, about 42 stimuli can be tested. 
   Again, 20 test subjects shall rate the items with either headphones 
   or loudspeakers. 

   The standard recommends using as lower anchor a low-pass filtered 
   version with a bandwidth limit of 3.5 kHz. Additional anchors are 
   recommended, especially if specific distortions are to be tested. 

3.1.3. ITU-T Recommendation P.800 

   The ITU-T P.800 defines multiple testing procedures to assess the 
   speech quality of telephone connections. The most important   
   procedure is called listening-only speech quality of telephone 
   connections. Listeners rate short groups of unrelated sentences. The 
   listeners are taken from the normal telephone-using population (no 
   experts). They use a typical sending system (e.g. a local telephone) 
   that may follow "modified IRS" frequency characteristics. The 
   results is the listening-quality scale, which is an absolute 
   category scale (ACS) ranging from excellent=5 to bad=1. Listeners 
   can judge about 54 stimuli within 30 minutes. 

   Other tests described in P.800 measure listening-effort, loudness-
   preference scale, conversation opinion and difficulty, 
   delectability, degradation, or minimal differences. 

3.1.4. ITU-T Recommendation P.805 

   The P.805 standard [24] extends P.800 and defines precisely how to 
   measure conversational quality.  Subjects have to do conversation 
   tests to evaluate the communication quality of a connected. Expert, 
   experienced or untrained (naive) subjects have to do these tests 
 
 
Hoene                    Expires June 3, 2011                  [Page 8] 

Internet-Draft              Codec Quality                 December 2010 
    

   collaboratively in soundproof cabinets. Typically, 6 transmission 
   conditions can be tested within 30 minutes. Depending on the 
   required precision, these tests have to be made 20 to 40 times. 

3.1.5. ITU-T Recommendation P.880 

   To measure time-variable distortion, a continuous evaluation of 
   speech quality has been defined in P.880 [31]. Subjects have to 
   assess transmitted speech quality consisting of long speech 
   sequences with quality/time fluctuations. The quality is rated on a 
   continuous scale ranging from Excellent=5 to Bad=1 is dynamically 
   changed over the time while the stimuli are played. Stimuli have a 
   length of between 45 seconds and 3 minutes. 

3.1.6. Formal Methods Used for Codec Testing at the ITU 

   In the last year, new narrow and wideband codecs have been tested 
   using ITU-T P.800 (and ITU-T P.830). For the ITU-T G.719 standard, 
   which supports besides speech content also audio, the ITU-R BS.1116-
   1 testing method has been applied during the selection of potential 
   codec candidates. During the qualification phase, the method that 
   was used was the ITU-P BS.1584-1. For the ITU-T G.718 codec, the 
   Absolute Category Rating (ACR) following ITU-T P.800 has been 
   applied.  

3.2. Informal Subjective Tests 

   Besides formal tests, informal subjective tests following less 
   stringent conditions might be taken to judge the quality of stimuli. 
   However, informal tests cannot be easily verified and lack the 
   reliability, accuracy and precision of formal tests. Informal tests 
   are needed if the available number of subjects who are able to 
   conduct the tests is low, or if time or money is limited. 

3.3. Interview and Survey Tests 

   In ITU-T P.800 [23] and [9] interview and survey tests are 
   described. In P.800, it says that "if the rather large amount of 
   effort needed is available and the importance of the study warrants 
   it, transmission quality can be determined by 'service 
   observations'." 

   These service observations are based on statistical surveys common 
   in social science and marketing research. Typically, the questions 
   asked in a survey are structured. 


Hoene                    Expires June 3, 2011                  [Page 9] 

Internet-Draft              Codec Quality                 December 2010 
    

   In addition, according to [23]: "To maintain a high degree of 
   precision a total of at least 100 interviews per condition is 
   required. A disadvantage of the service-observation method for many 
   purposes is that little control is possible over the detailed 
   characteristics of the telephone connections being tested." 

3.4. Web-based Testing 

   If the large-wide scale proliferation of the Internet, researchers 
   suggested testing the speech or audio quality on web sites via web 
   site visitors [43]. A current web site that compares multiple audio 
   codecs has been setup at SoundExpert.org [42]. On this web site, a 
   user can download an audio item that consists of a reference item 
   and a degraded item. Then, the user must identify the reference and 
   rate the ODG of the degraded item. The tests are single-blind as the 
   user does not know which codec he is currently rating. 

   One can anticipate that the visitors of web sites will use similar 
   equipment for testing of audio samples and for conducting VoIP 
   calls. Thus, web site testing can be made realistic in a way that 
   considers the impact of (typically used) loudspeakers and 
   headphones. 

   However, currently used web sites lack a proper identification of 
   outliers. Thus, all ratings of all users are considered despite the 
   fact that they might be (deliberately) faked or that subjects might 
   not be able to hear well the acoustic difference. Thus, one can 
   expect that web based ratings will show a high degree of variation 
   and that many more tests are needed to achieve the same confidence 
   that is gained within formal tests. A profound scientific study on 
   the quality of web based audio rating has not yet been published. 
   Thus, any statements on the validity of web based rating are 
   premature.  

3.5. Call Length and Conversational Quality 

   In the ETSI technical report document ETR-250 [6], a model is 
   presented that discusses various impairments caused in narrow band 
   telephone systems. The ETSI model describes the combinatorial effect 
   of all those impairments. The ETSI model later became the famous E-
   Model described in ITU-T G.107. Both the ETSI- and the E-Model 
   calculate the R factor that ranges from 0 (bad) to 100 (excellent 
   conversational quality). 

   Based on the R factor, the users' reaction to the voice transmission 
   quality of a connection can be predicted. For example, Section 8.3 
   describes the effect that users terminate the call if the quality is 
 
 
Hoene                    Expires June 3, 2011                 [Page 10] 

Internet-Draft              Codec Quality                 December 2010 
    

   bad. More precisely, they summarize it as users who "(i) terminate 
   their calls unusually early, (ii) re-dial or even (iii) actually 
   complain to the network operator". 

   In the ETSI model, the percentage of users "terminating calls 
   early", TME, is given as 

      $TME=100\cdot erf\left(\frac{36-R}{16}\right)\%$ 

   with $erf(X)$ being the sigmoid shaped Gaussian error function and 
   $R$ the R Factor of the E-Model (Figure 1). This relation is based 
   on results from "AT&T Long toll" interviews as cited in [2]. 

   These findings have been confirmed by Holub et al. [12] who have 
   studied the correlation between call length and narrow band speech 
   quality. Birke et al. [1] have also studied the duration of phone 
   calls which show a duration varying with day time and day of the 
   week and also may be affected by pricing schemata. 


Hoene                    Expires June 3, 2011                 [Page 11] 

Internet-Draft              Codec Quality                 December 2010 
    

             100 -+TME.                               +- 5 
                  |..iii.                             | 
        T         |    .ii                            |   
        e         |      ii                        MOS|          
        r         |       i.                     .iiii|          
        m     80 -+       .i.                  .ii.   |          
        i         |        .i                .ii.     +- 4         
        n         |         i.              .i.       |      M   
        a         |         .i            .ii.        |      O   
        t         |          i.          .i.          |      S    
        e     60 -+          .i         .i.           |      |    
                  |           i.       ii.            |      C   
        E         |           .i     .ii              +- 3   Q    
        a         |            i.   .i.               |      E 
        r     40 -+            .i  .i.                |         
        l         |             i..i.                 |         
        y         |             .ii.                  |         
                  |            .il.                   |         
        (         |           .i..i                   +- 2 
        T     20 -+          .i.  i.                  |            
        M         |        .ii.   .i.                 |            
        E         |      .ii.      .i.                |            
        )         |    .ii.         .ii.              |            
                  |MOSlii.           .iiiiiiiiiiiiiTME|          
               0 -+-----------------+-----------------+- 1 
                  |                 |                 |                          
                  0                 50               100 

                                R Factor 

    Figure 1 - Relation between calls terminating early, the R Factor, 
                 and the speech quality given in (MOS-CQE) 

   Whereas bad quality is related to short calls, it remains unproven 
   whether better quality (>4 MOS) results in longer phone calls. There 
   are two factors which might have an opposite effect on the call 
   length. On the one hand, if the quality is superb, the talkers might 
   be more willing to talk because of the pleasure of talking, on the 
   other hand they might fulfill their conversational tasks faster 
   because of the great quality Thus, depending on the context, good 
   speech quality might result either in longer or shorter calls. 

3.6. Field Studies 

   Field studies can be conducted if usage data on calls are collected. 
   Field studies are useful to monitor real user behavior and to 
   collect data about the actual conversational context. 
 
 
Hoene                    Expires June 3, 2011                 [Page 12] 

Internet-Draft              Codec Quality                 December 2010 
    

   Because of highly varying conditions, the precision of those 
   measurements is high and many tests have to be done to get 
   significantly different measurement values. Also, the tests are not 
   repeatable because the conditions are changing with time. 

   For example, Skype has done quality tests in a deployed VoIP system 
   in the field with its users as testers [47]. The subjective tests 
   are done in the following manner. 

   o  Download of test vectors to VoIP clients. Typically, this can be 
      done with an automated software update. 

   o  Delivery changing VoIP configurations (such as the used codecs) 
      so that different calls are subjected to different 
      configurations. The selection of configurations can be done 
      randomly, alternating in time or based on other criteria. 

   o  Collecting feedback from the users. For example, the following 
      parameters can be monitored or recorded: 

       o The call length and other call specific parameters 

       o A user's quality voting (e.g. MOS-ACR) after the call 

       o Other feedback of the user (e.g. via support channels) 

   The field tests have the benefit of being conducted under real 
   conditions with the real users. However, they have some drawbacks. 
   First, the experimental conditions cannot be controlled well. 
   Second, the tests are only valid for the current situations and do 
   not allow predictions for other use cases. Third, the statistical 
   significance might be largely questionable if confidence intervals 
   are overlapping.  

   The costs for running the tests are low because the users are doing 
   the tests for free. However, the operator might lose users after a 
   user experienced a test case causing bad quality.   

3.7. Objective Tests 

   Objective tests, also called instrumental tests, try to predict the 
   human rating behavior with mathematical models and algorithms. They 
   also calculate quality ratings for a given set of audio items. 
   Naturally, they are not rating as precisely as their human 
   counterparts, whom they try to simulate. However, the results are 
   repeatable and less costly than formal subjective testing campaigns.  
   Instrumental methods have a limited precision. That means that their 
 
 
Hoene                    Expires June 3, 2011                 [Page 13] 

Internet-Draft              Codec Quality                 December 2010 
    

   quality ratings do not perfectly match the results of formal 
   listening-only tests. Typically, the correlation between formal 
   results and instrumental calculations are compared using a 
   correlation function. The resulting metric is given as R ranging 
   from 0 (no correlation) to 1 (perfect match). 

   Over the last years, several objective evaluation algorithms have 
   been developed and standardized. We describe them briefly in the 
   following. 

3.7.1. ITU-R Recommendation BS.1387-1 

   The ITU developed an algorithm that is called Perceptual Evaluation 
   of Audio Quality (PEAQ). It was published in the document ITU-R 
   BS.1387 called Method for objective measurements of perceived audio 
   quality in 1998 [15]. PEAQ is intended to predict the quality rating 
   of low-bit-rate coded audio signals. Two different versions of PEAQ 
   are provided: a basic version with lower computational complexity 
   and an advanced version with higher computational complexity. 

   PEAQ calculates a quality grading called "Objective Difference 
   Grade" (ODB) ranging from 0 to -4. Typically, it shows a prediction 
   quality of between R=0.85 and 0.97 when compared to subjective 
   testing results. The ITU-T Study Group 12 assumes that PEAQ can 
   detect auditable differences between two implementations of the same 
   codec [5].   

3.7.2. ITU-T Recommendation P.862 

   The ITU-T PESQ algorithm [27] is intended to judge distortions 
   caused by narrow band speech codecs and other kind of channel and 
   transmission errors. These include also variable delays, filtering 
   and short localize distortions such as those caused by frame loss 
   concealment. For a large number of conditions, the validity and 
   precision of PESQ has been proven. For untested distortions, prior 
   subjective tests must be conducted to verify whether PESQ judges 
   these kinds of distortions precisely. Also, it is recommended to use 
   PESQ for 3.1 kHz (narrow-band) handset telephony and narrow-band 
   speech codecs only. For wide-band operations, a modified filter has 
   to be applied prior to the tests. 

   Furthermore, the ITU-T Recommendation P.862.1 [28] describes how to 
   transfer the PESQ's raw scores, which range from -0.5 to 4.5, to 
   MOS-LQO values similar to those gathered from ACR ratings. Then, as 
   it has been shown, the correlation between a large corpus of testing 
   samplings shows a correlation of R=0.879 (instead of R=0.876) 
   between subjective and MOS-LQO (respective PESQ raw) ratings. The 
 
 
Hoene                    Expires June 3, 2011                 [Page 14] 

Internet-Draft              Codec Quality                 December 2010 
    

   ITU-T Recommendation P.862.2 [29] modifies the PESQ algorithm 
   slightly to support wideband operations. And finally, the ITU-T 
   Recommendation P.862.3 [30] gives detailed hints and recommendations 
   on how and when to use the PESQ algorithms. 

3.7.3. ITU-T Draft P.OLQA 

   The soon-to-be standardized algorithm P.OLQA [40] extends PESQ and 
   will be able to rate narrow to super-wideband speech and the effect 
   of time-varying speech playout. Later distortions are common in 
   modern VoIP systems which stretch and shrink the speech playout 
   during voice activity to adapt it to the delay process of the 
   network. 

4. Measuring Complexity 

   Besides audio and speech quality, the complexity of a codec is of 
   prime importance. Knowing the algorithmic efficiency is important 
   because: 

   .  the complexity has an impact on power consumption and system 
      costs 

   .  the hardware can be selected to fit pre-known complexity 
      requirements and 

   .  different codec proposals can be compared if they show similar 
      performances in other aspects. 

   Before any complexity comparisons can be made, one has to agree on 
   an objective, precise, reliable, and repeatable metric on how to 
   measure the algorithmic efficiency. In the following, we list three 
   different approaches. 

4.1. ITU-T Approaches to Measuring Algorithmic Efficiency 

   Over the last 17 years, the ITU-T Study Group 16 measured the 
   complexity of codecs using a library called ITU-T Basic Operators 
   and described in ITU-T G.191 [19], which counts the kind and number 
   of operations and the amount of memory used. The latest version of 
   the standard supports both fix-point operations of different widths 
   and floating operations.  Each operation can be counted 
   automatically and weighted accordingly. The following source code is 
   an [edited] excerpt from the source file baseop32.h: 
    
    
Hoene                    Expires June 3, 2011                 [Page 15] 

Internet-Draft              Codec Quality                 December 2010 
    

   /* Prototypes for basic arithmetic operators */ 

   /* Short add,           1   */  
   Word16 add (Word16 var1, Word16 var2);    

   /* Short sub,           1   */ 
   Word16 sub (Word16 var1, Word16 var2);    

   /* Short abs,           1   */ 
   Word16 abs_s (Word16 var1);               

   /* Short shift left,    1   */ 
   Word16 shl (Word16 var1, Word16 var2);    

   /* Short shift right,   1   */ 
   Word16 shr (Word16 var1, Word16 var2);    

   ... 

   /* Short division,       18  */ 
   Word16 div_s (Word16 var1, Word16 var2); 

   /* Long norm,             1  */ 
   Word16 norm_l (Word32 L_var1);           

   In the upcoming ITU-T G.GSAD standard another approach has been used 
   as shown in the following code example. For each operation, WMPOS 
   functions have been added, which count the number of operations. If 
   the efficiency of an algorithm has to be measured, the program is 
   started and the operations are counted for a known input length. 

   for (i=0; i<NUM_BAND; i++) 
   { 
   #ifdef WMOPS_FX 
       move32();move32(); 
       move32();move32(); 
   #endif 
       state_fx->band_enrg_long_fx[i] = 30; 
       state_fx->band_enrg_fx[i] = 30; 
       state_fx->band_enrg_bgd_fx[i] = 30;    
       state_fx->min_band_enrg_fx[i] = 30; 
   } 


Hoene                    Expires June 3, 2011                 [Page 16] 

Internet-Draft              Codec Quality                 December 2010 
    

4.2. Software Profiling 

   The previously described methods are well-established procedures on 
   how to measure computational complexity. Still, they have some 
   drawbacks: 

   o  Existing algorithms must be modified manually to include 
      instructions that count arithmetic operations. In complex codecs, 
      this may take substantial time. 

   o  The CPU model is simple as it does not consider memory access 
      (e.g. cache), parallel executions, or other kinds of optimization 
      that are done in modern microprocessors and compilers. Thus, the 
      number of instructions might not correlate to the actual 
      execution time on modern CPUs. 

   Thus, instead of counting instructions manually, run times of the 
   codec can be measured on a real system. In software engineering, 
   this is called profiling. The Wikipedia article on profiling [54] 
   explains profiling as follows: 

     "In software engineering, program profiling, software profiling or 
     simply profiling, a form of dynamic program analysis (as opposed 
     to static code analysis), is the investigation of a program's 
     behavior using information gathered as the program executes. The 
     usual purpose of this analysis is to determine which sections of a 
     program to optimize - to increase its overall speed, decrease its 
     memory requirement or sometimes both. 

       o  A (code) profiler is a performance analysis tool that, most 
          commonly, measures only the frequency and duration of 
          function calls, but there are other specific types of 
          profilers (e.g. memory profilers) in addition to more 
          comprehensive profilers, capable of gathering extensive 
          performance data 

       o  An instruction set simulator which is also - by necessity - a 
          profiler, can measure the totality of a program's behaviour 
          from invocation to termination." 

   Thus, a typical profiler such as the GNU gprof can be used to 
   measure and understand the complexity of a codec implementation. 
   This is precisely the case because it is used on modern computers. 
   However, the execution times depend on the CPU architecture, the PC 
   in general, the OS and parallel running programs. 


Hoene                    Expires June 3, 2011                 [Page 17] 

Internet-Draft              Codec Quality                 December 2010 
    

   To ensure repeatable results, the execution environment (i.e. the 
   computer) must be standardized. Otherwise the results of run times 
   cannot be verified by other parties as the results may differ if 
   done under slightly changed conditions. 

4.3. Cycle Accurate Simulation 

   If reliable and repeatable results are needed, another similar 
   approach can be chosen. Instead of run times, CPU clock cycles on a 
   virtual reference system can be measured. Quoting Wikipedia again 
   [52]: 

     "A Cycle Accurate Simulator (CAS) is a computer program that 
     simulates a microarchitecture cycle-accurate. In contrast 
     an instruction set simulator simulates an Instruction Set 
     Architecture usually faster but not cycle-accurate to a specific 
     implementation of this architecture." 

   With a cycle accurate simulator, the execution times are precise and 
   repeatable for the system that is being studied. If two parties make 
   measurements using different real computers, they still get the same 
   results if they use the same CAS. 

   A cycle accurate simulator is slower than the real CPU by a factor 
   of about 100. Also, it might have a measurement error as compared to 
   the simulated, real CPU because the CPU is typically not perfectly 
   modeled. 

   If an x86-64 architecture shall be simulated, the open-source Cycle 
   accurate simulator called PTLsim can be considered [55]. PTLsim 
   simulates a Pentium IV. On their website, the authors of PTLsim 
   write: 

     "PTLsim is a cycle accurate x86 microprocessor simulator and 
     virtual machine for the x86 and x86-64 instruction sets. PTLsim 
     models a modern superscalar out of order x86-64 compatible 
     processor core at a configurable level of detail ranging from 
     full-speed native execution on the host CPU all the way down 
     to RTL level models of all key pipeline structures." 

   Another cycle accurate simulator called FaCSIM simulated the ARM9E-S 
   processor core and ARM926EJ-S memory subsystem [36]. It is also 
   available as open-source. Texas Instruments also provides as CAS for 
   its C64x+ digital signal processor [44]. 

   To have a metric that is independent of a particular architecture, 
   the results of cycle accurate simulators could be combined. 
 
 
Hoene                    Expires June 3, 2011                 [Page 18] 

Internet-Draft              Codec Quality                 December 2010 
    

4.4. Typical run time environments 

   The IIAC codec will run on various different platforms with quite 
   diverse properties. After discussions on the WG mailing list, a few 
   typical run time environments have been identified.  

   Three of the run time environments are end devices (aka phones). The 
   first one is a PC, either stationary or a portable, having a >2 GHz 
   PCU, >2 GByte of RAM, and a hard disk for permanent storage. 
   Typically, a Windows, MacOS or Linux operating system is running on 
   a PC. The second one is a SmartPhone, for example with an ARM11 500 
   MHz CPU, 192 Mbyte RAM and 256 MByte Flashrom. An example is the HTC 
   Dream Smart phone equipped with Qualcomm MSM7201A chip. Various 
   operating systems are found on those devices such as Symbian, 
   Android, and iOS. The last ones are high end stationary VoIP phones 
   with for example a 275-MHz MIPS32 CPU (with 400 DMIPS) with a 125-
   MHz (250 MIPS) ZSP DSP with dual-MAC. They both have more than 1 
   Mbyte RAM and FlashRom. An exemplary Chip is the BCM1103 [3]. 

   Besides phones, VoIP gateways are frequently needed for conferencing 
   or transcoding to legacy VoIP or PSTN. In this case, two different 
   platforms have been identified. The first one is based on standard 
   PC server platforms. It consists, for example, of an Intel six core 
   Xeon 54XX or 55XX, two 1 GB NIC, 12 GByte RAM, hard disks, and a 
   Linux operating system. Thus, a server can serve from 400 to 10000 
   calls depending on conference mode, codecs used, and ability of user 
   pre-encoded audio [46]. On the other hand, high density, highly 
   optimized voice gateways use a special purpose hardware platform 
   like for example, TNETV3020 chips consisting of six TI C64x+ DSPs 
   with 5.5 MB internal RAM. If they run with a Telogy conference 
   engine, they might serve about 1300 AMR or 3000 G.711 calls per chip 
   [45]. 

5. Measuring Latency 

   Latency is a measure of time delay experienced in a system. Latency 
   can be measured as one-way delay or as round-trip time. The latter 
   one is the one-way latency from a source to destination plus the 
   one-way latency back from destination to source. Latency can be 
   measured at multiple positions, at the network layer or at higher 
   layers [53]. 

   As we aim to increase the Quality of Experience, the mouth-to-ear 
   delay is of importance because it directly correlates with 
   perceptual quality [17]. More precisely, the acoustic round-trip 
   time shall be a means of optimization when studying interactive and 
   conversational application scenarios. 
 
 
Hoene                    Expires June 3, 2011                 [Page 19] 

Internet-Draft              Codec Quality                 December 2010 
    

5.1. ITU-T Recommendation G.114 

   The G.114 standard [45] gives guidelines on how to estimate one-way 
   transmission delays. It describes how the delay introduced by the 
   codec is generated. Because most of the encoders do a processing of 
   frames, the duration of a frame (named "frame size") is the foremost 
   contributor to the overall algorithmic delay. Citing [18]: 

     "In addition, many coders also look into the succeeding frame to 
     improve compression efficiency. The length of this advance look is 
     known as the look-ahead time of the coder. The time required to 
     process an input frame is assumed to be the same as the frame 
     length since efficient use of processor resources will be 
     accomplished when an encoder/decoder pair (or multiple 
     encoder/decoder pairs operating in parallel on multiple input 
     streams) fully uses the available processing power (evenly 
     distributed in the time domain). Thus, the delay through an 
     encoder/decoder pair is normally assumed to be:" 

   $2*frameSize + lookAhead$ 

   In addition, if the link speeds are low, the serialization delay 
   might contribute significantly to the codec delay. 

   Also, if IP transmissions are used and multiple frames are 
   concatenated in one IP packet, further delay is added. Then, "the 
   minimum delay attributable to codec-related processing in IP-based 
   systems with multiple frames per packet is:" 

   $(N+1)*frameSize + lookAhead$ 

   "where N is the number of frames in each packet." 

5.2. Discussion 

   Extensive discussion on the WG mailing list led to the insight that 
   the afore mentioned ITU delay model overestimates the delay 
   introduced by the codec. In the last decade, two developments led to 
   slightly other conditions. 

   First, the processing power of CPU increased significantly (see 
   Section 4.4). Nowadays, even stand-alone VoIPs have CPUs with a 
   speed of 300 MHz. They are capable of doing the encoding and 
   decoding faster than real time. Thus, also the delay introduced by 
   processing is not at 100% anymore but significantly lower. For 
   example, it might be just 10% or less. 

 
Hoene                    Expires June 3, 2011                 [Page 20] 

Internet-Draft              Codec Quality                 December 2010 
    

   Second, even if the CPUs are fully loaded, especially if also other 
   tasks such as a video conference or other calls need to be 
   processed, advantaged scheduling algorithms allow for a timely 
   encoding and decoding. For example, a staggered processing schedule 
   can be used to reduce processing delays [45]. 

   Thus, the impact of processing delay is reduced significantly in 
   most of the cases. 

   Moreover, besides a look-ahead time, the decoder might also 
   contribute to the algorithmic delay e.g. if decoded and concealed 
   periods shall be mixed well. 

6. Measuring Bit and Frame Rates 

   For decades, there was a quest to achieve high quality while keeping 
   the coding rate low. Coding rate, sometimes called multimedia bit 
   rate, is the bit rate that an encoder produces as its output stream. 
   In cases of variable rate encoding, the coding bit rate differs over 
   time. Thus, one has to describe the coding rate statistically. For 
   example, minimal, mean, and maximal coding rates need to be 
   measured. 

   A second parameter is the frame rate as the encoder produces frames 
   at a given rate.  Again, in case of discontinuous transmission modes 
   (DTX), the frame rate can vary and a statistical description is 
   required. 

   Both coding and frame rate influence network related bit rates. For 
   example, the physical layer gross bit rate is the total number of 
   physically transferred bits per second over a communication link, 
   including useful data as well as protocol overhead [51]. It depends 
   on the access technology, the packet rate, and packet sizes. The 
   physical layer net bit rate is measured in a similar way but 
   excludes the physical layer protocol overhead. The network 
   throughput is the maximal throughput of a communication link of an 
   access network. Finally, the goodput or data transfer rate refers to 
   the net bit rate delivered to an application excluding all protocol 
   headers and data link layer retransmissions, etc. Typically, to 
   avoid packet losses or queuing delay, the goodput shall be equally 
   large as the coding rate.  

   The relation between goodput and the physical layer gross bit rate 
   is not trivial.  First of all, the goodput is measured end-to-end. 
   The end-to-end path can consist of multiple physical links, each 
   having a different overhead. Second, the overhead of physical layers 
   may vary with time and load, depending for example on link 
 
 
Hoene                    Expires June 3, 2011                 [Page 21] 

Internet-Draft              Codec Quality                 December 2010 
    

   utilization and link quality. Third, packets may be tunneled through 
   the network and additional headers (such as IPsec) might be added. 
   Fourth, IP header compression might be applied (as in LTE networks) 
   and the overhead might be reduced. Overall, many information about 
   the network connection must be collected to predict what the 
   relation between physical layer gross bit rate and a given coding 
   and frame rate is going to be. Applications, which have only a 
   limited view of the network, can hardly know the precise relation. 

   For example, the DCCP TFRC-SP transport protocol simply estimates a 
   header size on data packets of 36 bytes (20 bytes for the IPv4 
   header and 16 bytes for the DCCP-Data header with 48-bit sequence 
   numbers) [7][8]. Thus, [11] suggested a typical scenario in which 
   one encoded frame is transmitted with the RTP, UDP, IPv4 and IEEE 
   802.3 protocols and thus each packet contains packet headers having 
   12 bytes, 8 bytes, 20 bytes and 18 bytes respectively. The gross bit 
   rate calculates as 

   $r_{gross}=r_{coding}+overhead \cdot framerate$ 

   where $r_{coding}$ is the coding rate of the encoding, $framerate$ 
   is the frame rate of the codec, $overhead$ is the number of bits for 
   protocol headers in each packet (typically 58*8=464), and the 
   $r_{gross}$ is the rate used on physical mediums. 

7. Codec Testing Procedures Used by Other SDOs 

   To ensure quality, each newly standardized codec is rigorously 
   tested. ITU-T Study Group 12 and 16 have developed very good and 
   mature procedures on how to test codecs. The ITU-T Study Group 12 
   has described the testing procedures of narrow- and wide-band codecs 
   in the ITU-T P.830 standard. 

7.1. ITU-T Recommendation P.830 

   The ITU-T P.830 recommendation describes methods and procedures for 
   conducting subjective performance evaluations of digital speech 
   codecs. It recommends for most applications the Absolute Category 
   Rating (ACR) method using the Listening Quality scale. The process 
   of judging the quality of a speech codec consists of five steps, 
   which are described in the following. 

   Step 1: Preparation of Source Speech Materials Including Recording 
   of Talkers. When testing a narrow band codec, the recommendation 
   suggests to use a bandwidth filter before applying sample items to a 
   codec. This bandwidth filter is called modified Intermediate 
   Reference System (IRS) and limits the frequency band to the range 
 
 
Hoene                    Expires June 3, 2011                 [Page 22] 

Internet-Draft              Codec Quality                 December 2010 
    

   between 300 and 3400 Hz. In addition, the recommendation states that 
   "if a wideband system (100-7000 Hz) is to be used for audio-
   conferencing, then the sending end should conform to IEC Publication 
   581.7." 

   It also says that "speech material should consist of simple, short, 
   meaningful sentences." The sentences shall be understandable to a 
   broad audience and sample items should consist of two or three 
   sentences, each of them having a duration of between 2 and 3 
   seconds. Sample items should not contain noise or reverberations 
   longer than 500 ms. The recommendation also makes suggestions on the 
   loudness of the signal: "A typical nominal value for mean active 
   speech level (measured according to Recommendation P.56) is         
   -20 dBm0, corresponding to approximately -26 dBov" 

   Step 2: Selection of Experimental Parameters to Exercise the 
   Features of the Codec That Are of Interest. Various parameters shall 
   be tested. Those include 

   o  Codec Conditions 

       o  Speech input levels ("input levels of 14, 26 and 38 dB below 
          the overload point of the codec") 

       o  Listening levels ("levels should lie 10 dB to either side of 
          the preferred listening level") 

       o  Talkers 

            . Different talkers ("a minimum of two male and two female 
               talkers") 

            . Multiple talkers ("multiple simultaneous voice input 
               signals") 

       o  Errors ("randomly distributed bit errors" or burst-errors) 

       o  Bitrates ("The codec must be tested at all the bit rates") 

       o  Transcodings ("Asynchronous tandeming", "Synchronous 
          tandeming", and "Interoperability with other speech coding 
          standards") 

       o  Mismatch (sender and receiver operate in different modes) 

       o  Environmental noise (sending) ("30 dB for room noise" and "10 
          dB and 20 dB for vehicular noise") 
 
 
Hoene                    Expires June 3, 2011                 [Page 23] 

Internet-Draft              Codec Quality                 December 2010 
    

       o  Network information signals ("signaling tones, conforming to 
          Recommendation Q.35, should be tested subjectively, and the 
          minimum should be proceed to dial tone,  called subscriber 
          ringing tone, called subscriber engaged tone, equipment 
          engaged tone, [and] number unobtainable tone.") 

       o  Music ("to ensure that the music is of reasonable quality") 

   o  Reference conditions ("for making meaningful comparisons") 

       o  Direct (no coding, only input and output filtering) 

       o  Modulated Noise Reference Unit (MNRU) 

       o  Signal-to-Noise Ratio (SNR) (for comparison purposes) 

       o  Reference codecs 

   Step 3: Design of the Experiment. The considerations described in 
   B.3/P.80 apply here. Typically, it is not possible to test each 
   combination of parameters. Thus, recommendation P.830 states that 
   "it is recommended that a minimum set of experiments be conducted, 
   which, although they would not cover every combination, would result 
   in sufficient data to make sensible decisions. [...] Extreme caution 
   should be used when comparing systems with widely differing 
   degradations, e.g. digital codecs, frequency division multiplex 
   systems, vocoders, etc., even within the same test." 

   Step 4: Selection of a Test Procedure and Conduct of the Experiment. 
   Here, the considerations as in B.4/P.80 apply. However, a modified 
   IRS at the receiver shall be used (narrow band) or an IEC 
   Publication 581.7 filter (wideband). Also, "Gaussian noise 
   equivalent to -68 dBmp should be added at the input to the receiving 
   system to reduce noise contrast effects at the onset of speech 
   utterances." 

   Step 5: Analysis of Results. Again, the considerations detailed in 
   B.4.7/P.80 apply. The arithmetic mean (over subjects) is to be 
   calculated for each condition at each listening level. 

7.2. Testing procedure for the ITU-T G.719 

   Recently, the ITU-T has standardized the audio and speech codec ITU-
   T G.719. The G.719 has similar properties as the anticipated IIAC, 
   thus the optimization and characterization of the G.719 is of 
   particular interest. 

 
Hoene                    Expires June 3, 2011                 [Page 24] 

Internet-Draft              Codec Quality                 December 2010 
    

   In the following, we will describe the "Quality Assessment Test 
   Plan" in TD 322 and 323 [33][35]. The ITU Study Group 16 used ITU-R 
   BS.1116 to tests sample items. Audio sample items were sampled at 48 
   kHz mixed down to mono. Speech sample items contain one sentence 
   with a duration of 4 s, mixed content had a duration of 5-6 s and 
   music a duration of between 10 and 15 s. The beginning and ending of 
   the samples were smoothed. Also, a filter was applied to limit the 
   nominal bandwidth of the input signal to the range of 20 to 20000 
   Hz. As for the mixed content, advertisements, film trailers and news 
   (including a jingle) have been selected. For music items, classical 
   and modern styles of music have been selected. Besides the codec 
   under test, test stimuli degraded with LAMP MP3 and G722 were added 
   to the tests. Some test stimuli have been modified to include 
   reverberations or an interfering talker and office noise. Some tests 
   were done studying the effect of a frame erasure rate of 3% having 
   random loss patterns. All listening labs used different sample items 
   and attention paid to not use the same material twice. 

   Listening labs were required to provide the results of 24 
   experienced listeners excluding those listeners, who did not passed 
   a pre- and post-screening. The experienced listeners should "neither 
   have a background in technical implementations of the equipment 
   under test nor do they have detailed knowledge of the influence of 
   these implementations on subjective quality". 

   During the tests, "circum aural headphones - open back for example: 
   STAX Signature SR-404 or Sennheiser HD-600) on both ears (diotic 
   presentation)" were used. The listening levels were -26 dB relative 
   to OVL. 

   Some results of the listening tests are given in TD 341 R1 [34]. In 
   those tests, they also compared the subjective ratings that were 
   made following BS.1116 with the objective ratings of ITU-R BS.1387-
   1. The correlation between objective and subjective ratings was 
   below R=0.9. 

8. Transmission Channel 

   Between speech encoder and decoder lies a transmission channel that 
   effects the transmission. For cellular or wireless phones, the 
   typical transmission channel is assumed to be equal to the wireless 
   link(s). This typically means, that a circuit switch link is assumed 
   (e.g., in GSM, UMTS, DECT). The bandwidth is typically constant in 
   DECT and GSM or variable in a given range depending on the quality 
   of the wireless transmission (UMTS). Bit errors do occur but they 
   don't be equally distributed if unequal bit error correction is 
   applied (UMTS). 
 
 
Hoene                    Expires June 3, 2011                 [Page 25] 

Internet-Draft              Codec Quality                 December 2010 
    

   In the case of the IIAC codec, the transmission channel is the 
   internet. More precisely, it is the packet transmission over the 
   Internet, plus the transport protocol (e.g. UDP, TCP, DCCP), plus 
   potentially Forward Error Correction, and plus dejittering buffers. 

   Also, the transmission channel is reactive. It changes its 
   properties depending on how much data is transmitted. For example, 
   parallel TCP flows reduce their transmission bandwidth in the 
   presence of an unresponsive UDP stream. 

   Overall, one can say that the transmission channel "Internet" is 
   difficult to understand. Thus, in this chapter, we try to shed light 
   on the question of what types of transmission channels a codec has 
   to cope with. 

8.1. ITU-T G.1050: Network Model for Evaluating Multimedia Transmission 
   Performance over IP (11/2007)  

   The current ITU-T G.1050 standard [20] describes layer 3 packet 
   transmission models that can be used to evaluate IP applications. 
   The models are of statistical nature. They consider networks 
   architectures, types of access links, QoS controlled edge routing, 
   MTU size, networks faults, link failures, route flapping, reordered 
   packets, packet loss, one-way delay, variable deploys and background 
   traffics.  

   G.1050 is a network model consisting of three parts, LAN a, LAN b, 
   and an interconnection core. Both LANs can have different rates and 
   occupancy and can be of different types. LAN and core are connected 
   via access technologies, which might vary in data rate, occupancy 
   and MTU size.  

   The core is characterized by route flapping, link failures, one-way 
   delay, jitter, packet loss and reordered packets. Route flaps are 
   repeatedly changed in a transmission path because of alternating 
   routing tables. These routing updates cause incremental changes in 
   the transmission delays. A link failure is a period of consecutive 
   packet loss. Packet losses can be bursty having a high loss rate 
   during bursts and having otherwise a lower loss rate otherwise. 
   Delays are modeled via multiple different jitter models supporting 
   delay spikes, random jitter and filtered random jitters. 

   The standard recommends three profiles, named "Well-managed IP 
   network", "Partially-managed IP network", and "Unmanaged IP Network, 
   Internet", which differ in their connection qualities. 


Hoene                    Expires June 3, 2011                 [Page 26] 

Internet-Draft              Codec Quality                 December 2010 
    

   Limitations to these models are the missing cross-correlation 
   between packet delays and packet loss events, the lack of 
   responsiveness to the tests application flow, and the lack of link 
   qualities that vary with time. 

8.2. Draft G.1050 / TIA-921B   

   Currently, an enhancement to ITU-T G.1050 (11/2007) is being 
   developed (e.g. [13])). It does not use a statistical model but 
   takes advantage of the NS/2 simulator. Thus, most of the above 
   mentioned limitations have been overcome.  

   Despite that, even the new model does not yet give an answer to the 
   question of which distributions of typical Internet connection 
   qualities can be expected. 

8.3. Delay and Throughput Distributions on the Global Internet 

   In general, it is not precisely known how the qualities of end-to-
   end connections are distributed. It is also unclear whether the 
   anticipated IIAC Codec will be used globally or whether its area of 
   usage will be somehow restricted. 

   Despite the fact, that the codec has to be optimized for an unknown 
   Internet, the following scientific publications give an estimate on 
   how different Internet end-to-end paths might behave. One recent 
   example is on studies about the residential broadband Internet 
   access traffic of a major European ISP [37]. 


Hoene                    Expires June 3, 2011                 [Page 27] 

Internet-Draft              Codec Quality                 December 2010 
    

         +------------------------------------------------------------+ 
   p 0.6-+                                                            | 
   r     |  e eDonkey                                                 | 
   o     |                           ee                               | 
   b     |  H HTTP                   e e                              | 
   a     |                          ee e                              | 
   b     |                          e  e                              | 
   i 0.4-+                          e   e                             | 
   l     |                          e   e                             | 
   i     |                         e    e                             | 
   t     |                         e    e     HHHH                    | 
   y     |                         e     e  HHHHHHHHH                 | 
         |                        ee     e HH       HH                | 
   d 0.2-+                        e      eHH         HH               | 
   e     |                        e       H           HH              | 
   n     |                       ee      He            HH             | 
   s     |            ee         e      HH e            HH            | 
   i     |           e  ee      e     HH    e            HHH          | 
   t     |         ee     eeeeee HHHHHH      eeee          HHH        | 
   y 0.0-+ eHeHeHeHHHHHHHHHHHHHHH                eeeeeeeeeeeeeHHHHHHH | 
         +----+---------+---------+--------+---------+---------+------+ 
              |         |         |        |         |         | 
             0.1       1.0       10       100      1000      10000 
    
                              Throughput [kbps] 

    Figure 2 Achieved throughput of flows measured for eDonkey and HTTP 
                             applications [37]  

   Figure 2 displays the throughput distribution of TCP connections for 
   eDonkey peer-to-peer and HTTP applications. It only considers single 
   flow with a length of more than 50 Kbyte. But typically, a web 
   browser uses two to three TCP connections at the same time and an 
   eDonkey client about 10. Still, the throughput of a single HTTP flow 
   is in about an order faster than the of eDonkey flow. In [37], the 
   authors assume this is due to the fact that peer-to-peer connections 
   fill the uplink and that HTTP is used at the faster downlink.  

    
Hoene                    Expires June 3, 2011                 [Page 28] 

Internet-Draft              Codec Quality                 December 2010 
    

         +------------------------------------------------------------+ 
         |                                                            | 
         |                    **                                      | 
   p 0.8-+                    **                                      | 
   r     |                    ***                                     | 
   o     |                    * *                                     | 
   b     |                   ** *                                     | 
   a 0.6-+                   *  *                                     | 
   b     |                   *  **                                    |          
   i     |                   *   *                                    | 
   l     |                   *   *                                    | 
   i     |                   *   *                                    | 
   t 0.4-+                  **   **                                   | 
   y     |                  *     *                                   | 
         |                  *      * ****                             | 
   d     |                  *       *    *                            | 
   e 0.2-+                  *            **                           | 
   n     |                 **             **                          | 
   s     |            **** *               ***                        | 
   i     |          ***  ***                 ***                      | 
   t     |        ***                          **************         | 
   y 0.0-+*********                                  *****************|          
         +-------+-----------------+----------------+-----------------+ 
                 |                 |                |                 | 
                10                100             1000            10000 

                                        RTT [ms] 

                     Figure 3 TCP roundtrip times [36] 

   Figure 3 displays TCP roundtrip times including both access and 
   backbone network. Both graphs can be seen as an indication for the 
   assumption that an application, even in modern Internet access 
   networks, might be subjected to a wide variability of throughput 
   ranging from a few kbits/s up to 10 Gbit/s and TCP round trip times 
   from 5ms up to one of several seconds.  

   Albeit these results are only valid for TCP, similar results should 
   be expected for RTP over UDP - with a small advantage because UDP 
   flows are not always responsive.  

   As a summary, a codec for the Internet should be able to work under 
   these widely varying transmission conditions and should be tested 
   against a wide distribution of expected throughputs.  


Hoene                    Expires June 3, 2011                 [Page 29] 

Internet-Draft              Codec Quality                 December 2010 
    

8.4. Transmission Variability on the Internet 

   Besides effects such as route flapping or link failures modeled in 
   G.1050 [20], the Internet experience in short-time scales sharp 
   changes sharply in bandwidth utilization. For example, [49] and [38] 
   showed that variability of Internet traffic comes in form of spike 
   like traffic increments. Similarly, [32] studied why the Internet is 
   bursty in time scales of between 100 and to 1000 milliseconds. 

   In the light of these results, one can assume that the IIAC's 
   transmission conditions will vary in similar time scales. More 
   precisely, it will be subjected to 

   .  variability due to bursty traffic having a duration of between 
      100 and 1000 milliseconds, 

   .  interruptions due to temporal link failures every minute to every 
      hour that might have a temporal interruption from 64 ms to 
      several seconds [20], and 

   .  route flap events every minute to every hour that have a delay of 
      between 2 and 128 ms [20].  

8.5. The Effects of Transport Protocols 

   Realtime multimedia is not always transported over RTP and UDP. 
   Sometimes it makes sense to use a different transport protocol or an 
   additional rate adaptation. The reasons for that are manifold. 

   .  If a scalable codec shall be supported, RTCP-based feedback 
      information can be utilized to implement a rate control 
      mechanisms [41]. However, RTCP-based feedback suffers from the 
      drawback that RTCP messages are allowed only every 5 s. Thus, 
      implementing a fast responding mechanism is not possible. 

   .  In the presence of restricted firewalls, VoIP can sometimes only 
      be transmitted over TCP. In those cases, the transmission 
      scheduling is not given by the codec but by TCP. TCP algorithms 
      typically don't have a smooth sending rate but frequently send 
      packets in bursts and change the amount of packets sent every 
      round trip time (Figure 4). More precisely, TCP causes the 
      sending schedule to behave in the following way:  

       .  During the Slow Start phase (for example at the beginning of 
          a TCP connection) the transmission rate increases 
          exponentially. 

 
Hoene                    Expires June 3, 2011                 [Page 30] 

Internet-Draft              Codec Quality                 December 2010 
    

       .  If a TCP segment is not acknowledged after about four RTTs, 
          the TCP sending rate starts at one packet per RTT again. 

       .  During congestion avoidance, the sending rate increases 
          steadily by one segment per RTT. 

       .  If a congestion event is then detected, the sending rate is 
          reduced by 50%. 

   p 15-+-------------------------------------------------------------+ 
   a    |                                                             | 
   c    |             **                  **              **          | 
   k    |           ** *                ** *            ** *          | 
   e    |         **   *              **   *          **   *          | 
   t    |       **     *            **     *        **     *        **| 
   s    |     **       *          **       *      **       *      **  | 
      8-+   **         *        **         *    **         *    **    | 
   p    |   *          *      **           *  **           *  **      | 
   e    |   *          *      *            ***             ***        | 
   r    |   *          *      *                                       | 
      4-+  *           *     *                                        | 
   R    |  *           *     *                                        | 
   T  2-+ *            *    *                                         | 
   T  1-+*             *   *                                          | 
        +---------+---------+---------+---------+---------+---------+-+ 
        |         |         |         |         |         |         | 
        0        10        20        30        40        50        60 
          
                      time in round- trip times (RTT) 

             Figure 4 Sending rate of a standard TCP over time  


Hoene                    Expires June 3, 2011                 [Page 31] 

Internet-Draft              Codec Quality                 December 2010 
    

   .  The DCCP transport protocol supports multiple congestion control 
      protocols and gives means to support TCP friendliness without 
      retransmission. Thus, it is suitable for real time multimedia 
      transmissions. DCCP supports a TCP emulation, which shows a 
      similar rate over time as TCP, and the TFRC congestion control, 
      which changes its rate in a smoother way (Figure 5).  
      Besides TFRC, which is intended to transmit packets of maximal 
      size (aka MTU), TFRC-SP is optimized for flows with variable 
      packet sizes such as VoIP. With TFRC-SP, smaller packets can be 
      transmitted at a faster pace than it is the case for larger 
      packets because they contribute less to the gross bandwidth 
      consumption.  
      The TFRC protocol might provide a lower bandwidth and a lower QoE 
      as UDP or TCP, unless if not proper optimizations are taken (see 
      [48]). Also, it is suggested to limit the rate control to 100 
      packets per second. This limit might be too low for an IIAC.  

    
   p 15-+-------------------------------------------------------------+ 
   a    |                                                             | 
   c    |             **                  **                  **      | 
   k    |           **  **              **  **              **  **    | 
   e    |         **      **          **      **          **      **  | 
   t    |       **          **      **          **      **          **| 
   s    |     **              **  **              **  **              | 
      8-+   **                  **                  **                | 
   p    |   *                                                         | 
   e    |   *                                                         | 
   r    |   *                                                         | 
      4-+  *                                                          | 
   R    |  *                                                          | 
   T  2-+ *                                                           | 
   T  1-+*                                                            | 
        +---------+---------+---------+---------+---------+---------+-+ 
        |         |         |         |         |         |         | 
        0        10        20        30        40        50        60 
         
                      time in round- trip times (RTT) 

                Figure 5 Sending rate of the TFRC protocol  

   In general, the transport protocol has a clear influence on the 
   transmission conditions. Coding rates need to be adapted by sharply 
   and smoothly to changed bandwidth estimations. Changes of the 
   bandwidth estimation may occur every RTT. Also, in cases of a TCP 
   timeout, the transmission is halted and the decoding must be 
   stalled.             
 
 
Hoene                    Expires June 3, 2011                 [Page 32] 

Internet-Draft              Codec Quality                 December 2010 
    

8.6. The Effect of Jitter Buffers and FEC 

   Both jitter buffers trade frame losses against delay. In cases of a 
   jitter buffer, frames are delayed before playout. This helps in 
   cases of lately arriving frames that would otherwise be ignored and 
   would have to be concealed. Jitter buffers are adaptive and are 
   changing dynamically to the current loss process on the Internet. 

   Forward Error Correction helps to cope with isolated losses as 
   redundant speech frames are transmitted in the following packets. In 
   the presence of loss, FEC increases the delay because the receiver 
   has to wait for the following packets. Both delay and packet losses 
   are important contributors to the overall Quality of Experience [2].   

   Since the delay process on the Internet often comes in the form of a 
   gamma distribution, thus a statistical monitor of past delays helps 
   to predict the size of future jitter. Then, if the playout schedule 
   does not match the predicted loss process, playout can be 
   accelerated or slowed down. 

   However, due to the reasons described in Section 8.4 not all 
   increments in transmission time might be predictable. This has a 
   profound effect on the jitter buffer as it actually cannot predict 
   well, whether a frame is lost or whether it is going to be delayed. 
   If a frame is scheduled for playout but has not been received, the 
   jitter buffer has to consider two cases. First, the frame is lost 
   and has to be concealed. This typically means that the audio signal 
   needs to be extrapolated or interpolated to conceal the gap due to a 
   lost frame. Second, the frame is delayed and shall be played out at 
   a later point in time. Then, the resulting gap in playout must be 
   concealed by extrapolating the previous audio signal. 

   These issues have an effect on testing the concealment algorithm of 
   the codec. The same concealment function must be tested against time 
   gap concealment and loss concealment. 

8.7. Discussion 

   Judging a codec performance using a realistic model of a 
   transmission channel is difficult. Good models of IP transmission 
   channels are available. However, before a codec can be tested 
   against those channels, further building blocks such as the 
   transport protocol, the jitter buffer, and FEC should be known - at 
   least roughly. 

   Alternatively, a codec can be tested only against of packet loss 
   patterns only without considering any rate adaption or playout 
 
 
Hoene                    Expires June 3, 2011                 [Page 33] 

Internet-Draft              Codec Quality                 December 2010 
    

   rescheduling. But then again, the codec should be additionally 
   tested for those impairments, which occur due to the dynamics of the 
   Internet. These include 

   o  slowing down and speeding up the playout in cases of moderate 
      rescheduling of playout times, 

   o  stalling and resuming the playout in cases of temporal link 
      outages, 

   o  moderately reducing and increasing bit and frame rates during 
      contention periods, and  

   o  sharply reducing (in case of congestion) and fast increasing 
      (during connection establishment) of bit and frame rates. 

   o  Time gap and loss concealment. 

   o  Speeding up and slowing down the playout speed. 

9. Usage Scenarios 

   Quality of Experience is the service quality perceived subjectively 
   by end-users (refer to Section 2) and as ITU-T document G.RQAM [21] 
   states "overall acceptability may be influenced by user expectations 
   and context". Thus, in this section we describe the usage scenarios, 
   in which the IIAC codec will probably be used, and the expectations 
   users have in those communication contexts. We list seven main 
   scenarios and describe their quality requirements. 

9.1. Point-to-point Calls (VoIP) 

   The classic scenario is that of the phone usage to which we will 
   refer in this document as Voice over IP (VoIP). Human speech is 
   transmitted interactively between two Internet hosts. Typically, 
   besides speech some background noise is present, too. 

   The quality of a telephone call is traditionally judged by 
   subjective tests such as those described in [24]. The ACR scale used 
   in MOS-LQS sometimes might not be very suitable for high quality 
   calls, then - for example - the MUSHRA [16] rating can be applied.  

   A telephone call is considered good if it has a maximal mouth-to-ear 
   delay of 150 ms [17] and a speech quality of MOS-LQS 4 or above. 
   However, interhuman communication is still possible if the mounth-
   to-ear delay is much larger.  

 
Hoene                    Expires June 3, 2011                 [Page 34] 

Internet-Draft              Codec Quality                 December 2010 
    

   The effect of delay jitter might not be very well notable in case of 
   speech. Thus, playout rescheduling can happen often take place. 

   In many cases, phone calls are made between mobile devices such as 
   mobile phones and cellular phone. In these cases, energy consumption 
   is crucial and both complexity and transmission rate may be reduced 
   to save resources. 

9.2. High Quality Interactive Audio Transmissions (AoIP) 

   In this scenario we consider a telephone call having a very good 
   audio quality at modest acoustic one-way latencies ranging from 50 
   and 150 ms [17], so that music can be listened to over the telephone 
   while two persons are talking interactively.  

   While delay expectations might be similar to those of classic 
   telephony, the audio quality must meet similar standards as those of 
   consumer Hifi equipment like MP3 and CD players, iPods, etc. 

   If music is played, playout rescheduling events may be heard easily 
   be heard as the rhythm changes. Only a few studies such as [10] have 
   been made to examine the effect of time varying delays on service 
   quality. In general, it can be assumed that the requirements 
   regarding constancies of playout schedules are higher than in case 
   of speech because human beings can notice rhythmic changes easily. 
   Thus, in the presence of music, frequent playout rescheduling shall 
   be avoided. 

9.3. High Quality Teleconferencing 

   Also, for today's teleconferencing and videoconferencing systems 
   there is a strong and increasing demand for audio coding providing 
   the full human auditory bandwidth of 20 Hz to 20 kHz. This rising 
   demand for high quality audio is due to the following reasons:  

   o  Conferencing systems are increasingly used for more elaborated 
      presentations, often including music and sound effects which 
      occupy a wider audio bandwidth than that of speech. For example, 
      Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect 
      are based on an IP based transmission. 

   o  The new "Telepresence" video conferencing systems, providing the 
      user with High Definition video and audio quality, create the 
      experience of being in the same room by introducing high quality 
      media delivery (such as from Cisco).  


Hoene                    Expires June 3, 2011                 [Page 35] 

Internet-Draft              Codec Quality                 December 2010 
    

   o  The emerging Digital Living Rooms are to be interconnected and 
      might require a constant high quality acoustic transmission at 
      high qualities. 

   o  Spatial audio teleconference solutions increase the quality 
      because they take advantage of the cocktail-party effect. By 
      taking advantage of 3D audio, participants can be identified by 
      their location in a virtual acoustic environment and multiple 
      talkers can be distinguished from each other. However, these 
      systems require stereo audio, if the spatial audio is rendered 
      for headphones. 

9.4. Interconnecting to Legacy PSTN and VoIP (Convergence) 

   This scenario does not include the use case of using a VoIP-PSTN 
   gateway to connect to legacy telephone systems. In those cases, the 
   gateway would make an audio conversion from broadband Internet voice 
   to the frugal 1930's 3.1 kHz audio bandwidth.  

   The quality requirements in this scenario are low because legacy 
   PSTN typically uses narrow-band voice. Also, in those cases one 
   might expect the codec negotiation might decide on a common codec 
   both for PSTN and VoIP in order to avoid transcoding. 

   However, the complexity requirements might be stringent because 
   central media gateways must scale to a high number of users. In this 
   context, hardware costs are an important criterion and the codec has 
   to operate efficient. 

9.5. Music streaming 

   Music streaming typically does not require low delays. However, in 
   special cases such as live events and in the presence of alternative 
   transmission technologies, low-delay streaming may be demanded.  

   Examples are important sport events, which are streamed both on 
   terrestrial, (analogue) and low delay broadcast networks and on IP- 
   based distribution networks. The latter ones becomes aware (such as 
   when a footballer scores) more lately than the ones their neighbors 
   using terrestrial technology.  

9.6. Ensemble Performances over a Network  

   In some usage scenarios, users want to act simultaneously and not 
   just interactively. For example, if persons sing in a chorus, if 
   musicians jam, or if e-sportsmen play computer games in a team 
   together they need to communicate acoustically.  
 
 
Hoene                    Expires June 3, 2011                 [Page 36] 

Internet-Draft              Codec Quality                 December 2010 
    

   In this scenario, the latency requirements are much harder than for 
   interactive usages. For example, if two musicians are placed more 
   than 10 meters apart, they can hardly stay synchronized. Empirical 
   studies [10] have shown that if ensembles play over networks, the 
   optimal acoustic latency is at around 11.5 ms with a targeted range 
   from 10 to 25 ms.  

   Also, the users demand very high audio quality, very low delay and 
   very few events of playout rescheduling. 

9.7. Push-to-talk like Services (PTT) 

   In spite of the development of broadband access (xDSL), a lot of 
   users do only have service access via PSTN modems or mobile links. 
   Also, on these links the available bandwidth might be shared among 
   multiple flows and is subjected to congestion. Then, even low coding 
   rates of about 8 kbps are too high. 

   If transmission capacity hardly exists, one can still degrade the 
   quality of a telephone call to something like a push-to-talk (PTT) 
   like service having very high latencies. Technically, this scenario 
   takes advantage of bandwidth gains due to disruptive transmission 
   (DTX) modes and very large packets containing multiple speech frames 
   causing a very low packetization overhead. 

   The quality requirements of a push-to-talk like service have hardly 
   been studied. The OMA lists as a requirement of a Push-to-talk over 
   cellular service a transmission delay of 1.6 s and a MOS values of 
   above 3.0 that typically should be kept [39]. However, as long as an 
   understandable transmission of speech is possible, the delay can be 
   even higher. For example, [39] allows a delay of typically up to 4 s 
   for the first talk-burst. Also, [39] describes a maximum duration of 
   speaking. If a participant speaking reaches the time limit, the 
   participant's righttospeak shall be automatically revoked. 

   If the quality of a telephone call is very low, then instead of 
   listening-only speech quality the degree of understandability can be 
   chosen as performance metric. For example, objective tests of the 
   understandability use automatic speech recognition (ASR) systems and 
   measure the amount of correctly detected words. 

   In any case, the participant shall be informed about the quality of 
   connection, the presence of high delays, the half-duplex style of 
   communication, and its (limited) righttospeak. For example this can 
   be achieved by a simulated talker echo. 


Hoene                    Expires June 3, 2011                 [Page 37] 

Internet-Draft              Codec Quality                 December 2010 
    

9.8. Discussion 

   The requirements of the usage scenarios are summarized in the 
   following table. 

                |     Sound Quality  |      Latency       | Complexity 
     Scenario   | low  | avg. | hifi | 10ms | 150ms| high | low  | high  
   -------------+------+------+------+------+------+------+------+----- 
   VoIP         |   X  |      |      |      |   X  |      |   X  |   X 
   AoIP         |      |   X  |   X  |      |   X  |      |      |   X 
   Conference   |      |   X  |      |      |   X  |      |      |   X 
   Convergence  |   X  |      |      |      |   X  |      |   X  |   X 
   Streaming    |      |   X  |   X  |      |      |   X  |      |   X 
   Performances |      |      |   X  |   X  |      |      |      |   X 
   Push-To-Talk |   X  |      |      |      |      |   X  |   X  |   X 

       Figure 6 Different requirements for different usage scenarios 

10. Recommendations for Testing the IIAC 

   The IETF IIAC differs substantially from a classic narrow and 
   wideband codec. Thus, the previously applied codec testing 
   procedures such as ITU P.830 cannot be entirely adopted. Instead, 
   one must check carefully, which of the procedures are used without 
   changes, which procedures are used with minor changes and which 
   procedures are dropped or replaced. 

   In Section 1 we listed five groups of stakeholders, which have 
   different requirements and demands on how to test the quality of an 
   IIAC. In the following, we recommend testing procedures for those 
   stakeholders.  

10.1. During Codec Development 

   The codec development is an innovative process. In general, 
   innovation and research in general benefits from openness and 
   discussion between experts. Thus, format restrictions on how to test 
   the codec might hinder the codec development because innovation may 
   also take place in testing procedures. Instead, many experts both in 
   codec development and codec usage shall be able to participate. If 
   this is the case, they contribute with their expertise, identify 
   weaknesses, and discuss potential codec enhancements. During 
   innovation, openness in participation and discussion is very 
   fruitful and leads to good results. 

   Based on the ongoing experience, codec developers know best on how 
   to tests their codecs. Typically, those tests include informal 
 
 
Hoene                    Expires June 3, 2011                 [Page 38] 

Internet-Draft              Codec Quality                 December 2010 
    

   testing, semiformal testing, and expert interviews. They are 
   intended to find weaknesses in the codec, to identify artifacts or 
   distortions, and to achieve algorithmic progress. 

10.2. Characterization Phase 

   The characterization phase is intended to study the features, the 
   quality tradeoff and the properties of a codec under 
   standardization. It is intended to be an objective measure of the 
   codec's quality to convince third parties of the quality properties 
   of the standardized codec. In order to achieve this aim, a formal 
   testing procedure has to be established.  

   In general, we recommend to base the procedure of the 
   characterization phase on procedures that are similar to those that 
   were used for the G.719 standardization (Section 7.2 and especially 
   [35]). In the following, we describe the suggested testing procedure 
   in the characterization phase. 

10.2.1. Methodology 

   The testing of sound quality can be done using the MUSHRA tests  
   with eight samples and three anchors. One anchor is the known 
   reference, the second one is a hidden reference, and the third one 
   the hidden anchor. It is suggested to use a bandwidth filtered 
   signal with at low-pass filter at 3.5 kHz. However, because a will 
   range of qualities are to be tested ranging from Hifi down to toll 
   quality, it is beneficial to add a further low quality anchor such 
   as a 3.5 kHz bandwidth sample distorted by modulated noise (MNRU) 
   [25], for example with MNRU of a strength of Q=25 dB that 
   corresponds to a MOS value of 1.79 [4]. 

10.2.2. Material 

   Reference samples should be 48 kHz sampled, stereo channel material. 
   The nominal bandwidth of the reference samples shall be limited to 
   the range of 20 to 20000 Hz. Three different kinds of contents shall 
   be tested: speech, music and mixed content.  

   Speech samples shall include different languages including English 
   and tonal languages. The speech samples shall be recorded in a quiet 
   environment without background noise or reverberations. The speech 
   samples shall contain one meaningful sentence having a length of 
   about 4 s.  

   Music samples shall contain a wide variety of music styles including 
   classical music, pop, jazz, and single instruments. The length of 
 
 
Hoene                    Expires June 3, 2011                 [Page 39] 

Internet-Draft              Codec Quality                 December 2010 
    

   samples shall be of between 10 and 15 s. A smoothing of 100 ms both 
   at the beginning and at the end shall be conducted, if required. 

   Mixed content may contain advertisements, film trailers, news with 
   jingles and other mixtures of speech, music and noises. The length 
   may be at about 5-6 s. 

10.2.3. Listening Laboratory 

   Multiple independent laboratories shall conduct the listening tests. 
   They are responsible for generating or selecting reference samples 
   as well as for the pre and post screening of subjects. In the end, 
   the results of about 24 experienced listeners shall be published (in 
   addition to the samples).  

   The tests must be conducted in a quiet listening environment at 
   about NC25 (approximate 35 dBA). For example, an ISOBOOTH room can 
   be used. 

   It is recommended to use a high quality D/A, such as Benchmark DAC, 
   Metric Halo ULN-2, Apogee MiniDAC. High quality headphone amplifiers 
   and playback level calibration shall be used. Playback levels might 
   be measured via Etymotic in-ear microphones. Also, high quality 
   headphones (e.g. AKG 240DF, Sennheiser HD600) are advisable. 

10.2.4. Degradation Factors 

   The IIAC is likely to be highly configurable. However, due to time 
   limits, only a few parameter sets can be tested subjectively. Thus, 
   we recommend to do subjective studies with  

   o  different bit rates (from low to high, 5 tests) 

   o  different frame rates (from low to high, 2 tests) 

   o  different loss pattern (G.1050 profile A, B, and C at low rate 
      with speech content and at high rate with music content. The 
      influence of jitter, delay, and link failures shall be ignored. 
      In total, this would be 6 tests) 

   o  different sample contents 

       o Speech, speech+reverberations, and speech+noise+reverberations 
          at low and medium rates (3 tests).  

       o The speech sample must be tested in different languages 
          (English, Chinese, ...) and with male/female voices (6 tests) 
 
 
Hoene                    Expires June 3, 2011                 [Page 40] 

Internet-Draft              Codec Quality                 December 2010 
    

       o Mixed content and music shall be tested at medium and high 
          rates (about 10 tests). 

   o  A low complexity mode, DTX and the FEC mode shall be tested at 
      low rates because they are typically used on constraint devices 
      (3 tests) 

   o  Abrupt changes in bit and frame rates (reduction by half, 
      exponential start, 2 tests) 

   o  Smooth changes of bit and frame rates (incrementing or degreasing 
      the codec's gross rate by 1.5 kbyte every 100ms, 2 tests)  

   o  Stall and continue operations (20, 200, and 1000 ms, 3 tests) 

   o  Accelerated and slowed down playout (+- 10% for speech at low 
      rates) 

   o  Reference codecs such as LAME MP3, G.719, and AMR each at two 
      coding rate (6 tests) 

   Already, these are 48 different tests that need to be conducted. 

   In addition, for intermediate values objective tests shall be run 
   using PEAQ (for music) and P.OLQA (for speech). The intermediate 
   results shall be mapped on the MUSHRA scale with a quadratic 
   regression because PEAQ and P.OLGA are using an ODG and MOS scale 
   respectively.  

10.3. Application Developers 

   Application developers can take advantage of the results of the 
   qualification phase. They may use the results to develop a quality 
   model, which describes the expected quality of the codec at a given 
   parameter set (refer to [11] for an example). 

   In addition, they can test their system using the draft G.1050 
   simulation model, which is especially useful for optimizing rate 
   control, dejittering buffers and concealment algorithms. Different 
   systems may be tested with quality models, subjective listening 
   tests, conversational listening tests, or with objective measures 
   such as POLQA. 

   Also, field tests may be conducted to test the effect of a real 
   network on the VoIP application. 


Hoene                    Expires June 3, 2011                 [Page 41] 

Internet-Draft              Codec Quality                 December 2010 
    

10.4. Codec Implementers 

   To tests the conformance of a codec, codec implementers can use 
   objective tools like PEAQ or P.OLQA to see, whether the newly 
   implemented codec performs in a way that is similar to the 
   performance of the reference implementation. These tests shall be 
   done for many different parameter sets. 

10.5. End Users 

   End user may be included in the qualification tests. The intentions 
   of these tests are two-fold. First, the awareness of the end-user 
   shall be increased. Second, querying users may be a cost effective 
   way of conducting listening-only tests.  

   However, before the rating results of end users can be considered 
   for further usage, one need to compare between formal and web-based 
   testing results to see, to what extent they differ from each other. 

11. Security Considerations 

   The results of the quality tests shall be convincing. Thus, special 
   care has to be taken to make the tests precise, accurate, repeatable 
   and trustworthy.  

   Some testing houses may have a conflict of interest between accurate 
   quality ratings and promotion of own codecs. Thus, a high degree of 
   openness shall be enforced that requires all of the testing material 
   and results to be published. This way, others may verify the results 
   of testing houses. In addition, some stimuli shall be tested by all 
   the testing houses to compare their quality of rating.  

   Moreover, hidden anchors may help to identify subjects, which rate 
   the quality of samples less precisely. 

12. IANA Considerations 

   This document has no actions for IANA. 


Hoene                    Expires June 3, 2011                 [Page 42] 

Internet-Draft              Codec Quality                 December 2010 
    

13. References 

13.1. Normative References 

13.2. Informative References 

   [1]   R. Birke, M. Mellia, M. Petracca, D. Rossi, "Understanding 
         VoIP from Backbone Measurements", IEEE INFOCOM 2007, 26th IEEE 
         International Conference on Computer Communications, pp.2027-
         2035, May 2007. 

   [2]   C. Boutremans, J.-Y. Le Boudec, "Adaptive joint playout buffer 
         and FEC adjustment for Internet telephony," IEEE Societies 
         INFOCOM 2003. Twenty-Second Annual Joint Conference of the 
         IEEE Computer and Communications., vol.1, pp. 652- 662 vol.1, 
         30 March-3 April 2003. 

   [3]   Broadcom, "BCM1103: GIGABIT IP PHONE CHIP", Jan. 2005, 
         http://www.datasheetcatalog.org/datasheet2/3/07ozspx224dsarq6z
         u13i2ofyqyy.pdf 

   [4]   N. Cote, V. Koehl, V. Gautier-Turbin, A. Raake, S. Moeller, 
         "Reference Units for the Comparison of Speech Quality Test 
         Results", Audio Engineering Society Convention 126, May 2009. 

   [5]   Ericsson, "Analysis of PEAQ's applicability in predicting the 
         quality difference between alternative implementations of the 
         G.722.1FB coding algorithm", ITU-T SG12, Received on 2008-05-
         09, Related to question(s) : Q9/12, Meeting 2008-05-22. 

   [6]   ETSI TC-TM, "ETR 250: Transmission and Multiplexing (TM); 
         Speech communication quality from mouth to ear for 3,1 kHz 
         handset telephony across networks", ETSI Technical Report, 
         July 1996. 

   [7]   S. Floyd, E. Kohler, "Profile for Datagram Congestion Control 
         Protocol (DCCP) Congestion ID 4: TCP-Friendly Rate Control for 
         Small Packets (TFRC-SP)", RFC 5622, August 2009. 

   [8]   S. Floyd, E. Kohler, "TCP Friendly Rate Control (TFRC): The 
         Small-Packet (SP) Variant", RFC 4828, April 2007. 

   [9]   J. Gruber, G. Williams, Transmission Performance of Evolving 
         Telecommunications Networks, Artech House, 1992. 


Hoene                    Expires June 3, 2011                 [Page 43] 

Internet-Draft              Codec Quality                 December 2010 
    

   [10]  M. Gurevich, C. Chafe, G. Leslie, S. Tyan, "Simulation of 
         Networked Ensemble Performance with Varying Time Delays: 
         Characterization of Ensemble Accuracy", Proceedings of the 
         2004 International Computer Music Conference, Miami, USA, 
         2004.  

   [11]  C. Hoene, H. Karl, A. Wolisz, "A perceptual quality model 
         intended adaptive VoIP applications", International Journal of 
         Communication Systems, Wiley, August 2005. 

   [12]  J. Holub, J.G. Beerends, R. Smid, "A dependence between 
         average call duration and voice transmission quality: 
         measurement and applications," Wireless Telecommunications 
         Symposium, 2004, pp. 75- 81, May 2004. 

   [13]  ITU, "Incoming LS: Proposed G.1050/TIA-921B IP Network Model 
         Simulation", ITU-T SG 12, Temporary Document 268-GEN, May 12, 
         2010. 

   [14]  ITU, "ITU-R BS.1116-1: Methods for the subjective assessment 
         of small impairments in audio systems including multichannel 
         sound systems", Recommendation, October 1997. 

   [15]  ITU, "ITU-R BS.1387: Method for objective measurements of 
         perceived audio quality", Recommendation, November 2001. 

   [16]  ITU, "ITU-R BS.1534-1: Method for the subjective assessment of 
         intermediate quality levels of coding systems", 
         Recommendation, January 2003. 

   [17]  ITU, "ITU-T G.107: The E-model: a computational model for use 
         in transmission planning", Recommendation, April 2009. 

   [18]  ITU, "ITU-T G.114: One-way transmission time", Recommendation, 
         May 2003. 

   [19]  ITU, "ITU-T G.191: Software tools for speech and audio coding 
         standardization", Recommendation, March 2010. 

   [20]  ITU, "ITU-T G.1050: Network model for evaluating multimedia 
         transmission performance over Internet Protocol", 
         Recommendation, November 2007. 

   [21]  ITU, "ITU-T G.RQAM, "Reference guide to QoE assessment 
         methodologies", standard draft TD 310rev1, May 2010. 


Hoene                    Expires June 3, 2011                 [Page 44] 

Internet-Draft              Codec Quality                 December 2010 
    

   [22]  ITU, "ITU-T P.10/G.100: Vocabulary and effects of transmission 
         parameters on customer opinion of transmission quality", 
         Recommendation, July 2006. 

   [23]  ITU, "ITU-T P.800: Methods for objective and subjective 
         assessment of quality", Recommendation, August 1996. 

   [24]  ITU, "ITU-T P.805: Subjective evaluation of conversational 
         quality", Recommendation, April 2007. 

   [25]  ITU, "ITU-T P.810: Modulated noise reference unit (MNRU)", 
         Recommendation, February 1996. 

   [26]  ITU, "ITU-T P.830: Subjective performance assessment of 
         telephone-band and wideband digital codecs", Recommendation, 
         February 1996. 

   [27]  ITU, "ITU-T P.862: Perceptual evaluation of speech quality 
         (PESQ): An objective method for end-to-end speech quality 
         assessment of narrow-band telephone networks and speech 
         codecs", Recommendation, February 2001. 

   [28]  ITU, "ITU-T P.862.1: Mapping function for transforming P.862 
         raw result scores to MOS-LQO", Recommendation, November 2003. 

   [29]  ITU, "ITU-T P.862.2: Wideband extension to Recommendation 
         P.862 for the assessment of wideband telephone networks and 
         speech codecs", Recommendation, November 2007. 

   [30]  ITU, "ITU-T P.862.3: Application guide for objective quality 
         measurement based on Recommendations P.862, P.862.1 and 
         P.862.2", Recommendation, November 2007. 

   [31]  ITU, "ITU-T P.880: Continuous evaluation of time-varying 
         speech quality", Recommendation, May 2004. 

   [32]  H. Jiang, C. Dovrolis, "Why is the internet Traffic Bursty in 
         Short Time Scales?" Sigmetrics'05, Banff, Alberta, Canada, 
         June 2005. 

   [33]  C. Lamblin, R. Even, "Processing Test Plan for the ITU-T 
         G.722.1 fullband extension optimization/characterization 
         phase", ITU-T Study Group 16, Temporary Document TD 322 (WP 
         3/16), 22 April - 2 May 2008. 


Hoene                    Expires June 3, 2011                 [Page 45] 

Internet-Draft              Codec Quality                 December 2010 
    

   [34]  C. Lamblin, R. Even, "G.722.1 fullband extension 
         characterization phase test results: objective (ITU-R BS.1387-
         1) and subjective (ITU-R BS.1116) scores", ITU-T Study Group 
         16, Temporary Document TD 341 R1 (WP 3/16), 22 April - 2 May 
         2008. 

   [35]  C. Lamblin, R. Even, "G.722.1 fullband extension 
         optimization/characterization Quality Assessment Test Plan", 
         ITU-T Study Group 16, Temporary Document TD 323 (WP 3/16), 22 
         April - 2 May 2008. 

   [36]  J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, S Han, 
         "FaCSim: A Fast and Cycle-Accurate Architecture Simulator for 
         Embedded Systems", in Proceedings of the International 
         Conference on Languages, Compilers, and Tools for Embedded 
         Systems (LCTES'08), Tucson, Arizona, USA, June 2007, Software 
         available at http://facsim.snu.ac.kr/. 

   [37]  G. Maier, A. Feldmann, V. Paxson, M. Allman, "On Dominant 
         Characteristics of Residential Broadband Internet Traffic", 
         IMC'09, November 4-6, 2009, Chicago, Illinois, USA. 

   [38]  T. Mori, S.  Naito, R. Kawahara, S. Goto, "On the 
         characteristics of internet traffic variability: Spikes and 
         Elephants", SAINT'04, 2004. 

   [39]  Open Mobile Alliance, "Push to talk over Cellular 
         Requirements", Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC-
         V1_0-20060609-A.pdf   

   [40]  OPTICOM, SwissQual, TNO, "Announcement of OPTICOM, SwissQual 
         and TNO to submit a joint P.OLQA model", ITU-T SG 12, 
         Contribution 117, Received on 2010-05-07. Related to 
         question(s): Q9/12. 

   [41]  D. Sisalem, A. Wolisz, "Towards TCP-friendly adaptive 
         multimedia applications based on RTP", IEEE International 
         Symposium on Computers and Communications, pp. 166-172, 1999. 

   [42]  S. Smirnoff, K. Pupkov, "SoundExpert, How it Works, Audio 
         quality measurements in the digital age", 
         http://soundexpert.org/, revived Nov. 2010. 

   [43]  L. Sun, "Speech Quality prediction For Voice Over Internet", 
         PhD thesis, University of Plymouth, January 2004, 
         http://www.tech.plymouth.ac.uk/spmc/people/lfsun/mos/. 

 
Hoene                    Expires June 3, 2011                 [Page 46] 

Internet-Draft              Codec Quality                 December 2010 
    

   [44]  Texas Instruments, "C64x+ CPU Cycle Accurate Simulator", 
         October 2010, 
         http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accu
         rate_Simulator. 

   [45]  Texas Instruments, "TNETV3020: Carrier Infrastructure 
         Platform, Telogy Software products integrated with TI's DSP-
         based high-density communications processor", 2008, 
         http://focus.ti.com/lit/ml/spat174a/spat174a.pdf 

   [46]  TransNexus, "Asterisk V1.4.11 Performance", webpage, accessed 
         Nov. 2010, 
         http://www.transnexus.com/White%20Papers/asterisk_V1-4-
         11_performance.htm 

   [47]  K. Vos, K. Vandborg Sorensen, S. Skak Jensen, J. Spittka, 
         "SILK", presentation at the 77th IETF meeting in the WG Codec, 
         March 22, 2010, Anaheim, USA. 
         http://tools.ietf.org/agenda/77/slides/codec-3.pdf  

   [48]  H. Vlad Balan, L. Eggert, S. Niccolini, M. Brunner, "An 
         Experimental Evaluation of Voice Quality Over the Datagram 
         Congestion Control Protocol," IEEE INFOCOM 2007. 26th IEEE 
         International Conference on Computer Communications. pp. 2009-
         2017, 6-12 May 2007. 

   [49]  J. Wallerich, A. Feldmann, "Capturing the Variability of 
         Internet Flows Across Time", Proceedings INFOCOM 2006. 25th 
         IEEE International Conference on Computer Communications, 23-
         29 April 2006. 

   [50]  M. Westerlund, "How to Write an RTP Payload Format", work in 
         progress, draft-ietf-avt-rtp-howto-06, Internet-draft,    
         March 2, 2009. 

   [51]  Wikipedia contributors, "Bit rate", Wikipedia, The Free 
         Encyclopedia, 10 October 2010, 20:00 UTC, 
         http://en.wikipedia.org/w/index.php?title=Bit_rate&oldid=38993
         1944 

   [52]  Wikipedia contributors, "Cycle accurate simulator", Wikipedia, 
         The Free Encyclopedia, 4 September 2010, 14:27 UTC, 
         http://en.wikipedia.org/w/index.php?title=Cycle_accurate_simul
         ator&oldid=382876676 


Hoene                    Expires June 3, 2011                 [Page 47] 

Internet-Draft              Codec Quality                 December 2010 
    

   [53]  Wikipedia contributors, "Latency (engineering)", The Free 
         Encyclopedia, 15 October 2010, 23:54 UTC, 
         http://en.wikipedia.org/w/index.php?title=Latency_(engineering
         )&oldid=390971153 

   [54]  Wikipedia contributors, "Profiling (computer programming)", 
         Wikipedia, The Free Encyclopedia, 15 August 2010, 03:57 UTC, 
         http://en.wikipedia.org/w/index.php?title=Profiling_(computer_
         programming)&oldid=378987422. 

   [55]  M. T. Yourst, "PTLsim: A cycle accurate full system x86-64 
         microarchitectural simulator", in ISPASS '07, 2007, software 
         available at http://www.ptlsim.org/. 

14. Acknowledgments 

   This document is based on many discussions with experts in the field 
   of codec design, quality of experience and quality management. My 
   special thanks go to Michael Knappe, Sebastian Moeller, Raymond 
   Chen, Jack Douglass, Paul Coverdale, Jean-Marc Valin, Koen Vos, 
   Bilke Ullrich, and all active participants of the Codec WG mailing 
   list. Also, I like to express my appreciation to the members of the 
   ITU-T study groups 12 and 16, with whom I had many fruitful 
   discussions.   


Hoene                    Expires June 3, 2011                 [Page 48] 

Internet-Draft              Codec Quality                 December 2010 
    

Authors' Addresses 

   Christian Hoene 
   Universitaet Tuebingen 
   WSI-ICS 
   Sand 13 
   72076 Tuebingen 
   Germany 
       
   Phone: +49 7071 2970532 
   Email: hoene@uni-tuebingen.de 
    

Hoene                    Expires June 3, 2011                 [Page 49]