Network Working Group                                             K. Vos
Internet-Draft                                                 S. Jensen
Intended status: Standards Track                            K. Soerensen
Expires: January 7, 2010                         Skype Technologies S.A.
                                                            July 6, 2009


                           SILK Speech Codec
                           draft-vos-silk-00.txt

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on January 7, 2010.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.


Vos, et al.              Expires January 7, 2010                [Page 1]

Internet-Draft              SILK Speech Codec                  July 2009


Abstract

   This document describes SILK, a speech codec for real-time, packet-
   based voice communications.  Targeting a diverse range of operating
   environments, SILK provides scalability in several dimensions.  Four
   different sampling frequencies are supported for encoding the audio
   input signal.  Adaptation to network characteristics is provided
   through control of bitrate, packet rate, packet loss resilience and
   use of discontinuous transmission (DTX).  And several different
   complexity levels let SILK take advantage of available processing
   power without relying on it.  Each of these properties can be
   adjusted during operation of the codec on a frame-by-frame basis.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Techical Requirements for Internet Wideband Audio Codec  . . .  4
     2.1.  Bitrate  . . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.2.  Sampling Rate  . . . . . . . . . . . . . . . . . . . . . .  4
     2.3.  Complexity . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.4.  Packet Loss Resilience . . . . . . . . . . . . . . . . . .  4
     2.5.  Delay  . . . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.6.  DTX  . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Outline of the Codec . . . . . . . . . . . . . . . . . . . . .  6
     3.1.  Encoder  . . . . . . . . . . . . . . . . . . . . . . . . .  6
       3.1.1.  Control Parameters . . . . . . . . . . . . . . . . . .  6
       3.1.2.  Voice Activity Detection . . . . . . . . . . . . . . .  9
       3.1.3.  High-Pass Filter . . . . . . . . . . . . . . . . . . .  9
       3.1.4.  Pitch Analysis . . . . . . . . . . . . . . . . . . . . 10
       3.1.5.  Noise Shaping Analysis . . . . . . . . . . . . . . . . 11
       3.1.6.  Prefilter  . . . . . . . . . . . . . . . . . . . . . . 15
       3.1.7.  Prediction Analysis  . . . . . . . . . . . . . . . . . 15
       3.1.8.  LSF Quantization . . . . . . . . . . . . . . . . . . . 16
       3.1.9.  LTP Quantization . . . . . . . . . . . . . . . . . . . 19
       3.1.10. Noise Shaping Quantizer  . . . . . . . . . . . . . . . 20
       3.1.11. Range Encoder  . . . . . . . . . . . . . . . . . . . . 20
     3.2.  Decoder  . . . . . . . . . . . . . . . . . . . . . . . . . 21
       3.2.1.  Range Decoder  . . . . . . . . . . . . . . . . . . . . 22
       3.2.2.  Decode Parameters  . . . . . . . . . . . . . . . . . . 22
       3.2.3.  Generate Excitation  . . . . . . . . . . . . . . . . . 22
       3.2.4.  LTP Synthesis  . . . . . . . . . . . . . . . . . . . . 22
       3.2.5.  LPC Synthesis  . . . . . . . . . . . . . . . . . . . . 23
   4.  Reference Implementation . . . . . . . . . . . . . . . . . . . 24
   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 25
   6.  Informative References . . . . . . . . . . . . . . . . . . . . 26
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27


Vos, et al.              Expires January 7, 2010                [Page 2]

Internet-Draft              SILK Speech Codec                  July 2009


1.  Introduction

   A central component in voice communications is the speech codec,
   which compresses the audio signal for efficient transmission over a
   network.  A good speech codec achieves high coding efficiency,
   meaning that it delivers high audio quality at a given bitrate.
   However, for a good user experience in a broad range of environments,
   a speech codec should also be able adapt its operating point to the
   characteristics and limitations of network, hardware and audio
   signal.  SILK is a novel speech codec for real-time voice
   communications designed and developed by Skype [skype-website] to
   offer this kind of scalability.  This document describes the
   technical details of SILK.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119.


Vos, et al.              Expires January 7, 2010                [Page 3]

Internet-Draft              SILK Speech Codec                  July 2009


2.  Techical Requirements for Internet Wideband Audio Codec

   The Internet Wideband Audio Codec MUST be optimized towards real-time
   communications over the Internet, and MUST have the flexibility to
   adjust to the environment it operates in.  Below is a list of
   concrete requirements for the codec.

2.1.  Bitrate

   The codec MUST provide a quality/bitrate trade-off that is
   competitive with other state-of-the-art codecs.  It MUST be capable
   of running at bitrates below 10 kbps.  At low bitrates it MUST
   deliver good quality for clean, noisy or hands-free speech in any
   language.  At high bitrates the quality MUST be excellent for any
   audio signal, inlcuding music.  The bitrate MUST be adjustable in
   real-time.

2.2.  Sampling Rate

   The codec MUST support multiple sampling rates, ranging from
   narrowband (8 kHz) to super wideband (24 kHz or more).  Switching
   between sampling rates MUST be carried out in real-time.

2.3.  Complexity

   The codec MUST be capable of running at below 50 MHz of a x86 core in
   wideband mode (16 kHz sampling rate).  The codec SHOULD have a
   complexity that is adjustable in real-time, where a higher complexity
   setting improves the quality/bitrate trade-off.

2.4.  Packet Loss Resilience

   The codec MUST be capable of running with little error propagation,
   meaning that the decoded signal after one or more packet losses is
   close to the decoded signal without packet losses after no more than
   two additional packets.  The codec MUST have a packet loss resilience
   that is adjustable in real-time, where a lower packet loss resilience
   setting improves the quality/bitrate trade-off.

2.5.  Delay

   The codec MUST be capable of running with an algorithmic delay of no
   more than 30 milliseconds.

2.6.  DTX

   The codec SHOULD be capable of using Discontinuous Transmission (DTX)
   where packets are sent at a reduced rate when the input signal


Vos, et al.              Expires January 7, 2010                [Page 4]

Internet-Draft              SILK Speech Codec                  July 2009


   contains only background noise.


Vos, et al.              Expires January 7, 2010                [Page 5]

Internet-Draft              SILK Speech Codec                  July 2009


3.  Outline of the Codec

   The SILK codec consist of an encoder and a decoder as described in
   Section 3.1 and Section 3.2, respectively.

3.1.  Encoder

   We start the description of the encoder by listing the parameters
   that controls the operating point of the encoder.  Afterwards, we
   describe the encoder components in detail.

3.1.1.  Control Parameters

   The encoder with control parameters specifying the operating point is
   depicted in Figure 1.  All control parameters can be changed during
   regular operation of the codec, when inputting a frame of audio data,
   without interrupting the audio stream from encoder to decoder.  The
   codec control parameters are described in Section 3.1.1.1 to
   Section 3.1.1.5.


                 Sampling rate ---------------+
                       Bitrate -------------+ |
                   Packet rate -----------+ | |
              Packet loss rate ---------+ | | |
                    Complexity -------+ | | | |
                       Use DTX -----+ | | | | |
                                    | | | | | |
                                    \/\/\/\/\/\/
                                  +-------------+
                  Input signal -->|   Encoder   |--> Bitstream
                                  +-------------+


   Block diagram illustrating the control parameters that specifies the
                   operating point of the SILK encoder.

                                 Figure 1

3.1.1.1.  Sampling Rate

   SILK can switch in real-time between audio sampling rates of 8, 12,
   16 and 24 kHz.  A higher sampling rate improves audio quality by
   preserving a larger part of the input signal frequency range, at the
   cost of increased CPU load and bitrate.


Vos, et al.              Expires January 7, 2010                [Page 6]

Internet-Draft              SILK Speech Codec                  July 2009


3.1.1.2.  Bitrate

   The bitrate can be set between 6 and 40 kbps.  A higher bitrate
   improves audio quality by lowering the amount of quantization noise
   in the decoded signal.  The required bitrate for a given level of
   quantization noise is approximately linear with the sampling rate.
   Good quality is achieved at around 1 bit/sample, and at 1.5 bits/
   sample the quality becomes transparent for most material.

3.1.1.3.  Packet Rate

   SILK encodes frames of 20 milliseconds at a time and can combine 1,
   2, 3, 4 or 5 of these frames in one payload, thus creating one packet
   every 20, 40, 60, 80 or 100 milliseconds.  Because of the overhead
   from IP/UDP/RTP headers, sending fewer packets per second reduces the
   bitrate, but increases latency and sensitivity to packet losses as
   losing one packet constitutes a loss of a bigger chunk of audio
   signal.

3.1.1.4.  Packet Loss Resilience

   Speech codecs often exploit inter-frame correlations to reduce the
   bitrate at a cost in error propagation: after losing one packet
   several packets need to be received before the decoder is able to
   accurately reconstruct the speech signal.  The extent to which SILK
   exploits inter-frame dependencies can be adjusted on the fly to
   choose a trade-off between bitrate and amount of error propagation.

3.1.1.5.  Complexity

   SILK has several optional optimizations that can be enabled to reduce
   the CPU load by a few times, at the cost of increasing the bitrate by
   a few percent.  The most important algorithmic parts controlled by
   the three complexity settings (that is, high, medium, and low) are:
   o  The filter order of the whitening filter and the downsampling
      quality in the pitch analysis.
   o  The filter order of the short-term noise shaping filter used in
      the prefilter and noise shaping quantizer.
   o  The accuracy in the prediction analysis, the use of simulated
      output, and adjustment of the number of survivors that are carried
      over between stages in the multi-stage LSF vector quantization.
   o  The number of states in delayed decision quantization of the
      residual signal.

   In the following, we focus on the core encoder and describe its
   components.  For simplicity, we will refer to the core encoder simply
   as the encoder in the remainder of this document.  An overview of the
   encoder is given in Figure 2.


Vos, et al.              Expires January 7, 2010                [Page 7]

Internet-Draft              SILK Speech Codec                  July 2009


                                                                 +---+
                                  +----------------------------->|   |
           +---------+            |     +---------+              |   |
           |Voice    |            |     |LTP      |              |   |
    +----->|Activity |-----+      +---->|Scaling  |---------+--->|   |
    |      |Detector |  3  |      |     |Control  |<+  13   |    |   |
    |      +---------+     |      |     +---------+ |       |    |   |
    |                      |      |     +---------+ |       |    |   |
    |                      |      |     |Gains    | |  12   |    | R |
    |                      |      |  +->|Processor|-|---+---|--->| a |
    |                      |      |  |  |         | |   |   |    | n |
    |                     \/      |  |  +---------+ |   |   |    | g |
    |                 +---------+ |  |  +---------+ |   |   |    | e |
    |                 |Pitch    | |  |  |LSF      | |   |   |    |   |->
    |              +->|Analysis |-+  |  |Quantizer|-|---|---|--->| E |15
    |              |  |         |4|  |  |         | | 9 |   |    | n |
    |              |  +---------+ |  |  +---------+ |   |   |    | c |
    |              |              |  |  10/\  11|   |   |   |    | o |
    |              |              |  |    |    \/   |   |   |    | d |
    |              |  +---------+ |  |  +----------+|   |   |    | e |
    |              |  |Noise    | +--|->|Prediction|+---|---|--->| r |
    |              +->|Shaping  |-|--+  |Analysis  || 8 |   |    |   |
    |              |  |Analysis |5|  |  |          ||   |   |    |   |
    |              |  +---------+ |  |  +----------+|   |   |    |   |
    |              |             \/  \/      /\    \/  \/  \/    |   |
    |  +---------+ |           +---------+   |   +------------+  |   |
    |  |High-Pass| |           |         |---+   |Noise       |  |   |
   -+->|Filter   |-+---------->|Prefilter| 7     |Shaping     |->|   |
   1   |         |      2      |         |------>|Quantization|14|   |
       +---------+             +---------+   6   +------------+  +---+

   1:  Input speech signal
   2:  High passed input signal
   3:  Voice activity estimate
   4:  Pitch lags (per 5 ms) and voicing decision (per 20 ms)
   5:  Noise shaping quantization coefficients
     - Short term synthesis and analysis
       noise shaping coefficients (per 5 ms)
     - Long term synthesis and analysis noise
       shaping coefficients (per 5 ms and for voiced speech only)
     - Noise shape tilt (per 5 ms)
     - Quantizer gain/step size (per 5 ms)
   6:  Input signal filtered with analysis noise shaping filters
   7:  Simulated output signal
   8:  Short and long term prediction coefficients
       LTP (per 5 ms) and LPC (per 20 ms)
   9:  LSF quantization indices
   10: LSF coefficients


Vos, et al.              Expires January 7, 2010                [Page 8]

Internet-Draft              SILK Speech Codec                  July 2009


   11: Quantized LSF coefficients
   12: Processed gains, and synthesis noise shape coefficients
   13: LTP state scaling coefficient. Controlling error propagation
      / prediction gain trade-off
   14: Quantized signal
   15: Range encoded bitstream


                          Encoder block diagram.

                                 Figure 2

3.1.2.  Voice Activity Detection

   The input signal is processed by a VAD (Voice Activity Detector) to
   produce a measure of voice activity, and also spectral tilt and
   signal-to-noise estimates, for each frame.  The VAD uses a sequence
   of half-band filterbanks to split the signal in four subbands: 0 -
   Fs/16, Fs/16 - Fs/8, Fs/8 - Fs/4, and Fs/4 - Fs/2, where Fs is the
   sampling frequency (8, 12, 16 or 24 kHz).  The lowest subband, from 0
   - Fs/16 is high-pass filtered with a first-order MA (Moving Average)
   filter (with transfer function H(z) = 1-z^(-1)) to reduce the energy
   at the lowest frequencies.  For each frame, the signal energy per
   subband is computed.  In each subband, a noise level estimator tracks
   the background noise level and an SNR (Signal-to-Noise Ratio) value
   is computed as the logarithm of the ratio of energy to noise level.
   Using these intermediate variables, the following parameters are
   calculated for use in other SILK modules:
   o  Average SNR.  The average of the subband SNR values.
   o  Smoothed subband SNRs.  Temporally smoothed subband SNR values.
   o  Speech activity level.  Based on the average SNR and a weighted
      average of the subband energies.
   o  Spectral tilt.  A weighted average of the subband SNRs, with
      positive weights for the low subbands and negative weights for the
      high subbands.

3.1.3.  High-Pass Filter

   The input signal is filtered by a high-pass filter to remove the
   lowest part of the spectrum that contains little speech energy and
   may contain background noise.  This is a second order ARMA (Auto
   Regressive Moving Average) filter with a cut-off frequency around 70
   Hz.

   In the future, a music detector may also be used to lower the cut-off
   frequency when the input signal is detected to be music rather than
   speech.


Vos, et al.              Expires January 7, 2010                [Page 9]

Internet-Draft              SILK Speech Codec                  July 2009


3.1.4.  Pitch Analysis

   The high-passed input signal is processed by the open loop pitch
   estimator shown in Figure 3.


                                    +--------+  +----------+
                                    |2 x Down|  |Time-     |
                                 +->|sampling|->|Correlator|     |
                                 |  |        |  |          |     |4
                                 |  +--------+  +----------+    \/
                                 |                    | 2    +-------+
                                 |                    |  +-->|Speech |5
       +---------+    +--------+ |                   \/  |   |Type   |->
       |LPC      |    |2 x Down| |              +----------+ |       |
    +->|Analysis | +->|sampling|-+------------->|Time-     | +-------+
    |  |         | |  |        |                |Correlator|----------->
    |  +---------+ |  +--------+                |__________|          6
    |       |      |                                  |3
    |      \/      |                                 \/
    |  +---------+ |                            +----------+
    |  |Whitening| |                            |Time-     |
   -+->|Filter   |-+--------------------------->|Correlator|----------->
   1   |         |                              |          |          7
       +---------+                              +----------+

   1: Input signal
   2: Lag candidates from stage 1
   3: Lag candidates from stage 2
   4: Correlation threshold
   5: Voiced/unvoiced flag
   6: Pitch correlation
   7: Pitch lags


                   Block diagram of the pitch estimator.

                                 Figure 3

   The pitch analysis finds a binary voiced/unvoiced classification,
   and, for frames classified as voiced, four pitch lags per frame - one
   for each 5 ms subframe - and a pitch correlation indicating the
   periodicity of the signal.  The input is first whitened using a
   Linear Prediction (LP) whitening filter, where the coefficients are
   computed through standard Linear Prediction Coding (LPC) analysis.
   The order of the whitening filter is 16 for best results, but is
   reduced to 12 for medium complexity and 8 for low complexity modes.
   The whitened signal is analyzed to find pitch lags for which the time


Vos, et al.              Expires January 7, 2010               [Page 10]

Internet-Draft              SILK Speech Codec                  July 2009


   correlation is high.  The analysis consists of three stages for
   reducing the complexity:
   o  In the first stage, the whitened signal is downsampled 4 times and
      the current frame is correlated to a signal delayed by a range of
      lags, starting from a shortest lag corresponding to 500 Hz, to a
      longest lag corresponding to 56 Hz.
   o  The second stage operates on a two times downsampled signal and
      measures time correlations only near the lags corresponding to
      those that had sufficiently high correlations in the first stage.
      The resulting correlations are adjusted for a small bias towards
      short lags to avoid ending up with a multiple of the true pitch
      lag.  The highest adjusted correlation is compared to a threshold
      depending on:
      *  Whether the previous frame was classified as voiced
      *  The speech activity level
      *  The spectral tilt.
      If the threshold is exceeded, the current frame is classified as
      voiced and the lag with the highest adjusted correlation is stored
      for a final pitch analysis of the highest precision in the third
      stage.
   o  The last stage operates directly on the whitened input signal to
      compute time correlations for each of the four subframes
      independently in a narrow range around the lag with highest
      correlation from the second stage.

3.1.5.  Noise Shaping Analysis

   The noise shaping analysis finds gains and filter coefficients used
   in the prefilter and noise shaping quantizer.  These parameters are
   chosen such that they will fulfil several requirements:
   o  Balancing quantization noise and bitrate.  The quantization gains
      determine the step size between reconstruction levels of the
      excitation signal.  Therefore, increasing the quantization gain
      amplifies quantization noise, but also reduces the bitrate by
      lowering the entropy of the quantization indices.
   o  Spectral shaping of the quantization noise; the noise shaping
      quantizer is capable of reducing quantization noise in some parts
      of the spectrum at the cost of increased noise in other parts,
      without substantially changing the bitrate.  By shaping the noise
      such that it follows the signal spectrum, it becomes less audible.
      In practice, best results are obtained by making the shape of the
      noise spectrum slightly flatter than the signal spectrum.
   o  Deemphasizing spectral valleys; by using different coefficients in
      the analysis and synthesis part of the prefilter and noise shaping
      quantizer, the levels of the spectral valleys can be decreased
      relative to the levels of the spectral peaks such as speech
      formants and harmonics.  This reduces the entropy of the signal,
      which is the difference between the coded signal and the


Vos, et al.              Expires January 7, 2010               [Page 11]

Internet-Draft              SILK Speech Codec                  July 2009


      quantization noise, thus lowering the bitrate.
   o  Matching the levels of the decoded speech formants to the levels
      of the original speech formants; an adjustment gain and a first
      order tilt coefficient are computed to compensate for the effect
      of the noise shaping quantization on the level and spectral tilt.


                 / \   ___
                  |   // \\
                  |  //   \\     ____
                  |_//     \\___//  \\         ____
                  | /  ___  \   /    \\       //  \\
                P |/  /   \  \_/      \\_____//    \\
                o |  /     \     ____  \     /      \\
                w | /       \___/    \  \___/  ____  \\___ 1
                e |/                  \       /    \  \
                r |                    \_____/      \  \__ 2
                  |                                  \
                  |                                   \___ 3
                  |
                  +---------------------------------------->
                                   Frequency

               1: Input signal spectrum
               2: Deemphasized and level matched spectrum
               3: Quantization noise spectrum


           Noise shaping and spectral de-emphasis illustration.

                                 Figure 4

   Figure 4 shows an example of an input signal spectrum (1).  After de-
   emphasis and level matching, the spectrum has deeper valleys (2).
   The quantization noise spectrum (3) more or less follows the input
   signal spectrum, having slightly less pronounced peaks.  The entropy,
   which provides a lower bound on the bitrate for encoding the
   excitation signal, is proportional to the area between the
   deemphasized spectrum (2) and the quantization noise spectrum (3).
   Without de-emphasis, the entropy is proportional to the area between
   input spectrum (1) and quantization noise (3) - clearly higher.

   The transformation from input signal to deemphasized signal can be
   described as a filtering operation with a filter


                                                    Wana(z)


Vos, et al.              Expires January 7, 2010               [Page 12]

Internet-Draft              SILK Speech Codec                  July 2009


               H(z) = G * ( 1 - c_tilt * z^(-1) ) * -------
                                                    Wsyn(z),


   having an adjustment gain G, a first order tilt adjustment filter
   with tilt coefficient c_tilt, and where


                   16                                 d
                   __                                __
    Wana(z) = (1 - \ (a_ana(k) * z^(-k))*(1 - z^(-L) \ b_ana(k)*z^(-k)),
                   /_                                /_
                   k=1                               k=-d


   is the analysis part of the de-emphasis filter, consisting of the
   short-term shaping filter with coefficients a_ana(k), and the long-
   term shaping filter with coefficients b_ana(k) and pitch lag L. The
   parameter d determines the number of long-term shaping filter taps.

   Similarly, but without the tilt adjustment, the synthesis part can be
   written as


                   16                                 d
                   __                                __
    Wsyn(z) = (1 - \ (a_syn(k) * z^(-k))*(1 - z^(-L) \ b_syn(k)*z^(-k)).
                   /_                                /_
                   k=1                               k=-d


   All noise shaping parameters are computed and applied per subframe of
   5 milliseconds.  First, an LPC analysis is performed on a windowed
   signal block of 16 milliseconds.  The signal block has a look-ahead
   of 5 milliseconds relative to the current subframe, and the window is
   an asymmetric sine window.  The LPC analysis is done with the
   autocorrelation method, with an order of 16 for best quality or 12 in
   low complexity operation.  The quantization gain is found as the
   square-root of the residual energy from the LPC analysis, multiplied
   by a value inversely proportional to the coding quality control
   parameter and the pitch correlation.

   Next we find the two sets of short-term noise shaping coefficients
   a_ana(k) and a_syn(k), by applying different amounts of bandwidth
   expansion to the coefficients found in the LPC analysis.  This
   bandwidth expansion moves the roots of the LPC polynomial towards the
   origin, using the formulas


Vos, et al.              Expires January 7, 2010               [Page 13]

Internet-Draft              SILK Speech Codec                  July 2009


                         a_ana(k) = a(k)*g_ana^k, and
                         a_syn(k) = a(k)*g_syn^k,


   where a(k) is the k'th LPC coefficient and the bandwidth expansion
   factors g_ana and g_syn are calculated as


                         g_ana = 0.94 - 0.02*C, and
                         g_syn = 0.94 + 0.02*C,


   where C is the coding quality control parameter between 0 and 1.
   Applying more bandwidth expansion to the analysis part than to the
   synthesis part gives the desired de-emphasis of spectral valleys in
   between formants.

   The long-term shaping is applied only during voiced frames.  It uses
   three filter taps, described by


                   b_ana = F_ana * [0.25, 0.5, 0.25], and
                   b_syn = F_syn * [0.25, 0.5, 0.25].


   For unvoiced frames these coefficients are set to 0.  The
   multiplication factors F_ana and F_syn are chosen between 0 and 1,
   depending on the coding quality control parameter, as well as the
   calculated pitch correlation and smoothed subband SNR of the lowest
   subband.  By having F_ana less than F_syn, the pitch harmonics are
   emphasized relative to the valleys in between the harmonics.

   The tilt coefficient c_tilt is for unvoiced frames chosen as


                          c_tilt = 0.4, and as
                          c_tilt = 0.04 + 0.06 * C


   for voiced frames, where C again is the coding quality control
   parameter and is between 0 and 1.

   The adjustment gain G serves to correct any level mismatch between
   original and decoded signal that might arise from the noise shaping
   and de-emphasis.  This gain is computed as the ratio of the
   prediction gain of the short-term analysis and synthesis filter
   coefficients.  The prediction gain of an LPC synthesis filter is the
   square-root of the output energy when the filter is excited by a


Vos, et al.              Expires January 7, 2010               [Page 14]

Internet-Draft              SILK Speech Codec                  July 2009


   unit-energy impulse on the input.  An efficient way to compute the
   prediction gain is by first computing the reflection coefficients
   from the LPC coefficients through the step-down algorithm, and
   extracting the prediction gain from the reflection coefficients as


                                 K
                                ___
                   predGain = ( | | 1 - (r_k)^2 )^(-0.5),
                                k=1


   where r_k is the k'th reflection coefficient.

   Initial values for the quantization gains are computed as the square-
   root of the residual energy of the LPC analysis, adjusted by the
   coding quality control parameter.  These quantization gains are later
   adjusted based on the results of the prediction analysis.

3.1.6.  Prefilter

   In the prefilter the input signal is filtered using the spectral
   valley de-emphasis filter coefficients from the noise shaping
   analysis, see Section 3.1.5.  The filter output is called the
   simulated output signal and is passed on to the prediction analysis.

   Also, by applying only the noise shaping analysis filter to the input
   signal, it provides the input to the noise shaping quantizer.

3.1.7.  Prediction Analysis

   The prediction analysis is performed in one of two ways depending on
   how the pitch estimator classified the frame.  The processing for
   voiced and unvoiced speech are described in Section 3.1.7.1 and
   Section 3.1.7.2, respectively.  Inputs to this function include the
   pre-whitened signal from the pitch estimator, see Section 3.1.4.

3.1.7.1.  Voiced Speech

   For a frame of voiced speech the pitch pulses will remain dominant in
   the pre-whitened input signal.  Further whitening is desirable as it
   leads to higher quality at the same available bit-rate.  To achieve
   this, a Long-Term Prediction (LTP) analysis is carried out to
   estimate the coefficients of a fifth order LTP filter for each of
   four sub-frames.  The LTP coefficients are used to find an LTP
   residual signal with the simulated output signal as input to obtain
   better modelling of the output signal.  This LTP residual signal is
   the input to an LPC analysis where the LPCs are estimated using the


Vos, et al.              Expires January 7, 2010               [Page 15]

Internet-Draft              SILK Speech Codec                  July 2009


   covariance method, such that the residual energy is minimized.  The
   estimated LPCs are converted to a Line Spectral Frequency (LSF)
   vector, and quantized as described in Section 3.1.8.  After
   quantization, the quantized LSF vector is converted to LPC
   coefficients and hence by using these quantized coefficients the
   encoder remains fully synchronized with the decoder.  The LTP
   coefficients are quantized using a method described in Section 3.1.9.
   The quantized LPC and LTP coefficients are now used to filter the
   simulated output signal and measure a residual energy for each of the
   four subframes.

3.1.7.2.  Unvoiced Speech

   For a speech signal that has been classified as unvoiced there is no
   need for LTP filtering as it has already been determined that the
   pre-whitened input signal is not periodic enough within the allowed
   pitch period range for an LTP analysis to be worth-while the cost in
   terms of complexity and rate.  Therefore, the pre-whitened input
   signal is discarded and instead the simulated output is used for LPC
   analysis using the covariance method.  The resulting LPC coefficients
   are converted to an LSF vector, quantized as described in the
   following section and transformed back to obtain quantized LPC
   coefficients.  The quantized LPC coefficients are used to filter the
   simulated output signal and measure a residual energy for each of the
   four subframes.

3.1.8.  LSF Quantization

   The purpose of quantization is to significantly lower the bit rate at
   the cost of some introduced distortion.  A higher rate should always
   lead to lower distortion, and lowering the rate will generally lead
   to higher distortion.  A commonly used but generally sub-optimal
   approach is to use a quantization method with a constant rate where
   only the error is minimized when quantizing.

3.1.8.1.  Rate-Distortion Optimization

   Instead, we minimize an objective function that consists of a
   weighted sum of rate and distortion, and use a codebook with an
   associated non-uniform rate table.  Thus, we take into account that
   the probability mass function for selecting the codebook entries are
   by no means guaranteed to be uniform in our scenario.  The advantage
   of this approach is that it ensures that rarely used codebook vector
   centroids, which are modelling statistical outliers in the training
   set can be quantized with a low error but with a relatively high cost
   in terms of a high rate.  At the same time this approach also
   provides the advantage that frequently used centroids are modelled
   with low error and a relatively low rate.  This approach will lead to


Vos, et al.              Expires January 7, 2010               [Page 16]

Internet-Draft              SILK Speech Codec                  July 2009


   equal or lower distortion than the fixed rate codebook at any given
   average rate, if the data is similar to the data used for training
   the codebook.

3.1.8.2.  Error Mapping

   Instead of minimizing the error in the LSF domain, we map the errors
   to spectral distortion by applying a weight to the error of each
   element in the error vector.  These weight vectors are calculated for
   each input vector as a linear approximation of the true mapping
   function, which is accurate for small errors.  Consequently, we solve
   the following minimization problem, i.e.,


         LSF_q = argmin { (LSF - c)' * W * (LSF - c) + mu * rate },
                 c in C


   where LSF_q is the quantized vector, LSF is the input vector to be
   quantized, and c is the quantized LSF vector candidate taken from the
   set C of all possible outcomes of the codebook.

3.1.8.3.  Multi-Stage Vector Codebook

   We arrange the codebook in a multiple stage structure to achieve a
   quantizer that is both memory efficient and highly scalable in terms
   of computational complexity, see e.g. [sinervo-norsig].  In the first
   stage the input is the LSF vector to be quantized, and in any other
   stage s > 1, the input is the quantization error from the previous
   stage, see Figure 5.


         Stage 1:           Stage 2:                Stage S:
       +----------+       +----------+            +----------+
       |  c_{1,1} |       |  c_{2,1} |            |  c_{S,1} |
   LSF +----------+ res_1 +----------+  res_{S-1} +----------+
   --->|  c_{1,2} |------>|  c_{2,2} |--> ... --->|  c_{S,2} |--->
       +----------+       +----------+            +----------+ res_S =
           ...                ...                     ...      LSF-LSF_q
       +----------+       +----------+            +----------+
       |c_{1,M1-1}|       |c_{2,M2-1}|            |c_{S,MS-1}|
       +----------+       +----------+            +----------+
       | c_{1,M1} |       | c_{2,M2} |            | c_{S,MS} |
       +----------+       +----------+            +----------+


                Multi-Stage LSF Vector Codebook Structure.


Vos, et al.              Expires January 7, 2010               [Page 17]

Internet-Draft              SILK Speech Codec                  July 2009


                                 Figure 5

   By storing total of M codebook vectors, i.e.,


                                      S
                                     __
                                 M = \  Ms,
                                     /_
                                     s=1


   where M_s is the number of vectors in stage s, we obtain a total of


                                      S
                                     ___
                                 T = | | Ms
                                     s=1


   possible combinations for generating the quantized vector.  It is for
   example possible to represent 2^36 unique vectors using only 216
   vectors in memory, as done in SILK for voiced speech at all sample
   frequencies above 8 kHz.

3.1.8.4.  Survivor Based Codebook Search

   This number of possible combinations is far too high for a full
   search to be carried out for each frame so for all stages but the
   last, i.e., s smaller than S, only the best min( L, Ms ) centroids
   are carried over to stage s+1.  In each stage the objective function,
   i.e., the weighted sum of accumulated bit-rate and distortion, is
   evaluated for each codebook vector entry and the results are sorted.
   Only the best paths and the corresponding quantization errors are
   considered in the next stage.  In the last stage S the single best
   path through the multistage codebook is determined.  By varying the
   maximum number of survivors from each stage to the next L, the
   complexity can be adjusted in real-time at the cost of a potential
   decrease in the objective function for the resulting quantized
   vector.  This approach scales all the way between the two extremes,
   L=1 being a greedy search, and the desirable but infeasible full
   search, L=T/MS.  In fact, a performance almost as good as what can be
   achieved with the infeasible full search can be obtained at a
   substantially lower complexity by using this approach, see e.g.
   [leblanc-tsap].


Vos, et al.              Expires January 7, 2010               [Page 18]

Internet-Draft              SILK Speech Codec                  July 2009


3.1.8.5.  LSF Stabilization

   If the input is stable, finding the best candidate will usually
   result in the quantized vector also being stable, but due to the
   multi-stage approach it could in theory happen that the best
   quantization candidate is unstable and because of this there is a
   need to explicitly ensure that the quantized vectors are stable.
   Therefore we apply a LSF stabilization method which ensures that the
   LSF parameters are within valid range, increasingly sorted, and have
   minimum distances between each other and the border values that have
   been pre-determined as the 0.01 percentile distance values from a
   large training set.

3.1.8.6.  Off-Line Codebook Training

   The vectors and rate tables for the multi-stage codebook are trained
   by minimizing the average of the objective function for LSF vectors
   from a large training set.

3.1.9.  LTP Quantization

   For voiced frames, the prediction analysis described in
   Section 3.1.7.1 resulted in four sets (one set per subframe) of five
   LTP coefficients, plus four weighting matrices.  The LTP coefficients
   for each subframe are quantized using entropy constrained vector
   quantization.  A total of three vector codebooks are available for
   quantization, with different rate-distortion trade-offs.  The three
   codebooks have 10, 20 and 40 vectors and average rates of about 3, 4,
   and 5 bits per vector, respectively.  Consequently, the first
   codebook has larger average quantization distortion at a lower rate,
   whereas the last codebook has smaller average quantization distortion
   at a higher rate.  Given the weighting matrix W_ltp and LTP vector b,
   the weighted rate-distortion measure for a codebook vector cb_i with
   rate r_i is give by


               RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i,


   where u is a fixed, heuristically-determined parameter balancing the
   distortion and rate.  Which codebook gives the best performance for a
   given LTP vector depends on the weighting matrix for that LTP vector.
   For example, for a low valued W_ltp, it is advantageous to use the
   codebook with 10 vectors as it has a lower average rate.  For a large
   W_ltp, on the other hand, it is often better to use the codebook with
   40 vectors, as it is more likely to contain the best codebook vector.
   The weighting matrix W_ltp depends mostly on two aspects of the input
   signal.  The first is the periodicity of the signal; the more


Vos, et al.              Expires January 7, 2010               [Page 19]

Internet-Draft              SILK Speech Codec                  July 2009


   periodic the larger W_ltp.  The second is the change in signal energy
   in the current subframe, relative to the signal one pitch lag
   earlier.  A decaying energy leads to a larger W_ltp than an
   increasing energy.  Both aspects do not fluctuate very fast which
   causes the W_ltp matrices for different subframes of one frame often
   to be similar.  As a result, one of the three codebooks typically
   gives good performance for all subframes.  Therefore the codebook
   search for the subframe LTP vectors is constrained to only allow
   codebook vectors to be chosen from the same codebook, resulting in a
   rate reduction.

   To find the best codebook, each of the three vector codebooks is used
   to quantize all subframe LTP vectors and produce a combined weighted
   rate-distortion measure for each vector codebook and the vector
   codebook with the lowest combined rate-distortion over all subframes
   is chosen.  The quantized LTP vectors are used in the noise shaping
   quantizer, and the index of the codebook plus the four indices for
   the four subframe codebook vectors are passed on to the range
   encoder.

3.1.10.  Noise Shaping Quantizer

   The noise shaping quantizer independently shapes the signal and
   coding noise spectra to obtain a perceptually higher quality at the
   same bitrate.

   The prefilter output signal is multiplied with a compensation gain G
   computed in the noise shaping analysis.  Then the output of a
   synthesis shaping filter is added, and the output of a prediction
   filter is subtracted to create a residual signal.  The residual
   signal is multiplied by the inverse quantized quantization gain from
   the noise shaping analysis, and input to a scalar quantizer.  The
   quantization indices of the scalar quantizer represent a signal of
   pulses that is input to the pyramid range encoder.  The scalar
   quantizer also outputs a quantization signal, which is multiplied by
   the quantized quantization gain from the noise shaping analysis to
   create an excitation signal.  The output of the prediction filter is
   added to the excitation signal to form the quantized output signal
   y(n).  The quantized output signal y(n) is input to the synthesis
   shaping and prediction filters.

3.1.11.  Range Encoder

   Range encoding is a well known method for entropy coding in which a
   bitstream sequence is continually updated with every new symbol,
   based on the probability for that symbol.  It is similar to
   arithmetic coding but rather than being restricted to generating
   binary output symbols, it can generate symbols in any chosen number


Vos, et al.              Expires January 7, 2010               [Page 20]

Internet-Draft              SILK Speech Codec                  July 2009


   base.  In SILK all side information is range encoded.  Each quantized
   parameter has its own cumulative density function based on histograms
   for the quantization indices obtained by running a training database.

3.1.11.1.  Bitstream Encoding Details

   TBD.

3.2.  Decoder

   At the receiving end, the received packets are by the range decoder
   split into a number of frames contained in the packet.  Each of which
   contains the necessary information to reconstruct a 20 ms frame of
   the output signal.  An overview of the decoder is given in Figure 6.


        +---+
        | R |
        | a |
        | n |
        | g |
        | e |    +------------+
     -->|   |--->| Decode     |----------------------------+
      1 | D | 2  | Parameters |----------+        5        |
        | e |    +------------+     4    |                 |
        | c |        3 |                 |                 |
        | o |         \/                \/                \/
        | d |   +------------+    +------------+    +------------+
        | e |   | Generate   |--->| LTP        |--->| LPC        |--->
        | r |   | Excitation |    | Synthesis  |    | Synthesis  | 6
        +---+   +------------+    +------------+    +------------+

     1: Range encoded bitstream
     2: Coded parameters
     3: Pulses and gains
     4: Pitch lags and LTP coefficients
     5: LPC coefficients
     6: Decoded signal


                          Decoder block diagram.

                                 Figure 6


Vos, et al.              Expires January 7, 2010               [Page 21]

Internet-Draft              SILK Speech Codec                  July 2009


3.2.1.  Range Decoder

   The range decoder decodes the encoded parameters from the received
   bitstream.  Output from this function includes the pulses and gains
   for the excitation signal generation, as well as LTP and LSF codebook
   indices, which are needed for decoding LTP and LPC coefficients
   needed for LTP and LPC synthesis filtering the excitation signal,
   respectively.

3.2.2.  Decode Parameters

   Pulses and gains are decoded from the range decoded bitstream in the
   following way...  (TBD)

   When a voiced frame is decoded and LTP codebook selection and indices
   are received, LTP coefficients are decoded using the selected
   codebook by choosing the vector that corresponds to the given
   codebook index.  This is done for each of the four subframes.  The
   LPC coefficients are decoded from the LSF codebook by first adding
   the chosen vectors, one vector from each stage of the codebook.  The
   resulting LSF vector is stabilized using the same method as was used
   in the encoder, see Section 3.1.8.5.  The LSF coefficients are then
   converted to LPC coefficients, and passed on to the LPC synthesis
   filter.

3.2.3.  Generate Excitation

   The pulses signal is multiplied with the quantization gain to create
   the excitation signal.

3.2.4.  LTP Synthesis

   For voiced speech, the excitation signal e(n) is input to an LTP
   synthesis filter that will recreate the long term correlation that
   was removed in the LTP analysis filter and generate an LPC excitation
   signal e_LPC(n), according to


                                     d
                                    __
                  e_LPC(n) = e(n) + \  e(n - L - i) * b_i,
                                    /_
                                   i=-d


   using the pitch lag L, and the decoded LTP coefficients b_i.  For
   unvoiced speech, the output signal is a copy of the input signal,
   i.e., e_LPC(n) = e(n).


Vos, et al.              Expires January 7, 2010               [Page 22]

Internet-Draft              SILK Speech Codec                  July 2009


3.2.5.  LPC Synthesis

   In a similar manner, the short-term correlation that was removed in
   the LPC analysis filter is recreated in the LPC synthesis filter.
   The LPC excitation signal e_LPC(n) is filtered using the LTP
   coefficients a_i, according to


                                   d_LPC
                                    __
                  y(n) = e_LPC(n) + \  e_LPC(n - i) * a_i,
                                    /_
                                    i=1


   where d_LPC is the LPC synthesis filter order, and y(n) is the
   decoded signal.


Vos, et al.              Expires January 7, 2010               [Page 23]

Internet-Draft              SILK Speech Codec                  July 2009


4.  Reference Implementation

   To Be Defined.


Vos, et al.              Expires January 7, 2010               [Page 24]

Internet-Draft              SILK Speech Codec                  July 2009


5.  Security Considerations

   To Be Defined.


Vos, et al.              Expires January 7, 2010               [Page 25]

Internet-Draft              SILK Speech Codec                  July 2009


6.  Informative References

   [leblanc-tsap]
              LeBlanc, W., Bhattacharya, B., Mahmoud, S., and V.
              Cuperman, "Efficient Search and Design Procedures for
              Robust Multi-Stage VQ of LPC Parameters for 4 kb/s Speech
              Coding", IEEE Transactions on Speech and Audio Processing,
              Vol. 1, No. 4, October 1993.

   [sinervo-norsig]
              Sinervo, U., Nurminen, J., Heikkinen, A., and J. Saarinen,
              "Evaluation of Split and Multistage Techniques in LSF
              Quantization", NORSIG-2001, Norsk symposium i
              signalbehandling, Trondheim, Norge, October 2001.

   [skype-website]
              "Skype", Skype website http://www.skype.com/.


Vos, et al.              Expires January 7, 2010               [Page 26]

Internet-Draft              SILK Speech Codec                  July 2009


Authors' Addresses

   Koen Vos
   Skype Technologies S.A.
   Stadsgaarden 6
   Stockholm  11645
   SE

   Phone: +46 855 921 989
   Email: koen.vos@skype.net


   Soeren Skak Jensen
   Skype Technologies S.A.
   Stadsgaarden 6
   Stockholm  11645
   SE

   Phone: +46 855 921 989
   Email: soren.skak.jensen@skype.net


   Karsten Vandborg Soerensen
   Skype Technologies S.A.
   Stadsgaarden 6
   Stockholm  11645
   SE

   Phone: +46 855 921 989
   Email: karsten.vandborg.sorensen@skype.net


Vos, et al.              Expires January 7, 2010               [Page 27]