Network Working Group K. Vos Internet-Draft S. Jensen Intended status: Standards Track K. Soerensen Expires: January 7, 2010 Skype Technologies S.A. July 6, 2009 SILK Speech Codec draft-vos-silk-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on January 7, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Vos, et al. Expires January 7, 2010 [Page 1] Internet-Draft SILK Speech Codec July 2009 Abstract This document describes SILK, a speech codec for real-time, packet- based voice communications. Targeting a diverse range of operating environments, SILK provides scalability in several dimensions. Four different sampling frequencies are supported for encoding the audio input signal. Adaptation to network characteristics is provided through control of bitrate, packet rate, packet loss resilience and use of discontinuous transmission (DTX). And several different complexity levels let SILK take advantage of available processing power without relying on it. Each of these properties can be adjusted during operation of the codec on a frame-by-frame basis. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Techical Requirements for Internet Wideband Audio Codec . . . 4 2.1. Bitrate . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Sampling Rate . . . . . . . . . . . . . . . . . . . . . . 4 2.3. Complexity . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4. Packet Loss Resilience . . . . . . . . . . . . . . . . . . 4 2.5. Delay . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.6. DTX . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Outline of the Codec . . . . . . . . . . . . . . . . . . . . . 6 3.1. Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1. Control Parameters . . . . . . . . . . . . . . . . . . 6 3.1.2. Voice Activity Detection . . . . . . . . . . . . . . . 9 3.1.3. High-Pass Filter . . . . . . . . . . . . . . . . . . . 9 3.1.4. Pitch Analysis . . . . . . . . . . . . . . . . . . . . 10 3.1.5. Noise Shaping Analysis . . . . . . . . . . . . . . . . 11 3.1.6. Prefilter . . . . . . . . . . . . . . . . . . . . . . 15 3.1.7. Prediction Analysis . . . . . . . . . . . . . . . . . 15 3.1.8. LSF Quantization . . . . . . . . . . . . . . . . . . . 16 3.1.9. LTP Quantization . . . . . . . . . . . . . . . . . . . 19 3.1.10. Noise Shaping Quantizer . . . . . . . . . . . . . . . 20 3.1.11. Range Encoder . . . . . . . . . . . . . . . . . . . . 20 3.2. Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1. Range Decoder . . . . . . . . . . . . . . . . . . . . 22 3.2.2. Decode Parameters . . . . . . . . . . . . . . . . . . 22 3.2.3. Generate Excitation . . . . . . . . . . . . . . . . . 22 3.2.4. LTP Synthesis . . . . . . . . . . . . . . . . . . . . 22 3.2.5. LPC Synthesis . . . . . . . . . . . . . . . . . . . . 23 4. Reference Implementation . . . . . . . . . . . . . . . . . . . 24 5. Security Considerations . . . . . . . . . . . . . . . . . . . 25 6. Informative References . . . . . . . . . . . . . . . . . . . . 26 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 Vos, et al. Expires January 7, 2010 [Page 2] Internet-Draft SILK Speech Codec July 2009 1. Introduction A central component in voice communications is the speech codec, which compresses the audio signal for efficient transmission over a network. A good speech codec achieves high coding efficiency, meaning that it delivers high audio quality at a given bitrate. However, for a good user experience in a broad range of environments, a speech codec should also be able adapt its operating point to the characteristics and limitations of network, hardware and audio signal. SILK is a novel speech codec for real-time voice communications designed and developed by Skype [skype-website] to offer this kind of scalability. This document describes the technical details of SILK. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Vos, et al. Expires January 7, 2010 [Page 3] Internet-Draft SILK Speech Codec July 2009 2. Techical Requirements for Internet Wideband Audio Codec The Internet Wideband Audio Codec MUST be optimized towards real-time communications over the Internet, and MUST have the flexibility to adjust to the environment it operates in. Below is a list of concrete requirements for the codec. 2.1. Bitrate The codec MUST provide a quality/bitrate trade-off that is competitive with other state-of-the-art codecs. It MUST be capable of running at bitrates below 10 kbps. At low bitrates it MUST deliver good quality for clean, noisy or hands-free speech in any language. At high bitrates the quality MUST be excellent for any audio signal, inlcuding music. The bitrate MUST be adjustable in real-time. 2.2. Sampling Rate The codec MUST support multiple sampling rates, ranging from narrowband (8 kHz) to super wideband (24 kHz or more). Switching between sampling rates MUST be carried out in real-time. 2.3. Complexity The codec MUST be capable of running at below 50 MHz of a x86 core in wideband mode (16 kHz sampling rate). The codec SHOULD have a complexity that is adjustable in real-time, where a higher complexity setting improves the quality/bitrate trade-off. 2.4. Packet Loss Resilience The codec MUST be capable of running with little error propagation, meaning that the decoded signal after one or more packet losses is close to the decoded signal without packet losses after no more than two additional packets. The codec MUST have a packet loss resilience that is adjustable in real-time, where a lower packet loss resilience setting improves the quality/bitrate trade-off. 2.5. Delay The codec MUST be capable of running with an algorithmic delay of no more than 30 milliseconds. 2.6. DTX The codec SHOULD be capable of using Discontinuous Transmission (DTX) where packets are sent at a reduced rate when the input signal Vos, et al. Expires January 7, 2010 [Page 4] Internet-Draft SILK Speech Codec July 2009 contains only background noise. Vos, et al. Expires January 7, 2010 [Page 5] Internet-Draft SILK Speech Codec July 2009 3. Outline of the Codec The SILK codec consist of an encoder and a decoder as described in Section 3.1 and Section 3.2, respectively. 3.1. Encoder We start the description of the encoder by listing the parameters that controls the operating point of the encoder. Afterwards, we describe the encoder components in detail. 3.1.1. Control Parameters The encoder with control parameters specifying the operating point is depicted in Figure 1. All control parameters can be changed during regular operation of the codec, when inputting a frame of audio data, without interrupting the audio stream from encoder to decoder. The codec control parameters are described in Section 3.1.1.1 to Section 3.1.1.5. Sampling rate ---------------+ Bitrate -------------+ | Packet rate -----------+ | | Packet loss rate ---------+ | | | Complexity -------+ | | | | Use DTX -----+ | | | | | | | | | | | \/\/\/\/\/\/ +-------------+ Input signal -->| Encoder |--> Bitstream +-------------+ Block diagram illustrating the control parameters that specifies the operating point of the SILK encoder. Figure 1 3.1.1.1. Sampling Rate SILK can switch in real-time between audio sampling rates of 8, 12, 16 and 24 kHz. A higher sampling rate improves audio quality by preserving a larger part of the input signal frequency range, at the cost of increased CPU load and bitrate. Vos, et al. Expires January 7, 2010 [Page 6] Internet-Draft SILK Speech Codec July 2009 3.1.1.2. Bitrate The bitrate can be set between 6 and 40 kbps. A higher bitrate improves audio quality by lowering the amount of quantization noise in the decoded signal. The required bitrate for a given level of quantization noise is approximately linear with the sampling rate. Good quality is achieved at around 1 bit/sample, and at 1.5 bits/ sample the quality becomes transparent for most material. 3.1.1.3. Packet Rate SILK encodes frames of 20 milliseconds at a time and can combine 1, 2, 3, 4 or 5 of these frames in one payload, thus creating one packet every 20, 40, 60, 80 or 100 milliseconds. Because of the overhead from IP/UDP/RTP headers, sending fewer packets per second reduces the bitrate, but increases latency and sensitivity to packet losses as losing one packet constitutes a loss of a bigger chunk of audio signal. 3.1.1.4. Packet Loss Resilience Speech codecs often exploit inter-frame correlations to reduce the bitrate at a cost in error propagation: after losing one packet several packets need to be received before the decoder is able to accurately reconstruct the speech signal. The extent to which SILK exploits inter-frame dependencies can be adjusted on the fly to choose a trade-off between bitrate and amount of error propagation. 3.1.1.5. Complexity SILK has several optional optimizations that can be enabled to reduce the CPU load by a few times, at the cost of increasing the bitrate by a few percent. The most important algorithmic parts controlled by the three complexity settings (that is, high, medium, and low) are: o The filter order of the whitening filter and the downsampling quality in the pitch analysis. o The filter order of the short-term noise shaping filter used in the prefilter and noise shaping quantizer. o The accuracy in the prediction analysis, the use of simulated output, and adjustment of the number of survivors that are carried over between stages in the multi-stage LSF vector quantization. o The number of states in delayed decision quantization of the residual signal. In the following, we focus on the core encoder and describe its components. For simplicity, we will refer to the core encoder simply as the encoder in the remainder of this document. An overview of the encoder is given in Figure 2. Vos, et al. Expires January 7, 2010 [Page 7] Internet-Draft SILK Speech Codec July 2009 +---+ +----------------------------->| | +---------+ | +---------+ | | |Voice | | |LTP | | | +----->|Activity |-----+ +---->|Scaling |---------+--->| | | |Detector | 3 | | |Control |<+ 13 | | | | +---------+ | | +---------+ | | | | | | | +---------+ | | | | | | | |Gains | | 12 | | R | | | | +->|Processor|-|---+---|--->| a | | | | | | | | | | | n | | \/ | | +---------+ | | | | g | | +---------+ | | +---------+ | | | | e | | |Pitch | | | |LSF | | | | | |-> | +->|Analysis |-+ | |Quantizer|-|---|---|--->| E |15 | | | |4| | | | | 9 | | | n | | | +---------+ | | +---------+ | | | | c | | | | | 10/\ 11| | | | | o | | | | | | \/ | | | | d | | | +---------+ | | +----------+| | | | e | | | |Noise | +--|->|Prediction|+---|---|--->| r | | +->|Shaping |-|--+ |Analysis || 8 | | | | | | |Analysis |5| | | || | | | | | | +---------+ | | +----------+| | | | | | | \/ \/ /\ \/ \/ \/ | | | +---------+ | +---------+ | +------------+ | | | |High-Pass| | | |---+ |Noise | | | -+->|Filter |-+---------->|Prefilter| 7 |Shaping |->| | 1 | | 2 | |------>|Quantization|14| | +---------+ +---------+ 6 +------------+ +---+ 1: Input speech signal 2: High passed input signal 3: Voice activity estimate 4: Pitch lags (per 5 ms) and voicing decision (per 20 ms) 5: Noise shaping quantization coefficients - Short term synthesis and analysis noise shaping coefficients (per 5 ms) - Long term synthesis and analysis noise shaping coefficients (per 5 ms and for voiced speech only) - Noise shape tilt (per 5 ms) - Quantizer gain/step size (per 5 ms) 6: Input signal filtered with analysis noise shaping filters 7: Simulated output signal 8: Short and long term prediction coefficients LTP (per 5 ms) and LPC (per 20 ms) 9: LSF quantization indices 10: LSF coefficients Vos, et al. Expires January 7, 2010 [Page 8] Internet-Draft SILK Speech Codec July 2009 11: Quantized LSF coefficients 12: Processed gains, and synthesis noise shape coefficients 13: LTP state scaling coefficient. Controlling error propagation / prediction gain trade-off 14: Quantized signal 15: Range encoded bitstream Encoder block diagram. Figure 2 3.1.2. Voice Activity Detection The input signal is processed by a VAD (Voice Activity Detector) to produce a measure of voice activity, and also spectral tilt and signal-to-noise estimates, for each frame. The VAD uses a sequence of half-band filterbanks to split the signal in four subbands: 0 - Fs/16, Fs/16 - Fs/8, Fs/8 - Fs/4, and Fs/4 - Fs/2, where Fs is the sampling frequency (8, 12, 16 or 24 kHz). The lowest subband, from 0 - Fs/16 is high-pass filtered with a first-order MA (Moving Average) filter (with transfer function H(z) = 1-z^(-1)) to reduce the energy at the lowest frequencies. For each frame, the signal energy per subband is computed. In each subband, a noise level estimator tracks the background noise level and an SNR (Signal-to-Noise Ratio) value is computed as the logarithm of the ratio of energy to noise level. Using these intermediate variables, the following parameters are calculated for use in other SILK modules: o Average SNR. The average of the subband SNR values. o Smoothed subband SNRs. Temporally smoothed subband SNR values. o Speech activity level. Based on the average SNR and a weighted average of the subband energies. o Spectral tilt. A weighted average of the subband SNRs, with positive weights for the low subbands and negative weights for the high subbands. 3.1.3. High-Pass Filter The input signal is filtered by a high-pass filter to remove the lowest part of the spectrum that contains little speech energy and may contain background noise. This is a second order ARMA (Auto Regressive Moving Average) filter with a cut-off frequency around 70 Hz. In the future, a music detector may also be used to lower the cut-off frequency when the input signal is detected to be music rather than speech. Vos, et al. Expires January 7, 2010 [Page 9] Internet-Draft SILK Speech Codec July 2009 3.1.4. Pitch Analysis The high-passed input signal is processed by the open loop pitch estimator shown in Figure 3. +--------+ +----------+ |2 x Down| |Time- | +->|sampling|->|Correlator| | | | | | | |4 | +--------+ +----------+ \/ | | 2 +-------+ | | +-->|Speech |5 +---------+ +--------+ | \/ | |Type |-> |LPC | |2 x Down| | +----------+ | | +->|Analysis | +->|sampling|-+------------->|Time- | +-------+ | | | | | | |Correlator|-----------> | +---------+ | +--------+ |__________| 6 | | | |3 | \/ | \/ | +---------+ | +----------+ | |Whitening| | |Time- | -+->|Filter |-+--------------------------->|Correlator|-----------> 1 | | | | 7 +---------+ +----------+ 1: Input signal 2: Lag candidates from stage 1 3: Lag candidates from stage 2 4: Correlation threshold 5: Voiced/unvoiced flag 6: Pitch correlation 7: Pitch lags Block diagram of the pitch estimator. Figure 3 The pitch analysis finds a binary voiced/unvoiced classification, and, for frames classified as voiced, four pitch lags per frame - one for each 5 ms subframe - and a pitch correlation indicating the periodicity of the signal. The input is first whitened using a Linear Prediction (LP) whitening filter, where the coefficients are computed through standard Linear Prediction Coding (LPC) analysis. The order of the whitening filter is 16 for best results, but is reduced to 12 for medium complexity and 8 for low complexity modes. The whitened signal is analyzed to find pitch lags for which the time Vos, et al. Expires January 7, 2010 [Page 10] Internet-Draft SILK Speech Codec July 2009 correlation is high. The analysis consists of three stages for reducing the complexity: o In the first stage, the whitened signal is downsampled 4 times and the current frame is correlated to a signal delayed by a range of lags, starting from a shortest lag corresponding to 500 Hz, to a longest lag corresponding to 56 Hz. o The second stage operates on a two times downsampled signal and measures time correlations only near the lags corresponding to those that had sufficiently high correlations in the first stage. The resulting correlations are adjusted for a small bias towards short lags to avoid ending up with a multiple of the true pitch lag. The highest adjusted correlation is compared to a threshold depending on: * Whether the previous frame was classified as voiced * The speech activity level * The spectral tilt. If the threshold is exceeded, the current frame is classified as voiced and the lag with the highest adjusted correlation is stored for a final pitch analysis of the highest precision in the third stage. o The last stage operates directly on the whitened input signal to compute time correlations for each of the four subframes independently in a narrow range around the lag with highest correlation from the second stage. 3.1.5. Noise Shaping Analysis The noise shaping analysis finds gains and filter coefficients used in the prefilter and noise shaping quantizer. These parameters are chosen such that they will fulfil several requirements: o Balancing quantization noise and bitrate. The quantization gains determine the step size between reconstruction levels of the excitation signal. Therefore, increasing the quantization gain amplifies quantization noise, but also reduces the bitrate by lowering the entropy of the quantization indices. o Spectral shaping of the quantization noise; the noise shaping quantizer is capable of reducing quantization noise in some parts of the spectrum at the cost of increased noise in other parts, without substantially changing the bitrate. By shaping the noise such that it follows the signal spectrum, it becomes less audible. In practice, best results are obtained by making the shape of the noise spectrum slightly flatter than the signal spectrum. o Deemphasizing spectral valleys; by using different coefficients in the analysis and synthesis part of the prefilter and noise shaping quantizer, the levels of the spectral valleys can be decreased relative to the levels of the spectral peaks such as speech formants and harmonics. This reduces the entropy of the signal, which is the difference between the coded signal and the Vos, et al. Expires January 7, 2010 [Page 11] Internet-Draft SILK Speech Codec July 2009 quantization noise, thus lowering the bitrate. o Matching the levels of the decoded speech formants to the levels of the original speech formants; an adjustment gain and a first order tilt coefficient are computed to compensate for the effect of the noise shaping quantization on the level and spectral tilt. / \ ___ | // \\ | // \\ ____ |_// \\___// \\ ____ | / ___ \ / \\ // \\ P |/ / \ \_/ \\_____// \\ o | / \ ____ \ / \\ w | / \___/ \ \___/ ____ \\___ 1 e |/ \ / \ \ r | \_____/ \ \__ 2 | \ | \___ 3 | +----------------------------------------> Frequency 1: Input signal spectrum 2: Deemphasized and level matched spectrum 3: Quantization noise spectrum Noise shaping and spectral de-emphasis illustration. Figure 4 Figure 4 shows an example of an input signal spectrum (1). After de- emphasis and level matching, the spectrum has deeper valleys (2). The quantization noise spectrum (3) more or less follows the input signal spectrum, having slightly less pronounced peaks. The entropy, which provides a lower bound on the bitrate for encoding the excitation signal, is proportional to the area between the deemphasized spectrum (2) and the quantization noise spectrum (3). Without de-emphasis, the entropy is proportional to the area between input spectrum (1) and quantization noise (3) - clearly higher. The transformation from input signal to deemphasized signal can be described as a filtering operation with a filter Wana(z) Vos, et al. Expires January 7, 2010 [Page 12] Internet-Draft SILK Speech Codec July 2009 H(z) = G * ( 1 - c_tilt * z^(-1) ) * ------- Wsyn(z), having an adjustment gain G, a first order tilt adjustment filter with tilt coefficient c_tilt, and where 16 d __ __ Wana(z) = (1 - \ (a_ana(k) * z^(-k))*(1 - z^(-L) \ b_ana(k)*z^(-k)), /_ /_ k=1 k=-d is the analysis part of the de-emphasis filter, consisting of the short-term shaping filter with coefficients a_ana(k), and the long- term shaping filter with coefficients b_ana(k) and pitch lag L. The parameter d determines the number of long-term shaping filter taps. Similarly, but without the tilt adjustment, the synthesis part can be written as 16 d __ __ Wsyn(z) = (1 - \ (a_syn(k) * z^(-k))*(1 - z^(-L) \ b_syn(k)*z^(-k)). /_ /_ k=1 k=-d All noise shaping parameters are computed and applied per subframe of 5 milliseconds. First, an LPC analysis is performed on a windowed signal block of 16 milliseconds. The signal block has a look-ahead of 5 milliseconds relative to the current subframe, and the window is an asymmetric sine window. The LPC analysis is done with the autocorrelation method, with an order of 16 for best quality or 12 in low complexity operation. The quantization gain is found as the square-root of the residual energy from the LPC analysis, multiplied by a value inversely proportional to the coding quality control parameter and the pitch correlation. Next we find the two sets of short-term noise shaping coefficients a_ana(k) and a_syn(k), by applying different amounts of bandwidth expansion to the coefficients found in the LPC analysis. This bandwidth expansion moves the roots of the LPC polynomial towards the origin, using the formulas Vos, et al. Expires January 7, 2010 [Page 13] Internet-Draft SILK Speech Codec July 2009 a_ana(k) = a(k)*g_ana^k, and a_syn(k) = a(k)*g_syn^k, where a(k) is the k'th LPC coefficient and the bandwidth expansion factors g_ana and g_syn are calculated as g_ana = 0.94 - 0.02*C, and g_syn = 0.94 + 0.02*C, where C is the coding quality control parameter between 0 and 1. Applying more bandwidth expansion to the analysis part than to the synthesis part gives the desired de-emphasis of spectral valleys in between formants. The long-term shaping is applied only during voiced frames. It uses three filter taps, described by b_ana = F_ana * [0.25, 0.5, 0.25], and b_syn = F_syn * [0.25, 0.5, 0.25]. For unvoiced frames these coefficients are set to 0. The multiplication factors F_ana and F_syn are chosen between 0 and 1, depending on the coding quality control parameter, as well as the calculated pitch correlation and smoothed subband SNR of the lowest subband. By having F_ana less than F_syn, the pitch harmonics are emphasized relative to the valleys in between the harmonics. The tilt coefficient c_tilt is for unvoiced frames chosen as c_tilt = 0.4, and as c_tilt = 0.04 + 0.06 * C for voiced frames, where C again is the coding quality control parameter and is between 0 and 1. The adjustment gain G serves to correct any level mismatch between original and decoded signal that might arise from the noise shaping and de-emphasis. This gain is computed as the ratio of the prediction gain of the short-term analysis and synthesis filter coefficients. The prediction gain of an LPC synthesis filter is the square-root of the output energy when the filter is excited by a Vos, et al. Expires January 7, 2010 [Page 14] Internet-Draft SILK Speech Codec July 2009 unit-energy impulse on the input. An efficient way to compute the prediction gain is by first computing the reflection coefficients from the LPC coefficients through the step-down algorithm, and extracting the prediction gain from the reflection coefficients as K ___ predGain = ( | | 1 - (r_k)^2 )^(-0.5), k=1 where r_k is the k'th reflection coefficient. Initial values for the quantization gains are computed as the square- root of the residual energy of the LPC analysis, adjusted by the coding quality control parameter. These quantization gains are later adjusted based on the results of the prediction analysis. 3.1.6. Prefilter In the prefilter the input signal is filtered using the spectral valley de-emphasis filter coefficients from the noise shaping analysis, see Section 3.1.5. The filter output is called the simulated output signal and is passed on to the prediction analysis. Also, by applying only the noise shaping analysis filter to the input signal, it provides the input to the noise shaping quantizer. 3.1.7. Prediction Analysis The prediction analysis is performed in one of two ways depending on how the pitch estimator classified the frame. The processing for voiced and unvoiced speech are described in Section 3.1.7.1 and Section 3.1.7.2, respectively. Inputs to this function include the pre-whitened signal from the pitch estimator, see Section 3.1.4. 3.1.7.1. Voiced Speech For a frame of voiced speech the pitch pulses will remain dominant in the pre-whitened input signal. Further whitening is desirable as it leads to higher quality at the same available bit-rate. To achieve this, a Long-Term Prediction (LTP) analysis is carried out to estimate the coefficients of a fifth order LTP filter for each of four sub-frames. The LTP coefficients are used to find an LTP residual signal with the simulated output signal as input to obtain better modelling of the output signal. This LTP residual signal is the input to an LPC analysis where the LPCs are estimated using the Vos, et al. Expires January 7, 2010 [Page 15] Internet-Draft SILK Speech Codec July 2009 covariance method, such that the residual energy is minimized. The estimated LPCs are converted to a Line Spectral Frequency (LSF) vector, and quantized as described in Section 3.1.8. After quantization, the quantized LSF vector is converted to LPC coefficients and hence by using these quantized coefficients the encoder remains fully synchronized with the decoder. The LTP coefficients are quantized using a method described in Section 3.1.9. The quantized LPC and LTP coefficients are now used to filter the simulated output signal and measure a residual energy for each of the four subframes. 3.1.7.2. Unvoiced Speech For a speech signal that has been classified as unvoiced there is no need for LTP filtering as it has already been determined that the pre-whitened input signal is not periodic enough within the allowed pitch period range for an LTP analysis to be worth-while the cost in terms of complexity and rate. Therefore, the pre-whitened input signal is discarded and instead the simulated output is used for LPC analysis using the covariance method. The resulting LPC coefficients are converted to an LSF vector, quantized as described in the following section and transformed back to obtain quantized LPC coefficients. The quantized LPC coefficients are used to filter the simulated output signal and measure a residual energy for each of the four subframes. 3.1.8. LSF Quantization The purpose of quantization is to significantly lower the bit rate at the cost of some introduced distortion. A higher rate should always lead to lower distortion, and lowering the rate will generally lead to higher distortion. A commonly used but generally sub-optimal approach is to use a quantization method with a constant rate where only the error is minimized when quantizing. 3.1.8.1. Rate-Distortion Optimization Instead, we minimize an objective function that consists of a weighted sum of rate and distortion, and use a codebook with an associated non-uniform rate table. Thus, we take into account that the probability mass function for selecting the codebook entries are by no means guaranteed to be uniform in our scenario. The advantage of this approach is that it ensures that rarely used codebook vector centroids, which are modelling statistical outliers in the training set can be quantized with a low error but with a relatively high cost in terms of a high rate. At the same time this approach also provides the advantage that frequently used centroids are modelled with low error and a relatively low rate. This approach will lead to Vos, et al. Expires January 7, 2010 [Page 16] Internet-Draft SILK Speech Codec July 2009 equal or lower distortion than the fixed rate codebook at any given average rate, if the data is similar to the data used for training the codebook. 3.1.8.2. Error Mapping Instead of minimizing the error in the LSF domain, we map the errors to spectral distortion by applying a weight to the error of each element in the error vector. These weight vectors are calculated for each input vector as a linear approximation of the true mapping function, which is accurate for small errors. Consequently, we solve the following minimization problem, i.e., LSF_q = argmin { (LSF - c)' * W * (LSF - c) + mu * rate }, c in C where LSF_q is the quantized vector, LSF is the input vector to be quantized, and c is the quantized LSF vector candidate taken from the set C of all possible outcomes of the codebook. 3.1.8.3. Multi-Stage Vector Codebook We arrange the codebook in a multiple stage structure to achieve a quantizer that is both memory efficient and highly scalable in terms of computational complexity, see e.g. [sinervo-norsig]. In the first stage the input is the LSF vector to be quantized, and in any other stage s > 1, the input is the quantization error from the previous stage, see Figure 5. Stage 1: Stage 2: Stage S: +----------+ +----------+ +----------+ | c_{1,1} | | c_{2,1} | | c_{S,1} | LSF +----------+ res_1 +----------+ res_{S-1} +----------+ --->| c_{1,2} |------>| c_{2,2} |--> ... --->| c_{S,2} |---> +----------+ +----------+ +----------+ res_S = ... ... ... LSF-LSF_q +----------+ +----------+ +----------+ |c_{1,M1-1}| |c_{2,M2-1}| |c_{S,MS-1}| +----------+ +----------+ +----------+ | c_{1,M1} | | c_{2,M2} | | c_{S,MS} | +----------+ +----------+ +----------+ Multi-Stage LSF Vector Codebook Structure. Vos, et al. Expires January 7, 2010 [Page 17] Internet-Draft SILK Speech Codec July 2009 Figure 5 By storing total of M codebook vectors, i.e., S __ M = \ Ms, /_ s=1 where M_s is the number of vectors in stage s, we obtain a total of S ___ T = | | Ms s=1 possible combinations for generating the quantized vector. It is for example possible to represent 2^36 unique vectors using only 216 vectors in memory, as done in SILK for voiced speech at all sample frequencies above 8 kHz. 3.1.8.4. Survivor Based Codebook Search This number of possible combinations is far too high for a full search to be carried out for each frame so for all stages but the last, i.e., s smaller than S, only the best min( L, Ms ) centroids are carried over to stage s+1. In each stage the objective function, i.e., the weighted sum of accumulated bit-rate and distortion, is evaluated for each codebook vector entry and the results are sorted. Only the best paths and the corresponding quantization errors are considered in the next stage. In the last stage S the single best path through the multistage codebook is determined. By varying the maximum number of survivors from each stage to the next L, the complexity can be adjusted in real-time at the cost of a potential decrease in the objective function for the resulting quantized vector. This approach scales all the way between the two extremes, L=1 being a greedy search, and the desirable but infeasible full search, L=T/MS. In fact, a performance almost as good as what can be achieved with the infeasible full search can be obtained at a substantially lower complexity by using this approach, see e.g. [leblanc-tsap]. Vos, et al. Expires January 7, 2010 [Page 18] Internet-Draft SILK Speech Codec July 2009 3.1.8.5. LSF Stabilization If the input is stable, finding the best candidate will usually result in the quantized vector also being stable, but due to the multi-stage approach it could in theory happen that the best quantization candidate is unstable and because of this there is a need to explicitly ensure that the quantized vectors are stable. Therefore we apply a LSF stabilization method which ensures that the LSF parameters are within valid range, increasingly sorted, and have minimum distances between each other and the border values that have been pre-determined as the 0.01 percentile distance values from a large training set. 3.1.8.6. Off-Line Codebook Training The vectors and rate tables for the multi-stage codebook are trained by minimizing the average of the objective function for LSF vectors from a large training set. 3.1.9. LTP Quantization For voiced frames, the prediction analysis described in Section 3.1.7.1 resulted in four sets (one set per subframe) of five LTP coefficients, plus four weighting matrices. The LTP coefficients for each subframe are quantized using entropy constrained vector quantization. A total of three vector codebooks are available for quantization, with different rate-distortion trade-offs. The three codebooks have 10, 20 and 40 vectors and average rates of about 3, 4, and 5 bits per vector, respectively. Consequently, the first codebook has larger average quantization distortion at a lower rate, whereas the last codebook has smaller average quantization distortion at a higher rate. Given the weighting matrix W_ltp and LTP vector b, the weighted rate-distortion measure for a codebook vector cb_i with rate r_i is give by RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i, where u is a fixed, heuristically-determined parameter balancing the distortion and rate. Which codebook gives the best performance for a given LTP vector depends on the weighting matrix for that LTP vector. For example, for a low valued W_ltp, it is advantageous to use the codebook with 10 vectors as it has a lower average rate. For a large W_ltp, on the other hand, it is often better to use the codebook with 40 vectors, as it is more likely to contain the best codebook vector. The weighting matrix W_ltp depends mostly on two aspects of the input signal. The first is the periodicity of the signal; the more Vos, et al. Expires January 7, 2010 [Page 19] Internet-Draft SILK Speech Codec July 2009 periodic the larger W_ltp. The second is the change in signal energy in the current subframe, relative to the signal one pitch lag earlier. A decaying energy leads to a larger W_ltp than an increasing energy. Both aspects do not fluctuate very fast which causes the W_ltp matrices for different subframes of one frame often to be similar. As a result, one of the three codebooks typically gives good performance for all subframes. Therefore the codebook search for the subframe LTP vectors is constrained to only allow codebook vectors to be chosen from the same codebook, resulting in a rate reduction. To find the best codebook, each of the three vector codebooks is used to quantize all subframe LTP vectors and produce a combined weighted rate-distortion measure for each vector codebook and the vector codebook with the lowest combined rate-distortion over all subframes is chosen. The quantized LTP vectors are used in the noise shaping quantizer, and the index of the codebook plus the four indices for the four subframe codebook vectors are passed on to the range encoder. 3.1.10. Noise Shaping Quantizer The noise shaping quantizer independently shapes the signal and coding noise spectra to obtain a perceptually higher quality at the same bitrate. The prefilter output signal is multiplied with a compensation gain G computed in the noise shaping analysis. Then the output of a synthesis shaping filter is added, and the output of a prediction filter is subtracted to create a residual signal. The residual signal is multiplied by the inverse quantized quantization gain from the noise shaping analysis, and input to a scalar quantizer. The quantization indices of the scalar quantizer represent a signal of pulses that is input to the pyramid range encoder. The scalar quantizer also outputs a quantization signal, which is multiplied by the quantized quantization gain from the noise shaping analysis to create an excitation signal. The output of the prediction filter is added to the excitation signal to form the quantized output signal y(n). The quantized output signal y(n) is input to the synthesis shaping and prediction filters. 3.1.11. Range Encoder Range encoding is a well known method for entropy coding in which a bitstream sequence is continually updated with every new symbol, based on the probability for that symbol. It is similar to arithmetic coding but rather than being restricted to generating binary output symbols, it can generate symbols in any chosen number Vos, et al. Expires January 7, 2010 [Page 20] Internet-Draft SILK Speech Codec July 2009 base. In SILK all side information is range encoded. Each quantized parameter has its own cumulative density function based on histograms for the quantization indices obtained by running a training database. 3.1.11.1. Bitstream Encoding Details TBD. 3.2. Decoder At the receiving end, the received packets are by the range decoder split into a number of frames contained in the packet. Each of which contains the necessary information to reconstruct a 20 ms frame of the output signal. An overview of the decoder is given in Figure 6. +---+ | R | | a | | n | | g | | e | +------------+ -->| |--->| Decode |----------------------------+ 1 | D | 2 | Parameters |----------+ 5 | | e | +------------+ 4 | | | c | 3 | | | | o | \/ \/ \/ | d | +------------+ +------------+ +------------+ | e | | Generate |--->| LTP |--->| LPC |---> | r | | Excitation | | Synthesis | | Synthesis | 6 +---+ +------------+ +------------+ +------------+ 1: Range encoded bitstream 2: Coded parameters 3: Pulses and gains 4: Pitch lags and LTP coefficients 5: LPC coefficients 6: Decoded signal Decoder block diagram. Figure 6 Vos, et al. Expires January 7, 2010 [Page 21] Internet-Draft SILK Speech Codec July 2009 3.2.1. Range Decoder The range decoder decodes the encoded parameters from the received bitstream. Output from this function includes the pulses and gains for the excitation signal generation, as well as LTP and LSF codebook indices, which are needed for decoding LTP and LPC coefficients needed for LTP and LPC synthesis filtering the excitation signal, respectively. 3.2.2. Decode Parameters Pulses and gains are decoded from the range decoded bitstream in the following way... (TBD) When a voiced frame is decoded and LTP codebook selection and indices are received, LTP coefficients are decoded using the selected codebook by choosing the vector that corresponds to the given codebook index. This is done for each of the four subframes. The LPC coefficients are decoded from the LSF codebook by first adding the chosen vectors, one vector from each stage of the codebook. The resulting LSF vector is stabilized using the same method as was used in the encoder, see Section 3.1.8.5. The LSF coefficients are then converted to LPC coefficients, and passed on to the LPC synthesis filter. 3.2.3. Generate Excitation The pulses signal is multiplied with the quantization gain to create the excitation signal. 3.2.4. LTP Synthesis For voiced speech, the excitation signal e(n) is input to an LTP synthesis filter that will recreate the long term correlation that was removed in the LTP analysis filter and generate an LPC excitation signal e_LPC(n), according to d __ e_LPC(n) = e(n) + \ e(n - L - i) * b_i, /_ i=-d using the pitch lag L, and the decoded LTP coefficients b_i. For unvoiced speech, the output signal is a copy of the input signal, i.e., e_LPC(n) = e(n). Vos, et al. Expires January 7, 2010 [Page 22] Internet-Draft SILK Speech Codec July 2009 3.2.5. LPC Synthesis In a similar manner, the short-term correlation that was removed in the LPC analysis filter is recreated in the LPC synthesis filter. The LPC excitation signal e_LPC(n) is filtered using the LTP coefficients a_i, according to d_LPC __ y(n) = e_LPC(n) + \ e_LPC(n - i) * a_i, /_ i=1 where d_LPC is the LPC synthesis filter order, and y(n) is the decoded signal. Vos, et al. Expires January 7, 2010 [Page 23] Internet-Draft SILK Speech Codec July 2009 4. Reference Implementation To Be Defined. Vos, et al. Expires January 7, 2010 [Page 24] Internet-Draft SILK Speech Codec July 2009 5. Security Considerations To Be Defined. Vos, et al. Expires January 7, 2010 [Page 25] Internet-Draft SILK Speech Codec July 2009 6. Informative References [leblanc-tsap] LeBlanc, W., Bhattacharya, B., Mahmoud, S., and V. Cuperman, "Efficient Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters for 4 kb/s Speech Coding", IEEE Transactions on Speech and Audio Processing, Vol. 1, No. 4, October 1993. [sinervo-norsig] Sinervo, U., Nurminen, J., Heikkinen, A., and J. Saarinen, "Evaluation of Split and Multistage Techniques in LSF Quantization", NORSIG-2001, Norsk symposium i signalbehandling, Trondheim, Norge, October 2001. [skype-website] "Skype", Skype website http://www.skype.com/. Vos, et al. Expires January 7, 2010 [Page 26] Internet-Draft SILK Speech Codec July 2009 Authors' Addresses Koen Vos Skype Technologies S.A. Stadsgaarden 6 Stockholm 11645 SE Phone: +46 855 921 989 Email: koen.vos@skype.net Soeren Skak Jensen Skype Technologies S.A. Stadsgaarden 6 Stockholm 11645 SE Phone: +46 855 921 989 Email: soren.skak.jensen@skype.net Karsten Vandborg Soerensen Skype Technologies S.A. Stadsgaarden 6 Stockholm 11645 SE Phone: +46 855 921 989 Email: karsten.vandborg.sorensen@skype.net Vos, et al. Expires January 7, 2010 [Page 27]