New Approaches to Fine-Grain Scalable Audio Coding
Ph.D. Thesis, December 2015
Supervisor: P. Kabal
Bit-rate scalability has been a useful feature in the multimedia communications. Without the need to re-encode the original signal, it allows for improving/decreasing the quality of a signal as more/less of a total bit stream becomes available. Using scalable coding, there is no need to store multiple versions of a signal encoded at different bit-rates. Scalable coding can also be used to provide users with different quality streaming when they have different constraints or when there is a varying channel; i.e., the receivers with lower channel capacities will be able to receive signals at lower bit-rates. It is especially useful in the client-server applications where the network nodes are able to drop some enhancement layer bits to satisfy link capacity constraints.
In this dissertation, we provide three contributions to practical scalable audio coding systems. Our first contribution is the scalable audio coding using watermarking. The proposed scheme uses watermarking to embed some of the information of each layer into the previous layer. This approach leads to a saving in bitrate, so that it outperforms (in terms of rate-distortion) the common scalable audio coding based on the reconstruction error quantization (REQ) as used in MPEG-4 audio. Another contribution is for the scalable coding based on bit-plane coding (BPC). Considering the properties of the residual signal, core-based bit-plane probabilities are provided for MPEG-4 audio scalable to lossless coding (SLS), which matches the quantization and coding performed in the core layer. Simulations show that proper consideration of the core layer parameters improves the bitplane probabilities estimation compared to the existing method.
Perhaps the most important contribution is presented lastly, which is a very fine-grain scalable coding approach by designing a scalable entropy coding using a trellis-based optimization. In the proposed scheme, by constructing an entropy coding tree where the internal nodes can be mapped into the reconstruction points, the tree can be pruned at internal nodes to control the rate-distortion (RD) performance of the encoder in afine-grain manner. A set of metrics and a trellis-based approach is proposed so that an appropriate path is generated on the RD plane. The results show the proposed method outperforms the scalable audio coding performed based on reconstruction error quantization as used in practical systems, e.g. in scalable advanced audio coding (S-AAC).
Quantifying and Exploiting Speech Memory for the Improvement of Narrowband Speech Bandwidth Extension
Ph.D. Thesis, December 2013
Supervisor: P. Kabal
Since its standardization in the 1960s, the bandwidth of traditional telephony speech has been limited to the 0.3–3.4 kHz range. Such narrowband speech exhibits not only a quality that is noticeably inferior to its wideband counterpart, but also reduced intelligibility especially for consonant sounds. Wideband speech reconstruction through artificial bandwidth extension (BWE) attempts to regenerate the highband frequency content above 3.4 kHz in the receiving end, thereby providing backward compatibility with existing networks. Although BWE has been the subject of considerable research, BWE schemes have primarily relied on memoryless mapping to capture the correlation between narrowband and highband spectra. In this thesis, we investigate exploiting speech memory—in reference to the long-term information in segments longer than the conventional 10–30 ms frames—for the purpose of improving the cross-band correlation central to BWE.
With speech durations of up to 600ms modelled through delta features, we first quantify the correlation between long-term parameterizations of the narrow and high frequency bands using information-theoretic measures in combination with statistical modelling based on Gaussian mixture models (GMMs) and vector quantization. In addition to showing that the inclusion of memory can indeed increase certainty about highband spectral content in joint-band GMMs by over 100%, our information-theoretic investigation also demonstrates that the gains achievable by such acoustic-only memory inclusion saturate at, roughly, the syllabic duration of 200 ms—thereby coinciding with similar findings to the same effect in earlier works studying the long-term information content of speech.
To translate the highband certainty gains achievable by memory inclusion into tangible BWE performance improvements, we subsequently propose two distinct and novel approaches for memory-inclusive GMM-based BWE where highband spectra are reconstructed given narrowband input by minimum mean-square error estimation. In the first approach, we incorporate delta features into the feature vector representations whose underlying cross-band correlations are to be modelled by joint-band GMMs. Due to their non-invertibility, however, the inclusion of delta features into the parameterization frontend in lieu of some of the conventional static features imposes a time-frequency information tradeoff. Accordingly, we propose an empirical optimization process to determine the optimal allocation of available dimensionalities among static and delta features such that the certainty about static highband content is maximized. Requiring only minimal modifications to our memoryless BWE baseline system, integrating frontend-based memory inclusion optimized as such results in performance improvements that, while modest, involve no increases in extension-stage computational cost nor in training data requirements, thereby providing an easy and convenient means for exploiting speech dynamics to improve BWE performance.
In our second approach, we focus on modelling the high-dimensional distributions underlying sequences of joint-band feature vectors as an alternative to the frontend dimensionality-reducing transform used in our first approach above. To that end, we extend the GMM framework by presenting a novel training approach where sequences of past frames are progressively used to estimate the parameters of high-dimensional temporally-extended GMMs in a tree-like time-frequency-localized fashion. By breaking down the infeasible task of modelling high-dimensional distributions into a series of localized modelling operations with considerably lower complexity and fewer degrees of freedom, our proposed tree-like extension algorithm circumvents the complexities associated with the problem of GMM parameter estimation in high-dimensional settings. Incorporating novel algorithms for fuzzy GMM-based clustering and weighted Expectation-Maximization, we also attempt to present our proposed temporal-based GMM extension approach in a manner that emphasizes its wide applicability to the general contexts of source-target conversion and high-dimensional modelling. By integrating temporally-extended GMMs into our memoryless BWE baseline system, we show that our model-based memory-inclusive BWE technique can outperform not only our first frontend-based approach, but also other comparable and oft-cited modelbased techniques in the literature. Although this superior BWE performance is achieved at a significant increase in extension-stage computational costs, we nevertheless show these costs to be within the typical capabilities of modern communication devices such as tablets and smart phones.
Playout Buffering for Conversational Voice over IP
Ph.D. Thesis, October 2012
Supervisor: P. Kabal
Voice over IP, the quality of interactive conversation is important to users. Major factors affecting perceived quality are delay, delay jitter, and missing packets. For conversational VoIP, a conversational delay also plays an important role for perceived quality. Large conversational delay can result in double talk, echo or even the termination of the conversation. In practice, a playout buffer is introduced at the receiver’s side to remove delay jitter, so that the voice information carried on packets can be available at regular intervals for decoding. A longer buffer reduces the possibility of late packet loss at the expense of increasing conversational delays. Since the time delay of a playout buffer is a major addition to conversational delay, to keep conversational interactivity, it is desirable to design a playout buffer to be short but capable of protecting packets against late packet loss.
In this thesis, we will explore playout buffering algorithms with improved conversational quality. We propose a quality-based adaptive playout buffering algorithm with improved voice quality and reduced conversational delays. We use the E-Model R factor as the cost index to obtain playout delays which adapt for each talkspurt. Special steps are taken to reduce conversational delay: (1) immediately play out stretched speech carried on the first packet of a talkspurt when received (stretching provides additional buffer delay for following packets); (2) compress the speech segment carried on the packets in the playout buffer at the end of a talkspurt (compression reduces the playout delay for the packets).
As other quality-based algorithms, our scheme is subject to burst losses. To improve perceived quality further, we use sender-driven repair algorithms, in which a sender sends redundancy information, to mitigate the impact of the missing packets due to network (lost packets) and buffer underflow (late packets) without increasing buffer delays. In this thesis, we develop a new adaptive forward error correction (FEC) scheme to provide redundancy without additional delay and apply it to our adaptive playout buffering algorithm for improved perceived quality. As an alternative sender-based technique to send redundancy information, a path diversity scheme uses multiple paths (here we consider two paths). Redundant information is sent on a second path. We consider four different path diversity schemes (two of them are proposed based on E-model in this work), and design corresponding playout buffering algorithms based on conversational quality including both calling uality and interactivity.
Cheick Mohamed Konaté
Enhancing Speech Coder Quality: Improved Noise Estimation for Postfilters
M.Eng. Thesis, June 2011
Supervisor: P. Kabal
ITU-T G.711.1 is a multirate wideband extension for the well-known ITU-T G.711 pulse code modulation of voice frequencies. The extended system is fully interoperable with the legacy narrowband one. In the case where the legacy G.711 is used to code a speech signal and G.711.1 is used to decode it, quantization noise may be audible. For this situation, the standard proposes an optional postfilter. The application of postfiltering requires an estimation of the quantization noise. The more accurate the estimate of the quantization noise is, the better the performance of the postfilter can be. In this thesis, we propose an improved noise estimator for the postfilter proposed for the G.711.1 codec and assess its performance. The proposed estimator provides a more accurate estimate of the noise with the same computational complexity.
A Sparse Auditory Envelope Representation with Iterative Reconstruction for Audio Coding
Ph.D. Thesis, April 2011
Supervisor: P. Kabal
Modern audio coding exploits the properties of the human auditory system to efficiently code speech and music signals. Perceptual domain coding is a branch of audio coding in which the signal is stored and transmitted as a set of parameters derived directly from the modeling of the human auditory system. Often, the perceptual representation is designed such that reconstruction can be achieved with limited resources but this usually means that some perceptually irrelevant information is included. In this thesis, we investigate perceptual domain coding by using a representation designed to contain only the audible information regardless of whether reconstruction can be performed efficiently. The perceptual representation we use is based on a multichannel Basilar membrane model, where each channel is decomposed into envelope and carrier components. We assume that the information in the carrier is also present in the envelopes and therefore discard the carrier components. The envelope components are sparsified using a transmultiplexing masking model and form our basic sparse auditory envelope representation (SAER).
An iterative reconstruction algorithm for the SAER is presented that estimates carrier components to match the encoded envelopes. The algorithm is split into two stages. In the first, two sets of envelopes are generated, one of which expands the sparse envelope samples while the other provides limits for the iterative reconstruction. In the second stage, the carrier components are estimated using a synthesis-by-analysis iterative method adapted from methods designed for reconstruction from magnitude-only transform coefficients. The overall system is evaluated using subjective and objective testing on speech and audio signals. We find that some types of audio signals are reproduced very well using this method whereas others exhibit audible distortion. We conclude that, except for in some specific cases where part of the carrier information is required, most of the audible information is present in the SAER and can be reconstructed using iterative methods.