Gilbert A. Soulodre
Adaptive Methods for Removing Camera Noise from Film Soundtracks
Ph.D. Thesis, November 1998
Supervisor: P. Kabal
One of the fundamental problems in signal processing is to enhance a signal which has been corrupted by an additive noise. In this thesis, the problem of alleviating the effects of camera noise corrupting the dialog of a film soundtrack is examined. Two methods of noise reduction are investigated: adaptive noise cancellation with a synthesized reference signal, and spectral subtraction. It is found that, due to the relatively low correlation between successive camera noise pulses, the adaptive noise cancellation approach is not effective at reducing camera noise. The spectral subtraction method is shown to reduce camera noise, but the process creates audible artifacts which can be very disturbing to the listener. To overcome this, new methods are proposed for reducing musical noise and time aliasing effects. The use of subbands and sub-frames is shown to significantly improve the performance of the spectral subtraction algorithm by providing a better match of the noise reduction process to the noise. The performance is further improved by incorporating a perceptual model into the spectral subtraction algorithm. The use of subbands, sub-frames, and a perceptual model allows the amount of processing applied to the signal to be minimized which in turn reduces the level of any artifacts which may result from the noise reduction process. The results of a formal subjective test demonstrate the improved performance of the new noise reduction algorithm.
Eddie Lun Tik Choy
Waveform Interpolation Speech Coder at 4 kb/s
M.Eng. Thesis, August 1998
Supervisor: P. Kabal
See also: Demonstration
Speech coding at bit rates near 4 kbps is expected to be widely deployed in applications such as visual telephony, mobile and personal communications. This research focuses on developing a speech coder based on the waveform interpolation (WI) scheme, with an attempt to deliver near toll-quality speech at rates around 4 kbps. A WI coder has been simulated in floating-point using the C programming language. The high performance of the WI model has been confirmed by subjective listening tests in which the unquantized coder outperforms the 32 kbps G.726 standard (ADPCM) 98% of the time under clean input speech conditions; the reconstructed speech is perceived to be essentially indistinguishable from the original. When fully quantized, the speech quality of the WI coder at 4.25 kbps has been judged to be equivalent to or better than that of G.729 (the ITU-T toll-quality 8 kbps standard) for 45% of the test sentences. Further refinements of the quantization techniques are warranted to bring the coder closer to the toll-quality benchmark. Yet, the existing implementation has produced good quality coded speech with a high degree of intelligibility and naturalness when compared to the conventional coding schemes operating in the neighbourhood of 4 kbps.
Robust Spectral Parameter Coding in Speech Processing
M.Eng. Thesis, May 1998
Supervisor: P. Kabal
Linear predictive coding (LPC) is employed in many low bit rate speech coders. LPC models the short-term spectral information for a block of speech using an all-pole response. Line spectral frequencies (LSF) have been found to be an effective parametric representation for the all-pole response.
Vector quantization (VQ) is often used to code the coefficients of the response. VQ performs poorly whenever it is coding coefficient vectors which are not well matched to the distribution of its codebooks. A shift in the distribution can be caused by filtering (microphones, filters in communication equipment, etc.), speaker or environmental variability (male, female, background noise, etc.). In this thesis, we explore a method for matching the distribution of the vectors representing the incoming speech signal to the distribution of the codebooks. A novel mapping model based on the transformation of codebooks using the mean and the standard deviation of the distributions is used to increase the robustness of vector quantization. The mapping model is optimized in two ways - choosing the most suitable spectral parameter representation and seeking the best way to obtain the form of the mapping model. The effectiveness and limitations of this method are investigated through simulation of a split vector quantizer (SVQ) of the LPC coefficients.
Mohammad R. Zad-Issa
Smoothing the Evolution of the Spectral Parameters in Speech Coders
M.Eng. Thesis, January 1998
Supervisor: P. Kabal
New generation of speech coders have to achieve two goals: efficient use of bandwidth and high speech quality. The objective of this thesis is to improve the modelling of speech signal within the constraints of a low bit rate coder.
Many speech coding algorithms use Linear Prediction (LP) coefficients to describe the power spectrum of the speech. These parameters are obtained for blocks of input samples using standard linear prediction analysis technique. Changes in the speech power spectrum results in the evolution of the LP parameters. However, conventional linear prediction analysis has shortcomings that contribute to the frame-to-frame variation of the LP parameters. These undesired variations affect the performance of the parameters coding and the perceptual quality of the synthesized signal. For voiced speech, efficient coding of the excitation pitch pulses relies on the similarity of successive pitch waveforms. The performance of this coding stage is also jeopardized by LP parameter variations.
The goal of this thesis is to modify the traditional linear prediction analysis in such way that the fluctuations of the LP coefficients are reduced, while the pitch pulse shape evolves slowly. These modifications can lead to an increase in the coding efficiency and/or an improvement in the speech quality. Two different methods have been developed for this purpose. In the first approach we derive the LP parameters such that the glottal excitation model matches as closely as possible a target waveform. The latter contains slowly evolving pulses representing voiced speech excitation. The simulation results indicate that the target matching method results in an increase in the pitch prediction gain which is a measure of similarity of successive pitch pulses. The frame-to-frame variation of the LP coefficients is also lowered with respect to the conventional linear prediction analysis. In the second method, we enforce the smoothness on the evolution of LP parameters by directly including their variation in the LP error function. A novel scheme to dynamically control the contribution of this additional term is also proposed. Experiments indicate that this method can considerably reduce the fluctuation of LP parameters while the overall prediction gain of the LP filter is maintained.
Delay Estimation for Transform Domain Acoustical Echo Cancellation
M.Eng. Thesis, September 1997
Supervisor: P. Kabal
Acoustic echo cancellation can be used to remove the annoying talker feedback in hands-free (teleconferencing) systems. The echo canceller identifies the response between the loudspeaker and the microphone, and produces an echo replica which is then subtracted from the signal. Adaptive filtering techniques are employed to determine the echo path response. The speech signal (or the reference signal) is used to train the algorithm. Fast convergence and good tracking capabilities can not be achieved by classical transform domain adaptive filtering algorithms when the reference signal has variable rank autocorrelation matrix. In this thesis, we examine the DCT-LMS algorithm and we emphasize on the role played by the Discrete Cosine Transform. This fixed transformation reduces the eigenvalue spread of the input autocorrelation matrix by partially decorrelating the inputs.
The autocorrelation matrix of speech signals is often rank-deficient. During the low rank phases, some of the transform-domain tap coefficients become irrelevant to the adaptation process and stop adapting. When the autocorrelation matrix gains full rank, there will be no longer any "frozen" weights. However, the weights that have been frozen are "far" from the optimal point; they require additional convergence time to track again the changes in the room impulse response. In this dissertation, we present a new method that uses the information contained in the other coefficients to move the frozen weights closer to the optimal point and, consequently, reduce the overall convergence time.
By modeling the changes in the impulse response that result from an alteration in the spacing between the microphone and the loudspeaker by a single delay, we were able to develop the "Spectrum Delay Update" method. It consists of replacing, during low-rank phase, each frozen coefficient by a delayed version of the previous full-rank solution. To estimate the corresponding delay, a novel DCT-domain delay estimation algorithm was derived.
Simulation results demonstrate the efficiency of SDU for acoustic echo cancellation, the gain in Echo Return Loss is substantial. The experimental performance analysis confirms the expected reduction in the Euclidean Distance between the filter weights and the actual room impulse response DCT. Furthermore, it shows that spectrally updating the filter weights reduces the MSE jump when the autocorrelation matrix gains full rank.
Improved Pitch Modelling for Low Bit-Rate Speech Coders
M.Eng. Thesis, August 1997
Supervisor: P. Kabal
During the last several years, there has been a dramatic growth of digital services, such as digital wireless and wireline communications, satellite communications and digital voice storage systems. Such services require the use of high-quality low bit-rate coders to efficiently code the speech signal before transmission or storage. The majority of such coders employ algorithms that are based on Code-Excited Linear Prediction (CELP).
The goal of this thesis is to improve the quality of CELP coded speech, while keeping the basic coding format intact. The quality improvement is focused on voiced speech segments. A Pitch Pulse Averaging (PPA) algorithm has been developed to enhance the periodicity of such segments, where during steady state voicing the pitch pulse waveforms in the excitation signal evolve slowly in time. The PPA algorithm extracts a number of such pitch pulse waveforms from the past excitation, aligns them, and then averages them to produce a new pitch pulse waveform with reduced noise.
The PPA algorithm has been simulated and tested on a floating point C-simulation of the G.729 8 kbps CS-ACELP coder. Objective tests verified that the algorithm contributes most during steady state voiced speech. Thus a simple voicing decision mechanism has been developed to deactivate the algorithm during unvoiced segments and voicing onsets of speech. Results verified that the algorithm has generally improved the periodicity of voiced segments by reducing the average of the weighted mean-squared error.
While we were able to demonstrate improvements in objective measures, informal listening tests indicate that the already high perceptual quality of G.729 is generally not audibly altered. Nonetheless, the technique may be useful for improving the quality at lower rates, particularly for next generation low bit-rate coders operating near 4 kbps.
A Pitch Pulse Evolution Model for Linear Predictive Coding of Speech
Ph.D. Thesis, May 1997
Supervisor: P. Kabal
Speech coding is important in the effort to make more efficient use of digital telecommunication networks, particularly wireless systems, and to reduce the memory requirements in speech storage systems. The desire for a low-rate digital representation of speech is often contrary to the demand for a high quality speech reconstruction. In this thesis we present a new speech compression technique designed for near toll quality speech coding at bit rates as low as 4 kb/s.
In low-rate speech coding based on linear prediction (LP), poor modelling of the LP excitation for voiced, quasi-periodic segments contributes to the degradation of the quality of the reconstructed speech. In this dissertation, we present a new speech coding method designed for improved modelling of the LP excitation.
Conceptually, the LP excitation is decomposed into a series of underlying pitch pulses and a simultaneous unvoiced noise-like signal. The underlying pitch pulses are estimated from noisy observations, i.e. the pitch pulses extracted from the LP residual. Since the pulses change little from one time instant to another, we call our representation the Pitch Pulse Evolution (PPE) model. The PPE model provides a framework to analyze and effectively control the periodicity of voiced speech.
We have developed a robust algorithm for extracting noisy pitch pulses from the LP residual based on error minimization with respect to a set of model pulses, and we have examined a number of methods for calculating the underlying pulses. The evolving pitch pulse waveshapes, the pulse positions, and the unvoiced signal are encoded separately. The positions and the shapes of the underlying pulses need only be coded infrequently, and the characteristics of intermediate pulses are obtained by interpolation.
The software implementation of a 4 kb/s PPE coder is described. The main features of the implemented PPE coder are: a novel approach to pitch analysis; estimation of evolving pitch pulses which enables control over the pulse characteristics; and a unique coding scheme which avoids the time dilation and contraction of individual pitch pulses found in other waveform interpolation coders.
Spacializing Simultaneous Speech with Application to Increasing Understanding
B.Eng. (Honours) Thesis, April 1997
Supervisor: P. Kabal
See also: Demonstration
How well we understand speech from several simultaneous speakers depends on several factors, including whether we process the speech in a serial or parallel fashion. Our ability to understand one speaker amongst many is quite strong but a much greater challenge is to understand two speakers at once, especially if we are hearing these speakers over headphones. An enhancement to the speech which will aid us in this endeavour is desired.
One proposed method of increasing this ability is to cause the listener to perceive the speaker as being separated in space. This paper will examine how this can be done using digital signal processing to allow the listener to hear the moved speech over headphones.
Our perception of a speaker's location has do with the speaker's direction relative to the listener and the environment (room, open space, concert hall, etc.) around them. The speaker's direction can be simulated by filtering the speech though a stereo finite impulse response filter called an HRTF (Head Related Transfer Function). The speaker's distance can be simulated by sending reverberation to the listener. Reverberation is composed of early and late reflections of sound off the surfaces in a room. The ratio of direct sound to reverberant sound is a strong cue to the distance of a sound source.
An algorithm was implemented to perform these transformations and tested with several subjects. The subjects were able to determine direction fairly well although the well documented front-back reversal error was often encountered. Distance is difficult to model properly in headphones due to the sound source being right beside the ear. Consequently, tests on subject's estimation of distance resulted in the judgment being quite a bit shorter than the desired distance. Reverberation, however, clearly helped in externalizing the sound from the head.
Spatialization of sound was then applied to the problem of parallel speech understanding, and several tests were performed. The results indicated that parallel speech was indeed easier to understand when the speakers were separated and externalized from the head. Understanding was higher for separated speakers than for one speaker in each ear (one mono-phone speech file per ear) and for both speakers superimposed at a distance external to the head.
Carl R. Nassar
A Novel Receiver Structure for Data Detection in the Presence of Rapidly Changing Nuisance Parameters
Ph.D. Thesis, December 1996
Supervisors: M. R. Soleymani and M. L. Blostein
This thesis introduces a novel receiver structure for the detection of data in the presence of rapidly changing nuisance parameters. Underlying equations characterizing the novel receiver are presented first. This is followed by a presentation and explanation of the receiver implementation; the implementation uses a parallel structure to facilitate a real time processing.
A theoretical analysis of this receiver is provided. Here, we introduce an existence condition; this condition suggests the broad applicability of our receiver. Next, we present two algorithms, one based on rate distortion theory, and the other on the Generalized Lloyd Algorithm (GLA). These two algorithms facilitate the creation of the novel receiver's variables for many practical applications.
We apply our receiver to four communication environments of practical interest. These environments can be described briefly as follows: (1) an MPSK signal is sent across a channel introducing noise and a phase offset; here, the phase offset is constant over only N symbols, where N is small but greater than two (e.g., N=3); (2) as in (1), an MPSK signal is sent across a channel introducing noise and a phase offset; this time, the phase offset is constant over only N=2 symbols; (3) a coded MPSK signal is sent across a channel adding noise and rapidly changing phase; and, finally, (4) independent data symbols are sent across a channel introducing timing offset and noise; here, the timing offset changes in every received burst of data. We show that, in these environments, our receiver is able to offer gains when compared to the receivers in the current literature; our receiver gains in terms of performance, complexity, or both.
There exists a great potential for future research. The novel receiver introduced in this thesis can be applied to many other communication environments of practical interest.
James H. Y. Loo
Intraframe and Interframe Coding of Speech Spectral Parameters
M.Eng. Thesis, September 1996
Supervisors: W.-Y. Chan and P. Kabal
Most low bit rate speech coders employ linear predictive coding (LPC) which models the short-term spectral information within each speech frame as an all-pole filter. In this thesis, we examine various methods that can efficiently encode spectral parameters for every 20 ms frame interval. Line spectral frequencies (LSF) are found to be the most effective parametric representation for spectral coding. Product code vector quantization (VQ) techniques such as split VQ (SVQ) and multi-stage VQ (MSVQ) are employed in intraframe where each frame vector is encoded independently from other frames. Depending on the product code structure, "transparent coding" quality is achieved for SVQ at 26-28 bits/frame and for MSVQ at 25-27 bits/frame.
Because speech is quasi-stationary, interframe coding methods such as predictive SVQ (PSVQ) can exploit the correlation between adjacent LSF vectors. Nonlinear PSVQ (NPSVQ) is introduced in which a nonparametric and nonlinear predictor replaces the linear predictor used in PSVQ. Regardless of predictor type, PSVQ garners a performance gain of 5-7 bits/frame over SVQ. By interleaving intraframe SVQ with PSVQ, error propagation is limited to at most one adjacent frame. At an overall bit rate of about 21 bits/frame, NPSVQ can provide similar coding quality as intraframe SVQ at 24 bits/frame (an average gain of 3 bits/frame). The particular form of nonlinear prediction we use incurs virtually no additional encoding computational complexity. Voicing classification is used in classified NPSVQ (CNPSVQ) to obtain an additional average gain of 1 bit/frame for unvoiced frames. Furthermore, switched-adaptive predictive SVQ (SA-PSVQ) provides an improvement of 1 bit/frame over PSVQ, or 6-8 bits/frame over SVQ, but error propagation increases to 3-7 frames. We have verified our comparative performance results using subjective listening tests.
Chip Timing Recovery for Indoor Wireless Networks Employing Commutation Signalling
M.Eng. Project, July 1996
Supervisors: H. Leib and P. Kabal
This project considers chip timing recovery for indoor wireless networks employing Commutation Signalling. First a general analysis is made, based on the Maximum A Posteriori Probability (MAP) concept, which leads to a synchronizer structure. Two cases are examined, one using a raised-cosine, the other using a half-sine as the chip pulse. It is shown that, once acquisition has taken place and therefore the difference between the incoming waveform and the locally generated clock is of no more than a half chip duration, the synchronizer will lock onto and track the phase of the incoming waveform.
The various blocks are implemented and the system is simulated using Simulink, a system level simulator which is part of Matlab.