Telecommunications & Signal Processing Laboratory

Thesis Abstracts, 2002-2004

Sien Ruan

Lapped Transforms in Perceptual Coding of Wideband Audio

M.Eng. Thesis, December 2004

Supervisor: P. Kabal

Audio coding paradigms depend on time-frequency transformations to remove statistical redundancy in audio signals and reduce data bit rate, while maintaining high fidelity of the reconstructed signal. Sophisticated perceptual audio coding further exploits perceptual redundancy in audio signals by incorporating perceptual masking phenomena. This thesis focuses on the investigation of different coding transformations that can be used to compute perceptual distortion measures effectively; among them the lapped transform, which is most widely used in nowadays audio coders. Moreover, an innovative lapped transform is developed that can vary overlap percentage at arbitrary degrees. The new lapped transform is applicable on the transient audio by capturing the time-varying characteristics of the signal.

Levent Tosun

Dynamically Adding Redundancy for Improved Error Concealment in Packet Voice Coding

M.Eng. Thesis, December 2004

Supervisor: P. Kabal

Data is sent in packets of bits over the Internet. However, packets may not arrive in order or in time for playout. Packet loss is a frequently encountered problem in Voice-over-IP (VoIP) applications. Modern speech coders use past information to decode current packets in order to reach very low bit-rates. Therefore, when a packet is lost, the effect of this packet loss propagates over several subsequent packets.

In this thesis, a new redundancy-based packet-loss-concealment scheme is presented. Many redundancy-based packet-loss-concealment schemes send a fixed amount of extra information about the current packet as part of the subsequent packet, but not every packet is equally important for packet loss concealment. We have developed an algorithm to determine the importance of packets and we propose that extra information should only be sent for the important packets. This provides a lower average bit-rate compared to sending the same amount of extra information for each and every packet. We use a linear prediction (LP) based speech coder (ITU-T G.723.1) as a test platform and we propose that only the excitation parameters should be sent as extra information since LP parameters of a frame can be estimated using the LP parameters of the previous frame. Furthermore, we propose that excitation parameters of an important frame that are sent as redundant information should be used in the reconstruction of the lost waveform - as a consequence, the states of the subsequent frame will also be updated.

Wei-shou Hsu

Robust Bandwidth Extension of Narrowband Speech

M.Eng. Thesis, November 2004

Supervisor: P. Kabal

Telephone speech often sounds muffled and thin due to its narrowband characteristics. With the increased availability of terminals capable of receiving wideband signals, extending the bandwidth of narrowband telephone speech at the receiver has drawn much research interest. Currently, there exist many methods that can provide good reconstructions of the wideband spectra from narrowband speech; however, they often lack robustness to different channel conditions, and their performances degrade when they operate in unknown environments.

This thesis presents a bandwidth extension algorithm that mitigates the effects of adverse conditions. The proposed system is designed to work with noisy input speech and unknown channel frequency response. To maximize the naturalness of the reconstructed speech, the algorithm estimates the channel and applies equalization to recover the attenuated bands. Artifacts are reduced by employing an adaptive and a fixed postfilters.

Subjective test results suggest that the proposed scheme is not affected by channel conditions and is able to produce speech with enhanced quality in adverse environments.

Colm Elliott

Stream Synchronization for Voice over IP Conference Bridges

M.Eng. Thesis, November 2004

Supervisor: P. Kabal

Potentially large network delay variability experienced by voice packets when travelling over IP networks complicates the design of a robust voice conference bridge. Intrastream synchronization is required at the conference bridge to smooth out network delay jitter on a given stream and provide a continuous stream of voice packets to the conference bridge core. Interstream synchronization is needed to provide time synchronization between packets in different streams, allowing for a mapping of selected voice streams to the conference bridge output and the creation of a periodic and synchronized output from the conference bridge.

This work presents a design and evaluation of a Synchronized conference bridge that maps N input voice streams to M output voice streams representing selected speakers. A conference simulator, designed for this thesis, is used to characterize the performance of this bridge in terms of delay and packet loss, speaker selection accuracy and conference audio quality.

Alexander Wyglinski

Physical Layer Loading Algorithms for Indoor Wireless Multicarrier Systems

Ph.D. Thesis, November 2004

Supervisors: P. Kabal, F. Labeau

The demand for wireless networks has been growing rapidly over the recent past due to improved reliability, higher supported data rates, seamless connectivity between users and the access point, and low deployment costs relative to wireline infrastructure. This increase in demand started with the popular IEEE 802.11b wireless local area network standard. Many recent wireless network standards are now employing multicarrier modulation in their design. Multicarrier modulation reduces the system's susceptibility to the frequency-selective fading channel, due to multipath propagation, by transforming it into a collection of approximately flat subchannels. As a result, this makes it easier to compensate for the distortion introduced by the channel. However, standardized wireless modems, such as the ETSI HiperLAN/2 and the IEEE 802.11a standards, employ the same operating parameters across all subcarriers, and thus do not exploit all the advantages offered by the multicarrier framework.

This dissertation investigates techniques to further enhance system throughput performance by tailoring several operating parameters on a per-subcarrier basis. These parameters are subcarrier modulation schemes, power levels, and equalizer lengths. The idea of tailoring modulation schemes and power levels, known as bit allocation and power allocation, has been studied for many years and for many applications. This work proposes two novel discrete bit allocation algorithms that strive to reach the optimal solution in a low computational complexity fashion, while constrained to a specified error performance. A novel power allocation algorithm is proposed that satisfies regulatory requirements by obeying a frequency interval power constraint.

Investigation of the third parameter, subcarrier equalizer lengths, has not been conducted before in the literature. Two algorithms are proposed that vary the lengths of the subcarrier equalizers such that the overall distortion is reduced to some specified amount, while the number of equalizer taps used by the system are kept small. Finally, the use of bit allocation is extended to the case when multiple antennas are employed by the wireless modems. Four algorithms are proposed that perform generalized antenna selection diversity at both the transmitter and receiver, in tandem with discrete bit allocation. Results show that employing two transmit and two receive antennas with discrete bit allocation can achieve an average increase in throughput of up to 33% when compared to a system without bit allocation.

Sam Vakil

Gaussian Mixture Model Based Coding of Speech and Audio

M.Eng. Thesis, October 2004

Supervisor: P. Kabal

The transmission of speech and audio over communication channels has always required speech and audio coders with reasonable search and computational complexity and good performance relative to the corresponding distortion measure. This work introduces a coding scheme which works in a perceptual auditory domain. The input high dimensional frames of audio and speech are transformed to power spectral domain, using either DFT or MDCT. The log spectral vectors are then transformed to the excitation domain. In the quantizer section the vectors are DCT transformed and decorrelated. This operation gives the possibility of using diagonal covariances in modelling the data. Finally, a GMM based VQ is performed on the vectors. In the decoder part the inverse operations are done. However, in order to prevent negative power spectrum elements due to inverse perceptual transformation in the decoder, instead of direct inversion, a Nonnegative Least Squares Algorithm has been used to switch back to frequency domain. For the sake of comparison, a reference subband based "Excitation Distortion coder" is implemented and comparing the resulting coded files showed a better performance for the proposed GMM based coder.

Dorothy K. Okello

Resource Management in CDMA-based Satellite Systems

Ph.D. Thesis, April 2004

Supervisor: M. A. Kaplan

There is interest, supported by successful field trials, in the use of satellite communications at the Ka band (30/20 GHz) and beyond to meet emerging demand for broadband interactive multimedia services. The key advantages of operation at Ka band are availability of bandwidth and favorable implications for terminal size, cost and mobility. We study two problems related to bandwidth management of the uplink in a multibeam, CDMA-based, GEO satellite. Our focus is on the delivery of data services with rigid constraints on bit-error rate and elastic constraints on data rate.

The first of the two problems concerns the design of the coverage areas of the satellite beams. We were interested specifically in the adaptation of beam shape to inhomogeneity in the geographic distribution of the user population, and in the impact of beam shaping on the set of transmission rates that are compatible with prescribed constraints on transmission powers and signal-to-interference ratios. Assuming that the spatial distribution of users is known, we construct an algorithm which computes beam coverage regions to equilibrate the per-beam user populations. The impact on the set of feasible bit-rate allocations is quantified through numerical experiments. Comparison with uniform beam shapes suggests that the adaptive approach is superior in terms of the number of concurrent transmissions that can be supported.

The second problem concerns the allocation of bit rates in a setting where user bit-rate requirements are assumed defined by averages over moving windows of constant length. We use a frame-based channel model characterized by fading coefficients which, though statistically variable, are assumed known to the controller at the start of each frame. The implied temporal elasticity in quality-of-service provides opportunity to achieve economies in transmitted power. The value of such opportunity is quantified by comparison of two extreme cases. We develop an approximate system model which allows optimization of the rate allocation when the number of users is small, and a heuristic which is useful when the number of users is not small. The associated performance results confirm the inverse relationship between the per-bit energy required for transmission and the length of the averaging window.

Sanja Kovacevic

SOVA Based on a Sectionalized Trellis of Linear Block Codes

M.Eng. Thesis, January 2004

Supervisor: F. Labeau

The use of block codes is a well known error-control technique for reliable transmission of digital information over noisy communication channels. However, a practically implementable soft input soft-output (SISO) decoding algorithm for block codes is still a challenging problem. This thesis examines a new decoding scheme based on the soft-output Viterbi algorithm (SOVA) applied to a sectionalized trellis for linear block codes. The computational complexities of the new SOVA decoder and of the conventional SOVA decoder based on the bit-level trellis are theoretically analyzed and derived. These results are used to obtain the optimum sectionalization of a trellis for SOVA. The optimum sectionalization of a trellis for Maximum A Posteriori (MAP), Maximum Logarithm MAP (Max-Log-MAP), and Viterbi algorithms, and their corresponding computational complexities are included for comparisons. The results confirm that SOVA based on a sectionalized trellis is the most computationally efficient SISO decoder examined in this thesis. The simulation results of the bit error rate (BER) over additive white Gaussian noise (AWGN) channel demonstrate that the BER performance of the new SOVA decoder is not degraded. The BER performance of SOVA used in a serially concatenated block codes scheme reveals that the soft outputs of the proposed decoder are the same as those of the conventional SOVA decoder. Iterative decoding of serially concatenated block codes reveals that the quality of reliability estimates of the proposed SOVA decoder is the same as that of the conventional SOVA decoder.

Khaled H. El-Maleh

Classification-Based Techniques for Digital Coding of Speech-plus-Noise

Ph.D. Thesis, January 2004

Supervisor: P. Kabal

With the increasing demand for wireless voice services and limited bandwidth resources, it is critical to develop and implement coding techniques which use spectrum efficiently. One approach to increasing system capacity is to lower the bit rate of telephone speech. A typical telephone conversation contains approximately 40% speech and 60% silence or background acoustic noise. A reduction of the average coding rate can be achieved by using a Voice Activity Detection (VAD) unit to distinguish speech from silence or background noise. The VAD decision can be used to select different coding modes for speech and noise or to discontinue transmission during speech pauses.

The quality of a telephone conversation using a VAD-based coding system depends on three major modules: the speech coder, the noise coder, and the VAD. Existing schemes for reduced-rate coding of background noise produce a signal that sounds different from the noise at the transmitting side. The frequent changes of the noise character between that produced during talk spurts (noise coded along with the speech) and that produced during speech pauses (noise coded at a reduced rate) are noticeable and can be annoying to the user.

The objective of this thesis is to develop techniques that enhance the output quality of variable-rate and discontinuous-transmission speech coding systems operating in noisy acoustic environments during the pauses between speech bursts. We propose novel excitation models for natural-quality reduced-rate coding of background acoustic noise in voice communication systems. A better representation of the excitation signal in a noise-synthesis model is achieved by classifying the type of acoustic environment noise. Class-dependent residual substitution is used at the receive side to synthesize a background noise that sounds similar to the background noise at the transmit side. The improvement in the quality of synthesized noise during speech gaps helps in preserving noise continuity between talk spurts and speech pauses, and enhances the overall perceived quality of a conversation.

Aziz Shallwani

An Adaptive Playout Algorithm with Delay Spike Detection for Real-Time VoIP

M.Eng. Thesis, October 2003

Supervisor: P. Kabal

As the Internet is a best-effort delivery network, audio packets may be delayed or lost en route to the receiver due to network congestion. To compensate for the variation in network delay, audio applications buffer received packets before playing them out. Basic algorithms adjust the packet playout time during periods of silence such that all packets within a talkspurt are equally delayed. Another approach is to scale individual voice packets using a dynamic time-scale modification technique based on the WSOLA algorithm.

In this work, an adaptive playout algorithm based on the normalized least mean square algorithm, is improved by introducing a spike-detection mode to rapidly adjust to delay spikes. Simulations on Internet traces show that the enhanced bi-modal playout algorithm improves performance by reducing both the average delay and the loss rate as compared to the original algorithm.

Christopher R. Cave

Perceptual Modelling for Low-Rate Audio Coding

M.Eng. Thesis, June 2002

Supervisor: P. Kabal

Sophisticated audio coding paradigms incorporate human perceptual effects in order to reduce data rates, while maintaining high fidelity of the reconstructed signal. Auditory masking is the phenomenon that is the key to exploiting perceptual redundancy in audio signals. Most auditory models conservatively estimate masking, as they were developed for medium to high rate coders where distortion can be made inaudible. At very low coding rates, more accurate auditory models will be beneficial since some audible distortion is inevitable. This thesis focuses on the application of human perception to low-rate audio coding. A novel auditory model that estimates masking levels is proposed. The new model is based on a study of existing perceptual literature. Among other features, it represents transient masking effects by tracking the temporal evolution of masking components. Moreover, an innovative bit allocation algorithm is developed that considers the excitation of quantization noise in the allocation process. The new adaptive allocation scheme is applicable with any auditory model that is based on the excitation pattern model of masking.

Mark Klein

Signal Subspace Speech Enhancement with Perceptual Post-Filtering

M.Eng. Thesis, January 2002 (2002-05-26 with corrections)

Supervisor: P. Kabal

See also: Demonstration & software

Speech enhancement blocks form a critical part of voice communications systems. Unfortunately, most enhancement schemes have difficulty eliminating noise from speech without introducing distortion or artefacts. Many of the disturbances originate from poor parameter estimation and interframe fluctuations.

This thesis introduces the Enhanced Signal Subspace (ESS) system to mitigate the above problems. Based on a signal subspace framework, ESS has been designed to attenuate disturbances while minimizing audible distortion.

Artefacts are reduced by employing an auditory post-filter to smooth the enhanced speech spectra. This filter performs averaging in a manner that exploits the properties of the human auditory system. As such, distortion of the underlying speech signal is reduced.

Testing shows that listeners prefer the proposed algorithm to traditional signal subspace speech enhancement.

Tarun Agarwal

Pre-Processing of Noisy Speech for Voice Coders

M.Eng. Thesis, January 2002

Supervisor: P. Kabal

Accurate Linear Prediction Coefficient (LPC) estimation is a central requirement in low bit-rate voice coding. Under harsh acoustic conditions, LPC estimation can become unreliable. This results in poor quality of encoded speech and introduces annoying artifacts. The purpose of this thesis is to develop and test a two-branch speech enhancement pre-processing system. This system consists of two denoising blocks. One block will enhance the degraded speech for accurate LPC estimation. The second block will increase the perceptual quality of the speech to be coded. The goals of this research are two-fold—to design the second block, and to compare the performance of other denoising schemes in each of the two branches. Test results show that the two-branch system can provide better perceptual quality of coded speech over conventional one-branch (i.e., one denoising block) speech enhancement techniques under many noisy environments.

Paxton J. Smith

Voice Conferencing over IP Networks

M.Eng. Thesis, January 2002

Supervisor: M. L. Blostein, P. Kabal

See also: Demonstration

Traditional telephone conferencing has been accomplished by way of a centralized conference bridge. An Internet Protocol (IP)-based conference bridge is subject to speech distortions and substantial computational demands due to the tandem arrangement of high compression speech codecs. Decentralized architectures avoid the speech distortions and delay, but lack strong control and have a key dependence on silence suppression for endpoint scalability. One solution is to use centralized speaker selection and forwarding, and decentralized decoding and mixing. This approach eliminates the problem of tandem encodings and maintains tight control, thereby improving the speech quality and scalability of the conference. This thesis considers design options and solutions for this model, and evaluates performance through live conferences with real conferees. Conferees found the speaker selection of the new conference model to be transparent, and strongly preferred the resulting speech quality to that of a centralized IP-based conference bridge.

Thesis titles.