Representing Voiced Speech Using Prototype Waveform Interpolation for Low Rate Speech Coding

M.Eng. Thesis, October 1992

Supervisor: P. Kabal

In recent years, research in narrow band digital speech coding has achieved good quality speech coders at low rates of 4.8 to 8.0 kb/s. This thesis examines the method proposed by W. B. Kleijn called prototype waveform interpolation (PWI) for coding the voiced sections of speech efficiently to achieve a coder below 4.8 kb/s while maintaining, even improving the perceptual quality of current coders.

In examining the PWI method, it was found that although the method generally works very well, there are occasional sections of the reconstructed voiced speech where audible distortion can be heard, even when the prototypes are not quantized. The research undertaken in this thesis focuses on the fundamental principles behind modeling voiced speech using PWI instead of focusing on the bit allocation for encoding the prototypes. Problems in the PWI method are found that maybe have been overlooked as encoding error if full encoding were implemented.

Kleijn uses PWI to represent the voiced sections of the excitation signal which is the residual obtained after the removal of short-term redundancies by linear predictive filter. The problem with this method is that when the PWI reconstructed excitation is passed through the inverse filter to synthesize the speech, undesired effects occur due to the time varying nature of the filter. The reconstructed speech may have undesired envelope variations which result in audible warble.

This thesis proposes an energy fix-up to smoothen the synthesized speech envelope when the interpolation procedure fails to provide the smooth linear result that is desired. Further investigation, however, leads to the final proposal in this thesis that PWI should be performed on the clean speech signal instead of the excitation to achieve consistently reliable results for all voiced frames.

Shaping Multi-dimensional Signal Spaces

Ph.D. Thesis, May 1992

Supervisor: P. Kabal

In selecting the boundary of a signal constellation used for data transmission, the objective is to minimize the average energy of the set for a given number of points from a given packing. Reduction in the average energy because of using the region C as the boundary of a hypercube is called the shape gain of C. The price to be paid for shaping is: (i) an increase in the factor CER (Constellation-Expansion-Ratio), (ii) an increase in the factor PAR (Peak-to-Average-Power-Ratio), and (iii) an increase in the addressing complexity. In this thesis, the structure of the region which optimizes the tradeoff between the shape gain and the CER and also between the shape gain and the PAR in a finite dimensional space is found. Analytical expressions are derived for the optimum tradeoff. The optimum shaping region can be mapped to a hypercube truncated within a simplex. This mapping has properties which facilitate the addressing of the signal points. We introduce several addressing schemes with low complexity and good performance. The concept of the unsymmetrical shaping is discussed. This is the selection of the boundary of a constellation which has different values of power along different dimensions. The rate of the constellation is maximized subject to some constraints on its power spectrum. This spectral shaping also involves the selection of an appropriate basis (modulating waveform) for the space. Finally, we discuss the selection of a signal constellation for signaling over a partial-response channel. In the continuous approximation, we introduce a method to select the nonempty dimensions. This method is based on minimizing the degradation caused by the channel memory. In the discrete case, shaping and coding depend on each other. In this case, a combined shaping and coding method is used. This concerns the joint selection of the shaping and coding to minimize the probability of the symbol error.

Low-Delay Speech Coding at 16 kb/s and Below

M.Eng. Thesis, September 1991

Supervisor: P. Kabal

Development of "network quality" speech coders at 16 kb/s and below is an active research area. This thesis focuses on the study of low-delay Code Excited Linear Predictive (CELP) and tree coders. A 16 kb/s stochastic tree coder based on the (M,L) search algorithm suggested by Iyengar and Kabal and a low-delay CELP coder proposed by AT&T (CCITT 16 kb/s standardization candidate) are examined. The first goal is to analyze the particular characteristics which make the two coders different from one another. The final goal is the improvement of the performance of the coders, particularly with a view of bringing down the bit rate below 16 kb/s.

When compared under similar conditions, the two coders showed comparable performance at 16 kb/s. The analysis of the components and particular of the tree and CELP coders provides new insight for future coders. Higher performance coder components such as prediction, gain adaptation, and residual signal quantization are needed. Issues in backward adaptive linear prediction analysis for both near and far-sample redundancy removal such as analysis methods, windowing, ill-conditioning, quantization noise effects and computational complexities are studied. Several new backward adaptive high-order methods show much better prediction gains than the previously reported ones. Other than a better high-order predictor for both coders, other suggestions to improve the performance of the coders include a new scheme of training of the excitation dictionary and better gain adaptation strategy for the tree coder. A hybrid "Tree-CELP" coder, taking the best components from the two archetypes is a good candidate to push coding rates below 16 kb/s.

A Technique for Combining Equalization with Differential Detection

M.Eng. Thesis, August 1991

Supervisors: H. Leib and P. Kabal

A technique for combining equalization and differentially coherent detection is proposed for use in wireless communication when carrier phase recovery is difficult. A decision-feedback differentially coherent scheme, which generates an improved reference phase, is combined with a linear equalizer and the LMS algorithm is used to adapt the equalizer to an unknown channel. In addition, the proposed receiver is simulated for various two-dimensional signal constellations over multipath channels. It is shown that for high SNR, the degradation of this structure is negligible with respect to combined coherent detection and equalization. Therefore, this equalized differentially coherent detection scheme can be used when carrier phase tracking (i.e. coherent detection) is difficult and intersymbol interference is a major obstacle.

The Hidden Markov Filter Model: Applications for Automatic Speech Processing

M.Eng. Thesis, June 1991

Supervisor: P. Kabal

This thesis examines hidden Markov filter models and their applications in speech segmentation. A method of segmenting the speech waveform is proposed. This method uses the Baum-Welch reestimation algorithm applied to the hidden filter models. Since speech signals are handled at the sample level, the amount of computations needed is very large. We will show how this issue can be dealt with effectively by using a staircase approach in the trellis calculations.

The hidden Markov filters are used to segment speech signals. Test results show very consistent locations of phone boundaries. The hidden model fits vocalic segments very well (with normalized prediction errors of less 0.01), but performs less well on consonants (with normalized prediction errors of up to 0.3).

The speech segmentation by hidden filters is applied to a large vocabulary speaker dependent isolated-word recognizer at the preprocessing stage. The performances of the recognizer with and without preprocessor are compared. The results show small improvements in the recognition accuracy.

Time-Scale Modification of Speech: A Time-Frequency Approach

M.Eng. Thesis, April 1991

Supervisor: P. Kabal

Time-scale modification (TSM) is a process whereby signals are compressed or expanded in time in a manner which preserves their original frequency characteristics. This work explores TSM algorithms for sampled speech. A known approach which is based on the short time Fourier transform (STFT) is first reviewed, then modified to provide high-quality TSM of speech signals at a lower computational cost. The proposed algorithm resembles the sinusoidal speech model (SSM) based approach, yet incorporates new phase compensatory measures to prevent excessive structural deterioration of the time-scaled signal. In addition, a novel incremental scheme for modifying polar parameters results in substantial computational savings.

Joint Time Delay Estimation and Adaptive Filtering Techniques

Ph.D. Thesis, November 1990

Supervisor: P. Kabal

This thesis studies adaptive filters for the case in which the main input signal is not synchronized with the reference signal. The asynchrony is modeled by a time-varying delay. This delay has to be estimated and compensated. This is accomplished by designing and investigating joint delay estimation and adaptive filtering algorithms. First, a joint maximum likelihood estimator is derived for input Gaussian signals. It is used to define a readily implementable joint estimator, composed of an adaptive delay element and adaptive filter. Next, two estimation criteria are investigated with that structure. The minimum mean squared error criterion is used with the joint steepest-decent adaptive algorithm. The general convergence conditions of the joint steepest-decent algorithm are derived. The joint LMS algorithm is analyzed in terms of joint convergence in the mean and in the mean square. Finally, a joint recursive least squares adaptive algorithms investigated in conjunction with the exponentially weighted lest squares criterion. Experimental results are obtained for these different adaptive algorithms, in order to verify the analyses. The results show that the joint algorithms improve the performance of the conventional adaptive filtering techniques.

Quantization of Predictor Coefficients in Speech Coding

M.Eng Thesis, September 1990

Supervisor: P. Kabal

The purpose of this thesis is to examine techniques of efficiently coding Linear Prediction Coding (LPC) coefficients with 20 to 30 bits per 20 ms speech frame. Scalar quantization is the first approach evaluated. In particular, experiments with LPC quantizers using reflection coefficients and Line Spectral Frequencies (LSF's) are presented. Results in this work show that LSF's require significantly fewer bits than reflection coefficients for comparable performance. The second approach investigated is the use of vector-scalar quantization. In the first stage, vector quantization is performed. The second stage consists of a bank of scalar quantizers which code the vector errors between the original LPC coefficients and the components of the vector of the quantized coefficients.

The new approach in this work is to couple the vector and scalar quantization stages. Every codebook vector is compared to the original LPC coefficient vector to produce error vectors. The components of these error vectors are scalar quantized. The resulting vectors from the overall vector-scalar quantization are all compared to the input vector and the one selected. For practical implementations, methods of reducing the computational complexity are examined. The second innovation into vector-scalar quantization is the incorporation of a small adaptive codebook to the large fixed codebook. Frame-to-frame correlation of the LPC coefficients is exploited at no extra cost in bits. Simple methods of limiting the propagation of error inherent in this partially differential scheme are suggested.

The results of this thesis show that the performance of the vector-scalar quantization with the use of the two new techniques introduced is better than that of the scalar coding techniques currently used in conventional LPC coders. The average spectral distortion is significantly reduced as is the number of outliers.

Bandwidth Efficient Filter Banks for Transmultiplexers

Ph.D. Thesis, September 1990

Supervisor: P. Kabal

This thesis addresses the problem of simultaneously transmitting several data signals across a single channel. For this purpose, a transmultiplexer that uses modulated filter banks is studied. Modulated filter banks comprise filters that are bandpass versions of a lowpass prototype. The filters serve to assign portions of the channel bandwidth to the data signals. The impulse responses of the filters are parameterized by a center frequency, delay and phase factor. The objectives in configuring modulated filter banks are to use the full channel bandwidth for transmission, cancel crosstalk between signals (arise when signals share bandwidth) and cancel intersymbol interference in each data signal. Assuming an ideal channel, a synthesis procedure is developed by assigning a bandwidth to the lowpass prototype and deriving relationships among the center frequencies, delays and phases such that the entire channel bandwidth is utilized and crosstalk is cancelled. New design procedures for an FIR lowpass prototype are proposed such that the intersymbol interference is suppressed. One design method is based on a minimax criterion. Another approach involves an unconstrained optimization of an error function.

The synthesis procedure leads to five bandwidth efficient transmultiplexers. Three of the systems implement multicarrier Quadrature Amplitude Modulation (QAM) and two multicarrier Vestigial Sideband Modulation (VSB). The performance of the five systems is compared with filters obtained by the new design approaches. Also, the issue of channel distortion is addressed. Finally, the transmultiplexers can be converted into new subband systems.

Low-Rate Analysis-By-Synthesis Wideband Speech Coding

M.Eng. Thesis, August 1990

Supervisor: P. Kabal

This thesis studies low-rate wideband analysis-by-synthesis speech coders. The wideband speech signals have a bandwidth of up to 8 kHz and are sampled at 16 kHz, while the target operating bit rate is 16 kbits/sec. Applications for such a coder range from high-quality voice-mail services to teleconferencing. In order to achieve a low operating rate, the coding places more emphasis on the lower frequencies (0 to 4 kHz), while the higher frequencies (4 to 8 kHz) are coded less precisely but with little perceived degradation.

The study consists of three stages. First, aspects of wideband spectral envelope modeling using Line Spectral Frequencies (LSF's) are studied. Then, the underlying coder structure is derived from a basic Residual Excited Linear Predictive coder (RELP). This structure is enhanced by the addition of a pitch prediction stage, and by the development of full-band and split-band pitch parameter optimization procedures. These procedures are then applied to an Code Excited Linear Prediction (CELP) model. Finally, the performance of full-band and split-band CELP structure are compared.

Enhancement of Acoustically Reverberant Speech Using Cepstral Methods

M.Eng. thesis, July 1990

Supervisors: M. L. Blostein and P. Kabal

Acoustical reverberation has been shown to degrade the intelligibility and naturalness of speech. In this thesis, we discuss the application of cepstral methods to the enhancement of acoustically reverberant speech.

We first study previously described cepstral techniques for removal of simple echoes from signals. Our results show that these techniques are not directly applicable to the enhancement of speech of indefinite extent. We next recast these techniques specifically for speech. We propose new segmentation and windowing strategies, in combination with cepstral averaging, to accurately identify the acoustical impulse response. We then consider inverse filtering based on an estimated acoustical impulse response, and find that finite impulse response filters designed according to the least mean square error criterion provide satisfactory performance. Finally, we synthesize and test an algorithm for enhancement of reverberant speech. Although significant difficulties remain, we feel that our methods offer a substantial contribution to the solution of the reverberant speech enhancement problem.

Thesis titles.