Technical Reports and Papers



Title:
Perceptual Features for Speech Recognition
Candidate: Serajul Haque
Supervisors: Roberto Togneri, Anthony Zaknich
Publication: UWA PhD Thesis, 2008
Contact: Serajul Haque < serajul[@]ee[.]uwa[.]edu[.]au >

Abstract:
Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch conditions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone suppression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic effects. It is observed that in speech recognition applications, spectral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.

Download PhD Thesis Adobe PDF (4,309K)


Title: Prosodic Features for a Maximum Entropy Language Model
Candidate: Oscar Chan
Supervisor: Roberto Togneri
Publication: UWA PhD Thesis, 2008
Contact: Oscar Chan < oscar[@]ee[.]uwa[.]edu[.]au >

Abstract:
A statistical language model attempts to characterise the patterns present in a natural language as a probability distribution defined over word sequences. Typically, they are trained using word co-occurrence statistics from a large sample of text. In some language modelling applications, such as automatic speech recognition (ASR), the availability of acoustic data provides an additional source of knowledge. This contains, amongst other things, the melodic and rhythmic aspects of speech referred to as prosody. Although prosody has been found to be an important factor in human speech recognition, its use in ASR has been limited.
The goal of this research is to investigate how prosodic information can be employed to improve the language modelling component of a continuous speech recognition system. Because prosodic features are largely suprasegmental, operating over units larger than the phonetic segment, the language model is an appropriate place to incorporate such information. The prosodic features and standard language model features are combined under the maximum entropy framework, which provides an elegant solution to modelling information obtained from multiple, differing knowledge sources. We derive features for the model based on perceptually transcribed Tones and Break Indices (ToBI) labels, and analyse their contribution to the word recognition task.
While ToBI has a solid foundation in linguistic theory, the need for human transcribers conflicts with the statistical model’s requirement for a large quantity of training data. We therefore also examine the applicability of features which can be automatically extracted from the speech signal. We develop representations of an utterance’s prosodic context using fundamental frequency, energy and duration features, which can be directly incorporated into the model without the need for manual labelling. Dimensionality reduction techniques are also explored with the aim of reducing the computational costs associated with training a maximum entropy model. Experiments on a prosodically transcribed corpus show that small but statistically significant reductions to perplexity and word error rates can be obtained by using both manually transcribed and automatically extracted features.

Download PhD Thesis Adobe PDF (1,202K)


Title: Feature Extraction for Robust Speech Recognition in Hostile Environments
Candidate: Aik Ming Toh
Supervisors: Roberto Togneri, Sven Nordholm
Publication: UWA PhD Thesis, 2008
Contact: Aik Ming Toh < aikming[@]student[.]uwa[.]edu[.]au >

Abstract:
Speech recognition systems have improved in robustness in recent years with respect to both speaker and acoustical variability. Nevertheless, it is still a challenge to deploy speech recognition systems in real-world applications that are exposed to diverse and significant level of noise. Robustness and recognition accuracy are the essential criteria in determining the extent of a speech recognition system deployed in real-world applications.
This work involves development of techniques and extensions to extract robust features from speech and achieve substantial performance in speech recognition. Robustness and recognition accuracy are the top concern in this research. In this work, the robustness issue is approached using the front-end processing, in particular robust feature extraction.
The author proposes an unified framework for robust feature and presents a comprehensive evaluation on robustness in speech features. The framework addresses three distinct approaches: robust feature extraction, temporal information inclusion and normalization strategies. The author discusses the issue of robust feature selection primarily in the spectral and cepstral context. Several enhancement and extensions are explored for the purpose of robustness. This includes a computationally efficient approach proposed for moment normalization. In addition, a simple back-end approach is incorporated to improve recognition performance in reverberant environments.
Speech features in this work are evaluated in three distinct environments that occur in real-world scenarios. The thesis also discusses the effect of noise on speech features and their parameters. The author has established that statistical properties play an important role in mismatches. The significance of the research is strengthened by the evaluation of robust approaches in more than one scenario and the comparison with the performance of the state-of-the-art features. The contributions and limitations of each robust feature in all three different environments are highlighted.
The novelty of the work lies in the diverse hostile environments which speech features are evaluated for robustness. The author has obtained recognition accuracy of more than 98.5% for channel distortion. Recognition accuracy greater than 90.0% has also been maintained for reverberation time 0.4s and additive babble noise at SNR 10dB.
The thesis delivers a comprehensive research on robust speech features for speech recognition in hostile environments supported by significant experimental results. Several observations, recommendations and relevant issues associated with robust speech features are presented.

Download PhD Thesis Adobe PDF (8,483K)


Title:
Zero-Crossings with Adaptation for Automatic Speech Recognition
Authors: Serajul Haque, Roberto Togneri, Anthony Zaknich
Publication: Proceedings of PEECS2006, Perth, November 2006
Contact: Serajul Haque < serajul[@]ee[.]uwa[.]edu[.]au >

Abstract:
An auditory model based on zero-crossings with peak amplitudes (ZCPA) was used as a front-end for automatic speech recognition (ASR) with the perceptual
property of adaptation as determined by psychoacoustic observations. The performance was evaluated on the isolated digits (TIDIGITS) database using continuous
density HMM recognizer in additive noise. Experimental results indicate that the ASR performance of the ZCPA may be improved with adaptation over the static baseline performance in white Gaussian and factory noise. The perceptual front-end was also evaluated with dynamic (delta and delta-delta) features added to the adaptation. It was observed that adaptation with dynamic features performed better in factory, babble and car noise over a wide range of SNR. The recognition performances were compared with the baseline MFCC. The performance of the dynamic ZCPA with adaptation was better than the dynamic MFCC in white Gaussian noise.

Download Paper Adobe PDF (245K)

Title: Combining MLLR Adaptation and Feature Extraction for Robust Speech Recognition in Reverberant Environments
Authors: Aik Ming Toh, Roberto Togneri, Sven Nordholm
Publication: Unpublished
Contact: Aik Ming Toh < aikming[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents an investigation on speech recognition performance in reverberant environments. Reverberant noise has been a major concern in speech recognition systems. Many speech recognition systems, even with state-of-art features, fail to respond to reverberant effects and the recognition rate deteriorates. This shows the limitations of robust feature extraction in reverberant environment. The maximum likelihood linear regression (MLLR) adaptation scheme is adopted for reverberant speech recognition on the TI-DIGIT database. The use of adaptation data improved the recognition performance significantly especially for strong reverberations. The performance of both MFCC_0 and MFCC_0_D_A features improved by more than 10% for reverberations greater than 0.4s. This paper also demonstrates the optimal strength of both robust feature extraction and adaptation scheme for reverberant speech recognition. The recognition performance is maintained above 90% up to reverberation time 0.5s using both schemes.

Download Paper Adobe PDF (89K)

Title: Mel-Entropy Speech Features for Speech Recognition in Hostile Environments
Authors: 
Aik Ming Toh, Roberto Togneri
Publication:
Unpublished
Contact: Aik Ming Toh < aikming[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents an investigation of entropy features,used for voice activity detection, in the context of speech recognition. The concept of entropy shows that the voicedregions of speech have lower entropy since there are clear formants. The flat distribution of silence or noise wouldinduce high entropy values. A novel extension of entropy features based on Mel-spaced filters which demonstrates superior performance in reverberant noise conditions is proposed. The performance of multi-band Mel-entropy features is investigated and evaluated on the TIDIGIT database in this paper. Furthermore, the entropy features are appended to baseline MFCC_0 features and evaluated on a wider range of possible noise scenarios. The results show that Mel-entropy features perform better than spectral entropy features both in the recognition performance and robustness in reverberant noise.

Download Paper Gzipped, PostScript (180K)

Title: Adaptation and Acoustic Matching Scheme for Speech Recognition in Reverberant Environments
Authors:
Roberto Togneri, Aik Ming Toh, Sven Nordholm
Publication:
Unpublished
Contact: Aik Ming Toh < aikming[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents an investigation on speech recog-nition performance  in reverberant environments. Reverberant noise has been a major concern in speech recognitionsystems. Many speech recognition systems fail to respond to reverberant effects and deteriorate the recognition rate.The maximum likelihood linear regression (MLLR) adaptation scheme is adopted for reverberant speech recognition on the TI-DIGIT database. This work shows that acoustic matching and adaptation play a vital role in reverberant speech recognition. The use of adaptation data improved the recognition performance significantly especially for strong reverberations. The performance of both MFCC_0 and MFCC_0_D_A features improved by more than 10% for re-verberations greater than 0.4s. This paper also demonstrates the strength of both robust feature extraction and adaptation scheme for robust speech recognition. The recognition performance is maintained above 90% up to reverberation time 0.5s using both schemes. Adaptation schemes have been shown to be important for reverberant speech recognition.

Download Paper Gzipped, PostScript (44K)

Title: Zero Crossings with Peak Amplitudes and Perceptual Features for Robust Speech Recognition (2)
Authors:
Serajul Haque, Roberto Togneri, Anthony Zaknich
Publication:
Unpublished
Contact: Serajul Haque < serajul[@]ee[.]uwa[.]edu[.]au >

Abstract:
Several perceptually motivated properties of human hearing, as observed from psychoacoustic behavior, are evaluated for
automatic speech recognition (ASR) on the zero crossings with peak amplitudes (ZCPA) model under white, factory and
reverberant noise. Experimental results indicate that the ASR performance of the perceptual ZCPA may be significantly improved by the inclusion of a fourth order polynomial lowpass filter for modeling the loss of synchrony at higher frequencies. Some effects of temporal processing on the ZCPA are also investigated. Improved performance may be obtained by including RASTA and dynamic (delta-acceleration) processing with the ZCPA. We show that dynamic ZCPA significantly outperforms dynamic MFCC features in some noise conditions. The performance of the baseline MFCC is also investigated for adaptation under noise by combining it with the temporal information of the perceptual ZCPA.

Download Paper Adobe PDF (73K)

Title: Zero Crossings with Peak Amplitudes and Perceptual Features for Robust Speech Recognition (1)
Authors:
Serajul Haque, Roberto Togneri, Anthony Zaknich
Publication:
Unpublished
Contact: Serajul Haque < serajul[@]ee[.]uwa[.]edu[.]au >

Abstract:
It is known that certain properties of human speech perception are invariant or less affected by additive and reverberant noise. In this paper the zero crossings with peak amplitudes (ZCPA) model is evaluated for speech recognition with several perceptual properties of human hearing. Experimental results indicate that under white Gaussian noise, the maximum performance benefit is obtained by the inclusion of a low-pass filter module for high frequency synchrony reduction. It is shown that improved recognition rate can be obtained by combining mean rate processing with the ZCPA. The proposed enhancement is more robust and performs well in noisy conditions compared to the MFCC, PLP and EIH processing at moderate and low SNRs.

Download Paper Adobe PDF (51K)

Title: A Maximum Entropy Method for Language Modelling
Authors:
Oscar Chan, Roberto Togneri
Publication:
Proceedings of PEECS2005, Perth, September 2005, pp. 66-69
Contact: Oscar Chan < oscar[@]ee[.]uwa[.]edu[.]au >
Abstract:
The language models used for automatic speech recognition (ASR) are of ten based on very simple Markov models. This paper presents an overview of a more powerful modelling technique, Maximum Entropy (ME), and its application in langauge modelling. Preliminary results indicate that ME models are viable for this task, and perform slightly better than the traditional models.

Download Paper Adobe PDF (70K)

Title: A zero-crossing perceptual model for robust speech recognition
Authors: Serajul Haque, Roberto Togneri, Anthony Zaknich
Publication: Proceedings of PEECS2005, Perth, September 2005, pp. 60-65
Contact: Serajul Haque < serajul[@]ee[.]uwa[.]edu[.]au >

Abstract:
The traditional speech recognition systems based on linear prediction and spectral/cepstral analysis can only partially fulfil the speech recognition process as it severely degrades under noise and environmental mismatched conditions. Alternatively, auditory models, based on the properties of human sound perception in the peripheral auditory system, depend on the time-frequency response of the basilar membrane, the neural activity pattern and other observed psychophysical properties of hearing. The disadvantages with these models are that they are computationally intensive and are dependent on several free parameters   such as zero-crossing level values, derivative window lengths and the number of frequency bins which are frequently selected by trial and error. In this paper, an auditory model based on the zero crossing peak amplitude, which is reasonably low order, computationally efficient and implemented  with minimum choice of parameters is presented. It is shown that for an isolated digit recognition task, improved performances can be obtained compared to the Mel Frequency Cepstral Coefficients (MFCC) and the Perceptual Linear Prediction (PLP) methods in presence of  additive Gaussian noise.

Download Paper Adobe PDF (141K)

Title: Spectral Entropy as Speech Features for Speech Recognition
Authors: Aik Ming Toh, Roberto Togneri, Sven Nordholm
Publication: Proceedings of PEECS2005, Perth, September 2005, pp. 22-25
Contact: Aik Ming Toh < aikming[@]student[.]uwa[.]edu[.]au >

Abstract:
This paper presents an investigation of spectral entropy features, used for voice activity detection, in the context of speech recognition. The entropy is a measure of disorganization and it can be used to measure the peakiness of a distribution. We compute the entropy features from the short-time Fourier transform spectrum, normalized as a PMF. The concept of entropy shows that the voiced regions of speech have lower entropy since there are clear formants. The flat distribution of silence or noise would induce high entropy values. In this paper, we investigate the use of the entropy as speech features for speech recognition purpose. We evaluate different sub-band spectral entropy features on the TI-DIGIT database. We have also explored the use of multi-band entropy features to create higher dimensional entropy features. Furthermore, we append the entropy features to baseline MFCC_0 and evaluate them in clean, additive babble noise and reverberant environments. The results show that entropy features improve the baseline performance and robustness in additive noise.

Download Paper Adobe PDF (56K)

Title: Blind source separation for isolation of motor unit potentials in electromyography
Authors:
David Knezevic, Roberto Togneri
Publication:
Unpublished
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper considers the application of Blind Source Separation (BSS) to electromyographic (EMG) data as an approach to isolating individual motor unit potentials (MUPs). BSS was applied both to needle EMG (nEMG) and surface EMG (sEMG), and experimental results were obtained that demonstrate the effectiveness of this approach. BSS is proposed as a technique to be incorporated into EMG methodology to enable separation of individual MUPs from an EMG interference pattern.

Download Paper Adobe PDF (87K)


Title: Use of Neural Network Mapping and Extended Kalman Filter to Recover Vocal Tract Resonances from the MFCC Parameters of Speech
Authors:
Roberto Togneri, Li Deng
Publication:
Proceedings of ICSLP 2004, October 2004, Oct 4-8, 2004, Jeju Island, Korea, No.WeB1201o.4, pp. 1201-1204
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
In this paper, we present a state-space formulation of a neural-network-based hidden dynamic model of speech whose parameters are trained using an approximate EM algorithm. The training makes use of the results of an off-the-shelf formant tracker (during the vowel segments) to simplify the complex sufficient statistics that would be required in the exact EM algorithm. The trained model, consisting of the state equation for the target-directed vocal tract resonance (VTR) dynamics on all classes of speech sounds (including consonant closure) and the observation equation for mapping from the VTR to acoustic measurement, is then used to recover the unobserved VTR based on Extended Kalman Filter. The results demonstrate accurate estimation of the VTRs, especially those during rapic consonant-vowel or vowel-consonant transitions and during consonant closure when the acoustic measurement alone provides weak or no information to infer the VTR values.

Download Paper Adobe PDF (305K)


Title: Convolutive Blind Signal Separation With Post-Processing 
Authors:
Siow Yong Low, Sven Nordholm, Roberto Togneri
Publication:
IEEE Transactions on Speech and Audio Processing, Vol. 12, No. 5, September 2004, pp. 539-548.
Contact:
Siow Yong Low < siowyong[@]watri[.]org[.]au >

Abstract:
A new subband based speech enhancement scheme is presented. It integrates spatial and temporal signal processing methods to enhance speech signals in a noisy environment. The approach makes use of the popular blind signal separation (BSS) to spatially separate the target signal from the interference. Due to the multipath/reverberant environment, BSS has its fundamental limitation in its separation quality. To overcome that, an adaptive noise canceller (ANC) is employed to perform further interference reduction. The reference for the ANC in this case is simply the interference dominant output from the BSS. A higher order statistical method is proposed for the selection of the reference signal. This post processing acts as a spectral decorrelator and experimental results show that even in under-determined (more sources than elements) case, the structure offers impressive enhancement capability. Further, a remarkable improvement in recognition rate is registered when tested in automatic speech recognition (ASR).

Download Paper Adobe PDF (307K)
Title: Spatio-Temporal Processing for Distant Speech Recognition
Authors: Siow Yong Low, Roberto Togneri, Sven Nordholm
Publication: Proceedings of ICASSP 2004, May 2004, Vol. 1, pp. 1001-1004.
Contact: Siow Yong Low < siowyong[@]watri[.]org[.]au >

Abstract:
A new subband based front-end processor for speech recognition is presented. It integrates both spatial and temporal signal processing methods to enhance noisy signals as a means to reduce the mismatch problem in speech recognition. The approach makes use of the popular blind signal separation (BSS) to spatially separate the target signal from the interference. Due to the multipath/reverberant environment, BSS has its fundamental limitation in the separation quality. To overcome that, an adaptive noise canceller (ANC) is employed to perform further interference reduction. Experimental results show that even in an adverse environment, the proposed structure improves the word recognition rate (WRR) by 70% for the connected digit recognition task.

Download Paper Adobe PDF (175K)


Title: Phonetic recognition using a statistical hidden dynamic model of speech
Authors:
Roberto Togneri
Publication:
Unpublished
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents new results on evaluation of the statistical coarticulatory hidden dynamic model (HDM) on the TIMIT phone recognition task. We train both the HDM and baseline HMM on the complete TIMIT training data set and evaluate both systems using the N-best rescoring algorithm on the TIMIT test data set and the dr8 dialect subset. We show that with the inclusion of the reference transcription the HDM consistently outperforms the HMM for both 100-best+ref rescoring of the TIMIT test data and 1000-best+ref rescoring of the dr8 dialect subset with a reduction in the WER of between 3% and 6% in all cases. We also verify the plausibility of the HDM paradigm by comparing plots of the model output with the observation data vectors.

Download Paper Adobe PDF (94K)


Title: Joint State and Parameter Estimation for a Target-Directed Nonlinear Dynamic System Model
Authors:
Roberto Togneri, Li Deng
Publication:
IEEE Transactions on Signal Processing, Vol. 51, No. 12, December 2003, pp. 3061-3070
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
In this paper, we present a new approach to joint state and parameter estimation for a target-directed, nonlinear dynamic system model with switching states. The model is also called the hidden dynamic model (HDM) recently proposed for representing speech dynamics. The model parameters subject to statistical estimation consist of the target vector and the system matrix (also called the "time-constants"), as well as the parameters characterizing the non-linear mapping from the hidden state to the observation. These latter parameters are implemented in the current work as the weights of a three-layer feedforward multi-layer perceptron (MLP) network. The new estimation approach presented in this paper is based on the extended Kalman filter (EKF), and its performance is compared with the more traditional approach based on the expectation-maximisa tion (EM) algorithm. Extensive simulation experiment results are presented using the proposed EKF-based and the EM algorithms and under the typical conditions for employing the HDM for speech modeling. The results demonstrate superior convergence performance of the EKF-based algorithm compared with the EM algorithm, but the former suffers from excessive computational loads when adopted for training the MLP weights. In all cases, the simulation results show that the simulated model output converges to the given observation sequence. However, only in the case where the MLP weights or the target vector are assumed known, do the time-constant parameters converge to their true values. We also show that the MLP weights never converge to their true values, thus demonstrating the many-to-one mapping property of the feed-forward MLP. We conclude from these simulation experiments that for the system to be identifiable, restrictions on the parameter space are needed.

Download Paper Adobe PDF (265K)


Title: An EKF-based Algorithm for Learning Statistical Hidden Dynamic Model Parameters for Phonetic Recognition
Authors:
Roberto Togneri, Li Deng
Publication:
International Conference on Acoustics, Speech and Signal Processing, May 2001, Vol. 1.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents a new parameter estimation algorithm based on the Extended Kalman Filter (EKF) for the recently proposed statistical coarticulatory Hidden Dynamic Model (HDM). We show how the EKF parameter estimation algorithm unifies and simplifies the estimation of both the state and parameter vectors. Experiments based on N-best rescoring demonstrate superior performance of the (context-independent) HDM over a triphone baseline HMM in the TIMIT phonetic recognition task. We also show that the HDM is capable of generating speech vectors close to those from the corresponding real data.

Download Paper Adobe PDF (19K)


Title: A Robust Speech Understanding System Using Conceptual Relational Grammar
Authors:
Jiping Sun, Roberto Togneri, Li Deng
Publication:
Proceedings of the International Conference on Spoken Language Processing, October 2000, Vol. 2, pp. 879-882.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
We describe a robust speech understanding system based on our newly developed approach to spoken language processing. We show that a robust NLU system can be rapidly developed using a relatively simple speech recognizer to provide sufficient information for database retrieval by spoken language. Our experimental system consists of three components: a speech recognizer based on HMM, a natural language parser based on conceptual relational grammar and a data retrieval system based on the ATIS database. With the use of the robust parsing strategy, database query tasks can be successfully performed.

Download Paper Adobe PDF (77K)


Title: Evolution of Markovian Speech Models
Authors:
Raymond Low, Roberto Togneri
Publication:
CIIPS Technical Report (December 1999)
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
The Hidden Markov Model is a well understood technology for modelling speech, but it makes a number of assumptions making it non-optimal for the task. Alternative models like the trended HMM, variable duration HMM, conditionally Gaussian HMM, and stochastic segment model have been proposed in the literature, all of which modify the basic HMM architecture. This article will show how all the above models are related to and are generalisations of the HMM.

Download Paper Gzipped, PostScript (54K)


Title: Parameter Estimation of a Target-Directed Dynamic System Model with Switching States
Authors:
Roberto Togneri, Jeff Ma and Li Deng
Publication:
Signal Processing, Vol. 81, No. 5, April 2001, pp. 975-987
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
In this paper, we describe an implementation of the extended Kalman filter (EKF) for joint state and parameter estimation for a target-directed, switching state-space nonlinear system model and compare its performance with a maximum-likelihood parameter estimation procedure based on the Expectation-Maximisation (EM) algorithm. The model parameters consist of the target one and the time-constant one. Simulation experimental results are presented for individual and joint estimation of all model parameters for both algorithms. The results show that both algorithms are able to converge to the true target parameter in the model, with the EKF approach exhibiting faster convergence. This is true even under the target-undershoot condition when the observation sequence is relatively short. However, convergence to the true time-constant parameter is not evident, possibly due to the non-unique nature of the parameter estimation problem. We also show empirically that in the case of joint estimation of the parameters, the EM algorithm diverges shortly after a small number of iterations whereas the EKF approach gives more desirable convergence properties.

Download Paper Gzipped, PostScript (82K)


Title: Speech Recognition Using the Probabilistic Neural Network
Authors:
Raymond Low, Roberto Togneri
Publication:
Proceedings of the ICSLP98/SST98 Student Day, paper 645 (December 1998)
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
A novel technique for speaker independent automated speech recognition is proposed. We take a segment model approach to Automated Speech Recognition (ASR), considering the trajectory of an utterance in vector space, then classify using a modified Probabilistic Neural Network (PNN) and maximum likelihood rule. The system performs favourably with established techniques. Our system achieves in excess of 94% with isolated digit recognition, 88% with isolated alphabetic letters, and 83% with the confusable /e/ set. A favourable compromise between recognition accuracy and computer memory and speech can also be reached by performing clustering on the training data for the PNN.

Download Paper Adobe PDF (45K)


Title: Phoneme Based Vector Quantization in A Discrete HMM Speech Recognizer
Authors:
Yaxin Zhang, Roberto Togneri, Mike Alder
Publication:
IEEE Transactions on Speech and Audio Processing,.Vol. 5., No. 1, January 1997, pp. 26-33.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
The quantization distortion of Vector Quantization (VQ) is a key element which affects the performance of a discrete hidden Markov modeling (DHMM) system. Many researchers have realized this problem and try to use integrated features or multiple code books in their systems to offset the disadvantage of the conventional VQ. However the computational complexity of these systems is then significantly increased. Our investigations have shown that the speech signal space can be modeled as a mixture of Gaussian clusters which represent the phoneme data sets from male and female speakers. In this paper we propose an alternative VQ method in which the phoneme is treated as a cluster in the speech space and a Gaussian model is estimated for each phoneme. A Gaussian mixture model (GMM) is generated by the Expectation-Maximization (EM) algorithm for the whole speech space and used as a code book in which the codewords are Gaussian models representing certain acoustic features. An input utterance was classified as a certain phoneme or a set of phonemes based on the maximum likelihood of the trained models. A typical discrete HMM system was used for both phoneme and isolated word recognition. The results show that phoneme based Gaussian modeling vector quantization classifies the speech space more effectively and significant improvements in the performance of the DHMM system have been achieved.

Download Paper Gzipped, PostScript (52K)


Title: Speaker Independent Recognition of Small Vocabulary
Authors:
Jason Chong, Roberto Togneri
Publication:
Sixth Australian International Conference on Speech Science and Technology, December 1996, Adelaide, Australia, pp. 479-484
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper reports on the implementation of a real-time speaker independent isolated word speech recognition program on a PC Windows platform. The overall structure of the recognition engine is based on the Dynamic Time Warping (DTW) paradigm for computational efficiency. Furthermore, to decrease the recognition time and increase the recognition accuracy, the dictionary is limited to under 15 words. This severely restricts the vocabulary. To overcome this restriction, a new technique is introduced. Many dictionaries are linked in a hierarchical structure and each word in each dictionary will activate a new dictionary related to that word. This represents a basic form of language modelling which is suited for the menu driven interface found in many of today's applications. The results show that reasonable performance can be achieved by these methods.

Download Paper Gzipped, PostScript (32K)


Title: Extraction of Speech Signal in the Presence of a Musical Note Signal Using the GRNN
Authors:
K. Chong, R. Togneri
Publication:
Sixth Australian International Conference on Speech Science and Technology, December 1996, Adelaide, Australia, pp. 629-634
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents the methodology of extracting a speech signal in the presence of a musical note signal using the GRNN (General Regression Neural Network). An overview of GRNN is presented first, followed by preliminary simulations. Results of extracting speech in the presence of a flute and a cello note are also presented.

Download Paper Gzipped, PostScript (88K)


Title: Investigation of speech and speaker recognition based on trajectory modelling of utterances
Authors:
W. J. Tey, N. P. Jong, R. Togneri
Publication:
Sixth Australian International Conference on Speech Science and Technology, December 1996, Adelaide, Australia, pp. 133-137
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
We present in this paper a modelling technique used to capture the dynamic and temporal behaviour of transitions between phonemes. This model relies on the trajectory instead of the geometrical position of the observations in the parameter space. Transition based models provide an alternative method for acoustic-phonetic modelling of the speech signal. In our modelling technique, the trajectory is modelled by regression analysis of low-order polynomials followed by statistical clustering of these coefficients. This technique is used for both speech recognition as well as speaker recognition. Results on a small trial set of isolated alphabet sounds and speakers for both speech and speaker recognition are presented. The speech recognition rate using the trajectory model is found comparable to traditional HMM modelling. However, the poor results for the speaker identification suggest that the current trajectory model is not suitable for this recognition task.

Download Paper Gzipped, PostScript (46K)


Title: A Geometric Interpretation of Hidden Markov Model
Authors:
Chee Wee Loke, Roberto Togneri
Publication:
Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Perth, December 1994.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
In this paper, we investigate the relationship between speech trajectories and the hidden Markov model. The speech utterances were transformed into speech feature vectors and the trajectories displayed on a two dimensional space. The hidden Markov models were also displayed on a two dimensional space. By visual examination, we think that the states seem to be associated with a distinct phoneme of the utterance. Therefore, the number of states required in the contonuous HMM is related to the number of phonemes in the word to be modelled. In the semi-continuous HMM, it is also possible that the same gaussian probability density function is shared by the same phoneme sound in different semi-continuous HMMs.

Download Paper Gzipped, PostScript (68K)


Title: A Comparison of PBDHMM and CHMM for Isolated Word Recognition
Authors:
Yaxin Zhang, Chee We Loke, Roberto Togneri
Publication:
Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Perth, December 1994.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
Using phoneme-based Gaussian mixture as a VQ codebook in DHMM speech recognition system (PBDHMM) is an efficient way to improve the system performance. This paper compares the performances of PBDHMM system with that of the well known continuous HMM system for isolated word recognition task. The results shown that PBDHMM system obtained better results than CHMM system, especially for phoneme-distinct data.

Download Paper Gzipped, PostScript (26K)


Title: CDIGITS: A Large Isolated English Digit Database
Authors:
Yaxin Zhang, Mylene Pijpers, Roberto Togneri, Mike Alder
Publication:
Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Perth, December 1994.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper described a large isolated English digit database which was designed for the training and evaluation of statistical algorithms and neural networks. 1108 speakers (575 males and 533 females) were recorded in the UWA campus under office environment.

Download Paper Gzipped, PostScript (29K)


Title: Optimization of Phoneme-Based Codebook in a DHMM System
Authors:
Yaxin Zhang, Roberto Togneri, Chris deSilva, Mike Alder
Publication:
Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Perth, December 1994.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
A phoneme-based Gaussian mixture VQ codebook can improve the conventional DHMM system performance significantly. In this paper, an optimization method for the phoneme-based VQ codebook is proposed. The experimental results shown that the optimized phoneme-based VQ codebook leads to both the improvement of system performance and the reduction of system complexity.

Download Paper Gzipped, PostScript (78K)


Title: Speaker-Independent Isolated Word Recognition Using Multi-Hidden Markov Models
Authors:
Yaxin Zhang, Chris deSilva, Roberto Togneri, Mike Alder, Yianni Attikiouzel
Publication:
IEE Proceedings on Vision, Image and Signal Processing, Vol 141(3), June 1994, pp. 197-202.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
A multi-HMM speaker-independent isolated word recognition system is described. In this system, three vector quantization methods, the LBG algorithm, the EM algorithm, and a new MGC algorithm, are used for the classification of the speech space. These quantizations of the speech space are then used to produce three HMMs for each word in the vocabulary. In the recognition step, the Viterbi algorithm is used in the three sub-recognizers. The log probabilities of the observation sequences matching the models are multiplied by the weights determined by the recognition accuracies of individual sub-recognizers and summed to give the log probability that the utterance is of a particular word in the vocabulary. This multi-HMM system results in a reduction of about 50 per cent in the error rate in comparison with the single model system.

Download Paper Gzipped, PostScript (74K)


Title: An Improved Training Procedure for Speaker-independent Isolated word recognition
Authors:
Yaxin Zhang, Mike Alder
Publication:
Proceedings of 1994 International Symposium on Speech, Image Processing and Neural networks, April 1994, Hong Kong. pp. 722-725.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper describe an improved training procedure in a HMM/VQ speech recognition system for speaker-independent speech recognition. The phoneme based Gaussian mixture models (GMM) were generated in the first step modeling using the Expectation-Maximization (EM) algorithm. These Gaussians more accurately describe the distribution characteristic of the phonemes in the speech signal space. Therefore better first step modeling is achieved and the performance of the whole recognition system is improved. The new method was used in a speaker-independent isolated digits and phoneme recognition tasks. Two English databases were used for the training and testing. Significant improvements have been achieved in comparison with the conventional HMM/VQ system.

Download Paper Gzipped, PostScript (28K)


Title: Using Gaussian Mixture Modeling in Speech Recognition
Authors:
Yaxin Zhang, Mike Alder, Roberto Togneri
Publication:
Proceedings of ICASSP 1994, April 1994, Adelaide, Australia. pp. I613-616.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper describes a speaker-independent isolated word recognition system which uses a well known technique, the combination of vector quantization with hidden Markov modeling. The conventional vector quantization algorithm is substituted by a statistical clustering algorithm, the Expectation- Maximization algorithm, in this system. Based on the investigation of the data space, the phonemes were manually extracted from the training data and were used to generate the Gaussians in a code book in which each code word is a Gaussian rather than a centroid vector of the data class. The word based hidden Markov modeling then was performed. Two English isolated digits data base were investigated and the 12 Mel-spaced filter bank coefficients was employed as the input feature. Comparing the conventional discrete HMM, our system obtained significant improvement of recognition accuracy.

Download Paper Gzipped, PostScript (52K)


Title: Finding Structure in the Vowel Space
Authors:
Mylene Pijpers, Michael D. Alder and Roberto Togneri
Publication:
Proceedings of the First Australian and New Zealand Conference on Intelligent Information Processing Systems, Perth, Western Australia, 1-3 December 1993.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
From the TIMIT database labelled speech waveform segments from 33 speakers were extracted. There were 8 categories of speech data, each representing a vowel sound. In each category were 80 - 130 utterances. The waveform segments were processed by taking a FFT, on 32 msec frames and binning the result into 12 frequency bands. This way each frame will be represented by 12 numbers/values. They become points in R12 and each utterance is a short trajectory in R12. The 8 vowel categories become 8 clusters of such trajectories. By projecting the clusters onto the screen of a SUN workstation, it was observed that each vowel cluster appeared to be substantially gaussian. The covariance matrix and the centers were computed. Dimension estimates of each vowel by Principal Components Analysis show that the eight clusters lie close to a plane in the filterbankspace, and that the principal axes of the vowel clusters make a small and consistent angle with respect to this plane. This confirms the results of Plomp, e.a. [4] who found the vowel space to be essentially 2 dimensional. Projecting the centroids of the vowel clusters onto this plane gives a representation of the vowels which is very similar to the configuration we get when plotting the first and second formant, F1 and F2, of the vowels and the front-back and high-low diagram of vowels used in phonetics. This method may determine the vowel class more reliably than Formant tracking procedures.

Download Paper Gzipped, PostScript (73K)


Title: The Effects of Scaling on Neural Network Classification
Authors:
Mylene Pijpers, Michael D. Alder and Roberto Togneri
Publication:
Proceedings of the First Australian and New Zealand Conference on Intelligent Information Processing Systems, Perth, Western Australia, 1-3 December 1993.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
Artificial Neural Networks are often looked upon as black boxes that can be used for classification tasks. Regarding the ANN as a simple tool to do a final classification, the research efforts tend to be concentrated on preprocessing stages, to improve the quality of the input to the neural network. One such preprocessor is the MSECT algorithm by Zahorian and Jagharghi [1]. It improves vowel classification. Since MSECT applies an affine transformation to the data, it is hard to see why this should make any difference to the end result. By implementing and testing the MSECT algoritm, using a simple backpropagation neural network as a tool or standard to measure the amount of neural network training needed to correctly classify two data clusters we confirmed the results of Zahorian and Jagharghi [2]. The simple ANN we used to classify the vowel data was not that simple at all. The preprocessing algorithm changes not only the dimension but also the scale of the vowel data. To perform optimal on the original data and on the preprocessed data the ANN would need different optimal parameters. But because the parameters of the ANN were not modified this preprocessing could result in better results for one of the data sets. An experiment was done with different scalings of the same data sets. For the parameters of the ANN we used, the optimal results in terms of speed of convergence and accuracy were obtained for data scaled to have their input range between 5 and 18.

Download Paper Gzipped, PostScript (32K)


Title: A geometric interpretation of speech features
Authors:
Gareth Lee and Michael D. Alder
Publication:
Proceedings of the First Australian and New Zealand Conference on Intelligent Information Processing Systems, Perth, Western Australia, 1-3 December 1993.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
Previous publications have concentrated on graphically representing features produced from speech as spectrograms. We present a new geometric interpretation of speech features. This is implemented in the form of the `fview' program which has been placed in the public domain. We use `fview' to consider the problem of generating effective feature sequences.

Download Paper Gzipped, PostScript (64K)


Title: Using Gaussian Mixture Modeling for Phoneme Classification
Authors:
Yaxin Zhang, Mike Alder, Roberto Togneri
Publication:
Proceedings of the First Australian and New Zealand Conference on Intelligent Information Processing Systems, Perth, Western Australia, 1-3 December 1993.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
Phoneme recognition is a key characteristic in large-vocabulary speech recognition system. The recent reports have shown that the accuracies of speaker-independent English phoneme recognition are around 60% while in the speaker-dependent case the recognition accuracies are under 70%. This paper describe a new method in which the Gaussian mixture modeling was employed for speaker-independent phoneme recognition. The Expectation-Maximization (EM) algorithm was used to generated the Gaissian mixture models (GMMs). Two English databases were used for both the system training and testing. The phonemes were manually extracted from 11 isolated digits (from zero to nine and oh). The testing results are higher than that of recent reports. Some related observations are also reported.

Download Paper Gzipped, PostScript (25K)


Title: A Multi-HMM Isolated Word Recognizer
Authors:
Yaxin Zhang, Christopher J. S. deSilva, Roberto Togneri, Mike Alder, and Yianni Attikiouzel
Publication:
Proceedings of the Fourth Australian Conference on Speech Science and Technology. In Brisbane, Australia, 1-3 December 1992, pp. 568-571.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
A multi-HMM speaker-independent isolated word recognition system is described. In this system,three vector quantization methods are used for the classification of speech space. This multi-HMM system results in an improvement of about 50 per cent in the error rate in comparison to the single model system.

Download Paper Gzipped, PostScript (21K)


Title: Affine Transformations of the Speech Space
Authors:
Mylene Pijpers and Michael D. Alder
Publication:
Proceedings of the Fourth Australian Conference on Speech Science and Technology. In Brisbane, Australia, 1-3 December 1992, pp. 124-129.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
The papers Speaker Normalization of static and dynamic vowel spectral features (J.A.S.A 90, July 1991 pp 67-75) and Minimum Mean-Square Error Transformations of Categorical Data to Target Positions (IEEE Trans Sig.Proc,40 Jan 1992, pp13-23) by Zahorian and Jagharghi describe an algorithm for transforming the space of speech sounds so as to improve the accuracy of classification. Classification was accomplished by both back-propagation neural nets and by a Bayesian Maximum Likelihood method on the model of each vowel class being specified by a gaussian distribution. The transformation was an affine transformation obtained by choosing ideal `target' points for each cluster in a second space and minimising the mean square distance of the points in the speech space from the appropriate target. The speech space itself was a space of cepstral coefficients obtained from a Discrete Cosine Transform. These findings are remarkable, indeed almost unbelievable. The reason is that both the maximum likelihood classification on the gaussian model, and the Neural Net classifier are essentially affine invariant. In the case where the transform ation is invertible, this is clearly the case. When the transformation has non-trivial kernel, it may happen that the classification gets worse, but it cannot get better. A back-propagation neural net in effect classifies by dividing the space into regions by means of hyperplanes. The gaussian model does so by means of quadratic forms, with quadratic discrimination hypersurfaces. Projecting a hyperplane by any non-zero affine map which is onto the target space will usually give another hyperplane in the target space, and if the second separates points, so will the first. Conversely, if there is a solution in the target space, it can be pulled back to a solution in the domain space. It is not hard to show that similar considerations apply to the case where we use quadratic hypersurfaces. In this paper, we attempt to account for the results of Zahorian and Jagharghi by investigating vowel data. We describe a simple projection algorithm which may be applied to high dimensional data to give a view on a computer screen of the data and of transformations of it.

Download Paper Gzipped, PostScript (75K)


Title: A Comparison of the LBG, LVQ, MLP, SOM and GMM Algorithms for Vector Quantisation and CLustering Analysis Authors: R. Togneri, D. Farrokhi, Y. Zhang and Y. Attikiouzel
Publication:
Proceedings of The Fourth Australian International Conference on SST-92, Brisbane, 1-3 December 1992, pp 173-177.

Abstract:
We compare the performance of five algorithms for vector quantisation and clustering analysis: the Self-Organising Map (SOM) and Learning Vector Quantization (LVQ) algorithms of Kohonen, the Linde-Buzo-Gray (LBG) algorithm, the MultiLayer Perceptron (MLP) and the GMM/EM algorithm for Gaussian Mixture Models (GMM). We propose that the GMM/EM provides a better representation of the speech space and demonstrate this by comparing the GMM with the LBG, LVQ, MLP and SOM algorithms in phoneme classification and digit recognition.


Title: A HMM/EM Speaker-Independent Isolated Word Recognizer
Authors:
Yaxin Zhang, Christopher J. S. deSilva, Y. Attikiouzel, M. D. Alder
Publication:
Journal of Electrical and Electronics Engineering, Australia Vol. 12, No.4, Dec 1992, pgs 334-339.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents a scheme of speaker-independent isolated word recognition in which Hidden Markov Modelling is used with Vector Quantization codebooks constructed using the Expectation-Maximization (EM) algorithm for Gaussian mixture models. In comparison with conventional vector quantization (the LBG algorithm), the EM algorithm results in a more reasonable clustering of the speech signal space. This is demonstrated by a higher recgonition accuracy of the system. Three types of feature parameters of the speech signal were used as input data to the system. The effects of using codebooks of different size were also investigated. Finally, the amalgamation of small codebooks for individual words into a large codebook for the system is discussed.

Download Paper Gzipped, PostScript (35K)


Title: Parallel implementation of the Kohonen algorithm on Transputer
Authors:
Roberto Togneri and Yianni Attikiouzel
Published:
Proc. IJCNN-91, Vol. 2, pgs. 1717-1722, Singapore, November 1991.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
In this paper a parallel implementation of the Kohonen algorithm is proposed using partitioning of the network. This allows an exact implementation of the Kohonen algorithm as opposed to partitioning the data. By using a simple routing strategy the parallel Kohonen algorithm was tested on a PC based transputer network without the need for any special distributed operating system. The execution time was measured for different sized networks and number of transputers. The execution time decreased as the number of transputers increased. However, for comparatively small sized neural networks the communication overhead caused the execution time to increase when more transputers were used. Thus, the proposed parallel implementation of the Kohonen algorithm is not suitable for massively parallel architectures.


Title: An Isolated Word Recognizer Using the EM Algorithm for Vector Quantization
Authors:
Zhang Yaxin, Christopher J. S. deSilva
Publication:
Proceedings of IREECON'91, Sydney, Australia, September 16-20, 1991. Pages 289-292.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
This paper presents a scheme of speaker-independent isolated word recognition in which Hidden Markov Modelling is used with Vector Quantization codebooks constructed using the Expectation Maximization (EM) algorithm for Gaussian mixture models. In comparison with conventional vect or quantization, the EM algorithm results in greater recognition accuracy.

Download Paper Gzipped, PostScript (27K)


Title: Speech Processing using Artificial Neural Networks
Authors:
Roberto Togneri, M.D. Alder, Yianni Attikiouzel
Publication:
Proceedings of the Third Australian International Conference on Speech Science and Technology, Melbourne, pp. 304-309, November 1990.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
A three layer perceptron network is used to classify the /i/ sound using isolated words from different speakers. A classification accuracy of 97\% has been achieved. A map of phonemes is used to trace trajectories of utterances using the self-organising neural network. A crinkle factor is proposed which allows using the self-organising map to determine the inherent dimensionality of a set of points. By this technique speech data has been shown to possess an inherent dimensionality of at least four. A projection of the map and the speech data shows how the self-organising map fits the speech space.

Download Paper Gzipped, PostScript (18K)


Title: Parametrisation of the speech space using the self-organising neural network
Authors:
Roberto Togneri, M.D. Alder, Yianni Attikiouzel
Publication:
Proceedings of the Fourth Australian Joint Conference on Artificial Intelligence, Perth, pp. 274-283, November 1990.
Contact:
Roberto Togneri < roberto[@]ee[.]uwa[.]edu[.]au >

Abstract:
Speech recognition is a difficult problem due to the inability of current systems to cope with connected speech. Neural networks are able to learn some aspects of this task. An unsupervised learning scheme like the self-organising map can be used to both classify and order the speech sounds and provide a front end to higher level processing. A map of phonemes (phonotopic map) is used to trace trajectories of sounds from utterances. The self-organising map provides a means of reducing the inherent dimensionality of the speech data. A crinkle factor which is used to determine how close the dimensionality of the map is to the dimensionality of the speech input shows that speech has an inherent dimensionality of at least three or four. A projection of the map and the speech data shows how the self-organising map fits the speech space.

Download Paper Gzipped, PostScript (23K)