Abstract:
This paper considers the application of Blind Source Separation (BSS) to
electromyographic (EMG) data as an approach to isolating individual motor
unit potentials (MUPs). BSS was applied both to needle EMG (nEMG) and
surface EMG (sEMG), and experimental results were obtained that demonstrate
the effectiveness of this approach. BSS is proposed as a technique to be
incorporated into EMG methodology to enable separation of individual MUPs
from an EMG interference pattern.
Download Paper Adobe PDF (87K)
Abstract:
In this paper, we present a state-space formulation of a neural-network-based
hidden dynamic model of speech whose parameters are trained using an
approximate EM algorithm. The training makes use of the results of an
off-the-shelf formant tracker (during the vowel segments) to simplify the
complex sufficient statistics that would be required in the exact EM algorithm.
The trained model, consisting of the state equation for the target-directed
vocal tract resonance (VTR) dynamics on all classes of speech sounds
(including consonant closure) and the observation equation for mapping from
the VTR to acoustic measurement, is then used to recover the unobserved
VTR based on Extended Kalman Filter. The results demonstrate accurate
estimation of the VTRs, especially those during rapic consonant-vowel or
vowel-consonant transitions and during consonant closure when the
acoustic measurement alone provides weak or no information to infer
the VTR values.
Download Paper Adobe PDF (305K)
Abstract:
A new subband based speech enhancement scheme is presented. It
integrates spatial and temporal signal processing methods to enhance
speech signals in a noisy environment. The approach makes use of the
popular blind signal separation (BSS) to spatially separate the target
signal from the interference. Due to the multipath/reverberant
environment, BSS has its fundamental limitation in its separation
quality. To overcome that, an adaptive noise canceller (ANC) is
employed to perform further interference reduction. The reference for
the ANC in this case is simply the interference dominant output from
the BSS. A higher order statistical method is proposed for the
selection of the reference signal. This post processing acts as a
spectral decorrelator and experimental results show that even in
under-determined (more sources than elements) case, the structure
offers impressive enhancement capability. Further, a remarkable
improvement in recognition rate is registered when tested in automatic
speech recognition (ASR).
Abstract:
A new subband based front-end processor for speech
recognition is presented. It integrates both spatial and
temporal signal processing methods to enhance noisy signals
as a means to reduce the mismatch problem in speech
recognition. The approach makes use of the popular blind
signal separation (BSS) to spatially separate the target
signal from the interference. Due to the
multipath/reverberant environment, BSS has its fundamental
limitation in the separation quality. To overcome that, an
adaptive noise canceller (ANC) is employed to perform
further interference reduction. Experimental results show
that even in an adverse environment, the proposed structure
improves the word recognition rate (WRR) by 70%
for the connected digit recognition task.
Download Paper Adobe PDF (175K)
Abstract:
This paper presents new results on evaluation of the statistical
coarticulatory
hidden dynamic model (HDM) on the TIMIT phone recognition task. We
train both the HDM and baseline HMM on the complete TIMIT training
data set and evaluate both systems using the N-best rescoring algorithm
on the TIMIT test data set and the dr8 dialect subset. We show that
with the inclusion of the reference transcription the HDM consistently
outperforms the HMM for both 100-best+ref rescoring of the TIMIT test
data and 1000-best+ref rescoring of the dr8 dialect subset with a
reduction in the WER of between 3% and 6% in all cases. We also
verify the plausibility of the HDM paradigm by comparing plots of
the model output with the observation data vectors.
Download Paper Adobe PDF (94K)
Abstract:
In this paper, we present a new approach to joint state and parameter
estimation for a target-directed, nonlinear dynamic system model with
switching states. The model is also called the hidden dynamic model (HDM)
recently proposed for representing speech dynamics. The model parameters
subject to statistical estimation consist of the target vector and the
system matrix (also called the "time-constants"), as well as the
parameters characterizing the non-linear mapping from the hidden state to
the observation. These latter parameters are implemented in the current
work as the weights of a three-layer feedforward multi-layer perceptron
(MLP) network. The new estimation approach presented in this paper is
based on the extended Kalman filter (EKF), and its performance is
compared with the more traditional approach based on the
expectation-maximisa tion (EM) algorithm. Extensive simulation experiment
results are presented using the proposed EKF-based and the EM algorithms
and under the typical conditions for employing the HDM for speech
modeling. The results demonstrate superior convergence performance of the
EKF-based algorithm compared with the EM algorithm, but the former
suffers from excessive computational loads when adopted for training the
MLP weights. In all cases, the simulation results show that the simulated
model output converges to the given observation sequence. However, only
in the case where the MLP weights or the target vector are assumed known,
do the time-constant parameters converge to their true values. We also
show that the MLP weights never converge to their true values, thus
demonstrating the many-to-one mapping property of the feed-forward MLP.
We conclude from these simulation experiments that for the system to be
identifiable, restrictions on the parameter space are needed.
Download Paper Adobe PDF (265K)
Abstract:
This paper presents a new parameter estimation algorithm based on the
Extended Kalman Filter (EKF) for the recently proposed statistical
coarticulatory Hidden Dynamic Model (HDM). We show how the EKF parameter
estimation algorithm unifies and simplifies the estimation of both the
state and parameter vectors. Experiments based on N-best rescoring
demonstrate superior performance of the (context-independent) HDM over a
triphone baseline HMM in the TIMIT phonetic recognition task. We also
show that the HDM is capable of generating speech vectors close to those
from the corresponding real data.
Download Paper Adobe PDF (19K)
Abstract:
We describe a robust speech understanding system based on our newly
developed approach to spoken language processing. We show that a robust
NLU system can be rapidly developed using a relatively simple speech
recognizer to provide sufficient information for database retrieval by
spoken language. Our experimental system consists of three components: a
speech recognizer based on HMM, a natural language parser based on
conceptual relational grammar and a data retrieval system based on the
ATIS database. With the use of the robust parsing strategy, database query
tasks can be successfully performed.
Download Paper Adobe PDF (77K)
Abstract:
The Hidden Markov Model is a well understood technology for modelling
speech, but it makes a number of assumptions making it non-optimal for the
task. Alternative models like the trended HMM, variable duration HMM,
conditionally Gaussian HMM, and stochastic segment model have been proposed
in the literature, all of which modify the basic HMM architecture. This
article will show how all the above models are related to and are
generalisations of the HMM.
Download Paper Gzipped, PostScript (54K)
Abstract:
In this paper, we describe an implementation of the extended Kalman filter
(EKF) for joint state and parameter estimation for a target-directed,
switching state-space nonlinear system model and compare its performance
with a maximum-likelihood parameter estimation procedure based on the
Expectation-Maximisation (EM) algorithm. The model parameters consist of
the target one and the time-constant one. Simulation experimental results
are presented for individual and joint estimation of all model parameters
for both algorithms. The results show that both algorithms are able to
converge to the true target parameter in the model, with the EKF approach
exhibiting faster convergence. This is true even under the
target-undershoot condition when the observation sequence is relatively
short. However, convergence to the true time-constant parameter is not
evident, possibly due to the non-unique nature of the parameter estimation
problem. We also show empirically that in the case of joint estimation of
the parameters, the EM algorithm diverges shortly after a small number of
iterations whereas the EKF approach gives more desirable convergence
properties.
Download Paper Gzipped, PostScript (82K)
Abstract:
A novel technique for speaker independent automated speech recognition is
proposed. We take a segment model approach to Automated Speech Recognition
(ASR), considering the trajectory of an utterance in vector space, then
classify using a modified Probabilistic Neural Network (PNN) and maximum
likelihood rule. The system performs favourably with established techniques.
Our system achieves in excess of 94% with isolated digit recognition, 88%
with isolated alphabetic letters, and 83% with the confusable /e/ set.
A favourable compromise between recognition accuracy and computer memory
and speech can also be reached by performing clustering on the training data
for the PNN.
Download Paper Adobe PDF (45K)
Abstract:
The quantization distortion of Vector Quantization (VQ) is a key element which
affects the performance of a discrete hidden Markov modeling (DHMM)
system. Many researchers have realized this problem and try to use integrated
features or multiple code books in their systems to offset the disadvantage of
the conventional VQ. However the computational complexity of these systems is
then significantly increased.
Our investigations have shown that the speech signal space can be modeled as a
mixture of Gaussian clusters which represent the phoneme data sets from male
and female speakers. In this paper we propose an alternative VQ method in which
the phoneme is treated as a cluster in the speech space and a Gaussian model is
estimated for each phoneme. A Gaussian mixture model (GMM) is generated by the
Expectation-Maximization (EM) algorithm for the whole speech space and used as
a code book in which the codewords are Gaussian models representing certain
acoustic features. An input utterance was classified as a certain phoneme or a
set of phonemes based on the maximum likelihood of the trained models. A
typical discrete HMM system was used for both phoneme and isolated word
recognition. The results show that phoneme based Gaussian modeling vector
quantization classifies the speech space more effectively and significant
improvements in the performance of the DHMM system have been achieved.
Download Paper Gzipped, PostScript (52K)
Abstract:
This paper reports on the implementation of a real-time speaker independent
isolated word speech recognition program on a PC Windows platform. The overall
structure of the recognition engine is based on the Dynamic Time Warping (DTW)
paradigm for computational efficiency. Furthermore, to decrease the recognition
time and increase the recognition accuracy, the dictionary is limited to under
15 words. This severely restricts the vocabulary. To overcome this restriction,
a new technique is introduced. Many dictionaries are linked in a hierarchical
structure and each word in each dictionary will activate a new dictionary
related to that word. This represents a basic form of language modelling
which is suited for the menu driven interface found in many of today's
applications. The results show that reasonable performance can be achieved
by these methods.
Download Paper Gzipped, PostScript (32K)
Abstract:
This paper presents the methodology of extracting a speech signal in the
presence of a musical note signal using the GRNN (General Regression Neural
Network). An overview of GRNN is presented first, followed by preliminary
simulations. Results of extracting speech in the presence of a flute and
a cello note are also presented.
Download Paper Gzipped, PostScript (88K)
Abstract:
We present in this paper a modelling technique used to capture the dynamic
and temporal behaviour of transitions between phonemes. This model relies
on the trajectory instead of the geometrical position of the observations in
the parameter space. Transition based models provide an alternative method
for acoustic-phonetic modelling of the speech signal. In our modelling
technique, the trajectory is modelled by regression analysis of low-order
polynomials followed by statistical clustering of these coefficients.
This technique is used for both speech recognition as well as speaker
recognition. Results on a small trial set of isolated alphabet sounds and
speakers for both speech and speaker recognition are presented.
The speech recognition rate using the trajectory model is found comparable
to traditional HMM modelling. However, the poor results for the speaker
identification suggest that the current trajectory model is not
suitable for this recognition task.
Download Paper Gzipped, PostScript (46K)
Abstract:
In this paper, we investigate the relationship between speech trajectories
and the hidden Markov model. The speech utterances were transformed into
speech feature vectors and the trajectories displayed on a two dimensional
space. The hidden Markov models were also displayed on a two dimensional
space. By visual examination, we think that the states seem to be
associated with a distinct phoneme of the utterance. Therefore, the number
of states required in the contonuous HMM is related to the number of
phonemes in the word to be modelled. In the semi-continuous HMM, it is also
possible that the same gaussian probability density function is shared by
the same phoneme sound in different semi-continuous HMMs.
Download Paper Gzipped, PostScript (68K)
Abstract:
Using phoneme-based Gaussian mixture as a VQ codebook in DHMM speech
recognition system (PBDHMM) is an efficient way to improve the system
performance. This paper compares the performances of PBDHMM system with
that of the well known continuous HMM system for isolated word recognition
task. The results shown that PBDHMM system obtained better results than
CHMM system, especially for phoneme-distinct data.
Download Paper Gzipped, PostScript (26K)
Abstract:
This paper described a large isolated English digit database which was
designed for the training and evaluation of statistical algorithms and
neural networks. 1108 speakers (575 males and 533 females) were recorded
in the UWA campus under office environment.
Download Paper Gzipped, PostScript (29K)
Abstract:
A phoneme-based Gaussian mixture VQ codebook can improve the conventional
DHMM system performance significantly. In this paper, an
optimization method for the phoneme-based VQ codebook is proposed.
The experimental results shown that the optimized phoneme-based VQ codebook
leads to both
the improvement of system performance and the reduction of system complexity.
Download Paper Gzipped, PostScript (78K)
Abstract:
A multi-HMM speaker-independent isolated word recognition system is described.
In this system, three vector quantization methods, the LBG algorithm, the EM
algorithm, and a new MGC algorithm, are used for the classification of the
speech space. These
quantizations of the speech space are then used to produce three HMMs for
each word in the vocabulary. In the recognition step, the Viterbi algorithm is
used in the three sub-recognizers. The log probabilities of the observation
sequences
matching the models are multiplied by the weights determined by the
recognition accuracies of individual sub-recognizers and summed to give the
log probability that the utterance is of a particular word in the vocabulary.
This multi-HMM system results in a reduction of about 50 per cent in the
error rate in comparison with the single model system.
Download Paper Gzipped, PostScript (74K)
Abstract:
This paper describe an improved training procedure in a HMM/VQ speech
recognition system for speaker-independent speech
recognition. The phoneme based Gaussian mixture models (GMM) were
generated in the first step modeling using
the Expectation-Maximization (EM) algorithm.
These Gaussians more
accurately describe the distribution characteristic of the phonemes in the
speech signal space. Therefore better first step modeling
is achieved and the performance of the whole recognition system is improved.
The new method was used in a speaker-independent isolated digits and
phoneme recognition tasks.
Two English databases were used for the training and testing.
Significant improvements have been achieved in comparison with the
conventional HMM/VQ system.
Download Paper Gzipped, PostScript (28K)
Abstract:
This paper describes a speaker-independent isolated word recognition
system which uses a well known technique, the combination of vector
quantization with hidden Markov modeling. The conventional vector
quantization algorithm is substituted by a statistical clustering
algorithm, the Expectation- Maximization algorithm,
in this system.
Based on the investigation of the data space, the
phonemes were manually extracted from the training data and
were used to generate the Gaussians in a code book in which each
code word is a Gaussian rather than a centroid vector of the data class.
The word based hidden Markov modeling then was performed.
Two English isolated digits data base were investigated and the 12
Mel-spaced filter bank coefficients was employed as the input feature.
Comparing the conventional discrete HMM, our system obtained significant
improvement of recognition accuracy.
Download Paper Gzipped, PostScript (52K)
Abstract:
From the TIMIT database labelled speech waveform segments from 33 speakers
were extracted. There were 8 categories of speech data, each representing a
vowel sound. In each category were 80 - 130 utterances. The waveform segments
were processed by taking a FFT, on 32 msec frames and binning the result into
12 frequency bands. This way each frame will be represented by 12
numbers/values. They become points in R12 and each utterance is a short
trajectory in R12. The 8 vowel categories become 8 clusters of such
trajectories. By projecting the clusters onto the screen of a SUN workstation,
it was observed that each vowel cluster appeared to be substantially gaussian.
The covariance matrix and the centers were computed. Dimension estimates of
each vowel by Principal Components Analysis show that the eight clusters lie
close to a plane in the filterbankspace, and that the principal axes of the
vowel clusters make a small and consistent angle with respect to this plane.
This confirms the results of Plomp, e.a. [4] who found the vowel space to be
essentially 2 dimensional. Projecting the centroids of the vowel clusters
onto this plane gives a representation of the vowels which is very similar to
the configuration we get when plotting the first and second formant, F1 and
F2, of the vowels and the front-back and high-low diagram of vowels used in
phonetics. This method may determine the vowel class more reliably
than Formant tracking procedures.
Download Paper Gzipped, PostScript (73K)
Abstract:
Artificial Neural Networks are often looked upon as black boxes that can be
used for classification tasks. Regarding the ANN as a simple tool to do a
final classification, the research efforts tend to be concentrated on
preprocessing stages, to improve the quality of the input to the neural network.
One such preprocessor is the MSECT algorithm by Zahorian and Jagharghi [1].
It improves vowel classification. Since MSECT applies an affine transformation
to the data, it is hard to see why this should make any difference to the end
result. By implementing and testing the MSECT algoritm, using a simple
backpropagation neural network as a tool or standard to measure the amount of
neural network training needed to correctly classify two data clusters we
confirmed the results of Zahorian and Jagharghi [2]. The simple ANN we used
to classify the vowel data was not that simple at all. The preprocessing
algorithm changes not only the dimension but also the scale of the vowel data.
To perform optimal on the original data and on the preprocessed data the ANN
would need different optimal parameters. But because the parameters of the ANN
were not modified this preprocessing could result in better results for one of
the data sets. An experiment was done with different scalings of the same data
sets. For the parameters of the ANN we used, the optimal results in terms of
speed of convergence and accuracy were obtained for data scaled to have their
input range between 5 and 18.
Download Paper Gzipped, PostScript (32K)
Abstract:
Previous publications have concentrated on graphically representing features
produced from speech as spectrograms. We present a new geometric
interpretation of speech features. This is implemented in the form of the
`fview' program which has been placed in the public domain. We use `fview' to
consider the problem of generating effective feature sequences.
Download Paper Gzipped, PostScript (64K)
Abstract:
Phoneme recognition is a key characteristic in large-vocabulary speech
recognition system. The recent reports have shown that the accuracies of
speaker-independent English phoneme recognition are around 60% while in the
speaker-dependent case the recognition accuracies are under 70%. This paper
describe a new method in which the Gaussian mixture modeling was employed for
speaker-independent phoneme recognition. The Expectation-Maximization (EM)
algorithm was used to generated the Gaissian mixture models (GMMs). Two
English databases were used for both the system training and testing. The
phonemes were manually extracted from 11 isolated digits (from zero to nine
and oh). The testing results are higher than that of recent reports. Some
related observations are also reported.
Download Paper Gzipped, PostScript (25K)
Abstract:
A multi-HMM speaker-independent isolated word recognition system is described.
In this system,three vector quantization methods are used for the classification
of speech space. This multi-HMM system results in an improvement of
about 50 per cent in the error rate in comparison to the single model system.
Download Paper Gzipped, PostScript (21K)
Abstract:
The papers Speaker Normalization of static and dynamic vowel spectral features
(J.A.S.A 90, July 1991 pp 67-75) and Minimum Mean-Square Error Transformations
of Categorical Data to Target Positions (IEEE Trans Sig.Proc,40 Jan 1992,
pp13-23) by Zahorian and Jagharghi describe an algorithm for transforming the
space of speech sounds so as to improve the accuracy of classification.
Classification was accomplished by both back-propagation neural nets and by a
Bayesian Maximum Likelihood method on the model of each vowel class being
specified by a gaussian distribution. The transformation was an affine
transformation obtained by choosing ideal `target' points for each cluster in
a second space and minimising the mean square distance of the points in the
speech space from the appropriate target. The speech space itself was a
space of cepstral coefficients obtained from a Discrete Cosine Transform.
These findings are remarkable, indeed almost unbelievable. The reason is that
both the maximum likelihood classification on the gaussian model, and the Neural
Net classifier are essentially affine invariant. In the case where the transform
ation is invertible, this is clearly the case. When the transformation has
non-trivial kernel, it may happen that the classification gets worse, but it
cannot get better. A back-propagation neural net in effect classifies by
dividing the space into regions by means of hyperplanes. The gaussian model
does so by means of quadratic forms, with quadratic discrimination
hypersurfaces. Projecting a hyperplane by any non-zero affine map which is
onto the target space will usually give another hyperplane in the target
space, and if the second separates points, so will the first. Conversely,
if there is a solution in the target space, it can be pulled back to a
solution in the domain space. It is not hard to show that similar
considerations apply to the case where we use quadratic hypersurfaces.
In this paper, we attempt to account for the results of Zahorian and Jagharghi
by investigating vowel data. We describe a simple projection algorithm which
may be applied to high dimensional data to give a view on a computer screen of
the data and of transformations of it.
Download Paper Gzipped, PostScript (75K)
Abstract:
We compare the performance of five
algorithms for vector quantisation and clustering analysis: the
Self-Organising Map (SOM) and Learning Vector Quantization (LVQ)
algorithms of Kohonen, the Linde-Buzo-Gray (LBG) algorithm, the
MultiLayer Perceptron (MLP) and the GMM/EM algorithm for Gaussian
Mixture Models (GMM). We propose that the GMM/EM provides a better
representation of the speech space and demonstrate this by comparing
the GMM with the LBG, LVQ, MLP and SOM algorithms in phoneme classification
and digit recognition.
Abstract:
This paper presents a scheme of speaker-independent isolated word recognition in
which Hidden Markov Modelling is used with Vector Quantization codebooks
constructed using the Expectation-Maximization (EM) algorithm for Gaussian
mixture models. In comparison with conventional vector quantization (the LBG
algorithm), the EM algorithm results in a more reasonable clustering of the
speech signal space. This is demonstrated by a higher recgonition accuracy of
the system. Three types of feature parameters of the speech signal were used
as input data to the system. The effects of using codebooks of different size
were also investigated. Finally, the amalgamation of small codebooks for
individual words into a large codebook for the system is discussed.
Download Paper Gzipped, PostScript (35K)
Abstract:
In this paper a parallel implementation of the Kohonen algorithm
is proposed using partitioning of the network. This allows an exact
implementation of the Kohonen algorithm as opposed to partitioning the data.
By using a simple
routing strategy the parallel Kohonen algorithm was tested on a PC based
transputer network without the need for any special distributed operating
system. The execution time was measured for different sized networks and
number of transputers. The execution time decreased as the number of
transputers increased. However, for comparatively small sized neural
networks the communication overhead caused the execution time to
increase when more transputers were used. Thus, the proposed parallel
implementation of the Kohonen algorithm is not suitable for massively
parallel architectures.
Abstract:
This paper presents a scheme of speaker-independent isolated word recognition
in which Hidden Markov Modelling is used with Vector Quantization codebooks
constructed using the Expectation Maximization (EM) algorithm for Gaussian
mixture models. In comparison with conventional vect or quantization, the EM
algorithm results in greater recognition accuracy.
Download Paper Gzipped, PostScript (27K)
Abstract:
A three layer perceptron
network is used to classify the /i/ sound using isolated words from
different speakers. A classification
accuracy of 97\% has been achieved. A map of phonemes
is used to trace trajectories of utterances using the
self-organising neural network. A crinkle factor is proposed which
allows using the self-organising map to determine the inherent
dimensionality of a set of points. By this technique speech data has
been shown to possess an inherent dimensionality of at least four. A
projection of the map and the speech data shows how the self-organising
map fits the speech space.
Download Paper Gzipped, PostScript (18K)
Abstract:
Speech recognition is a difficult problem due to the inability of
current systems to cope with connected speech. Neural networks are able
to learn some aspects of this task. An unsupervised learning
scheme like the self-organising map can be used to both classify and
order the speech sounds and provide a front end to higher level
processing. A map of phonemes (phonotopic map) is used to trace
trajectories of sounds from utterances. The self-organising map provides a
means of reducing the inherent dimensionality of the speech data. A
crinkle factor which is used to determine how close the dimensionality
of the map is to the dimensionality of the speech input shows that
speech has an inherent dimensionality of at least three or four. A
projection of the map and the speech data shows how the
self-organising map fits the speech space.
Download Paper Gzipped, PostScript (23K)