Results 1 -
9 of
9
A Probabilistic Framework For Segment-Based Speech Recognition
, 2003
"... Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individ ..."
Abstract
-
Cited by 108 (33 self)
- Add to MetaCart
Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individual frames andintV nal recognizerstr;E will depend on t;hypotEO; zed segmentme;LH There is anotLO tot of recognizer whoseobservat ion space isbetI r represente as anet ork, or graph, where each arc in t; graph correspondst a hypotL;) zed variable-lengt segment tm is represente by a fixed-dimensional "featO e". In suchfeatSE;)E sed recognizers, eachhypotO99 zed segmentme;L will correspondt a segment sequence, orpatH ttHSV tt overall segme ntme aph th; is associato wit a subset of all possible feat revectI s intV tVLI observatEV space. Int;E work we examine a maximum apostW iori decoding stcodin forfeat ure-based recognizers and develop a normalizat ioncrit9S on useful for a segme ntme; ed VitOLO or A # search. Experiment arereport ed for bot phoneto and word recognitco tcog .
A Probabilistic Framework For Feature-Based Speech Recognition
, 1996
"... Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." I ..."
Abstract
-
Cited by 101 (24 self)
- Add to MetaCart
Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." In such feature-based recognizers the observation space takes the form of a temporal network of feature vectors, so that a single segmentation of an utterance will use a subset of all possible feature vectors. In this work we examine amaximuma posteriori decoding strategy for feature-based recognizers and develop a normalization criterion useful for a segmentbased Viterbi or A* search. We report experimental results for the task of phonetic recognition on the TIMIT corpus where we achieved context-independent and context-dependent (using diphones) results on the core test set of 64.1% and 69.5% respectively.
Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks
, 1997
"... This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connec ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connected recurrent network grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal number of connections. The implementation of the combined architecture and training scheme is described in detail. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database, and for word recognition on the WAXHOLM database. The achieved phone error-rate, 27.8%, for the standard 39 phoneme set on the core test-set of the TIMIT database is in the range of the lowest reported. All training and simulation softwar...
Probabilistic Segmentation for Segment-Based Speech Recognition
, 1998
"... Segment-based speech recognition systems must explicitly hypothesize segment start and end times. The purpose of a segmentation algorithm is to hypothesize those times and to compose a graph of segments from them. During recognition, this graph is an input to a search that finds the optimal sequence ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Segment-based speech recognition systems must explicitly hypothesize segment start and end times. The purpose of a segmentation algorithm is to hypothesize those times and to compose a graph of segments from them. During recognition, this graph is an input to a search that finds the optimal sequence of sound units through the graph. The goal of this thesis is to create a high-quality, real-time phonetic segmentation algorithm for segment-based speech recognition. A high-quality segmentation algorithm produces a sparse network of segments that contains most of the actual segments in the speech utterance. A real-time algorithm implies that it is fast, and that it is able to produce an output in a pipelined manner. The approach taken in this thesis is to adopt the framework of a state-of-the-art algorithm that does not operate in real-time, and to make the modifications necessary to enable it to run in real-time. The algorithm adopted as the starting point for this work makes use of a for...
Gaussian mixture models of phonetic boundaries for speech recognition
- In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU
, 2001
"... Abstract—A new approach to represent temporal correlation in an automatic speech recognition system is described. It introduces an acoustic feature set that captures the dynamics of speech signal at the phoneme boundaries in combination with the traditional acoustic feature set representing the peri ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract—A new approach to represent temporal correlation in an automatic speech recognition system is described. It introduces an acoustic feature set that captures the dynamics of speech signal at the phoneme boundaries in combination with the traditional acoustic feature set representing the periods that are assumed to be quasi-stationary of speech. This newly introduced feature set represents an observed random vector associated with the state transition in HMM. For the same complexity and number of parameters, this approach improves the phoneme recognition accuracy by 3.5 % compared to the context-independent HMM models. Stop consonant recognition accuracy is increased by 40%. I.
Efficient high-order hidden Markov modelling
- in Proceedings of the International Conference on Spoken Language Processing
, 1998
"... I, the undersigned, hereby declare that the work contained in this dissertation is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree. Signature: Date: ii Currently, first-order hidden Markov models (HMMs) form the backbone arou ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
I, the undersigned, hereby declare that the work contained in this dissertation is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree. Signature: Date: ii Currently, first-order hidden Markov models (HMMs) form the backbone around which most automatic speech processing applications are built. Their higher-order extensions are known to be more powerful, but, due to their complexity and computational demands, they are seldomly used. It is the purpose of this work to advance their application In this work we unify HMMs of all orders by deriving and proving the ORder rEDucing (ORED) algorithm. This algorithm will reduce any higher-order HMM (also mixed-order) to an equivalent first-order representation. This makes it possible to process any higher-order HMM using known first-order algorithms, thereby
Variations on Statistical Phoneme Recognition -- A Hybrid Approach
, 1997
"... Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker ..."
Abstract
- Add to MetaCart
Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker independent continuous speech recognition with an error rate of less than 1%, still eludes researchers. At the heart of most modern speech recognition systems lies a HMM based phoneme recognition engine which segments and classifies the incoming acoustic signal into a sequence of phonemes. These phonemes are concatenated to form word models which are processed further to arrive at a transcription of the linguistic message encoded in the speech signal. The final recognition accuracy of the speech recognition system can thus be directly linked to the recognition accuracy of the underlying phoneme recogniser. Two types of features extracted from the speech signal is commonly used for phoneme recognition. These are the supra-segmental knowledge-based features derived from phonetic and phonologic theory, and the widely used frame-based cepstral features. Up till now, these features have been used separately by researchers, resulting in the loss of valuable discriminative information.
HMMs and OWE Neural Network for Continuous Speech Recognition
- In Proceedings of International Conference on Spoken Language Processing, ICSLP. Philadelphia October
, 1996
"... The phonetic context has a large effect on stop consonants in a continuous speech signal [1]. Therefore recognition systems that model allophones using context-dependent Hidden Markov Models have been implemented [3]. HMMs have a great ability for the segmentation in the temporal domain [4][6] but h ..."
Abstract
- Add to MetaCart
The phonetic context has a large effect on stop consonants in a continuous speech signal [1]. Therefore recognition systems that model allophones using context-dependent Hidden Markov Models have been implemented [3]. HMMs have a great ability for the segmentation in the temporal domain [4][6] but have some difficulties in the recognition because the MLE training (Maximum Likelihood Estimation) is not discriminant, whereas the discrimination is one of the abilities of the Artificial Neural Networks models. In the last three years we have developed a new ANN model named OWE (Orthogonal Weight Estimator)[9][10].
The Free Speech Journal, Issue 5(1997)
"... This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully con ..."
Abstract
- Add to MetaCart
This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connected recurrent network grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal number of connections. The implementation of the combined architecture and training scheme is described in detail. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database, and for word recognition on the WAXHOLM database. The achieved phone error-rate, 27.8%, for the standard 39 phoneme set on the core test-set of the TIMIT database is in the range of the lowest reported. All training and simulation software used is made freely available by the author, and detailed information about the software and the training process is given in an Appendix.

