Results 1 - 10
of
31
Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks
, 1997
"... This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connec ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connected recurrent network grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal number of connections. The implementation of the combined architecture and training scheme is described in detail. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database, and for word recognition on the WAXHOLM database. The achieved phone error-rate, 27.8%, for the standard 39 phoneme set on the core test-set of the TIMIT database is in the range of the lowest reported. All training and simulation softwar...
Indexing and Retrieval of Broadcast News
- Speech Communication
, 2000
"... This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech r ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on an archive of British English broadcast news. Keywords: Spoken Document Retrieval; Information Retrieval; Broadcast Speech; Large Vocabulary Speech Recognition. 1 Introduction Retrieval of audio segments according to their content is a challenging and significant problem. It has been estimated th...
Transcription Of Broadcast Television And Radio News: The 1996 Abbot System
- In DARPA Speech Recognition Workshop
, 1997
"... ABBOT is a hybrid connectionist-HMM large vocabulary continuous speech recognition system developed at the Cambridge University Engineering Department. This uses a recurrent neural network acoustic model to map acoustic features into posterior phone probabilities. These posterior probabilities are t ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
ABBOT is a hybrid connectionist-HMM large vocabulary continuous speech recognition system developed at the Cambridge University Engineering Department. This uses a recurrent neural network acoustic model to map acoustic features into posterior phone probabilities. These posterior probabilities are then converted to scaled likelihoods and used as observation likelihoods for phone HMMs [1, 2]. This paper describes the development of the CUCON system which participated in the 1996 ARPA Hub 4 Evaluations. The system is based on ABBOT. The Hub 4 Evaluation task involves the transcription of broadcast television and radio news programmes. This is an extremely demanding task for state-of-the-art speech recognition systems. Typical programmes include a wide variety of speaking styles and acoustic conditions. These range from read speech recorded in the studio to extemporaneous speech recorded over telephone channels. Results are presented for the system at various stages of development, as we...
Start-synchronous search for large vocabulary continuous speech recognition
- IEEE Trans. Speech and Audio Processing
"... Abstract — In this paper, we present a novel, efficient search strategy for large vocabulary continuous speech recognition. The search algorithm, based on a stack decoder framework, utilizes phone-level posterior probability estimates (produced by a connectionist/hidden Markov model acoustic model) ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
Abstract — In this paper, we present a novel, efficient search strategy for large vocabulary continuous speech recognition. The search algorithm, based on a stack decoder framework, utilizes phone-level posterior probability estimates (produced by a connectionist/hidden Markov model acoustic model) as a basis for phone deactivation pruning—a highly efficient method of reducing the required computation. The single-pass algorithm is naturally factored into the time-asynchronous processing of the word sequence and the time-synchronous processing of the hidden Markov model state sequence. This enables the search to be decoupled from the language model while still maintaining the computational benefits of time-synchronous processing. The incorporation of the language model in the search is discussed and computationally cheap approximations to the full language model are introduced. Experiments were performed on the North American Business News task using a 60 000 word vocabulary and a trigram language model. Results indicate that the computational cost of the search may be reduced by more than a factor of 40 with a relative search error of less than 2 % using the techniques discussed in the paper. Index Terms — Hidden Markov model, large vocabulary continuous speech recognition, phone deactivation pruning, search, stack decoding. I.
Discriminative Training of Hidden Markov Models
, 1998
"... vi Abbreviations vii Notation viii 1 Introduction 1 2 Hidden Markov Models 4 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HMM Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 HMM Topology . . . . . . . . . ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
vi Abbreviations vii Notation viii 1 Introduction 1 2 Hidden Markov Models 4 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HMM Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 HMM Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Finding the Best Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Setting the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Objective Functions 19 3.1 Properties of Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . 19 3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Maximum Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Frame Discrimination . . . . . . . . . . . . . . . . ....
Size Matters: An Empirical Study Of Neural Network Training For Large Vocabulary Continuous Speech Recognition
"... Wehave trained and tested a number of large neural networks for the purpose of emission probability estimation in large vocabulary continuous speech recognition. In particular, the problem under test is the DARPA Broadcast News task. Our goal here was to determine the relationship between training t ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
Wehave trained and tested a number of large neural networks for the purpose of emission probability estimation in large vocabulary continuous speech recognition. In particular, the problem under test is the DARPA Broadcast News task. Our goal here was to determine the relationship between training time, word error rate, size of the training set, and size of the neural network. In all cases, the network architecture was quite simple, comprising a single large hidden layer with an input window consisting of feature vectors from 9 frames around the current time, with a single output for each of 54 phonetic categories. Thus far, simultaneous increases to the size of the training set and the neural network improve performance; in other words, more data helps, as does the training of more parameters. We continue to be surprised that such a simple system works as well as it does for complex tasks. Given a limitation in training time, however, there appears to be an optimal ratio of training p...
The 1997 Abbot System For The Transcription Of Broadcast News
- IN PROCEEDINGS OF THE 1998 BROADCAST NEWS TRANSCRIPTION AND UNDERSTANDING WORKSHOP
, 1998
"... This paper describes the development of a connectionist-hidden Markov model (HMM) system for the 1997 DARPA Hub-4E CSR evaluations. We describe both system development and the enhancements designed to improve performance on broadcast news data. Both multilayer perceptron (MLP) and recurrent neural n ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper describes the development of a connectionist-hidden Markov model (HMM) system for the 1997 DARPA Hub-4E CSR evaluations. We describe both system development and the enhancements designed to improve performance on broadcast news data. Both multilayer perceptron (MLP) and recurrent neural network acoustic models have been investigated. We assess the effect of using gender-dependent acoustic models, and the impact on performance of varying both the number of parameters and the amount of training data used for acoustic modelling. The use of contextdependent phone models is described, and the effect of the number of context classes is investigated. We also describe a method for incorporating syllable boundary information during search. Results are reported on the 1997 DARPA Hub-4E development test set. We then describe the CU-CON evaluation system and report results on the 1997 Hub-4E test set.
Signer-independent Continuous Sign Language Recognition Based on SRN/HMM
- Gesture and Sign Language in Human-Computer Interaction. International Gesture Workshop, volume 2298 of Lecture Notes in Artificial Intelligence
, 2001
"... A divide-and-conquer approach is presented for signer-independent continuous Chinese Sign Language(CSL) recognition in this paper. The problem of continuous CSL recognition is divided into the subproblems of isolated CSL recognition. We combine the simple recurrent network(SRN) with the hidden M ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
A divide-and-conquer approach is presented for signer-independent continuous Chinese Sign Language(CSL) recognition in this paper. The problem of continuous CSL recognition is divided into the subproblems of isolated CSL recognition. We combine the simple recurrent network(SRN) with the hidden Markov models(HMM) in this approach. The improved SRN is introduced for segmentation of continuous CSL. Outputs of SRN are regarded as the states of HMM, and the Lattice Viterbi algorithm is employed to search the best word sequence in the HMM framework. Experimental results show SRN/HMM approach has better performance than the standard HMM one.
On Supervised Learning From Sequential Data With Applications For Speech Recognition
, 1999
"... visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In this synthetic example, the one-dimensional target data would be represented poorly by a uni-modal Gaussian distribution with a constant variance (which corresponds to using the squared-error objective function), which would average the two separate branches, indicated by the fat lines as the mean and constant variance of the single Gaussian. Compare this figure with Figure 3.10, Figure 3.11 and Figure 3.12 to see a subsequent improvement of the model.
Context-Dependent Hybrid HME/HMM Speech . . .
"... This paper presents a context-dependent hybrid connectionist speech recognition system that uses a set of generalized hierarchical mixtures of experts (HME) to estimate context-dependent posterior acoustic class probabilities. The connectionist part of the system is organized in a modular fashion, a ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
This paper presents a context-dependent hybrid connectionist speech recognition system that uses a set of generalized hierarchical mixtures of experts (HME) to estimate context-dependent posterior acoustic class probabilities. The connectionist part of the system is organized in a modular fashion, allowing the distributed training of such a system on regular workstations. Context classes are based on polyphonic contexts, clustered using decision trees which we adopt from our continuous density HMM recognizer JANUS [8]. The system is evaluated on ESST, an english speaker-independent spontaneous speech database. Context dependent modeling is shown to yield significant improvements over simple context-independent modeling, requiring only small additional overhead in terms of training and decoding time.

