Results 1 -
9 of
9
The Use of Context in Large Vocabulary Speech Recognition
, 1995
"... decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional dec ..."
Abstract
-
Cited by 93 (0 self)
- Add to MetaCart
decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional decoders. The second part of the thesis therefore presents a new decoder design which is capable of using these models efficiently. The decoder is suitable for use with very large vocabularies and long span language models. It is also capable of generating a lattice of word hypotheses with little computational overhead. These lattices can be used to constrain further decoding, allowing efficient use of complex acoustic and language models. The effectiveness of these techniques has been assessed on a variety of large vocabulary continuous speech recognition tasks and results are presented which analyse performance in terms of computational complexity and recognition accuracy. The experiments dem
A Segmental CRF Approach to Large Vocabulary Continuous Speech Recognition
"... Abstract—This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
Abstract—This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2 % absolute improvement over an HMM baseline. Index Terms—speech recognition, conditional random field, direct modeling, detector features I.
On Supervised Learning From Sequential Data With Applications For Speech Recognition
, 1999
"... visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In this synthetic example, the one-dimensional target data would be represented poorly by a uni-modal Gaussian distribution with a constant variance (which corresponds to using the squared-error objective function), which would average the two separate branches, indicated by the fat lines as the mean and constant variance of the single Gaussian. Compare this figure with Figure 3.10, Figure 3.11 and Figure 3.12 to see a subsequent improvement of the model.
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
SCARF: A Segmental Conditional Random Field Toolkit for Speech Recognition
"... This paper describes a new toolkit- SCARF- for doing speech recognition with segmental conditional random fields. It is designed to allow for the integration of numerous, possibly redundant segment level acoustic features, along with a complete language model, in a coherent speech recognition framew ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper describes a new toolkit- SCARF- for doing speech recognition with segmental conditional random fields. It is designed to allow for the integration of numerous, possibly redundant segment level acoustic features, along with a complete language model, in a coherent speech recognition framework. SCARF performs a segmental analysis, where each segment corresponds to a word, thus allowing for the incorporation of acoustic features defined at the phoneme, multi-phone, syllable and word level. SCARF is designed to make it especially convenient to use acoustic detection events as input, such as the detection of energy bursts, phonemes, or other events. Language modeling is done by associating each state in the SCRF with a state in an underlying n-gram language model, and SCARF supports the joint and discriminative training of language model and acoustic model parameters. SCARF is available for download from
N-Best Breadth Search For Large Vocabulary Continuous Speech Recognition Using A Long Span Language Model
, 1998
"... In large vocabulary continuous speech recognition, high level linguistic knowledge can enhance performance. However, integration of high level linguistic knowledge and complex acoustic models under an efficient search scheme is still an open question. In this paper, we propose the n-best breadth sea ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In large vocabulary continuous speech recognition, high level linguistic knowledge can enhance performance. However, integration of high level linguistic knowledge and complex acoustic models under an efficient search scheme is still an open question. In this paper, we propose the n-best breadth search algorithm under the framework of a state space search. The n-best breadth search is a combination of the best first search and the breadth first search, and it efficiently accommodates the long span language models and complex acoustic models. Our pilot experiment shows that the proposed algorithm decreases execution time with little effect on performance. 136th Meeting of Acoustical Society of America 2 Contents 1 INTRODUCTION 3 2 REVIEW OF DECODING ALGORITHMS 4 3 N-BEST BREADTH SEARCH 5 4 IMPLEMENTATION ISSUES 7 5 EXPERIMENTAL RESULTS 8 6 CONCLUSIONS 9 7 ACKNOWLEDGMENT 136th Meeting of Acoustical Society of America 3 1 INTRODUCTION In the statistical approach, speech recognition ...
Phonetic Set Hashing: A Novel Scheme For Transforming Phone Sequences To Words
, 1994
"... The usefulness of accurate sequence information is re-evaluated in this paper. A novel idea, called phonetic set hashing, of transforming phone sequences to words is then suggested. Phone sequences are mapped onto the corresponding phone sets, and the latter used as keys for indexing appropriate wor ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The usefulness of accurate sequence information is re-evaluated in this paper. A novel idea, called phonetic set hashing, of transforming phone sequences to words is then suggested. Phone sequences are mapped onto the corresponding phone sets, and the latter used as keys for indexing appropriate words. By using data-driven training strategies, the problem of word segmentation has been alleviated. The robustness of phone set hashing towards insertion, deletion, and substitution errors has also been studied. Experiments with subsets of the TIMIT database indicate that phone set hashing is a simple, fast scheme for word pre-selection. 1 Introduction Lexical modeling is an important aspect of continuous speech recognition(CSR) systems. Typically, the lexicon for a speech recognition system consists of a set of models that represent the pronunciations of words (usually a single pronunciation). Many successful speech recognition models are based on Hidden Markov Models (HMMs)[4]. The compu...
Nozomi - A Fast, Memory-Efficient Stack Decoder For Lvcsr
- in 5th International Conference on Spoken Language Processsing (ICSLP
, 1996
"... This paper describes some of the implementation details of the "Nozomi" 1 stack decoder for LVCSR. The decoder was tested on a Japanese Newspaper Dictation Task using a 5000 word vocabulary. Using continuous density acoustic models with 2000 and 3000 states trained on the JNAS/ASJ corpora and a 3-gr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes some of the implementation details of the "Nozomi" 1 stack decoder for LVCSR. The decoder was tested on a Japanese Newspaper Dictation Task using a 5000 word vocabulary. Using continuous density acoustic models with 2000 and 3000 states trained on the JNAS/ASJ corpora and a 3-gram LM trained on the RWC text corpus, both models provided by the IPA group [7], it was possible to reach more than 95% word accuracy on the standard test set. With computationally cheap acoustic models we could achieve around 89% accuracy in nearly realtime on a 300 Mhz Pentium II. Using a disk-based LM the memory usage could be optimized to 4 MB in total. 1. INTRODUCTION LVCSR is currently limited to workstations and fast highend laptops with a lot of memory. To make LVCSR work on PDAs, cellular phones, user-interfaces, wrist watches etc., it is necessary find time- and memory-efficient algorithms. The goal for implementation of any search engine must be to minimize time and memory requ...
Continuous Speech Dictation in French
- Proceedings ICSLP-94
"... A major research activity at LIMSI is multilingual, speakerindependent, large vocabulary speech dictation. In this paper we report on efforts in large vocabulary, speaker-independent continuous speech recognition of French using the BREF corpus. Recognition experiments were carried out with vocabula ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A major research activity at LIMSI is multilingual, speakerindependent, large vocabulary speech dictation. In this paper we report on efforts in large vocabulary, speaker-independent continuous speech recognition of French using the BREF corpus. Recognition experiments were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on 38 million words of newspaper text from Le Monde for language modeling. The recognizer uses a time-synchronous graph-search strategy. When a bigram language model is used, recognition is carried out in a single forward pass. A second forward pass, which makes use of a word graph generated with the bigram language model, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, contextdependent phone models and phone duration models. An average phone accuracy of 86% was achieved. A word accuracy of 84% h...

