Results 1 - 10
of
11
Network Optimizations for Large Vocabulary Speech Recognition
- Speech Communication
, 1998
"... The redundancy and the size of networks in large-vocabulary speech recognition systems can have a critical effect on their overall performance. We describe the use of two new algorithms: weighted determinization and minimization [12]. These algorithms transform recognition labeled networks into equi ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
The redundancy and the size of networks in large-vocabulary speech recognition systems can have a critical effect on their overall performance. We describe the use of two new algorithms: weighted determinization and minimization [12]. These algorithms transform recognition labeled networks into equivalent ones that require much less time and space in large-vocabulary speech recognition. They are both optimal: weighted determinization eliminates the number of alternatives at each state to the minimum, and weighted minimization reduces the size of deterministic networks to the smallest possible number of states and transitions. These algorithms generalize classical automata determinization and minimization to deal properly with the probabilities of alternative hypotheses and with the relationships between units (distributions, phones, words) at different levels in the recognition system. We illustrate their use in several applications, and report the results of our experiments. Key words...
Hierarchical search for large vocabulary conversational speech recognition
- IEEE Signal Processing Magazine
, 1999
"... ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information so ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information sources such as broadcast news and two-way telephone dialogs. A significant contribution to this advancement in technology is the development of search techniques that find suboptimal but accurate solutions in problems involving large search spaces and extremely complex statistical models. Moreover, these search strategies are capable of dynamically integrating information from a number of diverse knowledge sources to determine the correct word hypothesis, and limit the scope of the search by using a hierarchical search strategy. We refer to this problem as the decoding or search problem. This paper describes the complexity associated with decoding using hierarchical representations for linguistic and acoustic knowledge sources. An extensible object-oriented decoder available in the public domain, that leverages current state-of-the-art technology is described to illustrate these concepts. This decoder supports efficient handling of acoustic models for cross-word contextdependent phones, multiple pronunciations of words using lexical trees, and rescoring of word graphs based on N-gram language models in a single pass. It employs a state-of-the-art Viterbistyle dynamic programming algorithm, and is equipped with several heuristic pruning criteria to minimize the consumption of computational resources while maintaining good accuracy.
Boosting Gaussian Mixtures In An LVCSR System
- Proceedings of ICASSP 2000
, 2000
"... In this paper, we apply boosting to the problem of frame-level phone classification, and use the resulting system to perform voicemail transcription. We develop parallel, hierarchical, and restricted versions of the classic AdaBoost algorithm, which enable the technique to be used in large-scale spe ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
In this paper, we apply boosting to the problem of frame-level phone classification, and use the resulting system to perform voicemail transcription. We develop parallel, hierarchical, and restricted versions of the classic AdaBoost algorithm, which enable the technique to be used in large-scale speech recognition tasks with hundreds of thousands of Gaussians and tens of millions of training frames. We report small but consistent improvements in both frame recognition accuracy and word error rate. 1. INTRODUCTION Boosting is a technique for sequentially training and combining a collection of classifiers in such a way that the later classifiers make up for the deficiencies of the earlier ones. Many variants exist [1, 7, 2, 3], but all follow the same basic strategy. There is a sequence of iterations, and at each iteration a new classifier is trained on a weighted set of the training examples. Initially, every example gets the same weight, but in subsequent iterations, the weights of h...
Some Results on Search Complexity vs Accuracy
- in DARPA Speech Recognition Workshop
, 1997
"... This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcast news transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the wor ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcast news transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word error rate by about 3-10% (relative), depending on the test set. The execution time is at or close to real time for most utterances. Second, a segmented N-best list generation algorithm is described for producing compact N-best lists for very long utterances. Finally, a temporal smoothing technique is compared to deleted interpolation. On one test set, temporal smoothing reduces the error rate by 3% for an 8% increase in search cost, while the latter improves by 6% for a 50% increase in search cost. 1. Introduction In this paper we describe the results of a number of search experiments on the 1996 Hub-4 development and evaluation test sets. We have also attempted to document issues that a...
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
N-Best Breadth Search For Large Vocabulary Continuous Speech Recognition Using A Long Span Language Model
, 1998
"... In large vocabulary continuous speech recognition, high level linguistic knowledge can enhance performance. However, integration of high level linguistic knowledge and complex acoustic models under an efficient search scheme is still an open question. In this paper, we propose the n-best breadth sea ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In large vocabulary continuous speech recognition, high level linguistic knowledge can enhance performance. However, integration of high level linguistic knowledge and complex acoustic models under an efficient search scheme is still an open question. In this paper, we propose the n-best breadth search algorithm under the framework of a state space search. The n-best breadth search is a combination of the best first search and the breadth first search, and it efficiently accommodates the long span language models and complex acoustic models. Our pilot experiment shows that the proposed algorithm decreases execution time with little effect on performance. 136th Meeting of Acoustical Society of America 2 Contents 1 INTRODUCTION 3 2 REVIEW OF DECODING ALGORITHMS 4 3 N-BEST BREADTH SEARCH 5 4 IMPLEMENTATION ISSUES 7 5 EXPERIMENTAL RESULTS 8 6 CONCLUSIONS 9 7 ACKNOWLEDGMENT 136th Meeting of Acoustical Society of America 3 1 INTRODUCTION In the statistical approach, speech recognition ...
An RNN-Based Pre-classification Method for Fast Continuous Mandarin Speech Recognition
- IEEE Trans. Speech Audio Processing
"... A novel RNN-based front-end pre-classification scheme for fast continuous Mandarin speech recognition is proposed in this paper. First, an RNN is employed to discriminate each input frame for the three broad classes of initial, final, and silence. A finite state machine (FSM) is then used to classif ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A novel RNN-based front-end pre-classification scheme for fast continuous Mandarin speech recognition is proposed in this paper. First, an RNN is employed to discriminate each input frame for the three broad classes of initial, final, and silence. A finite state machine (FSM) is then used to classify the input frame into four states including three stable states of Initial (I), Final (F), and Silence (S), and a Transient (T) state. The decision is made based on examining whether the RNN discriminates well between classes. We then restrict the search space for the three stable states in the following DP search to speed up the recognition process. Efficiency of the proposed scheme was examined by simulations in which we incorporate it with an HMMbased continuous 411 Mandarin base-syllables recognizer. Experimental results showed that it can be used in conjunction with the beam search to greatly reduce the computational complexity of the HMM recognizer while keeping the recognition rate a...
Large vocabulary continuous speech recognition using linguistic features and constraints
, 2005
"... Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categor ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as “lexical access”. This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation.
Joint work with
, 1996
"... Text and speech processing: hard problems Theory of automata Appropriate level of abstraction ..."
Abstract
- Add to MetaCart
Text and speech processing: hard problems Theory of automata Appropriate level of abstraction
Some Results on Search Complexity vs Accuracy
"... This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcastnews transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word ..."
Abstract
- Add to MetaCart
This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcastnews transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word error rate by about 3-10 % (relative), depending on the test set. The execution time is at or close to real time for most utterances. Second, a segmented N-best list generation algorithm is described for producing compact N-best lists for very long utterances. Finally, a temporal smoothing technique is compared to deleted interpolation. On one test set, temporal smoothing reduces the error rate by 3 % for an 8 % increase in search cost, while the latter improves by 6 % for a 50 % increase in search cost. 1.

