Results 1 - 10
of
21
Large language models in machine translation
- In EMNLP
, 2007
"... This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabi ..."
Abstract
-
Cited by 78 (2 self)
- Add to MetaCart
This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases. 1
Hierarchical search for large vocabulary conversational speech recognition
- IEEE Signal Processing Magazine
, 1999
"... ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information so ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information sources such as broadcast news and two-way telephone dialogs. A significant contribution to this advancement in technology is the development of search techniques that find suboptimal but accurate solutions in problems involving large search spaces and extremely complex statistical models. Moreover, these search strategies are capable of dynamically integrating information from a number of diverse knowledge sources to determine the correct word hypothesis, and limit the scope of the search by using a hierarchical search strategy. We refer to this problem as the decoding or search problem. This paper describes the complexity associated with decoding using hierarchical representations for linguistic and acoustic knowledge sources. An extensible object-oriented decoder available in the public domain, that leverages current state-of-the-art technology is described to illustrate these concepts. This decoder supports efficient handling of acoustic models for cross-word contextdependent phones, multiple pronunciations of words using lexical trees, and rescoring of word graphs based on N-gram language models in a single pass. It employs a state-of-the-art Viterbistyle dynamic programming algorithm, and is equipped with several heuristic pruning criteria to minimize the consumption of computational resources while maintaining good accuracy.
Distant-talking continuous speech recognition based on a novel reverberation model
- in the feature domain,” Proc. INTERSPEECH
, 2006
"... A novel approach for automatic speech recognition in highly reverberant environments, proposed in [1] for isolated word recognition, is extended to continuous speech recognition (CSR) in this paper. The approach is based on a combined acoustic model consisting of a network of clean speech HMMs and a ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
A novel approach for automatic speech recognition in highly reverberant environments, proposed in [1] for isolated word recognition, is extended to continuous speech recognition (CSR) in this paper. The approach is based on a combined acoustic model consisting of a network of clean speech HMMs and a reverberation model. Because the grammatical information and the information about the acoustic environment are strictly separated in the combined model, a high degree of flexibility for adapting the system to new tasks and new environments is attained. We show that virtually all known CSR search algorithms can be used for decoding the proposed combined model if a few extensions are added. In a simulation of a connected digit recognition task, the proposed method achieves more than 40 % reduction of the word error rate compared to a conventional HMM-based system trained on reverberant speech, at the cost of an increased decoding complexity. Index Terms: robust speech recognition, distant-talking speech recognition, dereverberation.
Joint Video Scene Segmentation And Classification Based On Hidden Markov Model
- ICME-2000
, 2000
"... Video classi#cation and segmentation are fundamental steps for e#cient accessing, retrieving and browsing large amount of video data. Wehave developed a scene classi#cation scheme using a Hidden Markov Model #HMM#- based classi#er. By utilizing the temporal behaviors of di#erent scene classes, HMM ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Video classi#cation and segmentation are fundamental steps for e#cient accessing, retrieving and browsing large amount of video data. Wehave developed a scene classi#cation scheme using a Hidden Markov Model #HMM#- based classi#er. By utilizing the temporal behaviors of di#erent scene classes, HMM classi#er can e#ectively classify video segments into one of the prede#ned scene classes. In this paper, we describe two approaches for joint video classi#cation and segmentation based on HMM, which works by searching for the most likely class transition path utilizing the dynamic programming technique. 1. INTRODUCTION Video classi#cation and segmentation are fundamental steps for e#cient accessing, retrieving and browsing large amount of video data. Recently, several research groups have developed algorithms to detect scene change by incorporating audio and visual information. Most of these works #1, 2, 3# are based on some prior scene models, #e.g. dialog, setting, etc.# and accomplish ...
Arc Minimization in Finite State Decoding Graphs with Cross-Word Acoustic Context
- In Proc. ICSLP’02
, 2002
"... Recent approaches to large vocabulary decoding with finite state graphs have focused on the use of state minimization algorithms to produce relatively compact graphs. This paper extends the finite state approach by developing complementary arc-minimization techniques. The use of these techniques in ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Recent approaches to large vocabulary decoding with finite state graphs have focused on the use of state minimization algorithms to produce relatively compact graphs. This paper extends the finite state approach by developing complementary arc-minimization techniques. The use of these techniques in concert with state minimization allows us to statically compile decoding graphs in which the acoustic models utilize a full word of cross-word context. This is in significant contrast to typical systems which use only a single phone. We show that the particular arc-minimization problem that arises is in fact an NP-complete combinatorial optimization problem, and describe the reduction from 3-SAT. We present experimental results that illustrate the moderate sizes and runtimes of graphs for the Switchboard task. 1.
EWAVES: an efficient decoding algorithm for lexical tree based speech recognition
- in Proc. of ICSLP
"... We present an optimized implementation of the Viterbi algorithm suitable for small to large vocabulary, and isolated or continuous speech recognition. The Viterbi algorithm is certainly the most popular dynamic programming algorithm used in speech recognition. In this paper we propose a new algorith ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
We present an optimized implementation of the Viterbi algorithm suitable for small to large vocabulary, and isolated or continuous speech recognition. The Viterbi algorithm is certainly the most popular dynamic programming algorithm used in speech recognition. In this paper we propose a new algorithm that outperforms the Viterbi algorithm in term of complexity and of memory requirements. It is based on the assumption of strictly left to right models and explores the lexical tree in an optimal way, such that book-keeping computation is minimized. The tree is encoded such that children of a node are placed contiguously and in increasing order of memory heap so that the proposed algorithm also optimizes cache usage. Even though the algorithm is asymptotically two times faster that the conventional Viterbi algorithm, in our experiments
Dynamic Search-Space Pruning For Time-Constrained Speech Recognition
- in: International Conference on Spoken Language Processing
, 2002
"... In automatic speech recognition complex state spaces are searched during the recognition process. By limiting these search spaces the computation time can be reduced, but unfortunately the recognition rate mostly decreases, too. However, especially for time-critical recognition tasks a search-space ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In automatic speech recognition complex state spaces are searched during the recognition process. By limiting these search spaces the computation time can be reduced, but unfortunately the recognition rate mostly decreases, too. However, especially for time-critical recognition tasks a search-space pruning is necessary. Therefore, we developed a dynamic mechanism to optimize the pruning parameters for time-constrained recognition tasks, e.g. speech recognition for robotic systems, in respect to word accuracy and computation time. With this mechanism an automatic speech recognition system can process speech signals with an approximately constant processing rate. Compared to a system without such a dynamic mechanism and the same time available for computation, the variance of the processing rate is decreased greatly without a significant loss of word accuracy. Furthermore, the extended system can be sped up to real-time processing, if desired or necessary.
Modelling, estimating and compensating low-bit rate coding distortion in speech recognition
- IEEE Trans. on SAP
, 2002
"... A solution to the problem of speech recognition with signals distorted by low-bit rate coders is presented in this paper. A model for the coding-decoding distortion, a HMM compensation method to include this model, and an EM-based adaptation algorithm to estimate this distortion are proposed here. M ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A solution to the problem of speech recognition with signals distorted by low-bit rate coders is presented in this paper. A model for the coding-decoding distortion, a HMM compensation method to include this model, and an EM-based adaptation algorithm to estimate this distortion are proposed here. Medium vocabulary continuous-speech speaker-independent recognition experiments with 8 kbps G.729(CS-CELP), 13 kbps RPE-LTP (GSM), 5.3 kbps G723.1, 4.8 kbps FS-1016 and 32 kbps G.726(ADPCM) coders show that the approach described in this paper is able to dramatically reduce the effect of the coding distortion and, in some cases, gives a word accuracy higher than the baseline system with uncoded speech. Finally, the EM estimation algorithm requires only one adapting utterance and the approach described is certainly The evolution and popularity of cellular and TCP/IP networks has created the problem of improving the recognition accuracy for speech distorted by low-bit rate coders. The distortion of coding schemes in speech recognizers is difficult to model and is an open problem that cannot be solved by applying conventional noise cancelling techniques [1] such as spectral subtraction [2], cepstral mean subtraction [3] and RASTA
Combining character-based bigram with word-based bigram in contextual post-processing for Chinese script
- ACM TRANS. ASIAN LANGUAGE INFORMATION PROCESSING
, 2002
"... It is crucial to use contextual information to improve the recognition accuracy of Chinese script in an offline, handwritten Chinese character-recognition system. However, with the increase in the number of candidates given by a character recognizer, contextual postprocessing using a word-based bigr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
It is crucial to use contextual information to improve the recognition accuracy of Chinese script in an offline, handwritten Chinese character-recognition system. However, with the increase in the number of candidates given by a character recognizer, contextual postprocessing using a word-based bigram is time-consuming. This article presents a novel contextual postprocessing method that integrates character-based bigram postprocessing with word-based bigram postprocessing in light of the complementary action between Chinese characters and Chinese words. On the basis of isolated character recognition, character-based bigram postprocessing using a forward-backward search is first executed on a big candidate set, which improves both the accuracy and efficiency of the candidate set (the cumulative accuracy of the top ten candidates is greatly boosted). Then, to further improve accuracy, word-based bigram postprocessing (WBP) is executed on a small candidate set. This method obtains high accuracy while paying attention to postprocessing speed at the same time. Experimental results for three Chinese scripts (about 66,000 characters in total) demonstrate the effectiveness of our method: character-based bigram postprocessing improves accuracy from 81.58 % to 94.50%, and the cumulative accuracy of the top ten candidates rises from 94.33 % to 98.25%. After WBP, 95.75 % accuracy is achieved, which is equivalent to the accuracy of WBP executed on a big candidate set. However, our method is more than 100 times faster than that of WBP.
Semantically object synchronous understanding in SALT for highly interactive user interface
- EUROSPEECH
, 2003
"... SALT is an industrial standard that enables speech input/output for Web applications. Although the core design is to make simple tasks easy, SALT gives the designers ample fine-grained controls to create advanced user interface. The paper exploits a speech input mode in which SALT would dynamically ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
SALT is an industrial standard that enables speech input/output for Web applications. Although the core design is to make simple tasks easy, SALT gives the designers ample fine-grained controls to create advanced user interface. The paper exploits a speech input mode in which SALT would dynamically report partial semantic parses while audio capturing is still ongoing. The semantic parses can be evaluated and the outcome reported immediately back to the user. The potential impact for the dialog systems is that tasks conventionally performed in a system turn can now be carried out in the midst of a user turn, thereby presenting a significant departure from the conventional turn-taking. To assess the efficacy of such highly interactive interface, more user studies are undoubtedly needed. This paper demonstrates how SALT can be employed to facilitate such studies.

