Results 1 -
6 of
6
Sphinx-4: A flexible open source framework for speech recognition
, 2004
"... Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov model (HMM) speech recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems as well as new requirements based on area ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov model (HMM) speech recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems as well as new requirements based on areas that researchers currently want to explore. To exercise this framework, and to provide researchers with a “researchready” system, Sphinx-4 also includes several implementations of both simple and state-of-the-art techniques. The framework and the implementations are all freely available via open source.
PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices
- in Proceedings of ICASSP
, 2006
"... The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to date has been confined to proprietary software, or has focused o ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to date has been confined to proprietary software, or has focused on limited domains with constrained grammars. In this paper, we present a preliminary case study on the porting and optimization of CMU SPHINX-II, a popular open source large vocabulary continuous speech recognition (LVCSR) system, to hand-held devices. The resulting system operates in an average 0.87 times real-time on a 206MHz device, 8.03 times faster than the baseline system. To our knowledge, this is the first hand-held LVCSR system available under an open-source license. 1.
Head-driven parsing for word lattices
- In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics
, 2004
"... We present the first application of the head-driven statistical parsing model of Collins (1999) as a simultaneous language model and parser for largevocabulary speech recognition. The model is adapted to an online left to right chart-parser for word lattices, integrating acoustic, n-gram, and parser ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present the first application of the head-driven statistical parsing model of Collins (1999) as a simultaneous language model and parser for largevocabulary speech recognition. The model is adapted to an online left to right chart-parser for word lattices, integrating acoustic, n-gram, and parser probabilities. The parser uses structural and lexical dependencies not considered by ngram models, conditioning recognition on more linguistically-grounded relationships. Experiments on the Wall Street Journal treebank and lattice corpora show word error rates competitive with the standard n-gram language model while extracting additional structural information useful for speech understanding. 1
The 1999 CMU 10x real time broadcast news transcription system
- Proc. DARPA workshop on Automatic Transcription of Broadcast News
, 2000
"... CMU's 10X real time system is the HMM-based SPHINX-III system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a first-pa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
CMU's 10X real time system is the HMM-based SPHINX-III system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a first-pass decoder, capable of generating word lattices. It was designed to optimize speed, recognition accuracy as well as memory requirements. For the 1999 Hub 4 evaluation task, the system used two sets of acoustic models- full-bandwidth and narrow-bandwidth. The acoustic models were 6000 senone, 32 Gaussians per state, 3-state HMMs with no skips permitted across states. The system used a single 39 dimensional feature stream consisting of cepstra and cepstral differences. The lattices generated were rescored using a DAG algorithm. The DAG-rescored hypotheses were designated as those of the primary system. The contrastive system consisted of the output of the first pass Viterbi search, with no DAG rescoring of lattices. A trigram language model consisting of 57,000 unigrams, 10 million bigrams and 14.9 million trigrams was used. No adaptation passes were done. In this paper we describe the various components of the primary system. The first-pass word error rate on the 1998 Hub 4 evaluation set was 20.4 % with this system. The overall word error rate scored by NIST for the 1999 Hub 4 evaluation set was 27.6%.
New Features for Confidence Annotation
- In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP
"... In this paper we describe two new confidence measures for estimating the reliability of speech-to-text output: Likelihood Dependence and Neighborhood Dependence. Each word in the speechto -text output for a given utterance is annotated with these two measures. Likelihood dependence for a given word ..."
Abstract
- Add to MetaCart
In this paper we describe two new confidence measures for estimating the reliability of speech-to-text output: Likelihood Dependence and Neighborhood Dependence. Each word in the speechto -text output for a given utterance is annotated with these two measures. Likelihood dependence for a given word occurrence indicates how critical that word is to the overall utterance likelihood; i.e., how much worse is the likelihood of the next best utterance if that word is eliminated from the recognition. Neighborhood dependence measures how stable a given word is when neighboring words are changed in the recognition. We show that correct and incorrect words in the recognition behave significantly differently with respect to these measures. We also show that on the broadcast news task they perform better than some of the existing, commonly used confidence measures. 1. Introduction Detecting regions of high and low reliability or confidence in the output of an automatic speech recognizer is an im...
The 1996 Hub-4 Sphinx-3 System
- In Proc. of DARPA Speech Recognition Workshop
, 1996
"... This paper describes the CMU Sphinx-3 system, and the configuration we used for the 1996 DARPA (Hub-4) evaluation. The model structure, acoustic modeling, language modeling, lexical modeling, and system structure are summarized. We also discuss the experimental results obtained with this system on t ..."
Abstract
- Add to MetaCart
This paper describes the CMU Sphinx-3 system, and the configuration we used for the 1996 DARPA (Hub-4) evaluation. The model structure, acoustic modeling, language modeling, lexical modeling, and system structure are summarized. We also discuss the experimental results obtained with this system on the most recent DARPA evaluation, and some subsequent results are also discussed. Motivation Past efforts on speech recognition have focused on clean, good quality speech in friendly environments, and DARPA evaluations in past years have followed this agenda. While one must walk before one can run, we have, as a community, developed our technology to the point where we can handle large vocabulary dictation well, and spontaneous speech fairly well. The evaluations have tracked this, with the introduction last year of so-called "found" speech, recorded off the air from commercial broadcasts. For the 1996 Hub-4 evaluation we have continued in this vein, widening our horizons to include the ad...

