Results 1 - 10
of
13
Survey of the State of the Art in Human Language Technology
, 1995
"... Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Sig ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Signal Representation : : : : : : : : : : : : : : : : : : : : : : : : : : 11 Melvyn J. Hunt 1.4 Robust Speech Recognition : : : : : : : : : : : : : : : : : : : : : : 17 Richard M. Stern 1.5 HMM Methods in Speech Recognition : : : : : : : : : : : : : : : 24 Renato De Mori & Fabio Brugnara 1.6 Language Representation : : : : : : : : : : : : : : : : : : : : : : : : 35 Salim Roukos 1.7 Speaker Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : :<F35.37
High Performance Speaker-Independent Phone Recognition Using CDHMM
- In Proc. Eurospeech
, 1993
"... In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuraciesof 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phoneaccuracyof 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabularyCSR, we show that ...
The challenge of spoken language systems: Research directions for the nineties
- IEEE Transactions on Speech and Audio Processing
, 1995
"... Footnote This article is based on a February, 1992workshop sponsored by the National Science ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Footnote This article is based on a February, 1992workshop sponsored by the National Science
Bootstrap estimates for confidence intervals in ASR performance evaluation
- Proc. of ICASSP
, 2004
"... The field of speech recognition has clearly benefited from precisely defined testing conditions and objective performance measures such as word error rate. In the development and evaluations of new methods, the question arises whether the empirically observed difference in performance is due to a ge ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
The field of speech recognition has clearly benefited from precisely defined testing conditions and objective performance measures such as word error rate. In the development and evaluations of new methods, the question arises whether the empirically observed difference in performance is due to a genuine advantage of one system over the other, or just an effect of chance. However still many publications do not concern themselves with the statistical significance of the results reported. In this paper we present a bootstrap method for significance analysis which is at the same time intuitive, precise and and easy to use. Unlike some methods, we make no (possibly ill-founded) approximations and the results are immediately interpretable in terms of word error rate. 1.
Speaker-Independent Continuous Speech Dictation
- SPEECH COMMUNICATION
, 1994
"... In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n-gram statistics estimated on n ..."
Abstract
-
Cited by 26 (12 self)
- Add to MetaCart
In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n-gram statistics estimated on newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. For English the ARPA Wall Street Journal-based CSR corpus is used and for French the BREF corpus containing recordings of texts from the French newspaper Le Monde is used. Experiments were carried out with both these corpora at the phone level and at the word level with vocabularies containing up to 20,000 words. Word recognition experiments are also described for the ARPA RM task which has been widely used to evaluate and compare systems.
Near-Miss Modeling: A Segment-Based Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
Combining Linguistic with Statistical Methods in Automatic Speech Understanding
, 1994
"... this paper will argue, combining knowledge and techniques from the two communities can yield results that neither community alone could achieve ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
this paper will argue, combining knowledge and techniques from the two communities can yield results that neither community alone could achieve
Pegasus: A Spoken Language Interface for On-Line Air Travel Planning
- Speech Communication
, 1994
"... This paper describes PEGASUS, a spoken language inter-face for on-line air travel planning that we have recently devel-oped. PEGASUS leverages off our spoken language technology development in the ATIS domain, and enables users to book flights using the American Airlines EAASY SABRE system. The inpu ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This paper describes PEGASUS, a spoken language inter-face for on-line air travel planning that we have recently devel-oped. PEGASUS leverages off our spoken language technology development in the ATIS domain, and enables users to book flights using the American Airlines EAASY SABRE system. The input query is transformed by the speech understanding sys-tem to a frame representation that captures its meaning. The tasks of the System Manager include transforming the seman-tic representation into an EAASY SABRE command, transmit-ting it to the application backend, formatting and interpreting the xesulting information, and managing the dialogue. Pre-liminary evaluation results suggest that users can learn to make productive use of PEGASUS for travel planning, although much work remains to be done.
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Comparing the recognition performance of CSRs: in search of an adequate metric and statistical significance test
- In: Proc. 6 th Int. Conf. on Spoken Language Processing (ICSLP 2000
, 2000
"... In this paper a new measure of recognition accuracy is introduced which can be used when comparing the performance of two speech recognizers, to establish which is the better one. This metric combines the advantages of previous measures, but excludes their disadvantages. Essentially, the metric is a ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper a new measure of recognition accuracy is introduced which can be used when comparing the performance of two speech recognizers, to establish which is the better one. This metric combines the advantages of previous measures, but excludes their disadvantages. Essentially, the metric is an attempt to quantify the degree of recognition accuracy for each sentence, thus obtaining a more informative measure than either correct or incorrect, in such a way that the statistical significance of the observed differences can be tested. The advantages of our assessment method are illustrated on the basis of both artificial and real performance data of different recognizers. 1.

