Results 1 - 10
of
13
High Performance Speaker-Independent Phone Recognition Using CDHMM
- In Proc. Eurospeech
, 1993
"... In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuraciesof 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phoneaccuracyof 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabularyCSR, we show that ...
Speaker-Independent Continuous Speech Dictation
- SPEECH COMMUNICATION
, 1994
"... In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n-gram statistics estimated on n ..."
Abstract
-
Cited by 26 (12 self)
- Add to MetaCart
In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n-gram statistics estimated on newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. For English the ARPA Wall Street Journal-based CSR corpus is used and for French the BREF corpus containing recordings of texts from the French newspaper Le Monde is used. Experiments were carried out with both these corpora at the phone level and at the word level with vocabularies containing up to 20,000 words. Word recognition experiments are also described for the ARPA RM task which has been widely used to evaluate and compare systems.
Cross-Lingual Experiments with Phone Recognition
- Proc. IEEE ICASSP-93
"... This paper presents some of the recent research on speaker-independent continuous phone recognition for both French and English. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal and TIMIT corpora for English. Cross-language differences concerning language ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
This paper presents some of the recent research on speaker-independent continuous phone recognition for both French and English. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal and TIMIT corpora for English. Cross-language differences concerning language properties are presented. It was found that French is easier to recognize at the phone level (the phone error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognize at the lexical level due to the larger number of homophones. Experiments with signal analysis indicate that a 4kHz signal bandwidth is sufficient for French, whereas 8kHz is needed for English. Phone recognition is a powerful technique for language, sex, and speaker identification. With 2s of speech, the languagecan be identified with better than 99% accuracy. Sex-identification for BREF and WSJ is errorfree. Speaker identification accuracies of 98.2% on TIMIT (462 speakers) and 99.1% on BREF (57 speakers), were obtained w...
Identification of Non-Linguistic Speech Features
- Proc. ARPA Human Language Technology Workshop
, 1993
"... Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly h...
The LIMSI Continuous Speech Dictation System
"... A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this pa-per the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched condition ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this pa-per the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported. For both corpora word recognition expenrnents were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acous-tic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20k-word vocabulary when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models.
Pronunciation Variants across System Configuration, Language and Speaking Style
, 1999
"... This contribution aims at evaluating the use of pronunciation variants for different recognition system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription of the utterance and a phonemically represe ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This contribution aims at evaluating the use of pronunciation variants for different recognition system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription of the utterance and a phonemically represented lexicon, and is thus focused on the modeling capabilities of the acoustic word models. To measure the need for variants we have defined the variant2+ rate which is the percentage of words in the corpus not aligned with the most common phonemic transcription. This measure may be indicative of the possible need for pronunciation variants in the recognition system.
Speech-to-text conversion in French
, 1994
"... Speech-to-text conversion of French necessitates that both the acoustic level recognition and language modeling be tailored to the French language. Work in this area was initiated at LIMSI over 10 years ago. In this paper a summary of the ongoing research in this direction is presented. Included are ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
Speech-to-text conversion of French necessitates that both the acoustic level recognition and language modeling be tailored to the French language. Work in this area was initiated at LIMSI over 10 years ago. In this paper a summary of the ongoing research in this direction is presented. Included are studies on distributional properties of French text materials; problems specific to speech-to-text conversion particular to French; studies in phoneme-to-grapheme conversion, for continuous, error-free phonemic strings; past work on isolated-word speech-totext conversion; and more recent work on continuous-speech speech-to-text conversion. Also demonstrated is the use of phone recognition for both language and speaker identification. The
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Speech Recognition for an Information Kiosk
- Proc. ICSLP 96
, 1996
"... In the context of the ESPRIT MASK project we face the problem of adapting a "state-of-the-art" laboratory speech recognizer for use in the real world with naive users. The speech recognizer is a softwareonly system that runs in real-time on a standard Risc processor. All aspects of the speech recogn ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
In the context of the ESPRIT MASK project we face the problem of adapting a "state-of-the-art" laboratory speech recognizer for use in the real world with naive users. The speech recognizer is a softwareonly system that runs in real-time on a standard Risc processor. All aspects of the speech recognizer have been reconsidered from signal capture to adaptive acoustic models and language models. The resulting system includes such features as microphone selection, response cancellation, noise compensation, query rejection capability and decoding strategies for real-time recognition. 1. INTRODUCTION In this paper we address issues that must be faced in adapting a "state-of-the-art" speech recognizer developed in a laboratory for real-world use. All aspects of the speech recognizer must be reconsidered from signal capture to adaptive acoustic and language models. We have confronted these issues in the context of the ESPRIT MASK (Multimodal-Multimedia Automated Service Kiosk) project, aime...
The LIMSI Nov93 WSJ System
- In Proc. 1994 ARPA Spoken Language Technology Workshop
, 1994
"... In this paper we report on the LIMSI Wall Street Journal system which was evaluated in the November 1993 test. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The decoding is ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we report on the LIMSI Wall Street Journal system which was evaluated in the November 1993 test. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The decoding is carried out in two forward acoustic passes. The first pass is a time-synchronous graphsearch, which is shown to still be viable with vocabularies of up to 20k words when used with bigram back-off language models. The second pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. The official Nov93 evaluation results are given for vocabularies of up to 64,000 words, as well as results on the Nov92 5k and 20k test material. 1. Introduction Our speech recognition research focuses on developing reco...

