Results 1 - 10
of
15
The LIMSI Broadcast News Transcription System
- Speech Communication
, 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. T ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
Lightly Supervised and Unsupervised Acoustic Model Training
- Computer Speech and Language
, 2002
"... The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%.
Lightly Supervised Acoustic Model Training
- Proc. ISCA ITRW ASR2000
, 2000
"... Although tremendous progress has been made in speech recognition technology, with the capability of todays state-of-the-art systems to transcribe unrestricted continuous speech from broadcast data, these systems rely on the availability of large amounts of manually transcribed acoustic training data ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Although tremendous progress has been made in speech recognition technology, with the capability of todays state-of-the-art systems to transcribe unrestricted continuous speech from broadcast data, these systems rely on the availability of large amounts of manually transcribed acoustic training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. In this paper we describe some recent experiments using lightly supervised techniques for acoustic model training in order to reduce the system development cost. The strategy we investigate uses a speech recognizer to transcribe unannotated broadcast news data, and optionally combines the hypothesized transcription with associated, but unaligned closed captions or transcripts to create labeled training. We show that this approach can dramatically reduces the cost of building acoustic models. 1. INTRODUCTION The last decade has witnessed substantial progres...
Connectionist Language Modeling For Large Vocabulary Continuous Speech Recognition
- In International Conference on Acoustics, Speech and Signal Processing
, 2002
"... This paper describes ongoing work on a new approach for language modeling for large vocabulary continuous speech recognition. Almost all state-of-the-art systems use statistical n-gram language models estimated on text corpora. One principle problem with such language models is the fact that many of ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
This paper describes ongoing work on a new approach for language modeling for large vocabulary continuous speech recognition. Almost all state-of-the-art systems use statistical n-gram language models estimated on text corpora. One principle problem with such language models is the fact that many of the n-grams are never observed even in very large training corpora, and therefore it is common to back-off to a lower-order model. In this paper we propose to address this problem by carrying out the estimation task in a continuous space, enabling a smooth interpolation of the probabilities. A neural network is used to learn the projection of the words onto a continuous space and to estimate the n-gram probabilities. The connectionist language model is being evaluated on the DARPA HUB5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate.
Unsupervised Acoustic Model Training
- in Proceedings of ICASSP, 2002
, 2002
"... This paper describes some recent experiments using unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated raw broadcast news data. The hypothesized transcription is used to create labels for ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
This paper describes some recent experiments using unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated raw broadcast news data. The hypothesized transcription is used to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 minutes of manually annotated data. These experiments demonstrate that unsupervised training is a viable training scheme and can dramatically reduce the cost of building acoustic models.
Improved ROVER using Language Model Information
- Proc. ISCA ITRW Workshop on Automatic Speech Recognition: Challenges for the new Millenium
, 2000
"... In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the train ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and other task specific error criterions have been proposed. This paper presents an extension of the ROVER algorithm for combining outputs of multiple speech recognizers using both a word error criterion and a sentence error criterion. The algorithm has been evaluated on the 1998 and 1999 broadcast news evaluation test sets, as well as the SDR 1999 speech recognition 10 hour subset and consistently outperformed the standard ROVER algorithm. The approach seems to be of particular interest for improving the recognition performance by combining only two or three speech recognizers achieving relative pe...
Investigating Lightly Supervised Acoustic Model Training
- In Proc. ICASSP
, 2001
"... The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe broadcast audio data with a word error of about 20%. However, acoustic model development for the recognizers requires large corpora of manually transcrib ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe broadcast audio data with a word error of about 20%. However, acoustic model development for the recognizers requires large corpora of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. In this paper we describe some recent experiments using different levels of supervision for acoustic model training in order to reduce the system development cost. The experiments have been carried out using the DARPA TDT-2 corpus (also used in the SDR99 and SDR00 evaluations). Our experiments demonstrate that light supervision is sufficient for acoustic model development, drastically reducing the development cost.
Automatic Transcription Of Compressed Broadcast Audio
- in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’01), Vol.1
, 2001
"... With increasing volumes of audio and video data broadcast over the web, it is of interest to assess the performance of state-of-theart automatic transcription systems on compressed audio data for media indexation applications. In this paper the performance of the LIMSI 10x French broadcast news tran ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
With increasing volumes of audio and video data broadcast over the web, it is of interest to assess the performance of state-of-theart automatic transcription systems on compressed audio data for media indexation applications. In this paper the performance of the LIMSI 10x French broadcast news transcription system is measured on a two-hour audio set for a range of MP3 and RealAudio codecs at various bitrates and the GSM codec used for European cellular phone communications. The word error rates are compared with those obtained on high quality PCM recordings prior to compression. For a 6.5 kbps audio bit rate (the most commonly used on the web), word error rates under 40% can be achieved, which makes automatic media monitoring systems over the web a realistic task. 1.
Improving Genericity for Task-Independent Speech Recognition
"... Although there have been regular improvements in speech recognition technology over the past decade, speech recognition is far from being a solved problem. Recognition systems are usually tuned to a particular task and porting the system to a new task (or language) is both time-consuming and expensi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Although there have been regular improvements in speech recognition technology over the past decade, speech recognition is far from being a solved problem. Recognition systems are usually tuned to a particular task and porting the system to a new task (or language) is both time-consuming and expensive. In this paper, issues in speech recognizer portability are addressed through the development of generic core speech recognition technology. First, the genericity of wide domain models is assessed by evaluating performance on several tasks. Then, the use of transparent methods for adapting generic models to a specific task is explored. Finally, further techniques are evaluated aiming at enhancing the genericity of the wide domain models. We show that unsupervised acoustic model adaptation and multi-source training can reduce the performance gap between task-independent and taskdependent acoustic models, and for some tasks even out-perform task-dependent acoustic models.
Genericity and Adaptability Issues for Task-Independent Speech Recognition
- ISCA ITRW 2001 Adaptation Methods For Speech Recognition, Sophia-Antipolis
, 2001
"... The last decade has witnessed major advances in core speech recognition technology, with today's systems able to recognize continuous speech from many speakers without the need for an explicit enrollment procedure. Despite these improvements, speech recognition is far from being a solved problem. Mo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The last decade has witnessed major advances in core speech recognition technology, with today's systems able to recognize continuous speech from many speakers without the need for an explicit enrollment procedure. Despite these improvements, speech recognition is far from being a solved problem. Most recognition systems are tuned to a particular task and porting the system to another task or language is both time-consuming and expensive.

