Results 1 - 10
of
12
The LIMSI Broadcast News Transcription System
- Speech Communication
, 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. T ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
The multichannel Wall Street Journal audio–visual corpus (MC-WSJ-AV): Specification and initial experiments
- in IEEE Autom. Speech Recognition Understanding Workshop (ASRU
, 2005
"... The recognition of speech in meetings poses a number of challenges to current Automatic Speech Recognition (ASR) techniques. Meetings typically take place in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. In such circu ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
The recognition of speech in meetings poses a number of challenges to current Automatic Speech Recognition (ASR) techniques. Meetings typically take place in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. In such circumstances, headset microphones have to date provided the best recognition performance, however participants are often reluctant to wear them. Microphone arrays provide an alternative to close-talking microphones by providing speech enhancement through directional discrimination. Unfortunately, however, development of array front-end systems for state-of-the-art large vocabulary continuous speech recognition suffers from a lack of necessary resources, as most available speech corpora consist only of single-channel recordings. This paper describes the collection of an audio-visual corpus of read speech from a number of instrumented meeting rooms. The corpus, based on the WSJCAM0 database, is suitable for use in continuous speech recognition experiments and is captured using a variety of microphones, including arrays, as well as close-up and wider angle cameras. The paper also describes some initial ASR experiments on the corpus comparing the use of close-talking microphones with both a fixed and a blind array beamforming technique. 1.
Time-First Search For Large Vocabulary Speech Recognition
, 1998
"... This paper describes a new search technique for large vocabulary speech recognition based on a stack decoder. Considerable memory savings are achieved with the combination of a tree based lexicon and a new search technique. The search proceeds time-first, that is partial path hypotheses are extended ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
This paper describes a new search technique for large vocabulary speech recognition based on a stack decoder. Considerable memory savings are achieved with the combination of a tree based lexicon and a new search technique. The search proceeds time-first, that is partial path hypotheses are extended into the future in the inner loop and a tree walk over the lexicon is performed as an outer loop. Partial word hypotheses are grouped based on language model state. The stack maintains information about groups of hypotheses and whole groups are extended by one word to form new stack entries. An implementation is described of a one-pass decoder employing a 65,000 word lexicon and a disk-based trigram language model. Real time operation is achieved with a small search error, a search space of about 5 Mbyte and a total memory usage of about 35 Mbyte. 1. INTRODUCTION Search is an interesting problem in the field of large vocabulary speech recognition. Typically the acoustic vectors correspondi...
Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches
, 2000
"... This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perpl ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. The method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. The approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance. 1. INTRODUCTION The convention of adopting the orthographic word as the basic unit in the LVCSR lexicon is not suited to handling so-called compounding languages like German, Dutch,...
Evaluation Methodologies for Interactive Speech Systems
- In First International Conference on Language Resources and Evaluation (LREC
, 1998
"... In this paper, several criteria and paradigms are described to measure the performance of spoken language systems. The focus is on the evaluation of natural language understanding components. These evaluations are carried out in the domain of spontaneous human-human interaction as supported by autom ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this paper, several criteria and paradigms are described to measure the performance of spoken language systems. The focus is on the evaluation of natural language understanding components. These evaluations are carried out in the domain of spontaneous human-human interaction as supported by automatic translation systems. They are also applied in the domain of spontaneous human-machine interaction typically used in information retrieval applications. Some system response evaluation paradigms for different applications and domains are discussed in more detail. It is also shown that official performance tests and site-specific evaluation paradigms are complementary in use. 1. Introduction This paper describes and discusses methods and paradigms measuring the performance of a spoken language system for different applications and domains and at different stages of the inputprocessing. The focus is on the evaluationof natural language understanding components. A diagram of a generic spo...
Bilingual and Dialectal Adaptation and Retraining
- IN: PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP-1998), VOL.15, 30 NOVEMBER-4
, 1999
"... In this paper, we report our investigations on the use of adaptation and retraining in our bilingual (Italian, German) and multidialectal recognition system. Our approach for bilingual speech recognition is to assume the two languages as being one, which is best suited for a task where Italian and G ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper, we report our investigations on the use of adaptation and retraining in our bilingual (Italian, German) and multidialectal recognition system. Our approach for bilingual speech recognition is to assume the two languages as being one, which is best suited for a task where Italian and German natives speak both languages, resulting in a variety of accents and dialects. We performed adaptation on single speakers and speaker groups built from combinations of spoken and native language. Furthermore, we performed retraining on partitions of the adaptation or training data. Our experiments led to an error rate reduction in all cases: compared to the baseline system, we achieved an overall improvement of 14, 12-14 and 7% for speaker adaptation, speaker group adaptation and retraining, respectively. Furthermore, we found among others that performance is rather stable for Italian between adaptation and retraining, while adaptation for German outperforms retraining by far.
Leeuwen, “N-best: The northern- and southern-dutch benchmark evaluation of speech recognition technology
- in Interspeech
, 2007
"... In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are broadc ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are broadcast news (BN) and conversational telephone speech (CTS). The N-best evaluation will take place in the spring of 2008 and is open to all research institutes and industries on voluntary basis. The goals of this first N-best evaluation is to define, set-up and conduct a Dutch LVCSR benchmark evaluation. In this paper, we will describe the state-of-the-art of Dutch LVCSR, recognition problems that are typical for the Dutch language, and the evaluation protocol. Index Terms: Northern- and Southern-Dutch, large vocabulary speech recognition, benchmark test, evaluation, conversational telephone speech, broadcast news. 1.
A Hybrid Approach To Compounds In Lvcsr
- In Proc. International Conference on Spoken Language Processing, volume I
, 2002
"... In several languages compound words form orthographic units, which complicates the task of ensuring good lexical coverage for large vocabulary continuous speech recognition (LVCSR). A common approach to the problem consists of first recognizing the compound constituents, followed by an automatic rec ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In several languages compound words form orthographic units, which complicates the task of ensuring good lexical coverage for large vocabulary continuous speech recognition (LVCSR). A common approach to the problem consists of first recognizing the compound constituents, followed by an automatic recompounding process. We describe an accurate compound module, which combines a rule-based approach with statistical pruning. The module is incorporated in a broadcast news recognition task for Dutch and yields an 11% relative decrease in word error rate (WER).
Evaluation and usability of multimodal spoken language dialogue systems
- IN: SPEECH COMMUNICATION
, 2004
"... With the technical advances and market growth in the field, the issues of evaluation and usability of spoken language dialogue systems, unimodal as well as multimodal, are as crucial as ever. This paper discusses those issues by reviewing a series of European and US projects which have produced majo ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
With the technical advances and market growth in the field, the issues of evaluation and usability of spoken language dialogue systems, unimodal as well as multimodal, are as crucial as ever. This paper discusses those issues by reviewing a series of European and US projects which have produced major results on evaluation and usability. Whereas significant progress has been made on unimodal spoken language dialogue systems evaluation and usability, the emergence of, among others, multimodal, mobile, and domain-oriented systems continues to pose entirely new challenges to research in evaluation and usability.
Speech Processing for Communications: Whats New?
- MULTITEL ASBL, 1 Copernic Ave, Initialis Scientific Park, B-7000 MONS(**) Faculté Polytechnique de Mons, TCTS Lab, 1 Copernic Ave, Initialis Scientific Park, B-7000
, 2001
"... Speech is one of the most complex signals an engineer has to handle. It is thus not surprising that its automatic processing has only recently found a wide market. In this paper we analyze the latest developments in speech coding, synthesis and recognition, and show why they were necessary for comme ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Speech is one of the most complex signals an engineer has to handle. It is thus not surprising that its automatic processing has only recently found a wide market. In this paper we analyze the latest developments in speech coding, synthesis and recognition, and show why they were necessary for commercial maturity. Synthesis based on automatic unit selection, robust recognition systems, and mixed excitation coders are among the topics discussed here. Introduction Speech, which is one of the most complex signals an engineer has to handle (although we would need another article to support this claim), is also the easiest way of communication between humans. This is not a paradox : as opposed to telecommunication signals, speech was not invented by engineers. It was there much before them. If engineers had been given the task of designing speech, they sure would not have made it the way it is (chances are we would speak sinusoids, possibly with the help of attached bio-electronic devices...

