Results 1 - 10
of
40
Discriminative Training and Maximum Entropy Models for Statistical Machine Translation
, 2002
"... We present a framework for statistical machine translation of natural languages based on direct maximum entropy models, which contains the widely used source -channel approach as a special case. All knowledge sources are treated as feature functions, which depend on the source language senten ..."
Abstract
-
Cited by 256 (13 self)
- Add to MetaCart
We present a framework for statistical machine translation of natural languages based on direct maximum entropy models, which contains the widely used source -channel approach as a special case. All knowledge sources are treated as feature functions, which depend on the source language sentence, the target language sentence and possible hidden variables.
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
The SRI March 2000 Hub-5 conversational speech transcription system
- In Proceedings of the NIST Speech Transcription Workshop
, 2000
"... We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-m ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling crossword pronunciation variants in “multiword ” vocabulary items. The language model (LM) was enhanced with an “anti-LM ” representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models. 1.
Towards language independent acoustic modeling
- In Proc. ICASSP
, 2000
"... We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and aut ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans
- of the Johns Hopkins Summer 2000 Workshop,” in Proc. Works. Signal Processing
, 2001
"... We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at va ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs. 1.
Weighting Schemes For Audio-Visual Fusion In Speech Recognition
, 2001
"... In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio- and visual- only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
Improved language modeling for statistical machine translation
, 2005
"... Statistical machine translation systems use a combination of one or more translation models and a language model. While there is a significant body of research addressing the improvement of translation models, the problem of optimizing language models for a specific translation task has not received ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Statistical machine translation systems use a combination of one or more translation models and a language model. While there is a significant body of research addressing the improvement of translation models, the problem of optimizing language models for a specific translation task has not received much attention. Typically, standard word trigram models are used as an out-of-the-box component in a statistical machine translation system. In this paper we apply language modeling techniques that have proved beneficial in automatic speech recognition to the ACL05 machine translation shared data task and demonstrate improvements over a baseline system with a standard language model. 1
Progress In Automatic Meeting Transcription
- in Proceedings of the EUROSPEECH
, 1999
"... In this paper we report recent developments on the meeting transcription task, a large vocabulary conversational speech recognition task. Previous experiments showed this is a very challenging task, with about 50% word error rate (WER) using existing recognizers. The difficulty mostly comes from hig ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
In this paper we report recent developments on the meeting transcription task, a large vocabulary conversational speech recognition task. Previous experiments showed this is a very challenging task, with about 50% word error rate (WER) using existing recognizers. The difficulty mostly comes from highly disfluent/conversational nature of meetings, and lack of domain specific training data. For the first problem, our SWB(Switchboard) system --- a conversational telephone speech recognizer --- was used to recognize wide-band meeting data; for the latter, we leveraged the large amount of Broadcast News (BN) data to build a robust system. This paper will especially focus on two experiments in the BN system development: model combination and HMM topology/duration modeling. Model combination can be done at various stages of recognition: post-processing schemes such as ROVER can lead to significant improvements; to reduce computation we tried model combination at acoustic score level. We will ...
Integrating multilingual articulatory features into speech recognition
- in Proc. Eurospeech
, 2003
"... The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech recognition systems under different conditions and in different settings. For example recognition systems based on features are more robust to noise and reverberation. I ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech recognition systems under different conditions and in different settings. For example recognition systems based on features are more robust to noise and reverberation. In earlier work we showed that articulatory features can compensate for inter language variability and can be recognized across languages. In this paper we show that using cross- and multilingual detectors to support an HMM based speech recognition system significantly reduces the word error rate. By selecting and weighting the features in a discriminative way, we achieve an error rate reduction that lies in the same range as that seen when using language specific feature detectors. By combining feature detectors from many languages and training the weights discriminatively, we even outperform the case where only monolingual detectors are being used. 1.

