Results 1 - 10
of
32
ESTIMATING CONFIDENCE USING WORD LATTICES
"... For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many oftoday's speech recognition sy ..."
Abstract
-
Cited by 52 (3 self)
- Add to MetaCart
For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many oftoday's speech recognition systems use word lattices as a compact representation of a set of alternative hypothesis. We exploit the use of such word lattices as information sources for the measure-of-con dence tagger JANKA [1]. In experiments on spontaneous human-to-human speech data the use of word lattice related information signi cantly improves the tagging accuracy.
Language Independent and Language Adaptive Acoustic Modeling for Speech Recognition
- SPEECH COMMUNICATION
, 2001
"... With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port LVCSR systems in a fast and efficient way. More specifically we want to estimate acoustic ..."
Abstract
-
Cited by 51 (26 self)
- Add to MetaCart
With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port LVCSR systems in a fast and efficient way. More specifically we want to estimate acoustic models for a new target language using speech data from varied source languages, but only limited data from the target language. For this purpose we introduce different methods for multilingual acoustic model combination and a polyphone decision tree specialization procedure. Recognition results using language dependent, independent and language adaptive acoustic models are presented and discussed in the framework of our GlobalPhone project which investigates LVCSR systems in 15 languages.
Multilingual Speech Recognition
, 2000
"... The speech-to-speech translation system Verbmobil requires a multilingual setting. This consists of recognition engines in the three languages German, English and Japanese that run in one common framework together with a language identification component which is able to switch between these recogni ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
The speech-to-speech translation system Verbmobil requires a multilingual setting. This consists of recognition engines in the three languages German, English and Japanese that run in one common framework together with a language identification component which is able to switch between these recognizers. This article describes the challenges of multilingual speech recognition and presents different solutions to the problem of the automatic language identification task. The combination of the described components results in a flexible and user-friendly multilingual spoken dialog system.
Error-responsive feedback mechanisms for speech recognizers
, 1997
"... This thesis is about modeling, analyzing, and predicting errorful behavior in large vocabulary continuous speech recognition systems. Because today's state-of-the-art recognizers are not designed to be situated naturally in an error feedback loop, they are ill-positioned for inclusion in multi-modal ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
This thesis is about modeling, analyzing, and predicting errorful behavior in large vocabulary continuous speech recognition systems. Because today's state-of-the-art recognizers are not designed to be situated naturally in an error feedback loop, they are ill-positioned for inclusion in multi-modal interfaces, multi-media databases, and other interesting applications. I make improvements to the current approach to predicting and analyzing error behaviors, which is currently based only on the measurement ofword error rate. The speech recognizer's functionality is extended to include con dence annotations, which are \meta-level " markings that indicate how certain the recognizer is that it has decoded its input correctly. This is accomplished by feeding externally de ned error conditions back to the recognizer. Error feedback enables the construction of statistical models that map measurements of the recognizer's internal states and behaviors to externally de ned error conditions.
Speaker Normalization Based On Frequency Warping
, 1997
"... In speech recognition, speaker-dependence of a speech recognition system comes from speaker-dependence of the speech feature, and the variation of vocal tract shape is the major source of inter-speaker variations of the speech feature, though there are some other sources which also contribute. In th ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
In speech recognition, speaker-dependence of a speech recognition system comes from speaker-dependence of the speech feature, and the variation of vocal tract shape is the major source of inter-speaker variations of the speech feature, though there are some other sources which also contribute. In this paper, we address the approaches of speaker normalization which aim at normalizing speaker's vocal tract length based on Frequency WarPing (FWP). The FWP is implemented in the front-end preprocessing of our speech recognition system. We investigate the formant -based and ML-based FWP in linear and nonlinear warping modes, and compare them in detail. All experimental results are based on our JANUS3 large vocabulary continuous speech recognition system and the Spanish Spontaneous Scheduling Task database (SSST). 1. INTRODUCTION In speech recognition, we are mainly facing three major challenges: (1) speaker-dependence of the speech signal, which leads to speaker-dependence of the speech rec...
Word And Acoustic Confidence Annotation For Large Vocabulary Speech Recognition
"... We present improvements in confidence annotation of automatic speech recognizer output for large vocabulary, speakerindependent systems. Several strong additions to the set of predictor variables used for this purpose are discussed. Extensions which allow prediction of separate types of errors, as o ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We present improvements in confidence annotation of automatic speech recognizer output for large vocabulary, speakerindependent systems. Several strong additions to the set of predictor variables used for this purpose are discussed. Extensions which allow prediction of separate types of errors, as opposed to the simple presence of an error, are presented. A new development, acoustic confidence annotation, is explored, in which a predictor is built that indicates the likely successes and failures of the acoustic models alone. Four separate learning mechanisms are compared in terms of their ability to provide good confidence annotations from the same set of predictor variables. Performance figures are reported on both read news (the North American Business news corpus) and conversational telephone speech (the Switchboard corpus) , both in American English. The Sphinx-II system [1] is used for the NAB tests. The Janussystem [2] is used for the Switchboard tests. 1. Annotation of Read Spe...
Fast Bootstrapping Of LVCSR Systems With Multilingual Phoneme Sets
, 1997
"... In this paper we described an efficient method to bootstrap continuously spoken, large vocabulary speech recognition systems by multilingual phoneme sets. To evaluate this techniques we collected the multilingual database GlobalPhone which currently consists of 9 different languages. A multilingual ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
In this paper we described an efficient method to bootstrap continuously spoken, large vocabulary speech recognition systems by multilingual phoneme sets. To evaluate this techniques we collected the multilingual database GlobalPhone which currently consists of 9 different languages. A multilingual recognizer (MULTI) based on the four languages German, English, Japanese and Spanish was developed to serve as a source system. Likewise this system is very useful for language identification and achieves 100% language identification rate. Based on the MULTI system we evaluated our bootstrap technique on such completely different languages as Chinese, Croatian, and Turkish. 1. INTRODUCTION As the demand for speech recognition and translation systems in multiple languages grows, the development of multilingual systems is of increasing concern. On the one hand a multilingual system can be used as a language independent speech recognition and translation system with integrated automatic langu...
Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3d pointing gestures
, 2004
"... This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rul ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rules. In the presented user study, people could interact with a robot in a kitchen scenario, using speech and gesture input. In the study, we could observe that our fusion approach is very tolerant against falsely detected pointing gestures. This is because we use speech as the main modality and pointing gestures mainly for disambiguation of objects. In the paper we also report about the temporal correlation of speech and gesture events as observed in the user study.
Improved Methods For Vocal Tract Normalization
- In Proc. of the IEEE Int. Conf. on Acoustics Speech and Signal Processing
, 1999
"... This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the mod ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the models learn the normalization scales of the training speakers. We show that using single Gaussian densities for selecting the normalization scales in training results in lower error rates than using mixture densities. For VTN in recognition, we propose an improvement of the well--known multiple--pass strategy: By using an unnormalized acoustic model for the first recognition pass instead of a normalized model lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The multiple--pass strategy is an efficient method but it is suboptimal because the normalization scale and the word sequence are determined sequentially. We found that for telephon...
Grapheme Based Speech Recognition
- in Proceedings of the EUROSPEECH
, 2003
"... Large vocabulary speech recognition systems traditionally represent words in terms of subword units, usually phonemes. This paper investigates the potential of graphemes acting as subunits. In order to develop context dependent grapheme based speech recognizers several decision tree based clustering ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Large vocabulary speech recognition systems traditionally represent words in terms of subword units, usually phonemes. This paper investigates the potential of graphemes acting as subunits. In order to develop context dependent grapheme based speech recognizers several decision tree based clustering procedures are performed and compared to each other. Grapheme based speech recognizers in three languages - English, German, and Spanish - are trained and compared to their phoneme based counterparts. The results show that for languages with a close grapheme-to-phoneme relation, grapheme based modeling is as good as the phoneme based one. Furthermore, multilingual grapheme based recognizers are designed to investigate whether grapheme based information can be successfully shared among languages. Finally, some bootstrapping experiments for Swedish were performed to test the potential for rapid language deployment.

