Results 1 - 10
of
70
Speaker, Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion
, 1998
"... In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach ..."
Abstract
-
Cited by 153 (2 self)
- Add to MetaCart
In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach to detect turns of a Gaussian process; the decision of a turn is based on the Bayesian Information Criterion (BIC), a model selection criterion well-known in the statistics literature. The BIC criterion can also be applied as a termination criterion in hierarchical methods for clustering of audio segments: two nodes can be merged only if the merging increases the BIC value. Our experiments on the Hub4 1996 and 1997 evaluation data show that our segmentation algorithm can successfully detect acoustic changes; our clustering algorithm can produce clusters with high purity, leading to improvements in accuracy through unsupervised adaptation as much as the ideal clustering by the true speaker i...
Maximum Likelihood Modeling With Gaussian Distributions For Classification
- Proceedings of ICASSP
, 1998
"... Maximum Likelihood (ML) modeling of multiclass data for classication often suers from the following problems: a) data insuciency implying overtrained or unreliable models b) large storage requirement c) large computational requirement and/or d) ML is not discriminating between classes. Sharing param ..."
Abstract
-
Cited by 81 (26 self)
- Add to MetaCart
Maximum Likelihood (ML) modeling of multiclass data for classication often suers from the following problems: a) data insuciency implying overtrained or unreliable models b) large storage requirement c) large computational requirement and/or d) ML is not discriminating between classes. Sharing parameters across classes (or constraining the parameters) clearly tends to alleviate the rst three problems. It this paper we show that in some cases it can also lead to better discrimination (as evidenced by reduced misclassication error). The parameters considered are the means and variances of the gaussians and linear transformations of the feature space (or equivalently the gaussian means). Some constraints on the parameters are shown to lead to Linear Discrimination Analysis (a well-known result) while others are shown to lead to optimal feature spaces (a relatively new result) . Applications of some of these ideas to the speech recognition problem are also given. 1.
Speech Recognition in Noisy Environments
- Ph. D. Dissertation, ECE Department, CMU
, 1996
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 72 (3 self)
- Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2 The SPHINX-II Recognition System . . . . . . . . . . . . . . . . . . . . . . 17 2.1. An Overview of the SPHINX-II System . . . . . . . . . . . . . . . . . . 17 2.1.1. Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3. Recognition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.4. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.5. Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2. Experimental Tasks and Corpora . ...
ESTIMATING CONFIDENCE USING WORD LATTICES
"... For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many oftoday's speech recognition sy ..."
Abstract
-
Cited by 52 (3 self)
- Add to MetaCart
For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many oftoday's speech recognition systems use word lattices as a compact representation of a set of alternative hypothesis. We exploit the use of such word lattices as information sources for the measure-of-con dence tagger JANKA [1]. In experiments on spontaneous human-to-human speech data the use of word lattice related information signi cantly improves the tagging accuracy.
Learning bounds for domain adaptation
- In Advances in Neural Information Processing Systems
, 2008
"... Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. ..."
Abstract
-
Cited by 41 (6 self)
- Add to MetaCart
Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
Confidence Measures For Spontaneous Speech Recognition
- in Proc. ICASSP
, 1997
"... For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of confidence tagger JANKA, which is able to provide confidence information for the words in the output of the speech recognizer JANUS-3-SR. On a spontaneous german human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%. 1. INTRODUCTION Current speech recognition systems are far from perfect. Unfortunately, number and location of the errors in their output is usually unknown. This information, however, could be used in a number of applications. Examples for such applications are word selection for unsupervised adaptation schemes like MLLR [1], automatic weighting of additional, non-speech knowledge sources like lip-reading, or ai...
Supervised and unsupervised PCFG adaptation to novel domains
, 2003
"... This paper investigates adapting a lexicalized probabilistic context-free grammar (PCFG) to a novel domain, using maximum a posteriori (MAP) estimation. The MAP framework is general enough to include some previous model adaptation approaches, such as corpus mixing in Gildea (2001), for example ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
This paper investigates adapting a lexicalized probabilistic context-free grammar (PCFG) to a novel domain, using maximum a posteriori (MAP) estimation. The MAP framework is general enough to include some previous model adaptation approaches, such as corpus mixing in Gildea (2001), for example. Other approaches falling within this framework are more effective. In contrast to the results
Unsupervised Training Of A Speech Recognizer: Recent Experiments
- in Proc. EUROSPEECH
"... Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer with only a minimal amount (30 minutes) of transcripti ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer with only a minimal amount (30 minutes) of transcriptions and a large portion (50 hours) of untranscribed data. A recognizer is bootstrapped on the transcribed part of the data and initial transcripts are generated with it for the remainder (the untranscribed part). Using a lattice-based confidence measure, the recognition errors are (partially) detected and the remainder of the hypotheses is used for training. Using this scheme, the word error rate on a broadcast news speech recognition task dropped from more than 32.0% to 21.4%. In a cheating experiment we show, that this performance cannot be significantly improved by improving the measure of confidence. By combining the unsupervisedly trained system with our currently best recognizer which ...
Speaker Clustering And Transformation For Speaker Adaptation In Large-Vocabulary Speech Recognition Systems
- IEEE TRANS. SPEECH AND SIGNAL PROCESSING
, 1995
"... A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a li ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are re-estimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of reducing the error rate by 10-15% with the use of as little as 3 sentences of adaptation data.
Recent advances in spontaneous speech recognition and understanding
- In Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition
, 2003
"... Abstract—How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large-scale national project entitled “Spontaneous Speech: Corpus and Processing Technology ” started in Japan in 1999. This ..."
Abstract
-
Cited by 26 (10 self)
- Add to MetaCart
Abstract—How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large-scale national project entitled “Spontaneous Speech: Corpus and Processing Technology ” started in Japan in 1999. This paper gives an overview of the project and reports on the major results of experiments that have been conducted so far at Tokyo Institute of Technology, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition. The paper also discusses the most important research problems to be solved in order to achieve ultimate spontaneous speech recognition systems. I.

