Results 1 - 10
of
25
Graphical models and automatic speech recognition
- Mathematical Foundations of Speech and Language Processing
, 2003
"... Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recog ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recognition techniques commonly used as part of a speech recognition system can be described by a graph – this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models. Moreover, this paper shows that many advanced models for speech recognition and language processing can also be simply described by a graph, including many at the acoustic-, pronunciation-, and language-modeling levels. A number of speech recognition techniques born directly out of the graphical-models paradigm are also surveyed. Additionally, this paper includes a novel graphical analysis regarding why derivative (or delta) features improve hidden Markov model-based speech recognition by improving structural discriminability. It also includes an example where a graph can be used to represent language model smoothing constraints. As will be seen, the space of models describable by a graph is quite large. A thorough exploration of this space should yield techniques that ultimately will supersede the hidden Markov model.
SCANMail: Browsing and Searching Speech Data by Content
, 2001
"... Increasing amounts of public, corporate, and private audio data are available for use, but limited in usefulness by the lack of tools to permit their browsing and search. In this paper, we describe SCANMail, a system that employs automatic speech recognition, information retrieval, information extra ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
Increasing amounts of public, corporate, and private audio data are available for use, but limited in usefulness by the lack of tools to permit their browsing and search. In this paper, we describe SCANMail, a system that employs automatic speech recognition, information retrieval, information extraction, and human computer interaction technology to permit users to browse and search their voicemail messages by content through a graphical user interface interface. The SCANMail client also provides note-taking capabilities as well as browsing and querying features. A CallerId server also proposes caller names from existing caller acoustic models and is trained from user feedback. An Email server sends the original message plus its transcription to a mailing address specified in the user's profile.
VTLN-Based CrossLanguage Voice Conversion
- in Proc. of the ASRU’03, Virgin Islands
, 2003
"... In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker’s voice into that of a target speaker using a different language, we want to investigate whether VTL ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker’s voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders. 1.
Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition
- CMU COMPUTER SCIENCE TECHNICAL REPORTS
, 1997
"... Generally speaking, the speaker-dependence of a speech recognition system stems from speaker-dependent speech feature. The variation of vocal tract length and/or shape is one of the major source of inter-speaker variations. In this paper, we address several methods of vocal tract length normalizatio ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Generally speaking, the speaker-dependence of a speech recognition system stems from speaker-dependent speech feature. The variation of vocal tract length and/or shape is one of the major source of inter-speaker variations. In this paper, we address several methods of vocal tract length normalization (VTLN) for large vocabulary continuous speech recognition: (1) explore the bilinear warping VTLN in frequency domain; (2) propose a speaker-specific Bark/Mel scale VTLN in Bark/Mel domain; (3) investigate adaptation of the normalization factor. Our experimental results show that the speaker-specific Bark/Mel scale VTLN is better than the piecewise/bilinear warping VTLN in frequency domain. It can reduce up to 12% word error rate for our Spanish and English spontaneous speech scheduling task database. For adaptation of the normalization factor, our experimental results show that promising result can be obtained by using not more than three utterances from a new speaker to estimate his/her n...
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
Statistical Modelling in Continuous Speech Recognition (CSR)
- IN CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE
, 2001
"... Automatic continuous speech recognition (CSR) is sufficiently ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Automatic continuous speech recognition (CSR) is sufficiently
The Use of Speaker Correlation Information for Automatic Speech Recognition
, 1998
"... This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker in ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker independent systems, as can seen by the severe drop in performance exhibited by systems between their speaker dependent mode and their speaker independent mode. The typical solution to this problem is to apply speaker adaptation to the models of the speaker independent system. This approach is examined in this thesis with the explicit goal of improving the rapid adaptation capabilities of the system by incorporating within-speaker correlation information into the adaptation process. This is achieved through the creation of an adaptation technique called referencespeaker weighting and in the development of a speaker clustering technique called speaker cluster weighting. However, speaker adaptation is just one way in which the independence assumption can be attacked. This dissertation also introduces a novel speech recognition technique called consistency modeling. This technique utilizes a priori knowledge about the within-speaker correlations which exist between di#erent phonetic events for the purpose of incorporating speaker constraintinto a speech recognition system without explicitly applying speaker adaptation. These new techniques are implemented within a segment-based speech recognition system and evaluation results are reported on the DARPA Resource Management recognition task.
Automatic Transcription Of Voicemail At ATT
- In International Conference on Acoustics, Speech, and Signal Processing
, 2001
"... This paper reports on the automatic transcription accuracy of voicemail messages. It shows that vocal tract length normalization and adaptation using linear transformations, proven to improve accuracy on the Switchboard task, provide similar accuracy improvements on this task. Direct application of ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper reports on the automatic transcription accuracy of voicemail messages. It shows that vocal tract length normalization and adaptation using linear transformations, proven to improve accuracy on the Switchboard task, provide similar accuracy improvements on this task. Direct application of the normalization techniques is complicated by the fragmentation of the data. However, unsupervised clustering was found to be effective in ensuring robust estimation of normalization parameters. Variance adaptation resulted in larger accuracy improvements than adaptation of only mean parameters, probably due to a large variability in channel conditions. The use of semi-tied covariances provides additional gains over using speaker and channel normalization. The combined gain of using various compensation techniques improves the system word error rate from 34.9% for the baseline system to 28.7%.
Rapid Unsupervised Adaptation to Children's Speech on a Connected-Digit Task
, 1996
"... We are exploring ways in which to rapidly adapt our neural network classi#ers to new speakers and conditions using very small amounts of speech, say, one or a few words. Our approach is to perform a speaker-dependentwarping of the frequency scale by selecting a Bark o#set for each speaker. We choose ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We are exploring ways in which to rapidly adapt our neural network classi#ers to new speakers and conditions using very small amounts of speech, say, one or a few words. Our approach is to perform a speaker-dependentwarping of the frequency scale by selecting a Bark o#set for each speaker. We choose the o#set for a speaker to be the one that maximizes our recognizer output score on the adaptation utterance. We then use the speaker's o#set during evaluation of all other utterances by the speaker. To test our approach, weevaluate an adult-speech trained recognizer on children's speech from the same task both before and after adaptation to each child's voice. Using only a single digit for adaptation, we have reduced the word error rate for children's speech from 9.6# to 4.2#. Using a seven-digit utterance further reduced the error rate to 3.5#.
Rapid Speaker Adaptation for Neural Network Speech Recognizers
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Speech Recognition with Neural N ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Speech Recognition with Neural Networks : : : : : : : : : : : : : : : : : : 4 2.1 The Speech Recognition Problem : : : : : : : : : : : : : : : : : : : : : : : : 4 2.2 Hybrid Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2.2.1 Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2.2 Evaluation and Training : : : : : : : : : : : : : : : : : : : : : : : : : 8 3 Review of Adaptation Literature : : : : : : : : : : : : : : : : : : : : : : : : 13 3.1 Speaker Adaptation/Normalization : : : : : : : : : : : : : : : : : : : : : : : 13 3.1.1 Speaker Categorization Approaches : : : : : : : : : : : : : : : : : : : 16 3.1.2 Data/Feature Transformation Approaches : : : : : : : : : ...

