Results 1 - 10
of
13
Lexical Modeling Of Non-Native Speech For Automatic Speech Recognition
, 2000
"... This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-n ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-native data with a single model. In particular, this paper describes an attempt to better model non-native lexical patterns. These patterns are incorporated by applying context-independent phonetic confusion rules, whose probabilities are estimated from training data. Using this approach, the word error rate on a non-native test set is reduced from 20.9% to 18.8%. 1. INTRODUCTION Speech recognition accuracy has been observed to be drastically lower for non-native speakers of the target language than for native speakers [3, 13, 14]. Research on both nonnative accent modeling and dialect-specific modeling shows that large gains in performance can be achieved when the acoustics [1, 9, 14] and ...
Adaptation methods for non-native speech
- IN PROCEEDINGS OF MULTILINGUALITY IN SPOKEN LANGUAGE PROCESSING
, 2001
"... LVCSR performance is consistently poor on low-proficiency non-native speech. While gains from speaker adaptation can often bring recognizer performance on highpro ciency non-native speakers close to that seen for native speakers [12], recognition for lower-proficiency speakers remains low even after ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
LVCSR performance is consistently poor on low-proficiency non-native speech. While gains from speaker adaptation can often bring recognizer performance on highpro ciency non-native speakers close to that seen for native speakers [12], recognition for lower-proficiency speakers remains low even after individual speaker adaptation [2]. The challenge for accent adaptation is to maximize recognizer performance without collecting large amounts of acoustic data for each native-language/target-language pair. In this paper, we focus on adaptation for lower-proficiency speakers, exploring how acoustic data from up to 15 adaptation speakers can be put to its most effective use.
A Comparison Of Novel Techniques For Instantaneous Speaker Adaptation
- in Proceedings of the European Conference on Speech Communication and Technology
, 1997
"... This paper introduces two novel techniques for instantaneous speaker adaptation, reference speaker weighting and consistency modeling. An approach to hierarchical speaker clustering using gender and speaking rate as the clustering criteria is also presented. All three methods attempt to utilize the ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper introduces two novel techniques for instantaneous speaker adaptation, reference speaker weighting and consistency modeling. An approach to hierarchical speaker clustering using gender and speaking rate as the clustering criteria is also presented. All three methods attempt to utilize the underlying within-speaker correlations that are present between the acoustic realizations of different phones. By accounting for these correlations a limited amount of adaptation data can be used to adapt the models of every phonetic acoustic model including those for phones which have not been observed in the adaptation data. In instantaneous adaptation experiments using the DARPA Resource Management corpus, a reduction in word error rate of 20% has been achieved using a combination of these new techniques. INTRODUCTION Speaker adaptation can be viewed as the task of altering the acoustic models of a speech recognition system to match, as closely as possible, the current speaker. Reliable...
Speech Recognition in Mobile Environments
, 2000
"... The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements o ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements of the degradation introduced by idiosyncrasies of the mobile networks. These sources of degradation include distortion introduced by the speech codec as well as artifacts arising from channel errors and discontinuous transmission. In this thesis we focus on characterizing the distortion introduced to the speech signal by the speech codec and we propose methods for reducing the detrimental effect of coding on recognition accuracy. The initial focus of this thesis is on the full rate GSM codec (FRGSM) . We propose a method to generate recognition features directly from codec parameters. It is shown in this work that by selectively constructing a cepstral feature vector from the GSM codec para...
Some Results on Search Complexity vs Accuracy
- in DARPA Speech Recognition Workshop
, 1997
"... This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcast news transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the wor ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcast news transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word error rate by about 3-10% (relative), depending on the test set. The execution time is at or close to real time for most utterances. Second, a segmented N-best list generation algorithm is described for producing compact N-best lists for very long utterances. Finally, a temporal smoothing technique is compared to deleted interpolation. On one test set, temporal smoothing reduces the error rate by 3% for an 8% increase in search cost, while the latter improves by 6% for a 50% increase in search cost. 1. Introduction In this paper we describe the results of a number of search experiments on the 1996 Hub-4 development and evaluation test sets. We have also attempted to document issues that a...
The Use of Speaker Correlation Information for Automatic Speech Recognition
, 1998
"... This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker in ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker independent systems, as can seen by the severe drop in performance exhibited by systems between their speaker dependent mode and their speaker independent mode. The typical solution to this problem is to apply speaker adaptation to the models of the speaker independent system. This approach is examined in this thesis with the explicit goal of improving the rapid adaptation capabilities of the system by incorporating within-speaker correlation information into the adaptation process. This is achieved through the creation of an adaptation technique called referencespeaker weighting and in the development of a speaker clustering technique called speaker cluster weighting. However, speaker adaptation is just one way in which the independence assumption can be attacked. This dissertation also introduces a novel speech recognition technique called consistency modeling. This technique utilizes a priori knowledge about the within-speaker correlations which exist between di#erent phonetic events for the purpose of incorporating speaker constraintinto a speech recognition system without explicitly applying speaker adaptation. These new techniques are implemented within a segment-based speech recognition system and evaluation results are reported on the DARPA Resource Management recognition task.
Recognizing Non-Native Speech: Characterizing and Adapting to Non-Native Usage in LVCSR
, 2001
"... Low-proficiency non-native speakers represent a significant challenge for large-vocabulary continuous speech recognition (LVCSR). Acoustic models are confused by a heavy accent ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Low-proficiency non-native speakers represent a significant challenge for large-vocabulary continuous speech recognition (LVCSR). Acoustic models are confused by a heavy accent
Context-Dependent Modeling in a Segment-Based Speech Recognition System
- S.M. thesis, MIT
, 1997
"... The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent models in the search, while contextdependent models are reserved for re-scoring the hypotheses proposed by the contextindependent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modeling and "offset" modeling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of t...
Dynamically Configurable Acoustic Models For Speech Recognition
- Proc. of ICASSP'98, Seattle
, 1998
"... Senones were introduced to share Hidden Markov model (HMM) parameters at a sub-phonetic level in [3] and decision trees were incorporated to predict unseen phonetic contexts in [4]. In this paper, we will describe two applications of the senonic decision tree in (1) dynamically downsizing a speech r ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Senones were introduced to share Hidden Markov model (HMM) parameters at a sub-phonetic level in [3] and decision trees were incorporated to predict unseen phonetic contexts in [4]. In this paper, we will describe two applications of the senonic decision tree in (1) dynamically downsizing a speech recognition system for small platforms and in (2) sharing the Gaussian covariances of continuous density HMMs (CHMMs). We experimented how to balance different parameters that can offer the best trade off between recognition accuracy and system size. The dynamically downsized system, without retraining, performed even better than the regular Baum-Welch [1] trained system. The shared covariance model provided as good a performance as the unshared full model and thus gave us the freedom to increase the number of Gaussian means to increase the accuracy of the model. Combining the downsizing and covariance sharing algorithms, a total of 8% error reduction was achieved over the Baum-Welch trained ...
Some Results on Search Complexity vs Accuracy
"... This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcastnews transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word ..."
Abstract
- Add to MetaCart
This paper presents three different techniques applied in or developed during the 1996 Hub-4 broadcastnews transcription task. First, an efficient shortest path graph search algorithm is applied to the word lattice created by Viterbi search, producing a globally optimum result. This reduces the word error rate by about 3-10 % (relative), depending on the test set. The execution time is at or close to real time for most utterances. Second, a segmented N-best list generation algorithm is described for producing compact N-best lists for very long utterances. Finally, a temporal smoothing technique is compared to deleted interpolation. On one test set, temporal smoothing reduces the error rate by 3 % for an 8 % increase in search cost, while the latter improves by 6 % for a 50 % increase in search cost. 1.

