Results 1 -
8 of
8
Moving Beyond the `Beads-On-A-String' Model of Speech
- In Proc. IEEE ASRU Workshop
, 1999
"... The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives -- automatically derived subword units and linguistically motivated distinctive feature systems -- and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies. 1. INTRODUCTION It has often been noted that automatic speech recognition performance is much worse on spontaneous speech than on carefully planned or r...
Sequential Noise Estimation With Optimal Forgetting For Robust Speech Recognition
, 2001
"... Mismatch is known to degrade the performance of speech recognition systems. In real life applications mismatch is usually nonstationary, and a general way to compensate for slowly time varying mismatch is by using sequential algorithms with forgetting. The choice of forgetting factor is usually perf ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Mismatch is known to degrade the performance of speech recognition systems. In real life applications mismatch is usually nonstationary, and a general way to compensate for slowly time varying mismatch is by using sequential algorithms with forgetting. The choice of forgetting factor is usually performed empirically on some development data, and no optimality criterion is used. In this paper we introduce a framework for obtaining optimal forgetting factor. The proposed method is applied in conjunction with a sequential noise estimation algorithm, but can be extended to sequential bias or affine transformation estimation. Speech recognition experiments conducted first under a controlled scenario on the 5K Wall Street Journal task corrupted by different noise types, then under a real-life scenario on speech recorded in a noisy car environment validate the proposed method.
Integrating Dynamic Speech Modalities Into Context Decision Trees
- Proc. ICASSP 2000
, 2000
"... Context decision trees are widely used in the speech recognition community. Besides questions about phonetic classes of a phone's context, questions about their position within a word [Lee88] and questions about the gender of the current speaker [RC99] have been used so far. In this paper we additio ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Context decision trees are widely used in the speech recognition community. Besides questions about phonetic classes of a phone's context, questions about their position within a word [Lee88] and questions about the gender of the current speaker [RC99] have been used so far. In this paper we additionally incorporate questions about current modalities of the spoken utterance like the speaker's dialect, the speaking rate, the signal to noise ratio, the latter two of which may change while speaking one utterance. We present a framework that treats all these modalities in a uniform way. Experiments with the Janus speech recognizer have produced error rate reductions of up to 10% when compared to systems that do not use modality questions. 1. INTRODUCTION 1.1. Context Decision Trees in Janus As described in [FR97] and [Rog97], Janus uses decision trees [Ode92] to assign acoustic models to polyphone segments. The base algorithm of the decoder is described in [Wos98]. Like many other decoder...
Training Of Across-Word Phoneme Models For Large Vocabulary Continuous Speech Recognition
- Proc. ICASSP
, 2002
"... Today's speech recognition systems use across-word context dependent phoneme models to capture coarticulation across word boundaries. While there are several publications about the organization of across-word model search, there are hardly any descriptions about the training of across-word models. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Today's speech recognition systems use across-word context dependent phoneme models to capture coarticulation across word boundaries. While there are several publications about the organization of across-word model search, there are hardly any descriptions about the training of across-word models.
Acoustic Model Clustering Based on Syllable Structure
, 2002
"... Current speech recognition systems perform poorly on conversational speech as compared to read speech, arguably due to the large acoustic variability inherent in conversational speech. Our hypothesis is that there are systematic effects in local context, associated with syllabic structure, that are ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Current speech recognition systems perform poorly on conversational speech as compared to read speech, arguably due to the large acoustic variability inherent in conversational speech. Our hypothesis is that there are systematic effects in local context, associated with syllabic structure, that are not being captured in the current acoustic models. Such variation may be modeled using a broader definition of context than in traditional systems which restrict context to be the neighboring phonemes. In this paper, we study the use of word- and syllable-level context conditioning in recognizing conversational speech. We describe a method to extend standard tree-based clustering to incorporate a large number of features, and we report results on the Switchboard task which indicate that syllable structure outperforms pentaphones and incurs less computational cost. It has been hypothesized that previous work in using syllable models for recognition of English was limited because of ignoring the phenomenon of re-syllabification (change of syllable structure at word boundaries), but our analysis shows that accounting for re-syllabification does not impact recognition performance.
the State Based Mixture of Expert HMM with Applications to the Recognition of Spontaneous Speech
, 2001
"... Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Although the performance of speech recognition systems has increased substantially over the last decades, there still remain a number of tasks which pose considerable problems for current state-of-the-art te ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Although the performance of speech recognition systems has increased substantially over the last decades, there still remain a number of tasks which pose considerable problems for current state-of-the-art techniques. One of these tasks is the recognition of spontaneous speech which differs from read or planned speech in that its underlying dynamics change frequently over time. The negative effect of changes in acoustic background condition on recognition performance can also be observed in other situations as, for instance, in the case of speech that is corrupted by non-stationary noise. This thesis is concerned with the development of an acoustic model for speech recognition which automatically detects changes in the background condition of a signal and compensates for the model-data mismatch by combining the information of several expert models. These experts are specialised on the different acoustic conditions under consideration and their influ-ence on the recognition process is determined by how well their associated condition matches
Clustering Wide-Contexts and HMM Topologies for Spontaneous Speech Recognition
, 2001
"... In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured imp ..."
Abstract
- Add to MetaCart
In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured implicitly by using Gaussian mixture models for the state observations. Consequently, these models can be very broad, particularly for casual spontaneous speech. In this thesis, we explore conditioning of phonemes on higher level linguistic structure, specifically syllable- and word-level structure to learn models for phonemes that are more specific to the context, reporting experimental results on a large vocabulary (35k words) conversational speech task (Switchboard). In particular, this thesis makes three main contributions related to wide context conditioning. First, we demonstrate that syllable- and word-level structure can be incorporated into current acoustic models to improve recognition accuracy over triphones. For a fixed number of parameters, these models are computationally more efficient than pentaphones, both in training and in testing. In addition, use of syllable and word features leads to a small but significant improvement in performance. The wide-contexts used in our acoustic model can implicitly capture re-syllabification effects to a certain extent. However, we find that explicitly modeling re-syllabification does not improve recognition further, because there are only a small number of phones that exhibit acoustic difference after re-syllabification. The second contribution addresses the difficulties that arise when a large number of additional conditioning features are used. As the number of conditioning features increases, the training cost can increase exponentially. Moreover, a large fraction of the training labels tends to have too few examples to have reliable statistics associated with them, and this could potentially cause decision trees to learn bad clusters. A new method has been developed for clustering with multiple stages, where each stage clusters a different subset of features, and also has a choice of using the partitions learned in the previous stages. Apart from reducing the risk of unreliable statistics, it is designed to ameliorate data fragmentation problem and is computationally less expensive. This method was successfully demonstrated with pentaphones, resulting in equivalent performance at a lower cost. Finally, a new algorithm is described to design context-specific HMMs. The idea is to model reduction of a phone for certain contexts, and to learn a more constrained topology. Using contextual information, the algorithm clusters HMM paths where each path has a different number of states. An HMM distance measure has been formulated to prune out the paths which are similar. During decoding, the paths are allocated dynamically for each sub-word unit according to their context. We investigated this algorithm to model phone topologies, finding improved characterization of speech given known word sequences but no significant improvement in word error rate.
A Discriminative Splitting Criterion for Phonetic Decision Trees
"... Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all state-of-the-art acoustic models are trained discriminatively, the conventional phonet ..."
Abstract
- Add to MetaCart
Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all state-of-the-art acoustic models are trained discriminatively, the conventional phonetic decision tree approach still relies on the maximum likelihood principle. In this paper we develop a splitting criterion based on the minimization of the classification error. An improvement of more than 10 % relative over a discriminatively trained baseline system on the Wall Street Journal corpus suggests that the proposed approach is promising. Index Terms: discriminative training, phonetic decision trees, state tying, new paradigms 1.

