Results 1 - 10
of
10
Probabilistic Segmentation for Segment-Based Speech Recognition
, 1998
"... Segment-based speech recognition systems must explicitly hypothesize segment start and end times. The purpose of a segmentation algorithm is to hypothesize those times and to compose a graph of segments from them. During recognition, this graph is an input to a search that finds the optimal sequence ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Segment-based speech recognition systems must explicitly hypothesize segment start and end times. The purpose of a segmentation algorithm is to hypothesize those times and to compose a graph of segments from them. During recognition, this graph is an input to a search that finds the optimal sequence of sound units through the graph. The goal of this thesis is to create a high-quality, real-time phonetic segmentation algorithm for segment-based speech recognition. A high-quality segmentation algorithm produces a sparse network of segments that contains most of the actual segments in the speech utterance. A real-time algorithm implies that it is fast, and that it is able to produce an output in a pipelined manner. The approach taken in this thesis is to adopt the framework of a state-of-the-art algorithm that does not operate in real-time, and to make the modifications necessary to enable it to run in real-time. The algorithm adopted as the starting point for this work makes use of a for...
Segmentation and modeling in segment-based recognition
- In Proc. Eurospeech '97
, 1997
"... Recently, we have developed a probabilistic framework for segmentbased speech recognition that represents the speech signal as a network of segments and associated feature vectors [2]. Although in general, each path through the network does not traverse all segments, we argued that each path must ac ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Recently, we have developed a probabilistic framework for segmentbased speech recognition that represents the speech signal as a network of segments and associated feature vectors [2]. Although in general, each path through the network does not traverse all segments, we argued that each path must account for all feature vectors in the network. We then demonstrated an efficient search algorithm that uses a single additional model to account for segments that are not traversed. In this paper, we present two new extensions to our framework. First, we replace our acoustic segmentation algorithm with “segmentation by recognition, ” a probabilistic algorithm that can combine multiple contextual constraints towards hypothesizing only the most likely segments. Second, we generalize our framework to “near-miss modeling” and describe a search algorithm that can efficiently use multiple models to enforce contextual constraints across all segments in a network. We report experiments in phonetic recognition on the TIMIT corpus in which we achieve a diphone context-dependent error rate of 26.6% on the NIST core test set over 39 classes. This is a 12.8 % reduction in error rate from our best previously reported result. 1.
Near-Miss Modeling: A Segment-Based Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
Automatic Acquisition of Language Models for Speech Recognition
, 1994
"... This thesis focuses on the automatic acquisition of language structure and the subsequent use of the learned language structure to improve the performance of a speech recognition system. First, we develop a grammar inference process which is able to learn a grammar describing a large set of training ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
This thesis focuses on the automatic acquisition of language structure and the subsequent use of the learned language structure to improve the performance of a speech recognition system. First, we develop a grammar inference process which is able to learn a grammar describing a large set of training sentences. The process of acquiring this grammar is one of generalization so that the resulting grammar predicts likely sentences beyond those contained in the training set. From the grammar we construct a novel probabilistic language model called the phrase class n-gram model (pcng), which is a natural generalization of the word class n-gram model [11] to phrase classes. This model utilizes the grammar in such a way that it maintains full coverage of any test set while at the same time reducing the complexity, or number of parameters, of the resulting predictive model. Positive results are shown in terms of perplexity of the acquired phrase class n-gram models and in terms of reduction of ...
On Supervised Learning From Sequential Data With Applications For Speech Recognition
, 1999
"... visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In this synthetic example, the one-dimensional target data would be represented poorly by a uni-modal Gaussian distribution with a constant variance (which corresponds to using the squared-error objective function), which would average the two separate branches, indicated by the fat lines as the mean and constant variance of the single Gaussian. Compare this figure with Figure 3.10, Figure 3.11 and Figure 3.12 to see a subsequent improvement of the model.
Automatic Continuous Speech Recognition with Rapid Speaker Adaption for Human/Machine Interaction
, 1997
"... This thesis presents work in three main directions of the automatic speech recognition field. The work within two of these -- dynamic decoding and hybrid HMM/ANN speech recognition -- has resulted in a real-time speech recognition system, currently in use in the human/machine dialogue demonstra ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This thesis presents work in three main directions of the automatic speech recognition field. The work within two of these -- dynamic decoding and hybrid HMM/ANN speech recognition -- has resulted in a real-time speech recognition system, currently in use in the human/machine dialogue demonstration system WAXHOLM, developed at the department. The third direction is fast unsupervised speaker adaptation, where "fast" refers to adaptation with a small amount of adaptation speech. The work in
Context-Dependent Modeling in a Segment-Based Speech Recognition System
- S.M. thesis, MIT
, 1997
"... The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent models in the search, while contextdependent models are reserved for re-scoring the hypotheses proposed by the contextindependent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modeling and "offset" modeling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of t...
Explicit N-Best Formant Features fo Segment-Based Speech Recognition
, 1996
"... This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a s ..."
Abstract
- Add to MetaCart
This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a segment-based recognition framework and the use of explicit formant features in a flexible integration scheme to ultimately improve the phonetic recognition accuracy. We describe a segmentation algorithm that produces a lattice of segment hypotheses, each with an associated broad phonetic identity. We build a single phonetic segment classifier along with separate vowel/semi-vowel and consonant classifiers based on traditional cepstral features paying attention to reducing the mismatch between training and deployment conditions. We develop a robust, N-best formant tracking algorithm that generates a list of up to N consistent formant interpretations. The use of the N best feature paradigum is based on the observation that there are generally only a handful of reasonable interpretation of the given formant information. Instead of finding the best formant interpretation through the use of a global cost function that includes energy maximization and smoothness terms, we delay the selection of the correct formant interpretation until after the segment classification and phonetic search. We use the formant interpretations to extract features for a vowel/semi-vowel segment classifier. The formant trajectories are approximated either by three line segments or by a third-order Legendre polynomial. We show that together with formant amplitude, formant bandwidth, pitch, and segment durations we can produce a classifier of comparable performance to a cepstral-based classifier. We further demonstrate the potential of the N best classification paradigm and show that a combination of formant and cepstral features further improves the classification accuracy. Finally, the validity of the entire approach of using a segment-based approach, separate classifiers for vowels and consontans, and explicit formant features is verified by phonetic recognition experiments.
Real-Time Probabilistic Segmentation
, 1998
"... In this work, we investigate modifications to a probabilistic segmentation algorithm to achieve a real-time, and pipelined capability for our segment-based speech recognizer [4]. The existing algorithm used a Viterbi and backwards search to hypothesize phonetic segments [2]. We were able to reduc ..."
Abstract
- Add to MetaCart
In this work, we investigate modifications to a probabilistic segmentation algorithm to achieve a real-time, and pipelined capability for our segment-based speech recognizer [4]. The existing algorithm used a Viterbi and backwards search to hypothesize phonetic segments [2]. We were able to reduce the computational requirements of this algorithm by reducing the effective search space to acoustic landmarks, and were able to achieve pipelined capability by executing the A search in blocks defined by reliably detected phonetic boundaries. The new algorithm produces 30% fewer segments, and improves TIMIT phonetic recognition performance by 2.4% over an acoustic segmentation baseline. We were also able to produce 30% fewer segments on a word recognition task in a weather information domain [11].
Generation and Minimization of Word Graphs in Continuous
"... in the forward and backward passes, it is guaranteed that the generated graph is the minimal word-lattice, containing exactly the paths that have a higher score than the threshold. This includes all the different alignments of each wordstring. For re-scoring using new acoustical models, the word-lat ..."
Abstract
- Add to MetaCart
in the forward and backward passes, it is guaranteed that the generated graph is the minimal word-lattice, containing exactly the paths that have a higher score than the threshold. This includes all the different alignments of each wordstring. For re-scoring using new acoustical models, the word-lattice constructed above is the optimal minimal representation because it is desirable to re-score the different alignments. However, for re-scoring using only a new grammar, a more compact representation is better. Minimizing a word-lattice is equivalent to minimizing a nondeterministic finite-state automaton (NFA) which is a hard problem that can not in general be solved in polynomial time. Therefore, the problem has been attacked using heuristic methods that reduce the graph but not to the minimal size. In particular, the so called word-pair approximation has been applied [6]. In this study we instead approached the problem by applying the classical algorithms for: 1) constructing an equiv

