Results 1 - 10
of
20
Modeling Out-Of-Vocabulary Words For Robust Speech Recognition
, 2000
"... This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognize ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance.
Subword-based Approaches for Spoken Document Retrieval
, 2000
"... This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified query terms during retrieval. Four
Heterogeneous Acoustic Measurement And Multiple Classifiers For Speech Recognition
, 1998
"... The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements.
Towards unsupervised pattern discovery in speech
- Peter Hagedorn, Wolfgang Konrad and J. Wallaschek, The Journal of Sound and Vibration
, 2005
"... Abstract—We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., p ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Abstract—We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream. Index Terms—Speech processing, unsupervised pattern discovery, word acquisition. I.
Lexical Modeling Of Non-Native Speech For Automatic Speech Recognition
, 2000
"... This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-n ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-native data with a single model. In particular, this paper describes an attempt to better model non-native lexical patterns. These patterns are incorporated by applying context-independent phonetic confusion rules, whose probabilities are estimated from training data. Using this approach, the word error rate on a non-native test set is reduced from 20.9% to 18.8%. 1. INTRODUCTION Speech recognition accuracy has been observed to be drastically lower for non-native speakers of the target language than for native speakers [3, 13, 14]. Research on both nonnative accent modeling and dialect-specific modeling shows that large gains in performance can be achieved when the acoustics [1, 9, 14] and ...
Near-Miss Modeling: A Segment-Based Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
Automatic Language Identification Using a Segment-Based Approach
- Proc. Eurospeech
, 1993
"... Automatic Language Identification (ALI) is the problem of automatically identifying the language of an utterance through the use of a computer. In 1977, House and Neuburg proposed an approach to ALI which focused on the phonotactic constraints of different languages. Their work suggested that simple ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Automatic Language Identification (ALI) is the problem of automatically identifying the language of an utterance through the use of a computer. In 1977, House and Neuburg proposed an approach to ALI which focused on the phonotactic constraints of different languages. Their work suggested that simple language models could be used effectively for language identification if an accurate phonetic representation of an utterance could be obtained from the acoustic signal. Our research utilizes House and Neuburg's ideas as the starting point for a new segment-based approach to ALI. To develop a solid theoretical basis for the design of an ALI system, a formal probabilistic framework has been developed. This framework uses House and Neuburg's ideas as its foundation but also utilizes additional information that may be useful for ALI. Specifically, phonotactic, acoustic and prosodic information are all incorporated into the framework which provides the structure for the segment-based system. To ...
Remap - Experiments With Speech Recognition
, 1996
"... In this report we present experimental and theoretical results using a framework for training and modeling continuous speech recognition systems based on the theoretically optimal Maximum a Posteriori (MAP) criterion. This is in constrast to most state-of-the-art systems which are trained according ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this report we present experimental and theoretical results using a framework for training and modeling continuous speech recognition systems based on the theoretically optimal Maximum a Posteriori (MAP) criterion. This is in constrast to most state-of-the-art systems which are trained according to a Maximum Likelihood (ML) criterion. Although the algorithm is quite general, we applied it to a particular form of hybrid system combining Hidden Markov Models (HMMs) and Artificial Neural Networks (ANNs) in which the ANN targets and weights are iteratively re-estimated to guarantee the increase of the posterior probability of the correct model, hence actually minimizing the error rate. More specifically, this training approach is applied to a transition-based model that uses local conditional transition probabilities (i.e., the posterior probability of the current state given the current acoustic vector and the previous state) to estimate the posterior probabilities of sentences. Experi...
Modeling Dynamics In Connectionist Speech Recognition - The Time Index Model
- In Proc. Intl. Conf. on Speech and Language Processing
, 1994
"... We are experimenting with an approach to connectionist speech recognition that models the dynamics within a speech segment using temporal position as an explicit variable. Currently, the most common model for human speech production that is used in speech recognition is the Hidden Markov Model (HMM) ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We are experimenting with an approach to connectionist speech recognition that models the dynamics within a speech segment using temporal position as an explicit variable. Currently, the most common model for human speech production that is used in speech recognition is the Hidden Markov Model (HMM). However, HMMs suffer from well known limitations; most notably, the assumption that the observations generated in a given state are independent and identically distributed (i.i.d.). As an alternative, we are developing a time index model that explicitly conditions the emission probability of a state on the time index, where time index is defined as the number of frames since entering a state till the current frame. Thus, the proposed model does not require the i.i.d. assumption. Our pilot results suggest that the time-index approach can greatly reduce error if we have good information about the phoneme boundary location. 1 INTRODUCTION We briefly review the main approaches to acoustic mo...
A Segment-Based Speaker Verification System Using SUMMIT
, 1997
"... This thesis describes the development of a segment-based speaker verification system. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concate ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This thesis describes the development of a segment-based speaker verification system. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concatenation of phone-sized units, one may be able to capitalize on measurements for such units more readily. A potential side benefit of such an approach is that one may be able to achieve good performance with unit (i.e., phonetic inventory) and feature sizes that are smaller than what would normally be required for a frame-based system, thus deriving the benefit of reduced computation. To carry out our investigation, we started with the segment-based speech recognition system developed in our group called SUMMIT [44], and modified it to suit our needs. The speech signal was first transformed into a hierarchical segment network using frame-based measurements. Next, acoustic models for each speak...

