Results 1 - 10
of
27
Shared-Distribution Hidden Markov Models for Speech Recognition
, 1991
"... Parameter sharing plays an important role in statistical modeling since training data are usually limited. On the one hand, we would like to use models that are as detailed as possible. On the other hand, with models too detailed, we can no longer reliably estimate the parameters. Triphone generaliz ..."
Abstract
-
Cited by 227 (5 self)
- Add to MetaCart
Parameter sharing plays an important role in statistical modeling since training data are usually limited. On the one hand, we would like to use models that are as detailed as possible. On the other hand, with models too detailed, we can no longer reliably estimate the parameters. Triphone generalization may force two models to be merged together when only parts of the model output distributions are similar, while the rest of the output distributions are different. This problem can be avoided if clustering is carried out at the distribution level. In this paper, a shared-distribution model is proposed to replace generalized triphone models for speaker-independent continuous speech recognition. Here, output distributions in the hidden Markov model are shared with each other if they exhibit acoustic similarity. In addition to detailed representation, it also gives us the freedom to use a large number of states for each phonetic model. Although an increase in the number of states will inc...
The SPHINX-II Speech Recognition System: An Overview
- Computer, Speech and Language
, 1992
"... In order for speech recognizers to deal with increased task perplexity, speaker variation, and environment variation, improved speech recognition is critical. Steady progress has been made along these three dimensions at Carnegie Mellon. In this paper, we review the SPHINX-II speech recognition syst ..."
Abstract
-
Cited by 137 (7 self)
- Add to MetaCart
In order for speech recognizers to deal with increased task perplexity, speaker variation, and environment variation, improved speech recognition is critical. Steady progress has been made along these three dimensions at Carnegie Mellon. In this paper, we review the SPHINX-II speech recognition system and summarize our recent efforts on improved speech recognition. This research was sponsored by the Defense Advanced Research Projects Agency and monitored by the Space and Naval Warfare Systems Command under Contract N00039-91-C-0158, ARPA Order No. 7239. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Keywords: Speech recognition, hidden Markov models, SPHINX-II 1. INTRODUCTION At Carnegie Mellon, wehave made significant progress in large-vocabulary speaker-independent continuous speech recognition during the past years [1, 2, 3]. SP...
Lexical Modeling in a Speaker Independent Speech Understanding System
, 1993
"... Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even la ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even larger systems, capable of recognizing 20,000 words are just now being developed. Speech understanding systems have recently been developed that perform fairly well within a restricted domain. While the size and performance of modern speech recognition and understanding systems are impressive, it is evident to anyone who has used these systems that the technology is primitive compared to our own human ability to understand speech. Some of the difficulties hampering progress in the fields of speech recognition and understanding stem from the many sources of variation that occur during human communication. One of the sources of variation that occurs in human communication is the different ways that words can be pronounced. There are many causes of pronunciation variation, such as: the phonetic environment in which the word occurs, the dialect of the speaker,
Predicting Unseen Triphones With Senones
, 1993
"... In large-vocabulary speech recognition, the decoder often encounters triphones that are not covered in the training data. These unseen triphones are usually represented by corresponding diphones or context independent monophones. We propose to use decision-tree based senones to generate needed senon ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
In large-vocabulary speech recognition, the decoder often encounters triphones that are not covered in the training data. These unseen triphones are usually represented by corresponding diphones or context independent monophones. We propose to use decision-tree based senones to generate needed senonic baseforms for unseen triphones. A decision tree is built for each individual Markov state of each phone, and the leaves of the trees constitute the senone codebook. To find the senone a Markov state of any triphone is associated with, we traverse the corresponding tree until we reach a leaf node, where a senone is represented. We used the DARPA 5,000-word speaker-independent Wall Street Journal dictation task to evaluate the proposed method. The word error rate was reduced by 11% when unseen triphones were modeled by the decision-tree based senones. When there were at least 5 unseen triphones in each test utterance, the error rate could be reduced by more than 20%. This research was spons...
On-Line Cursive Handwriting Recognition Using Speech Recognition Methods
, 1994
"... A hidden Markov model (HMM) based continuous speech recognition system is applied to on-line cursive handwriting recognition. The base system is unmodified except for using handwriting feature vectors instead of speech. Due to inherent properties of HMMs, segmentation of the handwritten script sente ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
A hidden Markov model (HMM) based continuous speech recognition system is applied to on-line cursive handwriting recognition. The base system is unmodified except for using handwriting feature vectors instead of speech. Due to inherent properties of HMMs, segmentation of the handwritten script sentences is unnecessary. A 1.1% word error rate is achieved for a 3050 word lexicon, 52 character, writer-dependent task and 3%-5% word error rates are obtained for six different writers in a 25,595 word lexicon, 86 character, writer-dependent task. Similarities and differences between the continuous speech and on-line cursive handwriting recognition tasks are explored; the handwriting database collected over the past year is described; and specific implementation details of the handwriting system are discussed. 1. INTRODUCTION Traditionally, the first step in handwriting recognition is the segmentation of words into component characters [1]. However, in modern continuous speech recognition ef...
Statistical Trajectory Models for Phonetic Recognition
, 1994
"... The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Te ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Temporal behavior is modelled explicitly by creating dynamic tracks of the acoustic attributes used to represent the waveform, and by estimating the spatio--temporal correlation structure of the resulting errors. The tracks serve as templates from which synthetic segments of the acoustic attributes are generated. Scoring of an hypothesized phonetic segment is then based on the error between the measured acoustic attributes and the synthetic segments generated for each phonetic model.
Extensions to Constraint Dependency Parsing for Spoken Language Processing
- COMPUTER SPEECH AND LANGUAGE
, 1995
"... A text-based and spoken language processing framework based on the Constraint Dependency Grammar (CDG) developed by Maruyama [24, 25] is discussed. The scope of CDG is expanded to allow for the analysis of sentences containing lexically ambiguous words, to allow feature analysis in constraints, and ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
A text-based and spoken language processing framework based on the Constraint Dependency Grammar (CDG) developed by Maruyama [24, 25] is discussed. The scope of CDG is expanded to allow for the analysis of sentences containing lexically ambiguous words, to allow feature analysis in constraints, and to efficiently process multiple sentence candidates that are likely to arise in spoken language processing. The benefits of the CDG parsing approach are summarized. Additionally, the development of CDG grammars using our grammar tools and parser is discussed.
Segment-Based Stochastic Models Of Spectral Dynamics For Continuous Speech Recognition
, 1992
"... This dissertation addresses the problem of modeling the joint time-spectral structure of speech for recognition. Four areas are covered in this work: segment modeling, estimation, recognition search algorithms, and extension to a more general class of models. A unified view of the acoustic models th ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
This dissertation addresses the problem of modeling the joint time-spectral structure of speech for recognition. Four areas are covered in this work: segment modeling, estimation, recognition search algorithms, and extension to a more general class of models. A unified view of the acoustic models that are currently used in speech recognition is presented; the research is then focused on segment-based models that provide a better framework for modeling the intrasegmental statistical dependencies than the conventional hidden Markov models (HMMs). The validity of a linearity assumption for modeling the intrasegmental statistical dependencies is first checked, and it is shown that the basic assumption of conditionally independent observations given the underlying state sequence that is inherent to HMMs is inaccurate. Based on these results, linear models are chosen for the distribution of the observations within a segment of speech. Motivated by the original work of the stochastic segment model, a dynamical system segment model is proposed for continuous speech recognition. Training of this model is equivalent to the maximum likelihood identification of a stochastic linear system, and a simple alternative to the traditional approach is developed. This procedure is based on the ExpectationMaximization algorithm and is analogous to the Baum-Welch algorithm for HMMs, since the dynamical system segment model can be thought of as a continuous state vii HMM. Recognition involves computing the probability of the innovations given by Kalman filtering. The large computational complexity of segment-based models is dealt with by the introduction of fast recognition search algorithms as alternatives to the typical Dynamic Programming search. A Split-and-Merge segmentation algorithm is...
Speaker-Independent Phone Recognition Using BREF
, 1992
"... A series of experiments on speaker-independent phone recognition of continuous speech have been carried out using the recently recorded BREF corpus. These experiments are the first to use this large corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent phone r ..."
Abstract
-
Cited by 16 (11 self)
- Add to MetaCart
A series of experiments on speaker-independent phone recognition of continuous speech have been carried out using the recently recorded BREF corpus. These experiments are the first to use this large corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent phone recognition of French. The HMM-based recognizer was trained with hand-verified data from 43 speakers. Using 35 context-independent phone models, a baseline phone accuracy of 60% (no phone grammar) was obtained on an independent test set of 7635 phone segments from 19 new speakers. Including phone bigram probabilities as phonotactic constraints resulted in a performance of 63.5%. A phone accuracy of 68.6% was obtained with 428 context dependent models and the bigram phone language model. Vocabulary-independent word recognition results with no grammar are also reported for the same test data. INTRODUCTION This paper reports on a series of experiments for speakerindependent, continuous speech ...
A Speech-Based Route Enquiry System Built From General-Purpose Components
, 1993
"... The adaptation of existing general-purpose speech recognition and language understanding systems can greatly reduce the cost of developing applications. However, the components must have appropriate characteristics for this to be possible. Work is in progress to adapt two task-independent components ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
The adaptation of existing general-purpose speech recognition and language understanding systems can greatly reduce the cost of developing applications. However, the components must have appropriate characteristics for this to be possible. Work is in progress to adapt two task-independent components, the AURIX speech recognizer and the CLARE language processor to create a system allowing spoken queries of the PC-based Autoroute route planning package. Keywords: adaptability, general purpose, speech recognition, language understanding, AURIX, CLARE 1. INTRODUCTION A spoken language understanding system is being built by the reconfiguration of two general purpose components. AURIX is designed to be a reconfigurable speech recognizer generating either a string or words or a lattice. Either input can be fed into CLARE, a general purpose language processor, which can generate suitable commands or database queries for a particular application. In the following sections, we describe first...

