Results 1 - 10
of
43
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Multi Stream Speech Recognition
, 1996
"... . In this paper, we discuss a new automatic speech recognition (ASR) approach based on independent processing and recombination of several feature streams. In this framework, it is assumed that the speech signal is represented in terms of multiple input streams, each input stream representing a diff ..."
Abstract
-
Cited by 113 (16 self)
- Add to MetaCart
. In this paper, we discuss a new automatic speech recognition (ASR) approach based on independent processing and recombination of several feature streams. In this framework, it is assumed that the speech signal is represented in terms of multiple input streams, each input stream representing a different characteristic of the signal. If the streams are entirely synchronous, they may be accommodated simply (as they usually are in state-of-the-art systems). However, as discussed in the paper, it may be required to permit some degree of asynchrony between streams. This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates). This approach will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness. 2 IDIAP--RR 96-07 1 Introduction In current automatic speech recognition (ASR) systems, the a...
A Probabilistic Framework For Segment-Based Speech Recognition
, 2003
"... Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individ ..."
Abstract
-
Cited by 108 (33 self)
- Add to MetaCart
Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individual frames andintV nal recognizerstr;E will depend on t;hypotEO; zed segmentme;LH There is anotLO tot of recognizer whoseobservat ion space isbetI r represente as anet ork, or graph, where each arc in t; graph correspondst a hypotL;) zed variable-lengt segment tm is represente by a fixed-dimensional "featO e". In suchfeatSE;)E sed recognizers, eachhypotO99 zed segmentme;L will correspondt a segment sequence, orpatH ttHSV tt overall segme ntme aph th; is associato wit a subset of all possible feat revectI s intV tVLI observatEV space. Int;E work we examine a maximum apostW iori decoding stcodin forfeat ure-based recognizers and develop a normalizat ioncrit9S on useful for a segme ntme; ed VitOLO or A # search. Experiment arereport ed for bot phoneto and word recognitco tcog .
A Probabilistic Framework For Feature-Based Speech Recognition
, 1996
"... Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." I ..."
Abstract
-
Cited by 101 (24 self)
- Add to MetaCart
Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." In such feature-based recognizers the observation space takes the form of a temporal network of feature vectors, so that a single segmentation of an utterance will use a subset of all possible feature vectors. In this work we examine amaximuma posteriori decoding strategy for feature-based recognizers and develop a normalization criterion useful for a segmentbased Viterbi or A* search. We report experimental results for the task of phonetic recognition on the TIMIT corpus where we achieved context-independent and context-dependent (using diphones) results on the core test set of 64.1% and 69.5% respectively.
A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location
- COMPUTATIONAL LINGUISTICS
, 1994
"... Prosodic phrase structure ..."
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Statistical Trajectory Models for Phonetic Recognition
, 1994
"... The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Te ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Temporal behavior is modelled explicitly by creating dynamic tracks of the acoustic attributes used to represent the waveform, and by estimating the spatio--temporal correlation structure of the resulting errors. The tracks serve as templates from which synthetic segments of the acoustic attributes are generated. Scoring of an hypothesized phonetic segment is then based on the error between the measured acoustic attributes and the synthetic segments generated for each phonetic model.
Language Modeling With Sentence-Level Mixtures
, 1994
"... Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we dev ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we develop a topic-dependent sentence-level mixture language model which takes advantage of the topic constraints in a sentence or a paragraph. Since this language model is not Markov and has a large search space, it is used only in the last stage of a multi-pass search strategy in the recognizer. Second, we introduce topic-dependent dynamic adaptation techniques in the framework of the mixture model. During the course of this thesis, we also investigate robust parameter estimation techniques, which are extremely important in light of the sparse data problems in language modeling. The model is implemented in the BU speech recognition system and provides a significant improvement in recognition accuracy. An important advantage of the framework of our model is that it is a simple extension of existing language modeling techniques that can easily be integrated with other language modeling advances.
Syllable-Based Large Vocabulary Continuous Speech Recognition
- IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING
, 2001
"... Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent phone as the fundamental acoustic unit. In this paper, we present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speec ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent phone as the fundamental acoustic unit. In this paper, we present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speech. This effort is motivated by the inherent limitations in phone-based approaches — namely the lack of an easy and efficient way for modeling long-term temporal dependencies. A syllable unit spans a longer time frame, typically three phones, thereby offering a more parsimonious framework for modeling pronunciation variation in spontaneous speech. We present encouraging results which show that a syllable-based system exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity. The WER of the best syllable system reported here is 49.1 % on a standard SWITCHBOARD evaluation, a small improvement over the triphone system. We also report results on a much smaller recognition task, OGI Alphadigits, which was used to validate some of the benefits syllables offer over triphones. The syllable-based system exceeds the performance of the triphone system by nearly 20%, an impressive accomplishment since the alphadigits application consists mostly of phone-level minimal pair distinctions.

