Results 1 - 10
of
20
Hidden feature models for speech recognition using dynamic bayesian networks
- in Proc. Eurospeech
, 2003
"... In this paper, we investigate the use of dynamic Bayesian networks (DBNs) to explicitly represent models of hidden features, such as articulatory or other phonological features, for automatic speech recognition. In previous work using the idea of hidden features, the representation has typically bee ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
In this paper, we investigate the use of dynamic Bayesian networks (DBNs) to explicitly represent models of hidden features, such as articulatory or other phonological features, for automatic speech recognition. In previous work using the idea of hidden features, the representation has typically been implicit, relying on a single hidden state to represent a combination of features. We present a class of DBN-based hidden feature models, and show that such a representation can be not only more expressive but also more parsimonious. We also describe a way of representing the acoustic observation model with fewer distributions using a product of models, each corresponding to a subset of the features. Finally, we describe our recent experiments using hidden feature models on the Aurora 2.0 corpus. 1.
Articulatory Features for Robust Visual Speech Recognition
, 2004
"... Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel SVM classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU summer workshop
- Johns Hopkins University Center for
, 2007
"... We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In th ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In the area of observation modeling, we use the outputs of a set of multilayer perceptron articulatory feature classifiers (1) directly, in an extension of hybrid HMM/ANN models, and (2) as part of the observation vector in a standard Gaussian mixture-based model, an extension of the now popular “tandem ” approach. In the area of pronunciation modeling, we explore models consisting of multiple hidden streams of states, each corresponding to a different articulatory feature and having soft synchrony constraints, for both audio-only and audio-visual speech recognition. Our models are implemented as dynamic Bayesian networks, and our
Integrating multilingual articulatory features into speech recognition
- in Proc. Eurospeech
, 2003
"... The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech recognition systems under different conditions and in different settings. For example recognition systems based on features are more robust to noise and reverberation. I ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech recognition systems under different conditions and in different settings. For example recognition systems based on features are more robust to noise and reverberation. In earlier work we showed that articulatory features can compensate for inter language variability and can be recognized across languages. In this paper we show that using cross- and multilingual detectors to support an HMM based speech recognition system significantly reduces the word error rate. By selecting and weighting the features in a discriminative way, we achieve an error rate reduction that lies in the same range as that seen when using language specific feature detectors. By combining feature detectors from many languages and training the weights discriminatively, we even outperform the case where only monolingual detectors are being used. 1.
Multilingual Articulatory Features
- in Proc. ICASSP, Hong Kong
, 2003
"... Speech recognition systems based on or aided by articulatory features, such as place and manner of articulation, have been shown to be useful under varying circumstances. Recognizers based on features better compensate channel and noise variability. In this work we show that it is also possible to c ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Speech recognition systems based on or aided by articulatory features, such as place and manner of articulation, have been shown to be useful under varying circumstances. Recognizers based on features better compensate channel and noise variability. In this work we show that it is also possible to compensate for inter language variability using articulatory feature detectors. We come to the conclusion that articulatory features can be recognized across languages and that using detectors from many languages can improve the classification accuracy of the feature detectors on a single language. We further demonstrate how those multilingual and crosslingual detectors can support an HMM based recognizer and thereby significantly reduce the word error rate by up to 12.3% relative. We expect that with the use of multilingual articulatory features it is possible to support the rapid deployment of recognition systems for new target languages.
A.: Continuous Electromyographic Speech Recognition with a Multi-Stream Decoding Architecture
- In: International Conference on Communication Audio and Speech Processing
, 2007
"... In our previous work, we reported a surface electromyographic (EMG) continuous speech recognition system with a novel EMG feature extraction method, E4, which is more robust to EMG noise than traditional spectral features. In this paper, we show that articulatory feature (AF) classifiers can also be ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
In our previous work, we reported a surface electromyographic (EMG) continuous speech recognition system with a novel EMG feature extraction method, E4, which is more robust to EMG noise than traditional spectral features. In this paper, we show that articulatory feature (AF) classifiers can also benefit from the E4 feature, which improve the F-score of the AF classifiers from 0.492 to 0.686. We also show that the E4 feature is less correlated across EMG channels and thus channel combination gains larger improvement in F-score. With a stream architecture, the AF classifiers are then integrated into the decoding framework and improve the word error rate by 11.8% relative from 33.9 % to 29.9%. Index Terms — speech recognition, electromyography, articulatory muscles, articulatory features, feature extraction 1.
Modeling Coarticulation in EMG-based Continuous Speech Recognition
- Speech Communication Journal
"... This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even if it is spoken silently. Since speech is capt ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even if it is spoken silently. Since speech is captured before it gets airborne, the resulting signal is not masked by ambient noise. The resulting Silent Speech Interface has the potential to overcome major limitations of conventional speech-driven interfaces: it is not prone to any environmental noise, allows to silently transmit confidential information, and does not disturb bystanders. We describe our new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition and report results on the EMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which we recently collected. Our results on speaker-dependent and speaker-independent setups show that modeling the interdependence of phonetic features reduces the word error rate of the baseline system by over 33 % relative. Our final system achieves 10 % word error rate for the best-recognized speaker on a 101-word vocabulary task, bringing EMG-based speech recognition within a useful range for the application of silent speech interfaces.
Articulatory feature classification using surface electromyography
- in Proc. ICASSP
, 2006
"... In this paper, we present an approach for articulatory feature classification based on surface electromyographic signals generated by the facial muscles. With parallel recorded audible speech and electromyographic signals, experiments are conducted to show the anticipatory behavior of electromyograp ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
In this paper, we present an approach for articulatory feature classification based on surface electromyographic signals generated by the facial muscles. With parallel recorded audible speech and electromyographic signals, experiments are conducted to show the anticipatory behavior of electromyographic signals with respect to speech signals. On average, we found that the signals to be time delayed by 0.02 to 0.12 second. Furthermore, it is shown that different articulators have different anticipatory behavior. With offset-aligned signals, we improved the average F-score of the articulatory feature classifiers in our baseline system from 0.467 to 0.502.
PARSING SPEECH INTO ARTICULATORY EVENTS
"... In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. However, we extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.
Feature-Based Pronunciation Modeling for Automatic Speech Recognition
- In Proc. HLT/NAACL
, 2005
"... Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech. One approach to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech. One approach to handling this variation consists of expanding the dictionary with phonetic substitution, insertion, and deletion rules. Common rule sets, however, typically leave many pronunciation variants unaccounted for and increase word confusability due to the coarse granularity of phone units. We present an alternative approach, in which many types of variation are explained by representing a pronunciation as multiple streams of linguistic features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or to acoustic or perceptual categories. By

