Results 1 -
8 of
8
A Segmental CRF Approach to Large Vocabulary Continuous Speech Recognition
"... Abstract—This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
Abstract—This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2 % absolute improvement over an HMM baseline. Index Terms—speech recognition, conditional random field, direct modeling, detector features I.
Code Breaking for Automatic Speech Recognition
"... Code Breaking is a divide and conquer approach for sequential pattern recognition tasks where we identify weaknesses of an existing system and then use specialized decoders to strengthen the overall system. We study the technique in the context of Automatic Speech Recogniton. Using the lattice cutti ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Code Breaking is a divide and conquer approach for sequential pattern recognition tasks where we identify weaknesses of an existing system and then use specialized decoders to strengthen the overall system. We study the technique in the context of Automatic Speech Recogniton. Using the lattice cutting algorithm, we first analyze lattices generated by a state-of-the-art speech recognizer to spot possible errors in its first-pass hypothesis. We then train specialized decoders for each of these problems and apply them to refine the first-pass hypothesis. We study the use of Support Vector Machines (SVMs) as discriminative models over each of these problems. The estimation of a posterior distribution over hypoth-esis in these regions of acoustic confusion is posed as a logistic regression problem. GiniSVMs, a variant of SVMs, can be used as an approximation technique to estimate the parameters of the logistic regression problem. We first validate our approach on a small vocabulary recognition task, namely, alphadigits. We show that the use of GiniSVMs can substantially improve the per-formance of a well trained MMI-HMM system. We also find that it is possible to derive reliable confidence scores over the GiniSVM hypotheses and that these can be used to good effect in hypothesis combination. We will then analyze lattice cutting in terms of its ability to reliably identify, and provide good alternatives for incorrectly hypothesized words in the Czech MALACH domain, a large vocabulary task. We describe a procedure to train and apply SVMs to strengthen the first pass system, resulting in small but statistically significant recog-nition improvements. We conclude with a discussion of methods including clustering for obtaining further improvements on large vocabulary tasks.
Detection of Acoustic-Phonetic Landmarks in Mismatched Conditions Using a Biomimetic Model of Human Auditory Processing
"... Acoustic-phonetic landmarks provide robust cues for speech recognition and are relatively invariant between speakers, speaking styles, noise conditions and sampling rates. The ability to detect acoustic-phonetic landmarks as a front-end for speech recognition has been shown to improve recognition ac ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Acoustic-phonetic landmarks provide robust cues for speech recognition and are relatively invariant between speakers, speaking styles, noise conditions and sampling rates. The ability to detect acoustic-phonetic landmarks as a front-end for speech recognition has been shown to improve recognition accuracy. Biomimetic inter-spike intervals and average signal level have been shown to accurately convey information about acoustic-phonetic landmarks. This paper explores the use of inter-spike interval and average signal level as input features for landmark detectors trained and tested on mismatched conditions. These detectors are designed to serve as a front-end for speech recognition systems. Results indicate that landmark detectors trained using inter-spike intervals and signal level are relatively robust to both additive channel noise and changes in sampling rate. Mismatched conditions — differences in channel noise between training audio and testing audio — are problematic for computer speech recognition systems. Signal enhancement, mismatch-resistant acoustic features, and architectural compensation within the recognizer
unknown title
, 2006
"... Combining phonetic attributes using conditional random fields A Conditional Random Field is a mathematical model for sequences that is similar in many ways to a Hidden Markov Model, but is discriminative rather than generative in nature. In this paper, we explore the application of the CRF model to ..."
Abstract
- Add to MetaCart
Combining phonetic attributes using conditional random fields A Conditional Random Field is a mathematical model for sequences that is similar in many ways to a Hidden Markov Model, but is discriminative rather than generative in nature. In this paper, we explore the application of the CRF model to ASR processing of discriminative phonetic features by building a system that performs first-pass phonetic recognition using discriminatively trained phonetic features. With this system, we show that this CRF model achieves an accuracy level in a phone recognition task that is superior to a similarly trained HMM model. Index Terms: speech recognition, conditional random fields. 1.
PHONETIC FEATURES AND ACOUSTIC LANDMARKS
"... A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problem is hypothesized as a maximization of the joint posterior probability of a set of pho ..."
Abstract
- Add to MetaCart
A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problem is hypothesized as a maximization of the joint posterior probability of a set of phonetic features and the corresponding acoustic landmarks. Binary classifiers of the manner phonetic features- syllabic, sonorant and continuant- are applied for the probabilistic detection of speech landmarks. The landmarks include stop bursts, vowel onsets, syllabic peaks, syllabic dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. The classifiers use automatically extracted knowledge based acoustic parameters (APs) that are acoustic correlates of those phonetic features. For isolated word recognition with known and limited vocabulary, the landmark sequences are constrained using a manner class pronunciation graph. Probabilistic decisions on place and voicing phonetic features are then made using a separate set of APs extracted using the landmarks. The framework exploits two properties of the knowledge-based acoustic cuesof phonetic features: (1) sufficiency of the acoustic cues of a phonetic feature for a decision on that feature and (2) invariance of the acoustic cues with respect to context. The probabilistic framework makes the acoustic-phonetic approach to speech recognition suitable for practical recognition tasks as well as compatible with probabilistic pronunciation and language models. Support vector machines (SVMs) are applied for the binary classification tasks because of their two favorable properties- good generalization and the ability to learn from a relatively small amount of high dimensional data. Performance comparable to Hidden Markov Model (HMM) based systems is obtained on landmark detection as well as isolated word recognition. Applications to rescoring of lattices from a large vocabulary continuous speech recognizer are also presented. SPEECH RECOGNITION BASED ON
INTERSPEECH 2007 Dual-Channel Acoustic Detection of Nasalization States
"... Automatic detection of different oral-nasal configurations during speech is useful for understanding normal nasalization and assessing certain speech disorders. We propose an algorithm to extract nasalization features from dual-channel acoustic signals that are acquired by a simple two-microphone se ..."
Abstract
- Add to MetaCart
Automatic detection of different oral-nasal configurations during speech is useful for understanding normal nasalization and assessing certain speech disorders. We propose an algorithm to extract nasalization features from dual-channel acoustic signals that are acquired by a simple two-microphone setup. The feature is based on a dual-channel acoustic model and the associated analysis method. We successfully test this feature in speaker-dependent and speaker-independent tasks by comparing it with the conventional single-channel MFCC feature. The proposed feature uniformly performs better in both tasks. Index Terms: speech production, nasalization, speech pathology, velopharyngeal function, nasal resonance
SVM-HMM LANDMARK BASED SPEECH RECOGNITION
"... Support vector machines (SVMs) are trained to detect acoustic-phonetic landmarks, and to identify both the manner and place of articulation of the phones producing each landmark with high accuracy. The discriminant outputs of these SVMs are used as input features for a standard HMM based ASR system. ..."
Abstract
- Add to MetaCart
Support vector machines (SVMs) are trained to detect acoustic-phonetic landmarks, and to identify both the manner and place of articulation of the phones producing each landmark with high accuracy. The discriminant outputs of these SVMs are used as input features for a standard HMM based ASR system. There is a significant improvement in both the phone and word recognition accuracy when using these SVM discriminant features when compared to the phone and word recognition accuracy of an MFCC based recognizer.

