Results 1 - 10
of
13
Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
- IEEE Transactions on Speech and Audio Processing
, 1994
"... In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addr ..."
Abstract
-
Cited by 372 (36 self)
- Add to MetaCart
In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addressed. Using HMMs with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely the forward-backward algorithm and the segmental k-means algorithm, are expanded and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications: parameter smoothing and model adaptation, and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications
A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition
- IEEE Transactions on Speech and Audio Processing
, 1996
"... is granted. A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition Ananth Sankar 2 and Chin-Hui Lee Speech Research Department AT&T Bell Laboratories Murray Hill, NJ 07974 1 Introduction Recently there has been much interest in the problem of improving the performanc ..."
Abstract
-
Cited by 86 (14 self)
- Add to MetaCart
is granted. A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition Ananth Sankar 2 and Chin-Hui Lee Speech Research Department AT&T Bell Laboratories Murray Hill, NJ 07974 1 Introduction Recently there has been much interest in the problem of improving the performance of automatic speech recognition (ASR) systems in adverse environments. When there is a mismatch between the training and testing environments, ASR systems suffer a degradation in performance. The goal of robust speech recognition is to remove the effect of this mismatch so as to bring the recognition performance as close as possible to the matched conditions. In speech recognition, the speech is usually modeled by a set of hidden Markov models (HMM) X . During recognition the observed utterance Y is decoded using these models. Due to the mismatch between training and testing conditions, this often results in a degradation in performance compared to the matched conditions. The mismatch b...
Robust endpoint detection and energy normalization for real-time speech and speaker recognition
- IEEE Transactions on Speech and Audio Processing
, 2002
"... Abstract—When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) and nonstationary environments, conventional ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Abstract—When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) and nonstationary environments, conventional approaches to endpoint detection and energy normalization often fail and ASR performances usually degrade dramatically. The purpose of this paper is to address the endpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition diagram for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the proposed algorithm significantly reduces the string error rates in low SNR situations. The reduction rates even exceed 50 % in several evaluated databases. For SV, we propose a batch-mode approach. It uses the optimal filter plus a two-mixture energy model for endpoint detection. The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity. Index Terms—Change-point detection, edge detection, endpoint detection, optimal filter, robust speech recognition, speaker verification, speech activity detection, speech detection. I.
Extensions to Constraint Dependency Parsing for Spoken Language Processing
- COMPUTER SPEECH AND LANGUAGE
, 1995
"... A text-based and spoken language processing framework based on the Constraint Dependency Grammar (CDG) developed by Maruyama [24, 25] is discussed. The scope of CDG is expanded to allow for the analysis of sentences containing lexically ambiguous words, to allow feature analysis in constraints, and ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
A text-based and spoken language processing framework based on the Constraint Dependency Grammar (CDG) developed by Maruyama [24, 25] is discussed. The scope of CDG is expanded to allow for the analysis of sentences containing lexically ambiguous words, to allow feature analysis in constraints, and to efficiently process multiple sentence candidates that are likely to arise in spoken language processing. The benefits of the CDG parsing approach are summarized. Additionally, the development of CDG grammars using our grammar tools and parser is discussed.
Filtering the Time Sequences of Spectral Parameters for Speech Recognition
, 1997
"... In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions or to compute differential parameters dynamic features which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, bot...
The Use Of Wavelet Transforms In Phoneme Recognition
- Proc. ICSLP-96
, 1996
"... This study investigates the usefulness of wavelet transforms in phoneme recognition. Both discrete wavelet transforms #DWT# and sampled continuous wavelet transforms #SCWT# are tested. The wavelet transform is used as a part of the front-end processor which extracts feature vectors for a speakerinde ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This study investigates the usefulness of wavelet transforms in phoneme recognition. Both discrete wavelet transforms #DWT# and sampled continuous wavelet transforms #SCWT# are tested. The wavelet transform is used as a part of the front-end processor which extracts feature vectors for a speakerindependent HMM-based phoneme recognizer. The results are evaluated on a portion of TIMIT corpus consisting of 30293 phoneme tokens for training and 14489 phoneme tokens for testing. The test results suggest that SCWT gives considerably better recognition rate than DWT. On the other hand, the improvement of SCWT over Mel-scale cepstral coe #cients appears to be marginal.
A Study On Task-Independent Subword Selection And Modeling For Speech Recognition
"... We study two key issues in task-independent training, namely selection of a universal set of subword units and modeling of the selected units. Since no a priori knowledge about the application vocabulary and syntax was used in the collection of the training corpus and the recognition task is frequen ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We study two key issues in task-independent training, namely selection of a universal set of subword units and modeling of the selected units. Since no a priori knowledge about the application vocabulary and syntax was used in the collection of the training corpus and the recognition task is frequently changing, the conventional strategy can no longer provide the best performance across many different tasks. We present a new approach that use the complete sets of right and left context-dependent units as the basis phone sets. Training of these models is accomplished by a new training criterion that maximizes phone separation between competing models. The proposed phone selection and modeling approachwas evaluated across different tasks in American English. Good recognition results were obtained for both context-independent and context-dependent phone models even for unseen tasks. The same strategy has also been applied to two other languages, Mandarin Chinese and Spanish, with similar success.
A survey on automatic speech recognition with an illustrative example on continuous speech recognition
- of Mandarin,” Computat. Linguistics Chinese Language Processing
, 1996
"... For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
A Probabilistic Method for Tracking a Vocalist
, 1998
"... When a musician gives a recital or concert, the music performed generally includes accompaniment. To render a good performance, the soloist and the accompanist must know the musical score and must follow the other musician's performance. Both performing and rehearsing are limited by constraints on t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
When a musician gives a recital or concert, the music performed generally includes accompaniment. To render a good performance, the soloist and the accompanist must know the musical score and must follow the other musician's performance. Both performing and rehearsing are limited by constraints on the time and money available for bringing musicians together. Computer systems that automatically provide musical accompaniment offer an inexpensive, readily available alternative. Effective computer accompaniment requires software that can listen to live performers and follow along in a musical score. This work presents an implemented system and method for automatically accompanying a singer given a musical score. Specifically, I offer a method for robust, real-time detection of a singer's score position and tempo. Robust score following requires combining information obtained both from analyzing a complex signal (the singer's performance) and from processing symbolic notation (the score). Unfortunately, the mapping from the available information to score position does not define a function. Consequently, this work investigated a statistical characterization of a singer's score position and a model that combines the available musical information to produce a probabilistic position estimate. By making
In memory of my brother,
, 1955
"... This thesis addresses the application of automatic speech recognition to the task of offline closed-captioning of television programs, and describes the collection of corpora to support such research and an exploration of issues to be addressed. The use of automatic speech recognition (ASR) for tran ..."
Abstract
- Add to MetaCart
This thesis addresses the application of automatic speech recognition to the task of offline closed-captioning of television programs, and describes the collection of corpora to support such research and an exploration of issues to be addressed. The use of automatic speech recognition (ASR) for transcription of broadcast speech and as an aid to captioning is reviewed. As background to the task, the methodology for large vocabulary continuous speech recognition (LVCSR) is presented, with particular attention given to the issues of large vocabulary language modelling and consideration of the acoustic complexity arising in broadcast material. A speech corpus of segmented and transcribed speech utterances for ten program episodes was developed for a typical genre of television programming (travelogues) for which offline closed-captions are applied. The development of this corpus demonstrates the feasibility of using existing closed-caption sources for generating labelled acoustic data suitable for speech recognition research. The speech corpus exhibits far greater acoustic complexity and much lower signal to noise ratios than occurs in broadcast news data (which has been systematically evaluated in ASR research). Noise-tolerant speech recognisers were developed and effectively

