Results 11 - 20
of
30
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Building a highly accurate Mandarin speech recognizer
- In Proc. ASRU
, 2007
"... We describe a highly accurate large-vocabulary continuous Mandarin speech recognizer, a collaborative effort among four research organizations. Particularly, we build two acoustic models (AMs) with significant differences but similar accuracy for the purposes of cross adaptation and system combinati ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We describe a highly accurate large-vocabulary continuous Mandarin speech recognizer, a collaborative effort among four research organizations. Particularly, we build two acoustic models (AMs) with significant differences but similar accuracy for the purposes of cross adaptation and system combination. This paper elaborates on the main differences between the two systems, where one recognizer incorporates a discriminatively trained feature while the other utilizes a discriminative feature transformation. Additionally we present an improved acoustic segmentation algorithm and topicbased language model (LM) adaptation. Coupled with increased acoustic training data, we reduced the character error rate (CER) of the DARPA GALE 2006 evaluation set to 15.3 % from 18.4%. Index Terms — Mandarin, character error rates, multi-layer perceptrons, discriminative features, acoustic segmentation, LM adaptation, out-of-vocabulary. 1.
Improved Modeling and Efficiency for Automatic Transcription of Broadcast News
, 2000
"... Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We fo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We focus on individual techniques we developed, rather than on descriptions of our evaluation systems. We provide comparative experimental results showing the improvements obtained with the novel approaches we developed. 1 Introduction In recent years there has been increasing interest in developing large-vocabulary continuous speech recognition (LVCSR) systems for speech found in real sources. Broadcast news, in particular, has been the testbed for the DARPA-sponsored Hub4 continuous speech recognition (CSR) evaluations over the last few years, and represents a significant challenge to speech recognition researchers. Many interesting problems are associated with the automatic recognition of b...
Statistical Modeling Of Co-Articulation In Continuous Speech Based On Data Driven Interpolation
- Int. Conf. in Acoustics, Speech and Signal Processing
, 1997
"... Parsimonious modeling of the context dependency nature of speech due to co-articulation is very important for improving the performance of speech recognition systems. Numerous approaches have been proposed in the literature to address this problem. However, most of the methods are based on the idea ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Parsimonious modeling of the context dependency nature of speech due to co-articulation is very important for improving the performance of speech recognition systems. Numerous approaches have been proposed in the literature to address this problem. However, most of the methods are based on the idea of using context-dependent speech units, which inevitably increases the complexity of the model space. This paper presents a new approach of speech co-articulation modeling with complexity only comparable to context-independent models. In this model, the movement of a sequence of speech signals is characterized by a set of anchor points in the feature vector space that are corresponding to the target phonemic units. The transitions between the phonemic units due to co-articulation are modeled as interpolations between the target vectors. Two types of parameters are involved in the models: the intrinsic parameters in the models of target units and the auxiliary parameters specifying the trans...
Investigation on Man-darin Broadcast News speech recognition
- in Proceedings of ICSLP
, 2006
"... This paper describes our efforts in building a competitive Mandarin broadcast news speech recognizer. We successfully incorporated the most popular speech technologies into our system. More importantly, we present two novel algorithms in smoothing pitch features and segmenting Chinese characters int ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper describes our efforts in building a competitive Mandarin broadcast news speech recognizer. We successfully incorporated the most popular speech technologies into our system. More importantly, we present two novel algorithms in smoothing pitch features and segmenting Chinese characters into word units. Additionally, we propose to borrow the principle of pointwise mutual information for creating a Chinese word lexicon automatically. Our final system achieved 6.0 % character error rate (CER) on dev04 and 16.0 % on eval04, with simpler acoustic models, less training data, and simpler decoding architecture compared with other state-of-the-art systems, yet was equally competitive. Index Terms: Mandarin speech recognition, character error rate, pitch smoothing, word segmentation.
Mutual Information Phone Clustering for Decision Tree Induction
- Proc. Int. Conf. on Spoken Language Processing (ICSLP
, 2002
"... The paper presents an automatic method for devising the question sets used for the induction of classification and regression trees. The algorithm employed is the well-known mutual information based bottom-up clustering applied to phone bigram statistics. The sets of phones at the nodes in the resul ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The paper presents an automatic method for devising the question sets used for the induction of classification and regression trees. The algorithm employed is the well-known mutual information based bottom-up clustering applied to phone bigram statistics. The sets of phones at the nodes in the resulting binary tree are used as question sets for clustering context-sensitive (tri-phone) HMM output distributions in a large vocabulary speech recognizer. The algorithm is shown to perform as well and sometimes significantly better than question sets devised by human experts for a Spanish and German system evaluated on several tasks, respectively. It eliminates the need for linguistic expertise and it provides a faster solution as well. 1.
Hidden Model Sequence Models for Automatic Speech Recognition
, 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
How Effective is Unsupervised Data Collection for Children's Speech Recognition
- In Proceedings of ICSLP
, 1998
"... Children present a unique challenge to automatic speech recognition. Today’s state-of-the-art speech recognition systems still have problems handling children’s speech because acoustic models are trained on data collected from adult speech. In this paper we describe an inexpensive way to mend this p ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Children present a unique challenge to automatic speech recognition. Today’s state-of-the-art speech recognition systems still have problems handling children’s speech because acoustic models are trained on data collected from adult speech. In this paper we describe an inexpensive way to mend this problem. We collected children’s speech when they interact with an automated reading tutor. These data are subsequently transcribed by a speech recognition system and automatically filtered. We studied how to use these automatically collected data to improve children’s speech recognition system’s performance. Experiments indicate that automatically collected data can reduce the error rate significantly on children’s speech. 1.
Dynamically Configurable Acoustic Models For Speech Recognition
- Proc. of ICASSP'98, Seattle
, 1998
"... Senones were introduced to share Hidden Markov model (HMM) parameters at a sub-phonetic level in [3] and decision trees were incorporated to predict unseen phonetic contexts in [4]. In this paper, we will describe two applications of the senonic decision tree in (1) dynamically downsizing a speech r ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Senones were introduced to share Hidden Markov model (HMM) parameters at a sub-phonetic level in [3] and decision trees were incorporated to predict unseen phonetic contexts in [4]. In this paper, we will describe two applications of the senonic decision tree in (1) dynamically downsizing a speech recognition system for small platforms and in (2) sharing the Gaussian covariances of continuous density HMMs (CHMMs). We experimented how to balance different parameters that can offer the best trade off between recognition accuracy and system size. The dynamically downsized system, without retraining, performed even better than the regular Baum-Welch [1] trained system. The shared covariance model provided as good a performance as the unshared full model and thus gave us the freedom to increase the number of Gaussian means to increase the accuracy of the model. Combining the downsizing and covariance sharing algorithms, a total of 8% error reduction was achieved over the Baum-Welch trained ...
Analysis Of Acoustic-Phonetic Variations In Fluent Speech Using Timit
- Proc. ICASSP95
, 1995
"... In this paper, we propose a hierarchically structured Analysis of Variance (ANOVA) method to analyze, in a quantitative manner, the contributions of various identifiable factors to the overall acoustic variability exhibited in fluent speech data of TIMIT processed in the form of Mel-Frequency Cepstr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper, we propose a hierarchically structured Analysis of Variance (ANOVA) method to analyze, in a quantitative manner, the contributions of various identifiable factors to the overall acoustic variability exhibited in fluent speech data of TIMIT processed in the form of Mel-Frequency Cepstral Coefficients. The results of the analysis show that the greatest acoustic variability in TIMIT data is explained by the difference among distinct phonetic labels in TIMIT, followed by the phonetic context difference given a fixed phonetic label. The variability among sequential sub-segments within each TIMIT-defined phonetic segment is found to be significantly greater than the gender, dialect region, and speaker factors. 1. INTRODUCTION It has been known from many years of speech research, both theoretical and empirical, that speech variability at the acoustic level is overwhelming. In fact, such variability constitutes the major obstacle to the construction of machines for high-perform...

