Results 11 - 20
of
26
Flexible, Robust, And Efficient Human Speech Recognition
- In Proc. of the XIVth Int. Congress of Phonetic Sciences
, 1997
"... In describing human performance in sound perception, in word recognition, in speech understanding, and in dialogue handling, we generally test human limits under controlled conditions and try to understand the underlying mechanisms, however, the human system itself has already been built by nature. ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In describing human performance in sound perception, in word recognition, in speech understanding, and in dialogue handling, we generally test human limits under controlled conditions and try to understand the underlying mechanisms, however, the human system itself has already been built by nature. In speech and language technology we would like to equal, or perhaps even outrank, human performance, but we will then first have to design the system and we will have to develop the modules according to certain specifications. This paper emphasizes the flexibility, robustness, and efficiency of human performance at various levels and tries to indicate lessons to be learned for designing speech and language technology systems.
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Pronunciation modeling of mandarin casual speech – final report
- Proceedings of the Johns Hopkins Summer Workshop
, 2000
"... Current ASR systems can usually reach an accuracy of above 90 % when evaluated on carefully read standard speech, but only around 75 % on broadcast news speech. Broadcast news consists of utterances in both clear and casual speaking-modes, with large variations in pronunciation. Casual speech has hi ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Current ASR systems can usually reach an accuracy of above 90 % when evaluated on carefully read standard speech, but only around 75 % on broadcast news speech. Broadcast news consists of utterances in both clear and casual speaking-modes, with large variations in pronunciation. Casual speech has high pronunciation variability because users tend to speak more sloppily. Compared to
Pronunciation Modeling in Speech Synthesis
, 1998
"... iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long absence. It was a great pleasure to have Mark Randolph both as an external reader and as a colleague at Motorola. Mark’s work at MIT a decade ago has served as an inspiration to me. Orhan Karaali made this dissertation possible in this millennium. As my manager for over two years at Motorola, Orhan insisted on making my dissertation a priority at work. Harry Bliss provided his voice to this project and our whole group is very grateful for his patience and cooperation. My colleagues at Motorola listened to my ideas and provided technical and theoretical assistance at every turn: Noel
A Surficial Pronunciation Model
- In: Proc. of the ESCA Workshop ‘Modeling Pronunciation Variation for Automatic Speech Recognition’ (see [87
, 1998
"... We argue for a surficial pronunciation model: a model without underlying forms. The surficial model outperforms a traditional generative model by a significant margin on conversational speech (Switchboard) as well as on read speech (TIMIT). Our results suggest that the true mapping from underlying f ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We argue for a surficial pronunciation model: a model without underlying forms. The surficial model outperforms a traditional generative model by a significant margin on conversational speech (Switchboard) as well as on read speech (TIMIT). Our results suggest that the true mapping from underlying forms to surface forms is too complex to be accurately modeled using current techniques, and that we would be best served to model the surface forms directly.
Automatic generation of pronunciation lexicons for Mandarin spontaneous speech
- In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
, 2001
"... Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus o ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed in English are applied to this corpus to train pronunciaton models when are then applied in Mandarin Broadcast News transcription. 1.
Improving Tts By Higher Agreement Between
"... This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database.
Statistical modeling of pronunciation and production variations for speech recognition
- Proceedings of ICSLP 98,Sydney
, 1998
"... In this paper, we propose a procedure for training a pronunciation network with criteria consistent with the optimality objectives for speech recognition systems. In particular, we describe a framework for using maximum likelihood(ML) and minimum classi cation error(MCE) criteria for pronunciation n ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we propose a procedure for training a pronunciation network with criteria consistent with the optimality objectives for speech recognition systems. In particular, we describe a framework for using maximum likelihood(ML) and minimum classi cation error(MCE) criteria for pronunciation network optimization. The ML criterion is used to obtain an optimal structure for the pronunciation network based on statistically-derived phonological rules. Discrimination among di erent pronunciation networks is achieved by weighting of the pronunciation networks, optimized by applying the MCE criterion. Experinent results demonstrate improvements in speech recognition accuracy after applying statistically derived phonological rules. It is shown that the impact of the pronunciation network weighting on the recognition performance is determined by the size of the recognition vocabulary. 1.
Eurospeech 2001- Scandinavia Estimating Pronunciation Variations from Acoustic Likelihood Score for HMM Reconstruction
"... It is widely acknowledged that pronunciation modeling is an efficient way to improve recognition performance in spontaneous speech. In pronunciation modeling, almost all methods of generating variation probability are based on relative frequency counting from DP alignment. In this paper, we investig ..."
Abstract
- Add to MetaCart
It is widely acknowledged that pronunciation modeling is an efficient way to improve recognition performance in spontaneous speech. In pronunciation modeling, almost all methods of generating variation probability are based on relative frequency counting from DP alignment. In this paper, we investigate the local model mismatching caused by pronunciation variations and propose to estimate variation probability from acoustic likelihood score. According to estimated probability, we present a method of reconstructing pre-trained HMM models to include alternate pronunciations by sharing optimal mixture components instead of distributions. Experimental results show that using reconstructed HMM set reduces syllable error rate by 2.03% absolutely compared to the baseline system, also the accuracy improvement gained from proposed method is almost double with respect to that from previous DP alignment. 1.
Modeling pronunciation variation using artificial neural networks for English
, 2004
"... Pronunciation variation in conversational speech has caused significant amount of word errors in large vocabulary automatic speech recognition. Rule-based approaches and decision-tree based approaches have been previously proposed to model pronunciation variation. In this paper, we report our work o ..."
Abstract
- Add to MetaCart
Pronunciation variation in conversational speech has caused significant amount of word errors in large vocabulary automatic speech recognition. Rule-based approaches and decision-tree based approaches have been previously proposed to model pronunciation variation. In this paper, we report our work on modeling pronunciation variation using artificial neural networks (ANN). The results we achieved are significantly better than previously published ones on two different corpora, indicating that ANN may be better suited for modeling pronunciation variation than other statistical models that have been previously investigated. Our experiments indicate that binary distinctive features can be used to effectively represent the phonological context. We also find that including pitch accent feature in input improves the prediction of pronunciation variation on a ToBI-labeled subset of the Switchboard corpus.

