Results 1 - 10
of
16
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
Automatic Generation Of A Pronunciation Dictionary Based On A Pronunciation Network
- European Conference on Speech Communication and Technology (EuroSpeech’97
, 1997
"... In this paper, we propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations (alternative pronunciations) from the canonical pronunciation. This method can generate multiple forms of alternative pronunci ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, we propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations (alternative pronunciations) from the canonical pronunciation. This method can generate multiple forms of alternative pronunciations using the pronunciation network for words that only occur a few times in the database and even for unseen words. Experimental results on spontaneous speech show that the automatically-derived pronunciation dictionaries give consistently higher recognition rates and require less computational time for recognition than a conventional dictionary. 1. INTRODUCTION In spontaneous speech, word pronunciation varies more than in read speech, but in most spontaneous speech recognition systems, actual pronunciation variations are disregarded and only standard pronunciations in citation form (canonical pronunciations) are used. It has been confirmed that an appropriate pronunciation dictionary ...
Unsupervised learning of acoustic sub-word units
- In Proceedings of ACL-08: HLT, Short Papers
, 2008
"... Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The algorithm, originally proposed for unsupervised learning of allophonic variations within a given phoneme set, has been adapted to learn without any knowledge of the phonemes. An evaluation methodology is also proposed, whereby the state-sequence that aligns to a test utterance is transduced in an automatic manner to a phoneme-sequence and compared to its manual transcription. Over 85 % phoneme recognition accuracy is demonstrated for speaker-dependent learning from fluent, large-vocabulary speech. 1 Automatic Discovery of Phone(me)s Statistical models learnt from data are extensively used in modern automatic speech recognition (ASR) systems. Transcribed speech is used to estimate conditional models of the acoustics given a phonemesequence. The phonemic pronunciation of words and the phonemes of the language, however, are derived almost entirely from linguistic knowledge. In this paper, we investigate whether the phonemes may be learnt automatically from the speech signal. Automatic learning of phoneme-like units has significant implications for theories of language acquisition in babies, but our considerations here are somewhat more technological. We are interested in developing ASR systems for languages or dialects
Speech recognition via phonetically-featured syllables
- Institute of Phonetics, University of the Saarland
, 2000
"... We describe recent work on two new automatic speech recognition systems. The first part of this paper describes the components of a system based on phonological features (which we call Espresso-P) in which the values of these features are estimated from the speech signal before being used as the bas ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We describe recent work on two new automatic speech recognition systems. The first part of this paper describes the components of a system based on phonological features (which we call Espresso-P) in which the values of these features are estimated from the speech signal before being used as the basis for recognition. In the second part of the paper, another system (which we call Espresso-A) is described in which articulatory parameters are used instead of phonological features and a linear dynamical system model is used to perform recognition from automatically estimated values of these articulatory parameters. 1. Phonological feature-based system: Espresso-P The first 5 sections of this paper report work on the components of a two stage recognition architecture based on phonological features rather than phones. While phonological features have been proposed before as the basis of a speech recognition system (see section 1.2 for a review), the use of features has been out of favour until recently because there had been little success in extracting them from speech waveforms, and a lack of suitable models with
A survey on automatic speech recognition with an illustrative example on continuous speech recognition
- of Mandarin,” Computat. Linguistics Chinese Language Processing
, 1996
"... For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
Speech Recognition Based On Acoustically Derived Segment Units
"... This paper describes a new method of word model generation based on acoustically derived segment units (henceforth ASUs). An ASU-based approach has the advantages of growing out of human pre-determined phonemes and of consistently generating acoustic units by using the maximum likelihood (ML) criter ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper describes a new method of word model generation based on acoustically derived segment units (henceforth ASUs). An ASU-based approach has the advantages of growing out of human pre-determined phonemes and of consistently generating acoustic units by using the maximum likelihood (ML) criterion. The former advantage is effective when it is difficult to map acoustics to a phone such as with highly co-articulated spontaneous speech. In order to implement an ASU-based modeling approach in a speech recognition system, we must first solve two points: (1) How do we design an inventory of acoustically-derived segmental units and (2) How do we model the pronunciations of lexical entries in terms of the ASUs. As for the second question, we propose an ASU-based word model generation method by composing the ASU statistics, that is, their means, variances and durations. The effectiveness of the proposed method is shown through spontaneous word recognition experiments. 1. INTRODUCTION In ...
Sparseness Achievement in Hidden Markov Models
"... In this paper, a novel learning algorithm for Hidden Markov Models (HMMs) has been devised. The key issue is the achievement of a sparse model, i.e., a model in which all irrelevant parameters are set exactly to zero. Alternatively to standard Maximum Likelihood Estimation (Baum Welch training), in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, a novel learning algorithm for Hidden Markov Models (HMMs) has been devised. The key issue is the achievement of a sparse model, i.e., a model in which all irrelevant parameters are set exactly to zero. Alternatively to standard Maximum Likelihood Estimation (Baum Welch training), in the proposed approach the parameters estimation problem is cast into a Bayesian framework, with the introduction of a negative Dirichlet prior, which strongly encourages sparseness of the model. A modified Expectation Maximization algorithm has been devised, able to determine a MAP (Maximum A Posteriori probability) estimate of HMM parameters in this Bayesian formulation. Theoretical considerations and experimental comparative evaluations on a 2D shape classification task contribute to validate the proposed technique. 1.
Smart Sofa based on Biometric Pattern Recognition
"... Abstract — This paper discusses the user-customized interaction for intelligent home environments. The interactive system is based upon the integrated techniques using both speech and face recognition. For essential modules, the speech recognition and synthesis were basically used for a virtual inte ..."
Abstract
- Add to MetaCart
Abstract — This paper discusses the user-customized interaction for intelligent home environments. The interactive system is based upon the integrated techniques using both speech and face recognition. For essential modules, the speech recognition and synthesis were basically used for a virtual interaction between user and the proposed system. In experiments, the real-time speech recognizer based on the HM-Net(Hidden Markov Network) was incorporated into the proposed system. Besides, the face identification was adopted to customize home environments for a specific user. In evaluation, the results showed that the proposed system was useful and easy to use for intelligent home environments, even though the performance of the speech recognizer was not better than the simulation results owing to the ambient noisy environments. I.
Japanese Speech Databases for Robust Speech Recognition
- Proceedings of the ICSLP’96. Philadelphia, PA, pp.2199–2202, Volume 4
, 1996
"... At ATR, a next-generation speech translation system is under development towards natural trans-language communication. To cope with the various requirements to speech recognition technology for the new system, further research efforts should emphasize the robustness for large vocabulary, speaking v ..."
Abstract
- Add to MetaCart
At ATR, a next-generation speech translation system is under development towards natural trans-language communication. To cope with the various requirements to speech recognition technology for the new system, further research efforts should emphasize the robustness for large vocabulary, speaking variations often found in fast spontaneous speech and speaker variances. These are key problems to be solved not only for speech translation but also for the general use of speech recognition in real environments. In this paper, three large speech databases are designed to cope with these problems in speech recognition and the current status of data collection is reported.

