Results 1 - 10
of
30
Tree-Based State Tying for High Accuracy Acoustic Modelling
, 1994
"... The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance be-tween model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many mmh contexts will ..."
Abstract
-
Cited by 139 (15 self)
- Add to MetaCart
The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance be-tween model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many mmh contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.
The Use of Context in Large Vocabulary Speech Recognition
, 1995
"... decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional dec ..."
Abstract
-
Cited by 93 (0 self)
- Add to MetaCart
decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional decoders. The second part of the thesis therefore presents a new decoder design which is capable of using these models efficiently. The decoder is suitable for use with very large vocabularies and long span language models. It is also capable of generating a lattice of word hypotheses with little computational overhead. These lattices can be used to constrain further decoding, allowing efficient use of complex acoustic and language models. The effectiveness of these techniques has been assessed on a variety of large vocabulary continuous speech recognition tasks and results are presented which analyse performance in terms of computational complexity and recognition accuracy. The experiments dem
Whistler: A Trainable Text-To-Speech System
- Proc. ICSLP
"... We introduce Whistler, a trainable Text-to-Speech (TTS) system, that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We introduce Whistler, a trainable Text-to-Speech (TTS) system, that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style.
Automatic Generation Of Synthesis Units For Trainable Text-To-Speech Systems
, 1998
"... Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data. This paper will describe in detail the design issues of constructing the synthesis unit inventory automatically from speech databases. The automatic process includes (1) determ ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data. This paper will describe in detail the design issues of constructing the synthesis unit inventory automatically from speech databases. The automatic process includes (1) determining the scaleable synthesis unit which can reflect spectral variations of different allophones; (2) segmenting the recording sentences into phonetic segments; (3) select good instances for each synthesis unit to generate best synthesis sentence during run time. These processes are all derived through the use of probabilistic learning methods which are aimed at the same optimization criteria. Through this automatic unit generation, Whistler can automatically produce synthetic speech that sounds very natural and resembles the acoustic characteristics of the original speaker. 1. INTRODUCTION In [4][7], we have presented Whistler: Microsoft's Trainable Textto -Speech (TTS) System. In contras...
Production Models As A Structural Basis For Automatic Speech Recognition
, 1996
"... We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeli ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeling and of phonetic-interface modeling. We conclude by suggesting that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating models from the production community with the probabilistic analysis-by-synthesis strategy currently used by the technology community. R ' ESUM ' EE Dans cet article, nous proposons que les mod`eles de production de la parole contribueront beaucoup `a la r'eussite eventuelle des mod`eles de reconnaissance automatique, limit'es en ce moment par les faiblesses de la base th'eorique de la technologie actuelle. Nous analysons ces faiblesses au niveau des mod`eles phonologiques et mod`...
Speech Recognition using Neural Networks
, 1995
"... This thesis examines how artificial neural networks can benefit a large vocabulary, speaker independent, continuous speech recognition system. Currently, most speech recognition systems are based on hidden Markov models (HMMs), a statistical framework that supports both acoustic and temporal modelin ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
This thesis examines how artificial neural networks can benefit a large vocabulary, speaker independent, continuous speech recognition system. Currently, most speech recognition systems are based on hidden Markov models (HMMs), a statistical framework that supports both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs make a number of suboptimal modeling assumptions that limit their potential effectiveness. Neural networks avoid many of these assumptions, while they can also learn complex functions, generalize effectively, tolerate noise, and support parallelism. While neural networks can readily be applied to acoustic modeling, it is not yet clear how they can be used for temporal modeling. Therefore, we explore a class of systems called NN-HMM hybrids, in which neural networks perform acoustic modeling, and HMMs perform temporal modeling. We argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system, including better acoustic ...
Deleted Interpolation And Density Sharing For Continuous Hidden Markov Models
- In Proc. ICASSP, Atlanta
, 1996
"... As one of the most powerful smoothing techniques, deleted interpolation has been widely used in both discrete and semi-continuous hidden Markov model (HMM) based speech recognition systems. For continuous HMMs, most smoothing techniques are carried out on the parameters themselves such as Gaussian m ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
As one of the most powerful smoothing techniques, deleted interpolation has been widely used in both discrete and semi-continuous hidden Markov model (HMM) based speech recognition systems. For continuous HMMs, most smoothing techniques are carried out on the parameters themselves such as Gaussian mean or covariance parameters. In this paper, we propose to smooth the probability density values instead of the parameters of continuous HMMs. This allows us to use most of the existing smoothing techniques for both discrete and continuous HMMs. We also point out that our deleted interpolation can be regarded as a parameter sharing technique. We further generalize this sharing to the probability density function (PDF) level, in which each PDF becomes a basic unit and can be freely shared across any Markov state. For a wide range of dictation experiments, deleted interpolation reduced the word error rate by 11% to 23% over other simple parameter smoothing techniques like flooring. Generic PD...
Recent improvements on microsoft’s trainable text-to-speech synthesizer: Whistler
- In ICASSP-97, volume II
, 1997
"... Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data [7]. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data [7]. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API [10] and requires less than 3 MB of working memory. 1.
Can Continuous Speech Recognizers Handle Isolated Speech?
, 1997
"... Continuous speech is far more natural and efficient than isolated speech for communication. However, for current state-of-the-art automatic speech recognition systems, isolated speech recognition (ISR) is far more accurate than continuous speech recognition (CSR). It is common practice in the speech ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Continuous speech is far more natural and efficient than isolated speech for communication. However, for current state-of-the-art automatic speech recognition systems, isolated speech recognition (ISR) is far more accurate than continuous speech recognition (CSR). It is common practice in the speech research community to build CSR systems using only CS data. However, slowing of the speaking rate is a natural reaction for a user faced with the high error rates of current CSR systems. Ironically, CSR systems typically have a much higher word error rate when speakers slow down since the acoustic models are usually derived exclusively from continuous speech corpora. In this paper, we summarize our efforts to improve the robustness of our speaker-independent CSR system against speaking styles, without suffering a recognition accuracy penalty. In particular the multi-style trained system described in this paper attains a 7.0% word error rate for a test set consisting of both isolated and con...
Techniques for the Creation and Exploration of Digital Video Libraries
- in Multimedia Tools and Applications, B. Furht, Editor
, 1996
"... Introduction The Information Age is fully upon us. A recent article noted that there are perhaps 50 million people using the Internet on a regular basis, and that "the current growth rate is about 15% per month (!) and this could well continue until almost all of those in the `developed world' are ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Introduction The Information Age is fully upon us. A recent article noted that there are perhaps 50 million people using the Internet on a regular basis, and that "the current growth rate is about 15% per month (!) and this could well continue until almost all of those in the `developed world' are connected" [Fenn94, p. 30]. In addition, the digital domain consists not only of text but increasingly of other media representations, from graphics images to audio to motion video. As the amount of information and number of users exponentially escalate, more attention focuses on the basic problems of information management: How do you digitize information? How can you then visualize it and find what you need? How do you use and manipulate it effectively? How is it stored and managed? The proliferation of technical articles and special issues addressing these questions underscore their importance; see for example the special issue on content-based retrieval [Narasimhalu95] or digital

