Results 1 - 10
of
22
High Performance Speaker-Independent Phone Recognition Using CDHMM
- In Proc. Eurospeech
, 1993
"... In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuraciesof 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phoneaccuracyof 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabularyCSR, we show that ...
A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting
- In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
, 1994
"... Practical applications of wordspotting, such as spoken message retrieval and browsing, require the ability to process large amounts of speech data at speeds many times faster than real-time. This paper presents a novel approach to this problem in which all of the stored audio material is preprocesse ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
Practical applications of wordspotting, such as spoken message retrieval and browsing, require the ability to process large amounts of speech data at speeds many times faster than real-time. This paper presents a novel approach to this problem in which all of the stored audio material is preprocessed off-line to generate a phoneme lattice. At search time, putative word matches are found in this lattice using symmetric dynamic programming. The paper presents the details of the algorithms used and compares performance with a number of conventional approaches using a 20 keyword vocabulary on the DARPA Resource Management Task. The results show that the proposed method is very much faster yet performs acceptably compared to conventional systems which depend on keyword-specific training or prior knowledge of the test set vocabulary. 1. INTRODUCTION In recent years, computers have become increasingly able to manipulate non-textual data, and applications such as video and voice mail have ari...
A voice-controlled automatic telephone switchboard and directory information system
- Speech Communication
, 1997
"... The Philips automatic telephone switchboard and directory information system PADIS provides a natural-language user interface to a telephone directory database. Using speech recognition and language understanding technologies, the system offers phone numbers, fax numbers, email addresses, and room n ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
The Philips automatic telephone switchboard and directory information system PADIS provides a natural-language user interface to a telephone directory database. Using speech recognition and language understanding technologies, the system offers phone numbers, fax numbers, email addresses, and room numbers as well as direct call completion to a desired party. In this paper, we present the underlying probabilistic framework, the system architecture, and the individual modules for speech recognition, language understanding, dialogue control, and speech output. In addition, we report results on performance and user behaviour obtained from a field test in our research lab with a 600-entry database. We derive a new maximum-a-posteriori decision rule which incorporates database knowledge and dialogue history as constraints in speech recognition and language understanding. It has improved speech understanding accuracy by 19 % (in terms of concept error rate), and reduced attribute substitution errors (e.g. recognition of a wrong name) by 38%. The decision rule is implemented in a multi-stage approach as a combination of state-of-the-art speech recognition, partial parsing with an attributed stochastic context-free grammar, and an N-best algorithm which is also described in this paper. The system conducts a flexible mixed-initiative dialogue rather than using a rigid form-filling scheme, and incorporates database knowledge to optimize the dialogue flow.
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
State Tying For Context Dependent Phoneme Models
"... this paper several modifications of two methods for parameter reduction of Hidden Markov Models by state tying are described. The two methods represent a data driven clustering triphone states with a bottom up algorithm [3, 9], and a top down method growing decision trees for triphone states [2, 10] ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
this paper several modifications of two methods for parameter reduction of Hidden Markov Models by state tying are described. The two methods represent a data driven clustering triphone states with a bottom up algorithm [3, 9], and a top down method growing decision trees for triphone states [2, 10]. We investigate several aspects of state tying as the possible reduction of the word error rate by state tying, the consequences of different distance measures for the data driven approach and modifications of the original decision tree approach such as node merging. The tests were performed on the test corpora for the 5 000 word vocabulary of the WSJ November 92 task and on the evaluation corpora for the 3 000 word VERBMOBIL '95 task. The word error rate by state tying was reduced by 14% for the WSJ task and by 5% for the VERBMOBIL task
Improved On-Line Handwriting Recognition Using Context Dependent Hidden Markov Models
- In Proc. Int. Conference on Document Analysis and Recognition (ICDAR
, 1997
"... This paper presents the introduction of context dependent Hidden Markov Models for cursive, unconstrained handwriting recognition with large vocabularies. Since context dependent models were successfully introduced to speech recognition ([1], [2], [3]), it seems obvious, that the use of trigraphs co ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
This paper presents the introduction of context dependent Hidden Markov Models for cursive, unconstrained handwriting recognition with large vocabularies. Since context dependent models were successfully introduced to speech recognition ([1], [2], [3]), it seems obvious, that the use of trigraphs could also lead to improved on-line handwriting recognition systems [4]. In analogy to triphones in speech recognition, trigraphs are context dependent sub-word units representing a single written character in its left and right context. The tests were conducted on a writer dependent system with three different writers and two different vocabulary sizes (1000 words and 30000 words). The results we obtained with the trigraph-based system compared to the monograph system are very encouraging: A mean relative error reduction of 46% for the 1000 word handwriting recognition system and a mean relative error reduction of 37% for the same system with the 30000 word vocabulary. We believe that this r...
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
- In Proceedings of the Broadcast News Transcription and Understanding Workshop
, 1998
"... It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden Markov model (HMM) parameters for speech recognition, is sensitive to the initial model parameter values, making appropriate parameter initialization important. We investigate the use of iterative Gau ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden Markov model (HMM) parameters for speech recognition, is sensitive to the initial model parameter values, making appropriate parameter initialization important. We investigate the use of iterative Gaussian splitting and EM training to initialize the desired number of Gaussians per HMM state (or state cluster). We then study merging of Gaussians which contain little training data as an approach to robust parameter estimation. Finally Gaussian merging and splitting is combined to form the Gaussian Merging-Splitting (GMS) algorithm. Detailed experimental studies show that Gaussian splitting gives similar performance to our previous training algorithm, even though the two algorithms give very different parameter values. The robust parameter estimation from Gaussian merging results in better performance than our old algorithm for speaker-independent models that have a large number of paramete...
Phonetic Context-Dependency In a Hybrid ANN/HMM Speech Recognition System
, 1997
"... This report uses a bark scale, which has been replaced here with a mel-scale. CHAPTER 3. THE ABBOT SPEECH RECOGNITION SYSTEM 32 where, ¯ i = 1 ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This report uses a bark scale, which has been replaced here with a mel-scale. CHAPTER 3. THE ABBOT SPEECH RECOGNITION SYSTEM 32 where, ¯ i = 1
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Adaptive Training for Large Vocabulary Continuous Speech Recognition
, 2006
"... Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational te ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational telephone speech. Hence, it typically has greater variability in terms of speaker and acoustic conditions than specially collected data. Thus, in addition to the desired speech variability required to discriminate between words, it also includes various non-speech variabil-ities, for example, the change of speakers or acoustic environments. The standard approach to handle this type of data is to train hidden Markov models (HMMs) on the whole data set as if all data comes from a single acoustic condition. This is referred to as multi-style training, for exam-ple speaker-independent training. Effectively, the non-speech variabilities are ignored. Though good performance has been obtained with multi-style systems, these systems account for all variabilities. Improvement may be obtained if the two types of variabilities in the found data are modelled separately. Adaptive training has been proposed for this purpose. In contrast to multi-style training, a set of transforms is used to represent the non-speech variabilities. A canonical

