Results 1 -
5 of
5
Recognition of conversational telephone speech using the Janus Speech Engine
- IN PROCEEDINGS OF THE ICASSP’97
, 1997
"... Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10 % or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resiste ..."
Abstract
-
Cited by 31 (12 self)
- Add to MetaCart
Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10 % or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resisted most attempts at improvements by way of the proven techniques to date. Difficulties arise from shorter words, telephone channel degradation, and highly disfluent and coarticulated speech. In this paper, we describe the application, adaptation, and performance evaluation of our JANUS speech recognition engine to the Switchboard conversational speech recognition task. Through a number of algorithmic improvements, we havebeen able to reduce error rates from more than 50 % word error to 38%, measured on the official 1996 NIST evaluation test set. Improvements include vocal tract length normalization, polyphonic modeling, label boosting, speaker adaptation with and without confidence measures, and speaking mode dependent pronunciation modeling.
Comparison of Two Tree-Structured Approaches for Grapheme-to-Phoneme Conversion
, 1996
"... Recently, we described a two-step self-learning approach for grapheme-to-phoneme (G2P) conversion [1]. In the first step, grapheme and phoneme strings in the training data are aligned via an iterative Viterbi procedure that may insert graphemic and phonemic nulls where required. In the second step, ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Recently, we described a two-step self-learning approach for grapheme-to-phoneme (G2P) conversion [1]. In the first step, grapheme and phoneme strings in the training data are aligned via an iterative Viterbi procedure that may insert graphemic and phonemic nulls where required. In the second step, a Trie structure encoding pronunciation rules is generated. In this paper we describe the alignment module, and give alignment accuracies on the NETtalk database. We also compare transcription accuracies for two approaches to the second step on three databases: the NETtalk database, the CMU dictionary and the French part of the ONOMASTICA lexicon. The two transcription approaches applied in this research are a Trie approach [1] and an approach based on binary decision trees grown by means of the Gelfand-RavishankarDelp algorithm [2,3,4]. We discuss the choice of questions for these decision trees - it may be possible to formulate questions about groups of characters (e.g., "is the next lette...
Wide Context Acoustic Modeling In Read Vs. Spontaneous Speech
, 1997
"... Context-dependent acoustic models have been applied in speech recognition research for many years, and have been shown to increase the recognition accuracy significantly. The most common approach is to use triphones. Recently, several speech recognition groups have started investigating the use of l ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Context-dependent acoustic models have been applied in speech recognition research for many years, and have been shown to increase the recognition accuracy significantly. The most common approach is to use triphones. Recently, several speech recognition groups have started investigating the use of larger phonetic context windows when building acoustic models. In this paper we discuss some of the computational problems arising from wide context modeling (polyphonic modeling) and present methods to cope with these problems. A two stage decision tree based polyphonic clustering approach is described which implements a more flexible parameter tying scheme. The new clustering approach gave us significant improvement across all tasks - WSJ, SWB, and Spontaneous Scheduling Task - and across all languages involved (German, Spanish, English). We report recognition results based on the JANUS speech recognition toolkit [2, 8] on two tasks comparing acoustic context phenomena in English read versu...
Optimal Tying of HMMMixture Densities using Decision Trees
, 1996
"... Decision trees havebeen used in speech recognition with large numbers of context-dependentHMM models, to provide models for contexts not seen in training. Trees are usually created by successive node splitting decisions, based on how well a single Gaussian or Poisson density fits the data associated ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Decision trees havebeen used in speech recognition with large numbers of context-dependentHMM models, to provide models for contexts not seen in training. Trees are usually created by successive node splitting decisions, based on how well a single Gaussian or Poisson density fits the data associated with a node. We introduce a new node splitting criterion, derived from the maximum likelihood fitting of the complex node distributions with Gaussian tiedmixture densities. We also carry the use of decision trees for tying HMM models a step further. In addition to questions about phonetic class of neighbouring phonemes,we allow questions about the HMM model state to be asked. The resulting decision tree maximizes the likelihood by adjusting the amount of parameter tying simultaneously across state and context. Accuracy improvement and model size reduction were evaluated on a gender-dependent 5K closed-vocabulary WSJ task, using the SI-84 and SI-284 training sets, for tied-mixture and continuous HMMmodels. The new decision trees are shown to reduce both error rate and model size, while being computationally cheap enough to allow consideration of two preceding and two following phones for the context.
Recognizing Careless Speech
, 1996
"... Careless speech is a problem for speech recognition. Spoken quickly, strings of words take on a form that is not well represented by concatenations of canonical phonetic transcriptions; fast speech is full of reductions and deletions, even of entire syllables, and coarticulations. Moreover, spontane ..."
Abstract
- Add to MetaCart
Careless speech is a problem for speech recognition. Spoken quickly, strings of words take on a form that is not well represented by concatenations of canonical phonetic transcriptions; fast speech is full of reductions and deletions, even of entire syllables, and coarticulations. Moreover, spontaneous speech is not uniform in speed, and it is difficult to categorize an utterance as fast or slow even with accurate methods of determining rate. The goal of this project has been to characterize careless speech as it actually occurs in the Switchboard corpus and explore different approaches to processing it. The

