Results 1 -
7 of
7
Toward the realization of spontaneous speech recognition – Introduction of a Japanese Priority Program and preliminary results
- Proc. ICSLP, Beijing
, 2000
"... Although high recognition accuracy can be obtained for speech in the form of reading a written text or similar by using state-of-the art speech recognition technology, the accuracy is quite poor for freely spoken spontaneous speech. From this perspective, a five-year national project for raising the ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
Although high recognition accuracy can be obtained for speech in the form of reading a written text or similar by using state-of-the art speech recognition technology, the accuracy is quite poor for freely spoken spontaneous speech. From this perspective, a five-year national project for raising the technological level of speech recognition and understanding commenced in Japan in 1999. The project focuses on building a large-scale spontaneous speech corpus and acoustic and linguistic modeling for spontaneous speech recognition and summarization. This paper reports some results of preliminary experiments which have been conducted at Tokyo Institute of Technology. Experimental results show that acoustic and language modeling based on the actual spontaneous speech corpus is far more effective than modeling based on read speech. It was also shown that our proposed automatic speech summarization method could effectively extract relatively important information and remove redundant and irrelevant information. 1.
Recent Development of Open-Source Speech Recognition Engine Julius
"... Abstract—Julius is an open-source large-vocabulary speech recognition software used for both academic research and industrial applications. It executes real-time speech recognition of a 60k-word dictation task on low-spec PCs with small footprint, and even on embedded devices. Julius supports standa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Julius is an open-source large-vocabulary speech recognition software used for both academic research and industrial applications. It executes real-time speech recognition of a 60k-word dictation task on low-spec PCs with small footprint, and even on embedded devices. Julius supports standard language models such as statistical N-gram model and rule-based grammars, as well as Hidden Markov Model (HMM) as an acoustic model. One can build a speech recognition system of his own purpose, or can integrate the speech recognition capability to a variety of applications using Julius. This article describes an overview of Julius, major features and specifications, and summarizes the developments conducted in the recent years.
GMM AND HMM TRAINING BY AGGREGATED EM ALGORITHM WITH INCREASED ENSEMBLE SIZES FOR ROBUST PARAMETER ESTIMATION
"... In order to compensate for the weaknesses of the expectation maximization (EM) algorithm to over-training and to improve model performance for new data, we have recently proposed aggregated EM (Ag-EM) algorithm that introduces bagginglike approach in the framework of the EM algorithm and have shown ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In order to compensate for the weaknesses of the expectation maximization (EM) algorithm to over-training and to improve model performance for new data, we have recently proposed aggregated EM (Ag-EM) algorithm that introduces bagginglike approach in the framework of the EM algorithm and have shown that it gives similar improvements as cross-validation EM (CV-EM) over conventional EM. However, a limitation with the experiments was that the number of multiple models used in the aggregation operation or the ensemble size was �xed to a small value. Here, we investigate the relationship between the ensemble size and the performance as well as giving a theoretical discussion with the order of the computational cost. The algorithm is �rst analyzed using simulated data and then applied to large vocabulary speech recognition on oral presentations. Both of these experiments show that Ag-EM outperforms CV-EM by using larger ensemble sizes. Index Terms — Expectation maximization algorithm, ensemble training, bagging, suf�cient statistics, hidden Markov model 1.
Spontaneous Speech Recognition Using a Massively Parallel Decoder
- Proc. InterspeechICSLP, Jeju, Korea, vol.3
, 2001
"... Since spontaneous utterances include many variations, speakerand task-independent general models do not work well. This paper proposes combining cluster-based language and acoustic models based on the framework of Massively Parallel Decoder (MPD). The MPD is a parallel decoder that has a large numbe ..."
Abstract
- Add to MetaCart
Since spontaneous utterances include many variations, speakerand task-independent general models do not work well. This paper proposes combining cluster-based language and acoustic models based on the framework of Massively Parallel Decoder (MPD). The MPD is a parallel decoder that has a large number of decoding units, in which each unit is assigned to each combination of element models. It runs efficiently on a parallel computer, and thus the turnaround time is comparable to conventional decoders using a single model and a processor. In the experiments conducted using lecture speeches from the Corpus of Spontaneous Japanese, two types of cluster models have been investigated: lecture-based cluster models and utterancebased cluster models. It has been confirmed that utterancebased cluster models give significantly lower recognition error rate than lecture-based cluster models in both language and acoustic modeling. It has also been shown that roughly 100 decoding units are enough in terms of recognition rate, and in the best setting, 12% reduction in word error rate was obtained in comparison with the conventional decoder.
Analysis On Individual Differences In Automatic Transcription Of
- Proc. IEEE ICASSP
, 2002
"... This paper reports an analysis of individual differences in spontaneous presentation speech recognition performances. Ten minutes from each presentation given by 50 male speakers, for a total of 500 minutes, has been automatically recognized for the analysis. Correlation and regression analyses were ..."
Abstract
- Add to MetaCart
This paper reports an analysis of individual differences in spontaneous presentation speech recognition performances. Ten minutes from each presentation given by 50 male speakers, for a total of 500 minutes, has been automatically recognized for the analysis. Correlation and regression analyses were applied to the word recognition accuracy and various speaker attributes. A restricted set of the speaker attributes comprising the speaking rate, the out of vocabulary rate and the repair rate was found to be most significant to yield individual differences in the word accuracy. Unsupervised MLLR speaker adaptation worked well for improving the word accuracy but did not change the structure of the individual differences. Approximately half of the variance in the word accuracy was explained by a regression model using the limited set of three attributes.
Towards Automatic Transcription of Spontaneous Presentations
, 2001
"... This paper reports various investigations on recognizing spontaneous presentation speech in connection with the "Spontaneous Speech" national project started in 1999. Presentation speech uttered by 10 male speakers of approximately 4.5 hours duration has been recognized. Experimental results show th ..."
Abstract
- Add to MetaCart
This paper reports various investigations on recognizing spontaneous presentation speech in connection with the "Spontaneous Speech" national project started in 1999. Presentation speech uttered by 10 male speakers of approximately 4.5 hours duration has been recognized. Experimental results show that acoustic and language modeling based on an actual spontaneous speech corpus is far more effective than conventional modeling based on read speech. The recognition accuracy has a wide speaker-tospeaker variability according to the speaking rate, the number of fillers, the number of repairs, etc. It was confirmed that unsupervised speaker adaptation of acoustic models was effective to improve the recognition accuracy. The recognition accuracy for spontaneous speech is, however, still rather low, and there remains a large number of research issues.
Aggregated Cross-validation and Its Efficient Application to Gaussian Mixture Optimization
"... We have previously proposed a cross-validation (CV) based Gaussian mixture optimization method that efficiently optimizes the model structure based on CV likelihood. In this study, we propose aggregated cross-validation (AgCV) that introduces a bagging-like approach in the CV framework to reinforce ..."
Abstract
- Add to MetaCart
We have previously proposed a cross-validation (CV) based Gaussian mixture optimization method that efficiently optimizes the model structure based on CV likelihood. In this study, we propose aggregated cross-validation (AgCV) that introduces a bagging-like approach in the CV framework to reinforce the model selection ability. While a single model is used in CV to evaluate a held-out subset, AgCV uses multiple models to reduce the variance in the score estimation. By integrating AgCV instead of CV in the Gaussian mixture optimization algorithm, an AgCV likelihood based Gaussian mixture optimization algorithm is obtained. The algorithm works efficiently by using sufficient statistics and can be applied to large models such as Gaussian mixture HMM. The proposed algorithm is evaluated by speech recognition experiments on oral presentations and it is shown that lower word error rates are obtained by the AgCV optimization method when compared to CV and MDL based methods. 1.

