Results 11 - 20
of
26
Experiments in automatic meeting transcription using JRTK
- In Proceedings of ICASSP'98
"... In this paper we describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To achieve optimal performance we started from two different baseline English recognizers ad ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
In this paper we describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To achieve optimal performance we started from two different baseline English recognizers adapted to meeting conditions and tested resulting performance. The data were found to be highly disfluent (conversational human to human speech), noisy (due to lapel microphones and environment), and overlapped with background noise, resulting in error rates comparable so far to those on the CallHome conversational database (40-50% WER). A meeting browser is presented that allows the user to search and skim through highlights from a meeting efficiently despite the recognition errors. 1.
Using chunk based partial parsing of spontaneous speech in unrestricted domains for reducing word error rate in speech recognition
- In Proceedings of COLING-ACL 98
, 1998
"... In this paper, we present achunk based partial parsing system for spontaneous, conversational speech in unrestricted domains. We show that the chunk parses produced by this parsing system can be usefully applied to the task of reranking Nbest lists from a speech recognizer, using a combination of ch ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this paper, we present achunk based partial parsing system for spontaneous, conversational speech in unrestricted domains. We show that the chunk parses produced by this parsing system can be usefully applied to the task of reranking Nbest lists from a speech recognizer, using a combination of chunk-based n-gram model scores and chunk coverage scores. The input for the system is Nbest lists generated from speech recognizer lattices. The hypotheses from the Nbest lists are tagged for part of speech, \cleaned up " by a preprocessing pipe, parsed by a part of speech based chunk parser, and rescored using a backpropagation neural net trained on the chunk based scores. Finally, the reranked Nbest lists are generated. The results of a system evaluation are promising in that a chunk accuracy of 87.4 % is achieved and the best performance on a randomly selected test set is a decrease in word error rate of 0.3 percent (absolute), measured on the new rst hypotheses in the reranked Nbest lists. 1
Applying Divide and Conquer to Large Scale Pattern Recognition Tasks
, 1996
"... Rather than presenting a specific trick, this paper aims at providing a methodology for large scale, real-world classification tasks involving thousands of classes and millions of training patterns. Such problems arise in speech recognition, handwriting recognition and speaker or writer identificati ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Rather than presenting a specific trick, this paper aims at providing a methodology for large scale, real-world classification tasks involving thousands of classes and millions of training patterns. Such problems arise in speech recognition, handwriting recognition and speaker or writer identification, just to name a few. Given the typically very large number of classes to be distinguished, many approaches focus on parametric methods to independently estimate class conditional likelihoods. In contrast, we demonstrate how the principles of modularity and hierarchy can be applied to directly estimate posterior class probabilities in a connectionist framework. Apart from offering better discrimination capability, we argue that a hierarchical classification scheme is crucial in tackling the above mentioned problems. Furthermore, we discuss training issues that have to be addressed when an almost infinite amount of training data is available.
Recognizing Non-Native Speech: Characterizing and Adapting to Non-Native Usage in LVCSR
, 2001
"... Low-proficiency non-native speakers represent a significant challenge for large-vocabulary continuous speech recognition (LVCSR). Acoustic models are confused by a heavy accent ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Low-proficiency non-native speakers represent a significant challenge for large-vocabulary continuous speech recognition (LVCSR). Acoustic models are confused by a heavy accent
Effective structural adaptation of lvcsr systems to unseen domains using hierarchical connectionist acoustic models
- in Proceedings of the ICSLP
, 1998
"... We present an approach to efficiently and effectively downsize and adapt the structure of large vocabulary conversational speech recognition (LVCSR) systems to unseen domains, requiring only small amounts of transcribed adaptation data. Our approach aims at bringing todays mostly task dependent syst ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We present an approach to efficiently and effectively downsize and adapt the structure of large vocabulary conversational speech recognition (LVCSR) systems to unseen domains, requiring only small amounts of transcribed adaptation data. Our approach aims at bringing todays mostly task dependent systems closer to the aspired goal of domain independence. To achieve this, we rely on the ACID/HNN framework [2, 3], a hierarchical connectionist modeling paradigm which allows to dynamically adapt a tree structured modeling hierarchy to differing specifity of phonetic context in new domains. Experimental validation of the proposed approach has been carried out by adapting size and structure of ACID/HNN based acoustic models trained on Switchboard to two quite different, unseen domains, Wall Street Journal and an English Spontaneous Scheduling Task. In both cases, our approach yields considerably downsized acoustic models with performance improvements of up to 18 % over the unadapted baseline models. 1.
You're Not From Round Here, Are You? Naive Bayes Detection of Non-native Utterance Text
- IN PROC. OF THE SECOND NAACL
, 2001
"... Native and non-native use of language differs, depending on the proficiency of the speaker, in clear and quantifiable ways. It has been shown that customizing the acoustic and language models of a natural language understanding system can significantly improve handling of non-native input; in order ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Native and non-native use of language differs, depending on the proficiency of the speaker, in clear and quantifiable ways. It has been shown that customizing the acoustic and language models of a natural language understanding system can significantly improve handling of non-native input; in order to make such a switch, however, the nativeness status of the user must be known. In this paper, we show that naive Bayes classification can be used to identify non-native utterances of English. The advantage of our method is that it relies on text, not on acoustic features, and can be used when the acoustic source is not available. We demonstrate that both read and spontaneous utterances can be classified with high accuracy, and that classification of errorful speech recognizer hypotheses is more accurate than classification of perfect transcriptions. We also characterize part-of-speech sequences that play a role in detecting non-native speech.
Streamlining The Front End Of A Speech Recognizer
, 2000
"... In this paper we seek to streamline various operations within the front end of a speech recognizer, both to reduce unnecessary computation and to simplify the conceptual framework. First, a novel view of the front end in terms of linear transformations is presented. Then we study the invariance prop ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we seek to streamline various operations within the front end of a speech recognizer, both to reduce unnecessary computation and to simplify the conceptual framework. First, a novel view of the front end in terms of linear transformations is presented. Then we study the invariance property of recognition performance with respect to linear transformations (LT) at the front end. Analysis reveals that several LT steps can be consolidated into a single LT, which effectively eliminates the Discrete Cosine Transform (DCT) step, part of the traditional MFCC (Mel-Frequency Cepstral Coefficient) front end. Moreover, a highly simplified, data-driven front-end scheme is proposed as a direct generalization of this idea. The new setup has no Mel-scale filtering, another part of the MFCC front end. Experimental results show a 5% relative improvement on the Broadcast News task. 1. LINEAR TRANSFORMATIONS IN THE TRADITIONAL FRONT END The front end is a relatively independent component ...
Conversational Speech Systems For OnBoard Car Navigation And Assistance", ICSLP '98
, 1998
"... This paper describes our latest efforts in building a speech recognizer for operating a navigation system through speech instead of typed input. Compared to conventional speech recognition for navigation systems, where the input is usually restricted to a fixed ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper describes our latest efforts in building a speech recognizer for operating a navigation system through speech instead of typed input. Compared to conventional speech recognition for navigation systems, where the input is usually restricted to a fixed
Fuzzy Class Rescoring: A Part-Of-Speech Language Model
"... Current speech recognition systems usually use word-based trigram language models. More elaborate models are applied to word lattices or N best lists in a rescoring pass following the acoustic decoding process. In this paper we consider techniques for dealing with class-based language models in the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Current speech recognition systems usually use word-based trigram language models. More elaborate models are applied to word lattices or N best lists in a rescoring pass following the acoustic decoding process. In this paper we consider techniques for dealing with class-based language models in the lattice rescoring framework of our JANUS large vocabulary speech recognizer. We demonstrate how to interpolate with a Part-of-Speech (POS) tag-based language model as example of a class-based model, where a word can be member of many different classes. Here the actual class membership of a word in the lattice becomes a hidden event of the A algorithm used for rescoring. A forward type of algorithm is defined as extension of the lattice rescorer to handle these hidden events in a mathematically sound fashion. Applying the mixture of viterbi and forward kind of rescoring procedure to the German Spontaneous Scheduling Task (GSST) yields some improvement in word accuracy. Above all, the resc...
Hierarchies of Neural Networks for Connectionist Speech Recognition
- Proceedings of the European Symposium on Artificial Neural networks (ESANN ’98), Brugges
, 1998
"... We present a principled framework for context-dependent hierarchical connectionist HMM speech recognition. Based on a divide-and-conquer strategy, our approach uses an Agglomerative Clustering algorithm based on Information Divergence (ACID) to automatically design a soft classifier tree for an arbi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a principled framework for context-dependent hierarchical connectionist HMM speech recognition. Based on a divide-and-conquer strategy, our approach uses an Agglomerative Clustering algorithm based on Information Divergence (ACID) to automatically design a soft classifier tree for an arbitrary large number of HMM states. Nodes in the classifier tree are instantiated with small estimators of local conditional posterior probabilities, in our case feed-forward neural networks. Our framework represents an effective decomposition of state posteriors with advantages over traditional acoustic models. We evaluate the effectiveness of our Hierarchies of Neural Networks (HNN) on the Switchboard large vocabulary conversational speech recogntion (LVCSR) corpus.

