Results 1 - 10
of
27
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
Further progress in meeting recognition: The ICSI-SRI spring 2005 speech-to-text evaluation system
- In Proceedings of the
, 2005
"... Abstract. We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversat ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
Abstract. We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17 % relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on lecture data are comparable to the best reported results for that task. 1
The use of a linguistically motivated language model in conversational speech recognition
- in Proc. ICASSP
, 2004
"... Structured language models have recently been shown to give significant improvements in large-vocabulary recognition relative to traditional word N-gram models, but typically imply a heavy computational burden and have not been applied to large training sets or complex recognition systems. In previo ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Structured language models have recently been shown to give significant improvements in large-vocabulary recognition relative to traditional word N-gram models, but typically imply a heavy computational burden and have not been applied to large training sets or complex recognition systems. In previous work, we developed a linguistically motivated and computationally efficient almostparsing language model using a data structure derived from Constraint Dependency Grammar parses that tightly integrates knowledge of words, lexical features, and syntactic constraints. In this paper we show that such a model can be used effectively and efficiently in all stages of a complex, multi-pass conversational telephone speech recognition system. Compared to a state-of-the-art 4-gram interpolated word- and class-based language model, we obtained a 6.2 % relative word error reduction (a 1.6 % absolute reduction) on a recent NIST evaluation set. 1.
Using MLP features in SRI’s conversational speech recognition system
- in Proc. Interspeech
, 2005
"... We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP esti ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. This paper focuses on the challenges arising when incorporating these nonstandard features into a full-scale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of time-saving techniques for training feature MLPs on 1800 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2 % absolute (10 % relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features. 1.
Training LVCSR systems on thousands of hours of data
- In: Proc. ICASSP
, 2005
"... Typical systems for large vocabulary conversational speech recognition (LVCSR) have been trained on a few hundred hours of carefully transcribed acoustic training data. This paper describes an LVCSR system for the conversational telephone speech (CTS) task trained on more than 2000 hours of data for ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Typical systems for large vocabulary conversational speech recognition (LVCSR) have been trained on a few hundred hours of carefully transcribed acoustic training data. This paper describes an LVCSR system for the conversational telephone speech (CTS) task trained on more than 2000 hours of data for which only approximate transcriptions were available. The challenges of dealing which such a large data set and the accuracy improvements over the small baseline system are discussed. The effect on both acoustic and language modelling performance is studied. Overall increasing the training data size from 360h to 2200h and optimising the training procedure reduced the word error rate on the DARPA/NIST 2003 eval set by about 20 % relative. 1.
The 2005 AMI system for the transcription of speech
- in Proc. MLMI’05
, 2005
"... Abstract. The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Abstract. The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the AMI (Augmented Multiparty Interaction) project. We present several techniques important to processing of this data and show the performance in terms of word error rates (WERs). An important aspect of transcription of this data is the necessary flexibility in terms of audio pre-processing. Real world systems have to deal with flexible input, for example by using microphone arrays or randomly placed microphones in a room. Automatic segmentation and microphone array processing techniques are described and the effect on WERs is discussed. The system and its components presented in this paper yield compettive performance and form a baseline for future research in this domain. 1
Transcription of Conference Room Meetings: an Investigation
- IN PROCEEDINGS INTERSPEECH
, 2005
"... The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. In this paper we explore the use of various meeting corpora for the purpose of automatic speech recognition. In particular we investig ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. In this paper we explore the use of various meeting corpora for the purpose of automatic speech recognition. In particular we investigate the similarity of these resources and how to efficiently use them in the construction of a meeting transcription system. The analysis shows distinctive features for each resource. However the benefit in pooling data and hence the similarity seems sufficient to speak of a generic "conference meeting domain". In this context this paper also presents work on development for the AMI meeting transcription system, a joint effort by seven sites working on the AMI (augmented multi-party interaction) project.
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
"... Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon’s Mechanical Turk can match professional quality. We use Mechanical Turk to transcrib ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon’s Mechanical Turk can match professional quality. We use Mechanical Turk to transcribe conversational speech for as little as one thirtieth the cost of professional transcription. The higher disagreement of non-professional transcribers does not have a significant effect on system performance. While previous work demonstrated that redundant transcription can improve data quality, we found that resources are better spent collecting more data. Finally, we describe a quality control method without needing professional transcription. 1
Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System
- In Proc. ICASSP
, 2004
"... This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multipass, multi-branch structure where the output of all branches is combined using system combination. A number of advan ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multipass, multi-branch structure where the output of all branches is combined using system combination. A number of advanced modelling techniques such as Speaker Adaptive Training, Heteroscedastic Linear Discriminant Analysis, Minimum Phone Error estimation and specially constructed Single Pronunciation dictionaries were employed. The effectiveness of each of these techniques and their potential contribution to the result of system combination was evaluated in the framework of a state-of-the-art LVCSR system with sophisticated adaptation. The final 2003 CU-HTK CTS system constructed from some of these models is described and its performance on the DARPA/NIST 2003 Rich Transcription (RT-03) evaluation test set is discussed.
The 2005 AMI system for the transcription of speech in meetings
- In Proc. of the NIST RT05s workshop
, 2005
"... Abstract. In this paper we describe the 2005 AMI system for the transcription of speech in meetings used for participation in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Abstract. In this paper we describe the 2005 AMI system for the transcription of speech in meetings used for participation in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve very competitive performance. 1

