Results 1 - 10
of
24
A Neural Probabilistic Language Model
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract
-
Cited by 81 (8 self)
- Add to MetaCart
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
A Bit of Progress in Language Modeling
, 2001
"... Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1 ..."
Abstract
-
Cited by 70 (1 self)
- Add to MetaCart
Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. a Katz-smoothed trigram model). There are many improvements over this simple model however, including caching, clustering, higherorder n-grams, skipping models, and sentence-mixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this...
The 1998 Htk System For Transcription Of Conversational Telephone Speech
- IN: PROCEEDINGS INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING
, 1998
"... This paper describes the 1998 HTK large vocabulary speech recognition system for conversational telephone speech as used in the NIST 1998 Hub5E evaluation. Front-end and language modelling experiments conducted using various training and test sets from both the Switchboard and Callhome English corpo ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
This paper describes the 1998 HTK large vocabulary speech recognition system for conversational telephone speech as used in the NIST 1998 Hub5E evaluation. Front-end and language modelling experiments conducted using various training and test sets from both the Switchboard and Callhome English corpora are presented. Our complete system includes reduced bandwidth analysis, sidebased cepstral feature normalisation, vocal tract length normalisation (VTLN), triphone and quinphone hidden Markov models (HMMs) built using speaker adaptive training (SAT), maximum likelihood linear regression (MLLR) speaker adaptation and a confidence score based system combination. A detailed description of the complete system together with experimental results for each stage of our multi-pass decoding scheme is presented. The word error rate obtained is almost 20% better than our 1997 system on the development set.
The 1997 HTK Broadcast News Transcription System
, 1998
"... This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now using data for which no manual preclassification or segmentation is available and the ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now using data for which no manual preclassification or segmentation is available and therefore automatic techniques are required and compatible acoustic modelling strategies must be adopted. A number of recognition experiments are presented that compare data-type specific and non-specific models; differing amounts of training data; the use of gender-dependent modelling and the effects of automatic data-type classification. Based on these experiments, the HTK system for the 1997 broadcast news evaluation was designed. A detailed description of this system is given which includes a class-based language modelling component. The complete system yields an overall word error rate of 22.0% on the 1996 unpartitioned broadcast news development test data and just 15.8% on the 1997 evalua...
The CU-HTK March 2000 Hub5E Transcription System
, 2000
"... This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11% relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin. This paper describes th...
The Use of Clustering Techniques for Language Modeling - Application to Asian Languages
"... Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplex ..."
Abstract
-
Cited by 15 (11 self)
- Add to MetaCart
Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus, and on a Chinese heterogeneous corpus.
Automatic Transcription of Conversational Telephone Speech - Development of the CU-HTK 2002 System
- IEEE Transactions on Acoustics, Speech and Signal Processing
, 2003
"... This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modelling and model training, language and pronunciation modelling are pre ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modelling and model training, language and pronunciation modelling are presented. These include the use of conversation side based cepstral normalisation, vocal tract length normalisation, heteroscedastic linear discriminant analysis for feature projection, Minimum Phone Error Training and speaker adaptive training, latticebased model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation and class based language models.
Hierarchical probabilistic neural network language model
- AISTATS’05
, 2005
"... In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbo ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.
Recent Advances in Broadcast News Transcription
- in Proc. IEEE ASRU Workshop
, 2003
"... This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previousl ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previously developed in the context of the recognition of conversational telephone speech, have been successfully applied to the BN-E task for the first time. A number of new features have also been added. These include gender-dependent (GD) discriminative training; and modified discriminative training using lattice re-generation and combination. On the 2003 evaluation set the system gave an overall word error rate of 10.7% in less than 10 times real time (10RT).
The CUHTK-Entropic 10xRT Broadcast News Transcription System
, 1999
"... This paper describes the development of the CUHTK-Entropic 10xRT Broadcast News Transcription System. Previous HTK broadcast news transcription systems have focused on maximising accuracy with few constraints on compute power available. In order to develop a system running in under 10 times real tim ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
This paper describes the development of the CUHTK-Entropic 10xRT Broadcast News Transcription System. Previous HTK broadcast news transcription systems have focused on maximising accuracy with few constraints on compute power available. In order to develop a system running in under 10 times real time on a single CPU, detailed investigation and optimisation of the system architecture and mode of operation was required. This paper outlines those developments and discusses the way in which operation under 10xRT was ensured despite variability of the data to be recognised. On the 1998 test the system produced an average word error rate of 16.1% running in 9.5xRT.

