Results 21 - 30
of
234
The Use of Context in Large Vocabulary Speech Recognition
, 1995
"... decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional dec ..."
Abstract
-
Cited by 93 (0 self)
- Add to MetaCart
decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional decoders. The second part of the thesis therefore presents a new decoder design which is capable of using these models efficiently. The decoder is suitable for use with very large vocabularies and long span language models. It is also capable of generating a lattice of word hypotheses with little computational overhead. These lattices can be used to constrain further decoding, allowing efficient use of complex acoustic and language models. The effectiveness of these techniques has been assessed on a variety of large vocabulary continuous speech recognition tasks and results are presented which analyse performance in terms of computational complexity and recognition accuracy. The experiments dem
Unsupervised Learning from Dyadic Data
, 1998
"... Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event co-occurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applic ..."
Abstract
-
Cited by 89 (9 self)
- Add to MetaCart
Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event co-occurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applications ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework for unsupervised learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures and unifies probabilistic modeling and structure discovery. Mixture models provide both, a parsimonious yet flexible parameterization of probability distributions with good generalization performance on sparse data, as well as structural information about data-inherent grouping structure. We propose an annealed version of the standard Expectation Maximization algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.
Best-first Model Merging for Hidden Markov Model Induction
, 1994
"... This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are p ..."
Abstract
-
Cited by 86 (7 self)
- Add to MetaCart
This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard Baum-Welch approach in inducing simple finitestate languages from small, positive-only training samples. We found that the merging procedure is more robust and accurate, part...
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Classifier Combination for Improved Lexical Disambiguation
, 1998
"... One of the most exciting recent directions in machine learning is the discovery that the combination of multiple classifiers often results in significantly better performance than what can be achieved with a single classifier. In this paper, we first show that the errors made from three differ ..."
Abstract
-
Cited by 82 (1 self)
- Add to MetaCart
One of the most exciting recent directions in machine learning is the discovery that the combination of multiple classifiers often results in significantly better performance than what can be achieved with a single classifier. In this paper, we first show that the errors made from three different state of the art part of speech taggers are strongly complementary. Next, we show how this complementary behavior can be used to our advantage. By using contextual cues to guide tagger combination, we are able to derive a new tagger that achieves performance significantly greater than any of the individual taggers.
A Neural Probabilistic Language Model
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract
-
Cited by 81 (8 self)
- Add to MetaCart
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
Large language models in machine translation
- In EMNLP
, 2007
"... This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabi ..."
Abstract
-
Cited by 78 (2 self)
- Add to MetaCart
This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases. 1
A survey of smoothing techniques for ME models
- IEEE Transactions on Speech and Audio Processing
, 2000
"... Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous r ..."
Abstract
-
Cited by 75 (1 self)
- Add to MetaCart
Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in ME smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing-gram language models. Because of the mature body of research in-gram model smoothing and the close connection between ME and conventional-gram models, this domain is well-suited to gauge the performance of ME smoothing methods. Over a large number of data sets, we find that fuzzy ME smoothing performs as well as or better than all other algorithms under consideration. We contrast this method with previous-gram smoothing methods to explain its superior performance. Index Terms—Exponential models, language modeling, maximum entropy, minimum divergence,-gram models, smoothing.
Improvements In Part-of-Speech Tagging With an Application To German
- In Proceedings of the ACL SIGDAT-Workshop
, 1995
"... This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger) which improve its accuracy when trained on small corpora. The basic tagger was originally developed for English [Schmid, 1994]. The extensions together reduced error rates on a German test corpus by more th ..."
Abstract
-
Cited by 73 (1 self)
- Add to MetaCart
This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger) which improve its accuracy when trained on small corpora. The basic tagger was originally developed for English [Schmid, 1994]. The extensions together reduced error rates on a German test corpus by more than a third.
Similarity-based models of word cooccurrence probabilities
- Machine Learning
, 1999
"... Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP met ..."
Abstract
-
Cited by 70 (0 self)
- Add to MetaCart
Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar ” words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similaritybased method yields a 20 % perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similaritybased methods perform up to 40 % better on this particular task.

