Results 1 - 10
of
21
A Maximum Entropy approach to Natural Language Processing
- COMPUTATIONAL LINGUISTICS
, 1996
"... The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we des ..."
Abstract
-
Cited by 846 (5 self)
- Add to MetaCart
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Applying Co-Training methods to Statistical Parsing
, 2001
"... We propose a novel Co-Training method for statistical parsing. The algorithm takes as input a small corpus (9695 sentences) annotated with parse trees, a dictionary of possible lexicalized structures for each word in the training set and a large pool of unlabeled text. The algorithm iteratively labe ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
We propose a novel Co-Training method for statistical parsing. The algorithm takes as input a small corpus (9695 sentences) annotated with parse trees, a dictionary of possible lexicalized structures for each word in the training set and a large pool of unlabeled text. The algorithm iteratively labels the entire data set with parse trees. Using empirical results based on parsing the Wall Street Journal corpus we show that training a statistical parser on the combined labeled and unlabeled data strongly outperforms training only on the labeled data. 1
A Maximum-Entropy Partial Parser for Unrestricted Text
, 1998
"... This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge sources: the hierarchical structure, parts of speech and phrasal c ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge sources: the hierarchical structure, parts of speech and phrasal categories. In effect, the parser goes beyond simple bracketing and recognises even fairly complex structures. We give accuracy figures for different applications of the parser.
Dynamic Programming Search for Continuous Speech Recognition
, 1999
"... Initially introduced in the late 1960s and early 1970s, dynamic programming algorithms have become increasingly popular in automatic speech recognition. There are two reasons why this has occurred: First, the dynamic programming strategy can be combined with avery efficient and practical pruning str ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Initially introduced in the late 1960s and early 1970s, dynamic programming algorithms have become increasingly popular in automatic speech recognition. There are two reasons why this has occurred: First, the dynamic programming strategy can be combined with avery efficient and practical pruning strategy so that very large search spaces can be handled. Second, the dynamic programming strategy has turned out to be extremely flexible in adapting to new requirements. Examples of such requirements are the lexical tree organization of the pronunciation lexicon and the generation of a word graph instead of the single best sentence. In this paper, we attempt to systematically review the use of dynamic programming search strategies for small-vocabulary and large-vocabulary continuous speech recognition. The following methods are described in detail: search using a linear lexicon, search using a lexical tree, language-model look-ahead and word graph generation.
Exploiting Syntactic Structure for Natural Language Modeling
, 2000
"... The thesis presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parse ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The thesis presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood reestimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal, Switchboard and Broadcast News corpora show improvement in both perplexity and word error rate -- word lattice rescoring -- over the standard 3-gram language model. The significance of the thesis lies in presenting an original approach to language modeling that uses the hierarchical -- syntactic -- structure in natural language to improve on current 3-gram modeling techniques for large vocabulary speech recognition.
Grammatical Bigrams
- Advances in Neural Information Processing Systems 14
, 2001
"... Unsupervised learning algorithms have been derived for several statistical models of English grammar, but their computational complexity makes applying them to large data sets intractable. This paper presents a probabilistic model of English grammar that is much simpler than conventional models, but ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Unsupervised learning algorithms have been derived for several statistical models of English grammar, but their computational complexity makes applying them to large data sets intractable. This paper presents a probabilistic model of English grammar that is much simpler than conventional models, but which admits an efficient EM training algorithm. The model is based upon grammatical bigrams, i.e., syntactic relationships between pairs of words.
A Comparison of Criteria for Maximum Entropy/Minimum Divergence Feature Selection
- In Proceedings of the 3nd Conference on Empirical Methods in Natural Language Processing (EMNLP
, 1998
"... In this paper we study the gain, a naturally-arising statistic from the theory of memd modeling [2], as a figure of merit for selecting features for an memd language model. We compare the gain with two popular alternatives---empirical activation and mutual information---and argue that the gain is t ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In this paper we study the gain, a naturally-arising statistic from the theory of memd modeling [2], as a figure of merit for selecting features for an memd language model. We compare the gain with two popular alternatives---empirical activation and mutual information---and argue that the gain is the preferred statistic, on the grounds that it directly measures a feature 's contribution to improving upon the base model. Introduction Maximum entropy / minimum divergence (memd) modeling is a powerful technique for building statistical models of linguistic phenomena. It has been applied to problems as diverse as machine translation [2], parsing [10], word morphology [5] and language modeling [6, 11, 3, 9]. The heart of the method is to choose a collection of informative features, each encoding some linguistically significant event, and then to incorporate these features into a family of conditional models. A fundamental issue in applying this technique is the criterion used to select f...
Word Triggers and the EM Algorithm
- In Proceedings of the Workshop Computational Natural Language Learning (CoNLL 97
, 1997
"... In this paper, we study the use of so-called word trigger pairs to improve an existing language model, which is typically a trigram model in combination with a cache component. A word trigger pair is defined as a long-distance word pair. We present two methods to select the most significant s ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In this paper, we study the use of so-called word trigger pairs to improve an existing language model, which is typically a trigram model in combination with a cache component. A word trigger pair is defined as a long-distance word pair. We present two methods to select the most significant single word trigger pairs. The selected trigger pairs are used in a com- bined model where the interpolation parameters and trigger interaction parameters are trained by the EM algorithm.
Pattern-Based Disambiguation for Natural Language Processing
, 2000
"... A wide range of natural language problems can be viewed as disambiguating between a small set of alternatives based upon the string context surrounding the ambiguity site. In this paper we demonstrate that classification accuracy can be improved by invoking a more descriptive feature set than ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A wide range of natural language problems can be viewed as disambiguating between a small set of alternatives based upon the string context surrounding the ambiguity site. In this paper we demonstrate that classification accuracy can be improved by invoking a more descriptive feature set than what is typically used. We present a technique that disambiguates by learning regular expressions describing the string contexts in which the ambiguity sites appear.

