Results 1 - 10
of
12
Inducing Features of Random Fields
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1997
"... We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the ..."
Abstract
-
Cited by 465 (14 self)
- Add to MetaCart
We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classifica...
Maximum Entropy Models for Natural Language Ambiguity Resolution
, 1998
"... The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope th ..."
Abstract
-
Cited by 167 (1 self)
- Add to MetaCart
The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope that Ihave kept the good ideas in this thesis, and left the bad ideas out! Iwould like toacknowledge the following people for their contribution to my education: I thank my advisor Mitch Marcus, who gave me the intellectual freedom to pursue what I believed to be the best way to approach natural language processing, and also gave me direction when necessary. I also thank Mitch for many fascinating conversations, both personal and professional, over the last four years at Penn. I thank all of my thesis committee members: John La erty from Carnegie Mellon University, Aravind Joshi, Lyle Ungar, and Mark Liberman, for their extremely valuable suggestions and comments about my thesis research. I thank Mike Collins, Jason Eisner, and Dan Melamed, with whom I've had many stimulating and impromptu discussions in the LINC lab. Iowe them much gratitude for their valuable feedback onnumerous rough drafts of papers and thesis chapters.
A Simple Introduction to Maximum Entropy Models for Natural Language Processing
"... Many problems in natural language processing can be viewed as linguistic classification problems, in which linguistic contexts are used to predict linguistic classes. Maximum entropy models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a c ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
Many problems in natural language processing can be viewed as linguistic classification problems, in which linguistic contexts are used to predict linguistic classes. Maximum entropy models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context. This report demonstrates the use of a particular maximum entropy model on an example problem, and then proves some relevant mathematical facts about the model in a simple and accessible manner. This report also describes an existing procedure called Generalized Iterative Scaling, which estimates the parameters of this particular model. The goal of this report is to provide enough detail to re-implement the maximum entropy models described in [Ratnaparkhi, 1996, Reynar and Ratnaparkhi, 1997, Ratnaparkhi, 1997] and also to provide a simple explanation of the maximum entropy formalism. 1 Introduction Many problems in natural...
Cluster Expansions And Iterative Scaling For Maximum Entropy Language Models
- Maximum Entropy and Bayesian Methods
, 1995
"... . The maximum entropy method has recently been successfully introduced to a variety of natural language applications. In each of these applications, however, the power of the maximum entropy method is achieved at the cost of a considerable increase in computational requirements. In this paper we pre ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
. The maximum entropy method has recently been successfully introduced to a variety of natural language applications. In each of these applications, however, the power of the maximum entropy method is achieved at the cost of a considerable increase in computational requirements. In this paper we present a technique, closely related to the classical cluster expansion from statistical mechanics, for reducing the computational demands necessary to calculate conditional maximum entropy language models. 1. Introduction In this paper we present a computational technique that can enable faster calculation of maximum entropy models. The starting point for our method is an algorithm [1] for constructing maximum entropy distributions that is an extension of the generalized iterative scaling algorithm of Darroch and Ratcliff [2,3]. The extended algorithm relaxes the assumption of [2,3] that the constraint functions sum to a constant, and results in a set of decoupled polynomial equations, one fo...
Maximum Entropy Modeling Toolkit
, 1997
"... for not-for-profit academic or research purposes is hereby granted, provided that the above copyright notice appears in all copies, and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that use of this software for research pur ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
for not-for-profit academic or research purposes is hereby granted, provided that the above copyright notice appears in all copies, and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that use of this software for research purposes is explicitly acknowledged and cited in all relevant reports and publications via the following citation form
LANGUAGE MODEL ADAPTATION FOR AUTOMATIC SPEECH RECOGNITION AND STATISTICAL MACHINE TRANSLATION
, 2004
"... Language modeling is critical and indispensable for many natural language ap-plications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, it is almost impossible to construct language models by a set of linguistic rules; therefore stati ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Language modeling is critical and indispensable for many natural language ap-plications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, it is almost impossible to construct language models by a set of linguistic rules; therefore statistical techniques have been dominant for language modeling over the last few decades. All statistical modeling techniques, in principle, work under some conditions: 1) a reasonable amount of training data is available and 2) the training data comes from the same population as the test data to which we want to apply our model. Based on observations from the training data, we build statistical models and therefore, the success of a statistical model is crucially dependent on the training data. In other words, if we don’t have enough data for training, or the training data is not matched with the test data, we are not able to build accurate statistical models. This thesis presents novel methods to cope with those problems in language modeling—language model adaptation.
Iterative proportional scaling via decomposable submodels for contingency tables
, 2006
"... We propose iterative proportional scaling (IPS) via decomposable submodels for maximizing likelihood function of a hierarchical model for contingency tables. In ordinary IPS the proportional scaling is performed by cycling through the members of the generating class of a hierarchical model. We propo ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We propose iterative proportional scaling (IPS) via decomposable submodels for maximizing likelihood function of a hierarchical model for contingency tables. In ordinary IPS the proportional scaling is performed by cycling through the members of the generating class of a hierarchical model. We propose to adjust more marginals at each step. This is accomplished by expressing the generating class as a union of decomposable submodels and cycling through the decomposable models. We prove convergence of our proposed procedure, if the amount of scaling is adjusted properly at each step. We also analyze the proposed algorithms around the maximum likelihood estimate (MLE) in detail. Faster convergence of our proposed procedure is illustrated by numerical examples. Keywords and phrases: decomposable model, hierarchical model, I-projection, iterative proportional fitting, Kullback-Leibler divergence. 1
Gibbs-Markov Models
- In Computing Science and Statistics: Proceedings of the 27th Symposium on the Interface. Interface Foundation
, 1995
"... In this paper we present a framework for building probabilistic automata parameterized by context-dependent probabilities. Gibbs distributions are used to model state transitions and output generation, and parameter estimation is carried out using an EM algorithm where the M-step uses a generalized ..."
Abstract
- Add to MetaCart
In this paper we present a framework for building probabilistic automata parameterized by context-dependent probabilities. Gibbs distributions are used to model state transitions and output generation, and parameter estimation is carried out using an EM algorithm where the M-step uses a generalized iterative scaling procedure. We discuss relations with certain classes of stochastic feedforward neural networks, a geometric interpretation for parameter estimation, and a simple example of a statistical language model constructed using this methodology. 1. Introduction Standard statistical approaches to speech and language processing problems use hidden Markov models, or more general probabilistic automata such as stochastic context-free grammars, taking advantage of their wellunderstood properties and efficient training algorithms. But such models are limited in their ability to incorporate contextual information and long-distance dependencies. Because of the Markov assumption, all predi...
Industry/Government Track Poster Document Preprocessing For Naive Bayes Classification and Clustering with Mixture of Multinomials
"... Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiv ..."
Abstract
- Add to MetaCart
Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, as well as ability to handle heterogeneous features and multiple classes. Recently, there has been several attempts to improve the accuracy of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model.

