Results 1  10
of
10
A Model of Lexical Attraction and Repulsion
 In Proceedings of the ACL
, 1997
"... This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word cooccurrence statistics to be constant over a window of several hun dred words, we show that their influence is non ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word cooccurrence statistics to be constant over a window of several hun dred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as conversational speech,' reveals that the "attraction " between words decays exponentially, while stylistic and syntactic contraints create a "repulsion" between words that discourages close cooccurrence. 'We show that these characteristics are well described by simple mixture models based on twostage exponential distributions which can be trained using the EM algorithm. The resulting distance distributions can then be incorporated as penalizing features in an exponential language model.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create highaccuracy text classifiers. By assuming that documents are created by a parametric generative model, ExpectationMaximization (EM) finds local maximum a posteriori models and classifiers from all the data  labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling subtopic class structure, and by modeling supertopic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to lowprobability models. Performance can be significantly improved by using active learning to select highquality initializations, and by using alternatives to EM that avoid lowprobability local maxima.
Additive Models, Boosting, and Inference for Generalized Divergences
 In Proc. 12th Annu. Conf. on Comput. Learning Theory
, 1999
"... We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family of Bregman divergences is shown to yield a family of loss functions that includes the loglikelihood criterion of logistic regression as a special case, and that closely approximates the exponential loss criterion used in the AdaBoost algorithms of Schapire et al., as the natural parameter of the family varies. We also show how the quadratic approximation of the gain in Bregman divergence results in a weighted leastsquares criterion. This leads to a family of incremental learning algorithms that builds upon and extends the recent interpretation of boosting in terms of additive models proposed by Friedman, Hastie, and Tibshirani. 1 Introduction Logistic regression is a widely used statisti...
Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications,” manuscript, available at wwwstat.wharton.upenn.edu/~buja
, 2005
"... What are the natural loss functions or fitting criteria for binary class probability estimation? This question has a simple answer: socalled “proper scoring rules”, that is, functions that score probability estimates in view of data in a Fisherconsistent manner. Proper scoring rules comprise most ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
What are the natural loss functions or fitting criteria for binary class probability estimation? This question has a simple answer: socalled “proper scoring rules”, that is, functions that score probability estimates in view of data in a Fisherconsistent manner. Proper scoring rules comprise most loss functions currently in use: logloss, squared error loss, boosting loss, and as limiting cases costweighted misclassification losses. Proper scoring rules have a rich structure: • Every proper scoring rules is a mixture (limit of sums) of costweighted misclassification losses. The mixture is specified by a weight function (or measure) that describes which misclassification cost weights are most emphasized by the proper scoring rule. • Proper scoring rules permit Fisher scoring and Iteratively Reweighted LS algorithms for model fitting. The weights are derived from a link function and the above weight function. • Proper scoring rules are in a 11 correspondence with information measures for treebased classification.
Statistical Learning Algorithms Based on Bregman Distances
, 1997
"... We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the im ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the impurity function in treebased classifiers. Our approach is based on two components. In the feature selection step, each linear constraint in a pool of candidate features is evaluated by the reduction in Bregman distance that would result from adding it to the model. In the constraint satisfaction step, all of the parameters are adjusted to minimize the Bregman distance subject to the chosen constraints. We introduce a new iterative estimation algorithm for carrying out both the feature selection and constraint satisfaction steps, and outline a proof of the convergence of these algorithms. 1 Introduction In this paper we present a class of statistical learning algorithms formulated in terms...
Philosophy and the practice of Bayesian statistics
, 2010
"... A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypotheticodeductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.
Combining statistical language models via the latent maximum entropy principle
 Machine Learning
, 2005
"... Abstract. We present a unified probabilistic framework for statistical language modeling which can simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information. Our approach is based on a recent statistical infe ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. We present a unified probabilistic framework for statistical language modeling which can simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information. Our approach is based on a recent statistical inference principle we have proposed—the latent maximum entropy principle—which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of preprocessing the data to obtain explicitly observed features. We describe efficient algorithms for marginalization, inference and normalization in our extended models. We then use these techniques to combine two standard forms of language models: local lexical models (Markov Ngram models) and global documentlevel semantic models (probabilistic latent semantic analysis). Our experimental results on the Wall Street Journal corpus show that we obtain a 18.5% reduction in perplexity compared to the baseline trigram model with GoodTuring smoothing. Keywords: language modeling, Ngram models, latent semantic analysis, maximum entropy, latent variables
A Model of Lexical Attraction and Repulsion
 In Proceedings of the ACL
, 1997
"... This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word cooccurrence statistics to be constant over a window of several hundred words, we show that their influence is nonst ..."
Abstract
 Add to MetaCart
This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word cooccurrence statistics to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as conversational speech, reveals that the "attraction " between words decays exponentially, while stylistic and syntactic contraints create a "repulsion" between words that discourages close cooccurrence. We show that these characteristics are well described by simple mixture models based on twostage exponential distributions which can be trained using the EM algorithm. The resulting distance distributions can then be incorporated as penalizing features in an exponential language model. 1 Introduction One of the fundamental characteristics of language, viewed as a stochastic ...
A System For Multilingual Sentiment Learning On Large Data Sets
"... Classifying documents according to the sentiment they convey (whether positive or negative) is an important problem in computational linguistics. There has not been much work done in this area on general techniques that can be applied effectively to multiple languages, nor have very large data sets ..."
Abstract
 Add to MetaCart
Classifying documents according to the sentiment they convey (whether positive or negative) is an important problem in computational linguistics. There has not been much work done in this area on general techniques that can be applied effectively to multiple languages, nor have very large data sets been used in empirical studies of sentiment classifiers. We present an empirical study of the effectiveness of several sentiment classification algorithms when applied to nine languages (including Germanic, Romance, and East Asian languages). The algorithms are implemented as part of a system that can be applied to multilingual data. We trained and tested the system on a data set that is substantially larger than that typically encountered in the literature. We also consider a generalization of the ngram model and a variant that reduces memory consumption, and evaluate their effectiveness.
A Coclassification Approach to Learning from
"... We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, coregularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each langu ..."
Abstract
 Add to MetaCart
We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, coregularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which will be made available for benchmarking. 1