Results 1  10
of
97
Thumbs up? Sentiment Classification using Machine Learning Techniques
 IN PROCEEDINGS OF EMNLP
, 2002
"... We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform humanproduced baselines. However, the three mac ..."
Abstract

Cited by 703 (7 self)
 Add to MetaCart
(Show Context)
We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform humanproduced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topicbased categorization. We conclude by examining factors that make the sentiment classification problem more challenging. 1
Two decades of statistical language modeling: Where do we go from here
 Proceedings of the IEEE
, 2000
"... Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here ..."
Abstract

Cited by 170 (1 self)
 Add to MetaCart
(Show Context)
Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data. 1. OUTLINE Statistical language modeling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. By and large, statistical language modeling amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents. Statistical language modeling is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical approaches have been introduced in [1]. But even researchers using rulebased approaches have found it beneficial to introduce some elements of SLM and statistical estimation [2]. In information retrieval, a language modeling approach was recently proposed by [3], and a statistical/information theoretical approach was developed by [4]. SLM employs statistical estimation techniques using language training data, that is, text. Because of the categorical nature of language, and the large vocabularies people naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data.
Accurate information extraction from research papers using conditional random fields
 HLTNAACL04
, 2004
"... With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This paper employs Conditional Random Fields (CRFs) for the task of extracting various common fields from the headers ..."
Abstract

Cited by 144 (12 self)
 Add to MetaCart
With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This paper employs Conditional Random Fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. The basic theory of CRFs is becoming wellunderstood, but bestpractices for applying them to realworld data requires additional exploration. This paper makes an empirical exploration of several factors, including variations on Gaussian, exponential and hyperbolicL1 priors for improved regularization, and several classes of features and Markov order. On a standard benchmark data set, we achieve new stateoftheart performance, reducing error in average F1 by 36%, and word error rate by 78 % in comparison with the previous best SVM results. Accuracy compares even more favorably against HMMs. 1
Boosting and Maximum Likelihood for Exponential Models
 In Advances in Neural Information Processing Systems
, 2001
"... Recent research has considered the relationship between boosting and more standard statistical methods, such as logistic regression, concluding that AdaBoost is similar but somehow still very different from statistical methods in that it minimizes a different loss function. In this paper we derive a ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
Recent research has considered the relationship between boosting and more standard statistical methods, such as logistic regression, concluding that AdaBoost is similar but somehow still very different from statistical methods in that it minimizes a different loss function. In this paper we derive an equivalence between AdaBoost and the dual of a convex optimization problem. In this setting, it is seen that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels; the two methods minimize the same KullbackLeibler divergence objective function subject to identical feature constraints. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets, comparing exponential loss and maximum likelihood for parallel and sequential update algorithms, confirm our theoretical analysis, indicating that AdaBoost and maximum likelihood typically yield identical results as the number of features increases to allow the models to fit the training data.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2004
"... This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Severa ..."
Abstract

Cited by 69 (9 self)
 Add to MetaCart
(Show Context)
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ∼50 % for single writer data and by ∼25 % for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.
Statistical language model adaptation: review and perspectives
 Speech Communication
, 2004
"... Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate for this mismatch. More generally, an adaptive language model seeks to maintain an adequate representation of the current task domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This paper presents an overview of the major approaches proposed to address this issue, and offers some perspectives regarding their comparative merits and associated tradeoffs. Ó 2003 Elsevier B.V. All rights reserved. 1.
Exponential Priors for Maximum Entropy Models
 In Proceedings of the Annual Meeting of the Association for Computational Linguistics
, 2003
"... this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors. ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
(Show Context)
this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors.
Performance guarantees for regularized maximum entropy density estimation
 Proceedings of the 17th Annual Conference on Computational Learning Theory
, 2004
"... Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of th ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove nonasymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniformconvergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequentialupdate and parallelupdate algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions. 1
A maximum entropy approach to species distribution modeling
 In Proceedings of the TwentyFirst International Conference on Machine Learning
, 2004
"... We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments compar ..."
Abstract

Cited by 52 (7 self)
 Add to MetaCart
(Show Context)
We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distributionmodeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent. 1.
The (non)utility of predicateargument frequencies for pronoun interpretation
 In Proceedings of 2004 North American chapter of the Association for Computational Linguistics annual meeting
, 2004
"... Stateoftheart pronoun interpretation systems rely predominantly on morphosyntactic contextual features. While the use of deep knowledge and inference to improve these models would appear technically infeasible, previous work has suggested that predicateargument statistics mined from naturallyoc ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
(Show Context)
Stateoftheart pronoun interpretation systems rely predominantly on morphosyntactic contextual features. While the use of deep knowledge and inference to improve these models would appear technically infeasible, previous work has suggested that predicateargument statistics mined from naturallyoccurring data could provide a useful approximation to such knowledge. We test this idea in several system configurations, and conclude from our results and subsequent error analysis that such statistics offer little or no predictive information above that provided by morphosyntax. 1