Results 1 - 10
of
15
Exponential Priors for Maximum Entropy Models
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics
, 2003
"... this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors. ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors.
A maximum entropy approach to collaborative filtering in dynamic, sparse, high-dimensional domains
, 2002
"... We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionali ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users. 1
Ensembles of nested dichotomies for multi-class problems
- In Proc 21st International Conference on Machine Learning
, 2004
"... Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of dichotomies and provide a statistically sound wa ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of dichotomies and provide a statistically sound way of applying two-class learning algorithms to multi-class problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multi-class problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to error-correcting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good generalpurpose method for applying binary classifiers to multi-class problems.
Probabilistic user behavior models
- In: Proceedings of the IEEE International Conference on Data Mining. (2003) 203–210 IFAWC2006 March 15-16, Mobile Research Center, TZI Universität
, 2003
"... We present a mixture model based approach for learning individualized behavior models for the Web users. We investigate the use of maximum entropy and Markov mixture models for generating probabilistic behavior models. We first build a global behavior model for the entire population and then persona ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We present a mixture model based approach for learning individualized behavior models for the Web users. We investigate the use of maximum entropy and Markov mixture models for generating probabilistic behavior models. We first build a global behavior model for the entire population and then personalize this global model for the existing users by assigning each user individual component weights for the mixture model. We then use these individual weights to group the users into behavior model clusters. We show that the clusters generated in this manner are interpretable and able to represent dominant behavior patterns. We conduct offline experiments on around two months worth of data from CiteSeer, an online digital library for computer science research papers currently storing more than 470,000 documents. We show that both maximum entropy and Markov based personal user behavior models are strong predictive models. We also show that maximum entropy based mixture model outperforms Markov mixture models in recognizing complex user behavior patterns. 1. Introduction and Related
Hierarchical probabilistic neural network language model
- AISTATS’05
, 2005
"... In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbo ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.
Improving Classification When a Class Hierarchy is Available Using a Hierarchy-Based Prior
- Bayesian Analysis
, 2007
"... Abstract. We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. “softmax”) model, with a pri ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Abstract. We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. “softmax”) model, with a prior that introduces correlations between the parameters for classes that are nearby in the tree. We compare the performance on simulated data of the new method, the ordinary MNL model, and a model that uses the hierarchy in different way. We also test the new method on a document labelling problem, and find that it performs better than the other methods, particularly when the amount of training data is small. 1
Collaborative Filtering with Maximum Entropy
- IEEE Intelligent Systems
, 2004
"... We describe a novel maximum entropy (maxent) approach for generating online recommendations as a user navigates through a collection of documents. We show how to handle high-dimensional sparse data and represent it as a collection of ordered sequences of document requests. Our representation and the ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We describe a novel maximum entropy (maxent) approach for generating online recommendations as a user navigates through a collection of documents. We show how to handle high-dimensional sparse data and represent it as a collection of ordered sequences of document requests. Our representation and the maxent approach have several advantages: (1) we can naturally model long-term interactions and dependencies in the data sequences; (2) we can query the model quickly once it is learned, which makes the method applicable to highvolume Web servers; and (3) we obtain empirically high quality recommendations. Although maxent learning is computationally infeasible if implemented in the straightforward way, we explore data clustering and several algorithmic techniques to make learning practical even in high dimensions. We present several methods for combining the predictions of maxent models learned in different clusters. We conduct offline tests using over six months worth of data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent algorithm is arguably one of the most accurate recommenders, as compared to such techniques as correlation, mixture of Markov models, mixture of multinomial models, individual similarity-based recommenders currently available on ResearchIndex, and even various combinations of current ResearchIndex recommenders.
Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions
- IN: PROCEEDINGS OF THE THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM’03
, 2003
"... We present a novel approach to modeling sequences using mixtures of conditional maximum entropy distributions. Our method generalizes the mixture of first-order Markov models by including the "long-term" dependencies in model components. The "long-term" dependencies are represented by the freque ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We present a novel approach to modeling sequences using mixtures of conditional maximum entropy distributions. Our method generalizes the mixture of first-order Markov models by including the "long-term" dependencies in model components. The "long-term" dependencies are represented by the frequently used in the natural language processing (NLP) domain probabilistic triggers or rules (such as "A occurred k positions back =) the current symbol is B with probability P "). The maximum entropy framework is then used to create a coherent probabilistic model from all triggers selected for modeling. In order to represent hidden or unobserved effects in the data we use probabilistic mixtures with maximum entropy models as components. We demonstrate how our mixture of conditional maximum entropy models can be learned from data using the EM algorithm that scales linearly in the dimensions of the data and the number of mixture components. We present empirical results on the simulated and real-world data sets and demonstrate that the proposed approach enables us to create better quality models than the mixtures of first-order Markov models and resist overfitting and curse of dimensionality that would inevitably present themselves for the higher order Markov models.
Reduction of Maximum Entropy Models to Hidden Markov Models
, 2001
"... We show that maximum entropy models can be modeled with certain kinds of Hidden Markov Models (HMMs). This allows us to easily construct maximum entropy-style models with hidden variables, hidden state sequences, or other characteristics. The resulting models can be easily trained using standard ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We show that maximum entropy models can be modeled with certain kinds of Hidden Markov Models (HMMs). This allows us to easily construct maximum entropy-style models with hidden variables, hidden state sequences, or other characteristics. The resulting models can be easily trained using standard algorithms with guaranteed locally, and in some cases globally, optimal parameter settings. We also give experimental results showing that a maximum entropy model with a hidden variable outperforms conventional techniques on subject-verb agreement.
Efficient subsampling for training complex language models
- in Proceedings of EMNLP
, 2011
"... We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of bi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1 % ∼ 5 % of the standard complexity with no loss in performance. 1

