Results 1  10
of
41
Hierarchical probabilistic neural network language model
 In AISTATS
, 2005
"... In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbo ..."
Abstract

Cited by 94 (4 self)
 Add to MetaCart
In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used ngram models, both for training and recognition. As an alternative to an importance sampling method proposed to speedup training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speedup of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy. 1
Exponential Priors for Maximum Entropy Models
 IN PROCEEDINGS OF THE ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
, 2003
"... ..."
A bit of progress in language modeling — extended version
, 2001
"... 1.1 Overview Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, ..."
Abstract

Cited by 59 (1 self)
 Add to MetaCart
(Show Context)
1.1 Overview Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition,
Strategies for Training Large Scale Neural Network Language Models
"... Abstract—We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hashbased implementation of a maximum entropy model, ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
(Show Context)
Abstract—We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hashbased implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10 % relative reduction of word error rate on English Broadcast News speech recognition task, against large 4gram model trained on 400M tokens. I.
Probabilistic user behavior models
 In: Proceedings of the IEEE International Conference on Data Mining. (2003) 203–210 IFAWC2006 March 1516, Mobile Research Center, TZI Universität
, 2003
"... We present a mixture model based approach for learning individualized behavior models for the Web users. We investigate the use of maximum entropy and Markov mixture models for generating probabilistic behavior models. We first build a global behavior model for the entire population and then persona ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
(Show Context)
We present a mixture model based approach for learning individualized behavior models for the Web users. We investigate the use of maximum entropy and Markov mixture models for generating probabilistic behavior models. We first build a global behavior model for the entire population and then personalize this global model for the existing users by assigning each user individual component weights for the mixture model. We then use these individual weights to group the users into behavior model clusters. We show that the clusters generated in this manner are interpretable and able to represent dominant behavior patterns. We conduct offline experiments on around two months worth of data from CiteSeer, an online digital library for computer science research papers currently storing more than 470,000 documents. We show that both maximum entropy and Markov based personal user behavior models are strong predictive models. We also show that maximum entropy based mixture model outperforms Markov mixture models in recognizing complex user behavior patterns. 1. Introduction and Related
A maximum entropy approach to collaborative filtering in dynamic, sparse, highdimensional domains
, 2002
"... We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionali ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
(Show Context)
We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users. 1
Ensembles of nested dichotomies for multiclass problems
 In Proc 21st International Conference on Machine Learning
, 2004
"... Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multiclass classification task into a system of dichotomies and provide a statistically sound wa ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multiclass classification task into a system of dichotomies and provide a statistically sound way of applying twoclass learning algorithms to multiclass problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multiclass problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to errorcorrecting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good generalpurpose method for applying binary classifiers to multiclass problems.
Improving Classification When a Class Hierarchy is Available Using a HierarchyBased Prior
 Bayesian Analysis
, 2007
"... Abstract. We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. “softmax”) model, with a pri ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. “softmax”) model, with a prior that introduces correlations between the parameters for classes that are nearby in the tree. We compare the performance on simulated data of the new method, the ordinary MNL model, and a model that uses the hierarchy in different way. We also test the new method on a document labelling problem, and find that it performs better than the other methods, particularly when the amount of training data is small. 1
LSTM Neural Networks for Language Modeling
"... Neural networks have become increasingly popular for the task of language modeling. Whereas feedforward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words. On the ot ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
Neural networks have become increasingly popular for the task of language modeling. Whereas feedforward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words. On the other hand, it is well known that recurrent networks are difficult to train and therefore are unlikely to show the full potential of recurrent models. These problems are addressed by a the Long ShortTerm Memory neural network architecture. In this work, we analyze this type of network on an English and a large French language modeling task. Experiments show improvements of about 8 % relative in perplexity over standard recurrent neural network LMs. In addition, we gain considerable improvements in WER on top of a stateoftheart speech recognition system. Index Terms: language modeling, recurrent neural networks, LSTM neural networks
Collaborative Filtering with Maximum Entropy
 IEEE Intelligent Systems
, 2004
"... We describe a novel maximum entropy (maxent) approach for generating online recommendations as a user navigates through a collection of documents. We show how to handle highdimensional sparse data and represent it as a collection of ordered sequences of document requests. Our representation and the ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
We describe a novel maximum entropy (maxent) approach for generating online recommendations as a user navigates through a collection of documents. We show how to handle highdimensional sparse data and represent it as a collection of ordered sequences of document requests. Our representation and the maxent approach have several advantages: (1) we can naturally model longterm interactions and dependencies in the data sequences; (2) we can query the model quickly once it is learned, which makes the method applicable to highvolume Web servers; and (3) we obtain empirically high quality recommendations. Although maxent learning is computationally infeasible if implemented in the straightforward way, we explore data clustering and several algorithmic techniques to make learning practical even in high dimensions. We present several methods for combining the predictions of maxent models learned in different clusters. We conduct offline tests using over six months worth of data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent algorithm is arguably one of the most accurate recommenders, as compared to such techniques as correlation, mixture of Markov models, mixture of multinomial models, individual similaritybased recommenders currently available on ResearchIndex, and even various combinations of current ResearchIndex recommenders.