Results 1  10
of
677
Latent dirichlet allocation
 Journal of Machine Learning Research
, 2003
"... We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, ..."
Abstract

Cited by 2350 (63 self)
 Add to MetaCart
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. 1.
Probabilistic Latent Semantic Analysis
 In Proc. of Uncertainty in Artificial Intelligence, UAI’99
, 1999
"... Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of twomode and cooccurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Sema ..."
Abstract

Cited by 529 (6 self)
 Add to MetaCart
Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of twomode and cooccurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of cooccurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.
Opinion Mining and Sentiment Analysis
, 2008
"... An important part of our informationgathering behavior has always been to find out what other people think. With the growing availability and popularity of opinionrich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, active ..."
Abstract

Cited by 367 (4 self)
 Add to MetaCart
An important part of our informationgathering behavior has always been to find out what other people think. With the growing availability and popularity of opinionrich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a firstclass object. This survey covers techniques and approaches that promise to directly enable opinionoriented informationseeking systems. Our focus is on methods that seek to address the new challenges raised by sentimentaware applications, as compared to those that are already present in more traditional factbased analysis. We include materialon summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinionoriented informationaccess services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.
Concept Decompositions for Large Sparse Text Data using Clustering
 Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract

Cited by 303 (28 self)
 Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical kmeans algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the highdimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractallike" and "selfsimilar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the leastsquares approximation onto the linear subspace spanned...
Unsupervised learning of human action categories using spatialtemporal words
 In Proc. BMVC
, 2006
"... Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences ..."
Abstract

Cited by 290 (6 self)
 Add to MetaCart
Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, nonstationary background, and moving target, few vision algorithms could identify, categorize and
Eigentaste: A Constant Time Collaborative Filtering Algorithm
, 2000
"... Eigentaste is a collaborative filtering algorithm that uses universal queries to elicit realvalued user ratings on a common set of items and applies principal component analysis (PCA) to the resulting dense subset of the ratings matrix. PCA facilitates dimensionality reduction for offline clusterin ..."
Abstract

Cited by 276 (4 self)
 Add to MetaCart
Eigentaste is a collaborative filtering algorithm that uses universal queries to elicit realvalued user ratings on a common set of items and applies principal component analysis (PCA) to the resulting dense subset of the ratings matrix. PCA facilitates dimensionality reduction for offline clustering of users and rapid computation of recommendations. For a database of n users, standard nearestneighbor techniques require O(n) processing time to compute recommendations, whereas Eigentaste requires O(1) (constant) time. We compare Eigentaste to alternative algorithms using data from Jester, an online joke recommending system. Jester has collected approximately 2,500,000 ratings from 57,000 users. We use the Normalized Mean Absolute Error (NMAE) measure to compare performance of different algorithms. In the Appendix we use Uniform and Normal distribution models to derive analytic estimates of NMAE when predictions are random. On the Jester dataset, Eigentaste computes recommendations two ...
Parsimonious Language Models for Information Retrieval
 In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2004
"... We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, ..."
Abstract

Cited by 250 (38 self)
 Add to MetaCart
We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, they need fewer (nonzero) parameters to describe the data. We apply parsimonious models at three stages of the retrieval process:1) at indexing time; 2) at search time; 3) at feedback time. Experimental results show that we are able to build models that are significantly smaller than standard models, but that still perform at least as well as the standard approaches.
InformationTheoretic CoClustering
 In KDD
, 2003
"... Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract

Cited by 249 (11 self)
 Add to MetaCart
Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the coclustering problem as an optimization problem in information theory  the optimal coclustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.
The AuthorTopic Model for Authors and Documents
"... We introduce the authortopic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial dist ..."
Abstract

Cited by 233 (15 self)
 Add to MetaCart
We introduce the authortopic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics
that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact
inference is intractable for these datasets and
we use Gibbs sampling to estimate the topic
and author distributions. We compare the performance with two other generative models for documents, which are special cases of the authortopic model: LDA (a topic model)
and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the authortopic model, and demonstrate applications
to computing similarity between authors and
entropy of author output.
Learning object categories from google’s image search
 In ICCV
, 2005
"... Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Intern ..."
Abstract

Cited by 231 (16 self)
 Add to MetaCart
Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Internet. We develop a new model, TSIpLSA, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner. Our approach can handle the high intraclass variability and large proportion of unrelated images returned by search engines. We evaluate the models on standard test sets, showing performance competitive with existing methods trained on hand prepared datasets. 1.