Results 1  10
of
732
Text Classification from Labeled and Unlabeled Documents using EM
 Machine Learning
, 1999
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract

Cited by 804 (17 self)
 Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of ExpectationMaximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...
Probabilistic Latent Semantic Indexing
, 1999
"... Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized ..."
Abstract

Cited by 795 (8 self)
 Add to MetaCart
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domainspecific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.
Unsupervised learning of finite mixture models
 IEEE Transactions on pattern analysis and machine intelligence
, 2002
"... AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectationmaximization ..."
Abstract

Cited by 271 (20 self)
 Add to MetaCart
AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectationmaximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach. Index TermsÐFinite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods, expectationmaximization algorithm, clustering. æ 1
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 246 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 238 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Automating the Construction of Internet Portals with Machine Learning
 Information Retrieval
, 2000
"... Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract

Cited by 170 (3 self)
 Add to MetaCart
Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Webwide search engines. Unfortunately these portals are difficult and timeconsuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domainspecific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Learning with Labeled and Unlabeled Data
, 2001
"... In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as ..."
Abstract

Cited by 166 (3 self)
 Add to MetaCart
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of inputdependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...
Learning to Classify Text from Labeled and Unlabeled Documents
, 1998
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensi ..."
Abstract

Cited by 162 (18 self)
 Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text, based on the combination of ExpectationMaximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates. Experimental results, obtained using text from three different realworld tasks, show that the use of unlabeled...
Unsupervised Learning from Dyadic Data
, 1998
"... Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event cooccurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applic ..."
Abstract

Cited by 100 (9 self)
 Add to MetaCart
Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event cooccurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applications ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domainindependent framework for unsupervised learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures and unifies probabilistic modeling and structure discovery. Mixture models provide both, a parsimonious yet flexible parameterization of probability distributions with good generalization performance on sparse data, as well as structural information about datainherent grouping structure. We propose an annealed version of the standard Expectation Maximization algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.