Results 1  10
of
20
Empirical Analysis of Predictive Algorithm for Collaborative Filtering
 Proceedings of the 14 th Conference on Uncertainty in Artificial Intelligence
, 1998
"... 1 ..."
Text Classification from Labeled and Unlabeled Documents using EM
 Machine Learning
, 1999
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract

Cited by 803 (17 self)
 Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of ExpectationMaximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 561 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Duplicate record detection: A survey
 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2007
"... Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard ..."
Abstract

Cited by 256 (7 self)
 Add to MetaCart
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the ef ciency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.
Clustering Based On Association Rule Hypergraphs
"... Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. These discovered clusters are used to explain the characteristics of the data distribution. In this paper we propose a new metho ..."
Abstract

Cited by 88 (16 self)
 Add to MetaCart
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. These discovered clusters are used to explain the characteristics of the data distribution. In this paper we propose a new methodology for clustering related items using association rules, and clustering related transactions using clusters of items. Our approach is linearly scalable with respect to the number of transactions. The frequent itemsets used to derive association rules are also used to group items into a hypergraph edge, and a hypergraph partitioning algorithm is used to find the clusters. Our experiments indicate that clustering using association rule hypergraphs holds great promise in several application domains. Our experiments with stockmarket data and congressional voting data show that this clustering scheme is able to successfully group items that belong to the same group. Clustering of items can ...
An experimental comparison of several clustering and intialization methods
, 1998
"... We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation–Maximization (EM) algorithm, a “winner take all ” version of the EM algorithm reminiscent of the Kmeans algorithm, a ..."
Abstract

Cited by 78 (0 self)
 Add to MetaCart
We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation–Maximization (EM) algorithm, a “winner take all ” version of the EM algorithm reminiscent of the Kmeans algorithm, and modelbased hierarchical agglomerative clustering. We learn naiveBayes models with a hidden root node, using highdimensional discretevariable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality. 1
Clustering Transactions Using Large Items
"... In traditional data clustering, similarity of a cluster of objects is measured by pairwise similarity of objects in that cluster. We argue that such measures are not appropriate for transactions that are sets of items. We propose the notion of large items, i.e., items contained in some minimum fract ..."
Abstract

Cited by 63 (5 self)
 Add to MetaCart
In traditional data clustering, similarity of a cluster of objects is measured by pairwise similarity of objects in that cluster. We argue that such measures are not appropriate for transactions that are sets of items. We propose the notion of large items, i.e., items contained in some minimum fraction of transactions in a cluster, to measure the similarity of a cluster of transactions. The intuition of our clustering criterion is that there should be many large items within a cluster and little overlapping of such items across clusters. We discuss the rationale behind our approach and its implication on providing a better solution to the clustering problem. We present a clustering algorithm based on the new clustering criterion and evaluate its effectiveness.
Modelbased clustering and visualization of navigation patterns on a web site
 Data Mining and Knowledge Discovery
, 2003
"... We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through th ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach weemployis modelbased (as opposed to distancebased) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of rstorder Markov models using the ExpectationMaximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data � and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on usertra c data from msnbc.com. Keywords: Modelbased clustering, sequence clustering, data visualization, Internet, web 1
Accelerating EM for large databases
 Machine Learning
, 2001
"... The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires signi cant computational resources and has been dismissed as impractical for large databases. We presenttwo approaches that signi cantly reduce the ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires signi cant computational resources and has been dismissed as impractical for large databases. We presenttwo approaches that signi cantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial Esteps for which we can use the results of Neal and Hinton (1998) to obtain the standard convergence guarantees of EM. The rst approach is a version of the incremental EM, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically e ects the e ciency of the algorithm. We provide a method for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the signi cance of each data case and then proceed for several iterations actively using only the signi cant cases. We demonstrate that both methods can signi cantly reduce computational costs through their application to highdimensional realworld and synthetic mixture modeling problems for large databases. Keywords: Expectation Maximization Algorithm, incremental EM, lazy EM, online EM, data blocking, mixture models, clustering.