• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

An experimental comparison of several clustering methods, Microsoft Research Report MSR-TR-98-06 (0)

by M Meila, D Heckerman
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 57
Next 10 →

Performance Evaluation of Some Clustering Algorithms and Validity Indices

by Ujjwal Maulik, Sanghamitra B - IEEE Transactions on Pattern Analysis and Machine Intelligence , 2002
"... Abstract—In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn’s index, Calinski-Harabasz index, and a recently de ..."
Abstract - Cited by 40 (0 self) - Add to MetaCart
Abstract—In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn’s index, Calinski-Harabasz index, and a recently developed index I. Based on a relation between the index I and the Dunn’s index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.

Pulse: Mining Customer Opinions from Free Text

by Michael Gamon, Anthony Aue, Simon Corston-oliver, Eric Ringger - In Proc. of the 6th International Symposium on Intelligent Data Analysis , 2005
"... Abstract. We present a prototype system, code-named Pulse, for mining topics and sentiment orientation jointly from free text customer feedback. We describe the application of the prototype system to a database of car reviews. Pulse enables the exploration of large quantities of customer free text. ..."
Abstract - Cited by 40 (0 self) - Add to MetaCart
Abstract. We present a prototype system, code-named Pulse, for mining topics and sentiment orientation jointly from free text customer feedback. We describe the application of the prototype system to a database of car reviews. Pulse enables the exploration of large quantities of customer free text. The user can examine customer opinion “at a glance ” or explore the data at a finer level of detail. We describe a simple but effective technique for clustering sentences, the application of a bootstrapping approach to sentiment classification, and a novel user-interface. 1

Correlation Clustering with Partial Information

by Erik D. Demaine, et al. , 2003
"... We consider the following general correlation-clustering problem [1]: given a graph with real edge weights (both positive and negative), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncut negative edges. Thus, large positive weights (represent ..."
Abstract - Cited by 39 (1 self) - Add to MetaCart
We consider the following general correlation-clustering problem [1]: given a graph with real edge weights (both positive and negative), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncut negative edges. Thus, large positive weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster; large negative weights encourage the endpoints to belong to different clusters; and weights with small absolute value represent little information. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NP-hardness and gave constant-factor approximation algorithms for the special case in which the graph is complete (full information) and every edge has weight +1 or-1. We give an O(log n)-approximation algorithm for the general case based on a linear-programming rounding and the "region-growing " technique. We also prove that this linear program has a gap of Ω(log n), and therefore our approximation is tight under this approach. We also give an O(r³)-approximation algorithm for Kr,r-minor-free graphs. On the other hand, we show that the problem is APX-hard, and any o(log n)-approximation would require improving the best approximation algorithms known for minimum multicut.

Scaling EM (Expectation-Maximization) Clustering to Large Databases

by Paul S. Bradley, Usama M. Fayyad, Cory A. Reina, P. S. Bradley, Usama Fayyad, Cory Reina , 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract - Cited by 35 (0 self) - Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...

The effectiveness of lloyd-type methods for the k-means problem

by Rafail Ostrovsky, Yuval Rabani - In 47th IEEE Symposium on the Foundations of Computer Science (FOCS , 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract - Cited by 32 (3 self) - Add to MetaCart
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration. 1.

Learning spectral clustering, with application to speech separation

by Francis R. Bach, Michael I. Jordan - JOURNAL OF MACHINE LEARNING RESEARCH , 2006
"... Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost fun ..."
Abstract - Cited by 30 (4 self) - Add to MetaCart
Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled datasets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram.

Automatic construction of multifaceted browsing interfaces

by Wisam Dakka - In CIKM , 2005
"... Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted hierarchies represent a new powerful browsi ..."
Abstract - Cited by 23 (2 self) - Add to MetaCart
Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted hierarchies represent a new powerful browsing paradigm which has been proven to be a successful complement to keyword searching. Thus far, multifaceted hierarchies have been created manually or semi-automatically, making it difficult to deploy multifaceted interfaces over a large number of databases. We present automatic and scalable methods for creation of multifaceted interfaces. Our methods are integrated with traditional relational databases and can scale well for large databases. Furthermore, we present methods for selecting the best portions of the generated hierarchies when the screen space is not sufficient for displaying all the hierarchy at once. We apply our technique to a range of large data sets, including annotated images, television programming schedules, and web pages. The results are promising and suggest directions for future research.

Model-Based Hierarchical Clustering

by Shivakumar Vaithyanathan, Byron Dom - In Proc. 16th Conf. Uncertainty in Artificial Intelligence , 2000
"... We present an approach to model-based hierarchical clustering by formulating an objective function based on a Bayesian analysis. This model organizes the data into a cluster hierarchy while specifying a complex feature-set partitioning that is a key component of our model. Features can have ei ..."
Abstract - Cited by 21 (0 self) - Add to MetaCart
We present an approach to model-based hierarchical clustering by formulating an objective function based on a Bayesian analysis. This model organizes the data into a cluster hierarchy while specifying a complex feature-set partitioning that is a key component of our model. Features can have either a unique distribution in every cluster or a common distribution over some (or even all) of the clusters. The cluster subsets over which these features have such a common distribution correspond to the nodes (clusters) of the tree representing the hierarchy. We apply this general model to the problem of document clustering for which we use a multinomial likelihood function and Dirichlet priors. Our algorithm consists of a two-stage process wherein we first perform a flat clustering followed by a modified hierarchical agglomerative merging process that includes determining the features that will have common distributions over the merged clusters. The regularization induced...

Learning recursive Bayesian multinets for data clustering by means of constructive induction

by J.M. Peña, J.A. Lozano, P. Larrañaga , 2001
"... This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with co ..."
Abstract - Cited by 18 (7 self) - Add to MetaCart
This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with component BNs at the leaves. A RBMN is learnt using a greedy, heuristic approach akin to that used by many supervised decision tree learners, but where BNs are learnt at leaves using constructive induction. A key idea is to treat expected data as real data. This allows us to complete the database and to take advantage of a closed form for the marginal likelihood of the expected complete data that factorizes into separate marginal likelihoods for each family (a node and its parents). Our approach is evaluated on synthetic and real-world databases.

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

by Bin He Tao, Tao Tao, Kevin Chen-chuan Chang - In CIKM Conference , 2004
"... In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the int ..."
Abstract - Cited by 17 (2 self) - Add to MetaCart
In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas (i.e., attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schemabased clustering accurately organizes sources by object domains (e.g., Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University