Results 1  10
of
58
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 231 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Gaussian Processes for Active Data Mining of Spatial Aggregates
 In Proceedings of the SIAM International Conference on Data Mining
, 2005
"... We present an active data mining mechanism for qualitative analysis of spatial datasets, integrating identification and analysis of structures in spatial data with targeted collection of additional samples. The mechanism is designed around the spatial aggregation language (SAL) for qualitative ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We present an active data mining mechanism for qualitative analysis of spatial datasets, integrating identification and analysis of structures in spatial data with targeted collection of additional samples. The mechanism is designed around the spatial aggregation language (SAL) for qualitative spatial reasoning, and seeks to uncover highlevel spatial structures from only a sparse set of samples. This approach is important for applications in domains such as aircraft design, wireless system simulation, fluid dynamics, and sensor networks. The mechanism employs Gaussian processes, a formal mathematical model for reasoning about spatial data, in order to build surrogate models from sparse data, reason about the uncertainty of estimation at unsampled points, and formulate objective criteria for closingtheloop between data collection and data analysis. It optimizes sample selection using entropybased functionals defined over spatial aggregates instead of the traditional approach of sampling to minimize estimated variance. We apply this mechanism on a global optimization benchmark comprising a testbank of 2D functions, as well as on data from wireless system simulations. The results reveal that the proposed sampling strategy makes more judicious use of data points by selecting locations that clarify highlevel structures in data, rather than choosing points that merely improve quality of function approximation.
Using Trees to Depict a Forest
 PVLDB
"... When a database query has a large number of results, the user can only be shown one page of results at a time. One popular approach is to rank results such that the “best ” results appear first. However, standard database query results comprise a set of tuples, with no associated ranking. It is typi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
When a database query has a large number of results, the user can only be shown one page of results at a time. One popular approach is to rank results such that the “best ” results appear first. However, standard database query results comprise a set of tuples, with no associated ranking. It is typical to allow users the ability to sort results on selected attributes, but no actual ranking is defined. An alternative approach to the first page is not to try to show the best results, but instead to help users learn what is available in the whole result set and direct them to finding what they need. In this paper, we demonstrate through a user study that a page comprising one representative from each of k clusters (generated through a kmedoid clustering) is superior to multiple alternative candidate methods for generating representatives of a data set. Users often refine query specifications based on returned results. Traditional clustering may lead to completely new representatives after a refinement step. Furthermore, clustering can be computationally expensive. We propose a treebased method for efficiently generating the representatives, and smoothly adapting them with query refinement. Experiments show that our algorithms outperform the stateoftheart in both result quality and efficiency.
Antipole Tree indexing to support range search and Knearest neighbor search in metric spaces
 IEEE/TKDE
, 2005
"... Range and knearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a metric space M and a query object q in M, in a range searching problem the target is to find the objects of S within some threshold distance to q, whereas in a knearest neighbor searc ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Range and knearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a metric space M and a query object q in M, in a range searching problem the target is to find the objects of S within some threshold distance to q, whereas in a knearest neighbor searching problem, the k elements of S closest to q must be produced. These problems can obviously be solved with a linear number of distance calculations, by comparing the query object against every object in the database. However, the goal is to solve such problems much faster. We combine and extend ideas from the MTree, the MultiVantage Point structure, and the FQTree to create a new structure in the “bisector tree ” class, called the Antipole Tree. Bisection is based on the proximity to an “Antipole ” pair of elements generated by a suitable linear randomized tournament. The final winners a, b of such a tournament are far enough apart to approximate the diameter of the splitting set. If dist(a, b) is larger than the chosen cluster diameter threshold, then the cluster is split. The proposed data structure is an indexing scheme suitable for (exact and approximate) best match searching on generic metric spaces. The Antipole Tree compares very well with existing structures such as List of Clusters, MTrees and others, and in many cases it achieves better results.
On Effective Presentation of Graph Patterns: A Structural Representative Approach
 in Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08
, 2008
"... In the past, quite a few fast algorithms have been developed to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for knowledge discovery on graphs is neither efficiency nor scalability, but the usability of patterns t ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
In the past, quite a few fast algorithms have been developed to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for knowledge discovery on graphs is neither efficiency nor scalability, but the usability of patterns that are mined out. Currently, what the stateofart techniques give is a lengthy list of exact patterns, which are undesirable in the following two aspects: (1) on the micro side, due to various inherent noises or data diversity, exact patterns are usually not too useful in many real applications; and (2) on the macro side, the rigid structural requirement being posed often generates an excessive amount of patterns that are only slightly different from each other, which easily overwhelm the users. In this paper, we study the presentation problem of graph patterns, where structural representatives are deemed as the key mechanism to make the whole strategy effective. As a solution to fill the usability gap, we adopt a twostep smoothingclustering framework, with the first step adding error tolerance to individual patterns (the micro side), and the second step reducing output cardinality by collapsing multiple structurally similar patterns into one representative (the macro side). This novel, integrative approach is never tried in previous studies, which essentially rollsup our attention to a more appropriate level that no longer looks into every minute detail. The above framework is general, which may apply under various settings and incorporate a lot of extensions. Empirical studies indicate that a compact group of informative delegates can be achieved on real datasets and the proposed algorithms are both efficient and scalable.
The Application of Kmedoids and PAM to the Clustering of Rules
"... Abstract. Earlier research has resulted in the production of an ‘allrules’ algorithm for datamining that produces all conjunctive rules of above given confidence and coverage thresholds. While this is a useful tool, it may produce a large number of rules. This paper describes the application of two ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract. Earlier research has resulted in the production of an ‘allrules’ algorithm for datamining that produces all conjunctive rules of above given confidence and coverage thresholds. While this is a useful tool, it may produce a large number of rules. This paper describes the application of two clustering algorithms to these rules, in order to identify sets of similar rules and to better understand the data. 1
Creating streaming iterative soft clustering algorithms
 In NAFIPS 07
, 2007
"... Abstract — There are an increasing number of large labeled and unlabeled data sets available. Clustering algorithms are the best suited for helping one make sense out of unlabeled data. However, scaling iterative clustering algorithms to large amounts of data has been a challenge. The computation ti ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract — There are an increasing number of large labeled and unlabeled data sets available. Clustering algorithms are the best suited for helping one make sense out of unlabeled data. However, scaling iterative clustering algorithms to large amounts of data has been a challenge. The computation time can be very great and for data sets that will not fit in even the largest memory, only carefully chosen subsets of data can be practically clustered. We present a general approach which enables iterative fuzzy/possibilistic clustering algorithms to be turned into algorithms that can handle arbitrary amounts of streaming data. The computation time is also reduced for very large data sets while the results of clustering will be very similar to clustering with all the data, if that was possible. We introduce transformed equations for fuzzycmeans, possibilistic cmeans, the GustafsonKessel algorithm and show the excellent performance with a streaming fuzzy cmeans implementation. The resulting clusters are both sensible and for comparable data sets (those that fit in memory) almost identical to those obtained with the original clustering algorithm.
A Hierarchical Projection Pursuit Clustering Algorithm
 IN 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR
, 2004
"... We define a cluster to be characterized by regions of high density separated by regions that are sparse. By observing the downward closure property of density, the search for interesting structure in a high dimensional space can be reduced to a search for structure in lower dimensional subspaces. We ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We define a cluster to be characterized by regions of high density separated by regions that are sparse. By observing the downward closure property of density, the search for interesting structure in a high dimensional space can be reduced to a search for structure in lower dimensional subspaces. We present a Hierarchical Projection Pursuit Clustering (HPPC) algorithm that repeatedly bipartitions the dataset based on the discovered properties of interesting 1dimensional projections. We describe a projection search procedure and a projection pursuit index function based on Cho, Haralick and Yi's improvement of the Kittler and Illingworth optimal threshold technique. The output of the algorithm is a decision tree whose nodes store a projection and threshold and whose leaves represent the clusters (classes). Experiments with various real and synthetic datasets show the effectiveness of the approach.
A Novel Density based improved kmeans Clustering Algorithm – Dbkmeans
"... Abstract: Mining knowledge from large amounts of spatial data is known as spatial data mining. It becomes a highly demanding field because huge amounts of spatial data have been collected in various applications ranging from geospatial data to biomedical knowledge. The amount of spatial data being ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract: Mining knowledge from large amounts of spatial data is known as spatial data mining. It becomes a highly demanding field because huge amounts of spatial data have been collected in various applications ranging from geospatial data to biomedical knowledge. The amount of spatial data being collected is increasing exponentially. So, it far exceeded human’s ability to analyze. Recently, clustering has been recognized as a primary data mining method for knowledge discovery in spatial database. The database can be clustered in many ways depending on the clustering algorithm employed, parameter settings used, and other factors. Multiple clustering can be combined so that the final partitioning of data provides better clustering. In this paper, a novel density based kmeans clustering algorithm has been proposed to overcome the drawbacks of DBSCAN and kmeans clustering algorithms. The result will be an improved version of kmeans clustering algorithm. This algorithm will perform better than DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. Keywords: Clustering, DBSCAN, kmeans, DBkmeans. I.
Approximate kernel kmeans: Solution to large scale kernel clustering
 in Proceedings of the International Conference on Knowledge Discovery and Data mining
"... Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and h ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernelbased clustering algorithms can capture the nonlinear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel kmeans, termed approximate kernel kmeans, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel kmeans is similar to that of the kernel kmeans algorithm, but with dramatically reduced runtime complexity and memory requirements.