Results 1  10
of
227
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 399 (3 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 357 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Clustering binary data streams with Kmeans
 In Proc. ACM SIGMOD Data Mining and Knowledge Discovery Workshop
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that nds higher quality soluti ..."
Abstract

Cited by 62 (9 self)
 Add to MetaCart
(Show Context)
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that nds higher quality solutions in less time. Higher quality of solutions are obtained with a meanbased initialization and incremental learning. The speedup is achieved through a simplied set of sucient statistics and operations with sparse matrices. A summary table of clusters is maintained online. The Kmeans variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions. 1.
A fuzzy kmodes algorithm for clustering categorical data’, Fuzzy Systems
 IEEE Transactions on
, 1999
"... ©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other wo ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
(Show Context)
©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Similarity Measures for Categorical Data: A Comparative Evaluation
, 2008
"... Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven simi ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
(Show Context)
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.
Efficient kanonymization using clustering techniques
 In DASFAA
, 2007
"... Abstract. kanonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an ap ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
(Show Context)
Abstract. kanonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an approach that uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation here is that data records that are naturally similar to each other should be part of the same equivalence class. We thus formulate a specific clustering problem, referred to as kmember clustering problem. We prove that this problem is NPhard and present a greedy heuristic, the complexity of which is in O(n 2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. 1
Recent advances in clustering: A brief survey
 WSEAS Trans. Inform. Sci. Appl
"... Abstract: Unsupervised learning (clustering) deals with instances, which have not been preclassified in any way and so do not have a class attribute associated with them. The scope of applying clustering algorithms is to discover useful but unknown classes of items. Unsupervised learning is an app ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
Abstract: Unsupervised learning (clustering) deals with instances, which have not been preclassified in any way and so do not have a class attribute associated with them. The scope of applying clustering algorithms is to discover useful but unknown classes of items. Unsupervised learning is an approach of learning where instances are automatically placed into meaningful groups based on their similarity. This paper introduces the fundamental concepts of unsupervised learning while it surveys the recent clustering algorithms. Moreover, recent advances in unsupervised learning, such as ensembles of clustering algorithms and distributed clustering, are described.
Clustering Web Sessions by Sequence Alignment
 In Proceedings of the 13th international workshop on database and expert systems applications (DEXA 2002). AixenProvence
, 2002
"... Clustering means grouping similar objects into groups such that objects within a same group bear similarity to each other while objects in di#erent groups are dissimilar to each other. As an important component of data mining, much research on clustering has been conducted in di#erent disciplines. I ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Clustering means grouping similar objects into groups such that objects within a same group bear similarity to each other while objects in di#erent groups are dissimilar to each other. As an important component of data mining, much research on clustering has been conducted in di#erent disciplines. In the context of web mining, clustering could be used to cluster similar clickstreams to determine learning behaviours in the case of elearning, or general site access behaviours in ecommerce or other online applications. Most of the algorithms presented in the literature to deal with clustering web sessions treat sessions as sets of visited pages within a time period and don't consider the sequence of the clickstrem visitation. This has a significant consequence when comparing similarities between web sessions. We propose in this paper a new algorithm based on sequence alignment to measure similarities between web sessions where sessions are chronologically ordered sequences of page accesses.
Improving the Accuracy and Efficiency of the kmeans Clustering Algorithm
"... Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the kmeans clustering algorithm is widely used for many practical applications. But the original kmeans algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the kmeans clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.
EntropyBased Criterion in Categorical Clustering
 Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.