Results 1  10
of
125
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1413 (13 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Voronoi diagrams  a survey of a fundamental geometric data structure
 ACM COMPUTING SURVEYS
, 1991
"... This paper presents a survey of the Voronoi diagram, one of the most fundamental data structures in computational geometry. It demonstrates the importance and usefulness of the Voronoi diagram in a wide variety of fields inside and outside computer science and surveys the history of its development. ..."
Abstract

Cited by 621 (5 self)
 Add to MetaCart
This paper presents a survey of the Voronoi diagram, one of the most fundamental data structures in computational geometry. It demonstrates the importance and usefulness of the Voronoi diagram in a wide variety of fields inside and outside computer science and surveys the history of its development. The paper puts particular emphasis on the unified exposition of its mathematical and algorithmic properties. Finally, the paper provides the first comprehensive bibliography on Voronoi diagrams and related structures.
BIRCH: an efficient data clustering method for very large databases
 In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract

Cited by 476 (2 self)
 Add to MetaCart
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs. This paper presents a data clustering method named Bfll (;”H (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and clynamicall y clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a goocl clustering with a single scan of the data, and improve the quality further with a few aclditioual scans. BIRCH is also the first clustering algorithm proposerl in the database area to handle “noise) ’ (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIR (;’H versus CLARA NS, a clustering method proposed recently for large datasets, and S11OW that BIRCH is consistently 1
FastMap: A Fast Algorithm for Indexing, DataMining and Visualization of Traditional and Multimedia Datasets
, 1995
"... A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several types ..."
Abstract

Cited by 434 (22 self)
 Add to MetaCart
(Show Context)
A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several types of queries, including the `Query By Example' type (which translates to a range query); the `all pairs' query (which translates to a spatial join [8]); the nearestneighbor or bestmatch query, etc. However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points. This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some kdimensional space (k is userdefined), such that the dissimilarities are preserved. There are two benefits from this mapping: (a) efficient ret...
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 286 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 270 (3 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
The TVtree  an index structure for highdimensional data
 VLDB Journal
, 1994
"... We propose a file structure to index highdimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree struc ..."
Abstract

Cited by 208 (7 self)
 Add to MetaCart
(Show Context)
We propose a file structure to index highdimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such `varying length' feature vectors. Finally we report simulation results, comparing the proposed structure with the R tree, which is one of the most successful methods for lowdimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses. Type of Contribution: New Index Structure, for highdimensionality feature spaces. Algorithms and performance measurements. Keywords: Spatial Index, Similarity Retrieval, Query by Content 1 Introduction Many applications require enhanced indexing, capable of performing similarity searching on several, nontraditional (`exotic') data types. The targ...
Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences
 In SIGMOD
, 1997
"... Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset com ..."
Abstract

Cited by 108 (15 self)
 Add to MetaCart
Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the sequence is a numerical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets (AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% error in any data value after compressing to a mere 2.5% of the original space (i.e., a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on aggregate queries achieved a 0.5% reconstruction error with a space requirement under 2%. 1 Introduction The bulk of the data...
Network tomography: recent developments
 Statistical Science
, 2004
"... Today's Int ernet is a massive, dist([/#][ net work which cont inuest o explode in size as ecommerce andrelatH actH]M/# grow. Thehet([H(/#]H( and largelyunregulatS stregula of t/ Int/HH3 renderstnde such as dynamicroutc/[ opt2]3fl/ service provision, service level verificatflH( and det(2][/ of ..."
Abstract

Cited by 92 (4 self)
 Add to MetaCart
(Show Context)
Today's Int ernet is a massive, dist([/#][ net work which cont inuest o explode in size as ecommerce andrelatH actH]M/# grow. Thehet([H(/#]H( and largelyunregulatS stregula of t/ Int/HH3 renderstnde such as dynamicroutc/[ opt2]3fl/ service provision, service level verificatflH( and det(2][/ of anomalous/malicious behaviorext/[(22 challenging. The problem is compounded bytS fact tct onecannot rely ont[ cooperatH2 of individual servers and routSS t aid intS collect[3 of net workt/[S measurement vits fort/]3 t/]3] In many ways, net workmonit]/#[ and inference problems bear a st[fl[ resemblancet otnc "inverse problems" in which key aspect of asystfl are not direct/ observable. Familiar signal processing orst[]23/#[S problems such ast omographic imagereconst[/#[S] and phylogenet# tog identn/HH2[M have int erest3/ connect[HU t tonn arising in net working. This artflMM int/ ducesnet workt/H3]S]/ y, a new field which we believe will benefit greatU from tm wealt of stH2](/#S( ttH2 andalgorit#S( It focuses especially on recent development s int2 field includingtl applicat[fl of pseudolikelihoodmetfl ds andt reeestfl3](/# formulat]M23 Keyw ords:Net workt/HflS33/ y, pseudolikelihood,t opology identn/]H22(/ tn est/]H tst 1 Introducti6 Nonet work is an island, ent/S ofitS[S] everynet work is a piece of an int/]SS work, a part of t/ main . Alt[]][ administHSHSS of smallscale net works can monit( localt ra#ccondit][/ and ident ify congest/# point s and performance botU((2/ ks, very few net works are complet/# # Rui Castroan Robert Nowak are with theDepartmen t of Electricalan ComputerEnterX Rice Unc ersity,Houston TX; Mark Coates is with the Departmen t of Electricalan ComputerEnterX McGill UnG ersity,Mon treal, Quebec,Can Gan Lian an Bin Yu are with theDepartmen t of Statistics,...
Parallel Algorithms for Hierarchical Clustering
 Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract

Cited by 84 (1 self)
 Add to MetaCart
(Show Context)
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...