## A Customizable Hybrid Approach to Data Clustering (2003)

Venue: | Proc. of the 2003 ACM Symposium on Applied Computing |

Citations: | 4 - 1 self |

### BibTeX

@INPROCEEDINGS{Qian03acustomizable,

author = {Yu Qian and Kang Zhang},

title = {A Customizable Hybrid Approach to Data Clustering},

booktitle = {Proc. of the 2003 ACM Symposium on Applied Computing},

year = {2003},

pages = {485--489}

}

### OpenURL

### Abstract

Most current data clustering algorithms in data mining are based on a distance calculation in certain metric space. For Spatial Database Systems (SDBS), the Euclidean distance between two data points is often used to represent the relationship between data points. However, in some spatial settings and many other applications, distance alone is not enough to represent all the attributes of the relation between data points. We need a more powerful model to record more relational information between data objects. This paper adopts a graph model by which a database is regarded as a graph: each vertex of the graph represents a data point, and each edge, weighted or unweighted, is used to record the relation between two data points connected by the edge. Based on the graph model, this paper presents a set of cluster analysis criteria to guide data clustering. The criteria can be used to measure clustering results and help improving the quality of clustering. Further, a customizable algorithm using the criteria is proposed and implemented. This algorithm can produce clusters according to users ’ specifications. Preliminary experiments show encouraging results. 1.

### Citations

1096 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ... [16] . On the other hand, many clustering algorithms use some parameters or thresholds to control the clustering process or results. They include: CHAMELEON [15] , BIRCH [12] , CLARANS [10] , DBSCAN =-=[8]-=- , STING [11] , and Random Walk [14] . The parameters or thresholds in these algorithms can be regarded as channels between the clustering algorithm and the external environment. _____________________... |

596 | Efficient and Effective Clustering Methods for Spatial Data Mining
- Ng, Han
- 1994
(Show Context)
Citation Context ...h is AUTOCLUST [16] . On the other hand, many clustering algorithms use some parameters or thresholds to control the clustering process or results. They include: CHAMELEON [15] , BIRCH [12] , CLARANS =-=[10]-=- , DBSCAN [8] , STING [11] , and Random Walk [14] . The parameters or thresholds in these algorithms can be regarded as channels between the clustering algorithm and the external environment. ________... |

394 | Data Mining: An Overview from a Database Perspective
- Chen, Han, et al.
- 1996
(Show Context)
Citation Context ...mpared with current data clustering methods, this approach has two major features: first, it is based on graph structure analysis. Current clustering analyses are mainly based on distance computation =-=[13]-=- . In many applications, distance alone is not enough to represent all the attributes of the relation between data points. By modeling data with a graph, the approach proposed in this paper can contro... |

234 | STING: a statistical information grid approach to spatial data mining
- Wang, Yang, et al.
- 1997
(Show Context)
Citation Context ...he other hand, many clustering algorithms use some parameters or thresholds to control the clustering process or results. They include: CHAMELEON [15] , BIRCH [12] , CLARANS [10] , DBSCAN [8] , STING =-=[11]-=- , and Random Walk [14] . The parameters or thresholds in these algorithms can be regarded as channels between the clustering algorithm and the external environment. __________________________________... |

216 |
an efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...tomatic approach is AUTOCLUST [16] . On the other hand, many clustering algorithms use some parameters or thresholds to control the clustering process or results. They include: CHAMELEON [15] , BIRCH =-=[12]-=- , CLARANS [10] , DBSCAN [8] , STING [11] , and Random Walk [14] . The parameters or thresholds in these algorithms can be regarded as channels between the clustering algorithm and the external enviro... |

212 | CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
- Karypis, Han, et al.
- 1999
(Show Context)
Citation Context ...esentative automatic approach is AUTOCLUST [16] . On the other hand, many clustering algorithms use some parameters or thresholds to control the clustering process or results. They include: CHAMELEON =-=[15]-=- , BIRCH [12] , CLARANS [10] , DBSCAN [8] , STING [11] , and Random Walk [14] . The parameters or thresholds in these algorithms can be regarded as channels between the clustering algorithm and the ex... |

79 | A fast multi-scale method for drawing large graphs
- Harel, Koren
- 2000
(Show Context)
Citation Context ...de-disjoint cliques or sub-graphs that are almost-cliques (almost complete graphs). Node-disjoint cliques and almost-cliques are called groups, which form sets of high-cohesive nodes. Harel and Koren =-=[3]-=- provides a heuristic to help deciding which vertices should be drawn closely. The heuristic is based on the observation that a nice layout of a graph should convey visually the relational information... |

57 | Exact and approximation algorithms for clustering
- Agarwal, Procopiuc
(Show Context)
Citation Context ... cluster is c, n=cp, n is the number of total vertices in the given graph, and p is the number of clusters. The size of cluster C i must satisfy (1- α/2)c ≤ |C i| ≤ c(1+ α), 1≤ i≤ p, 0≤ α <1. Agarwal =-=[6]-=- proposes another definition of cluster size based on geometry space: the size of a cluster C i is the maximum distance between a fixed-point fp, called the center of the cluster, and any vertex of C ... |

47 |
Problem decomposition and data reorganization by a clustering technique
- McCormick, Schweitzer, et al.
- 1972
(Show Context)
Citation Context ... Step 2 is merely a process of creating the adjacency matrix for a graph we will skip it and explain Step 3. 4.2 The Bond Energy Algorithm The bond energy algorithm (BEA), proposed by McCormick et al =-=[19]-=- , is a cluster-analysis method for identifying natural groups and clusters in complex data arrays. It introduces the concept of measure of effectiveness (ME), aiming at maximizing the summed bond ene... |

29 | Clustering Spatial Data Using Random Walks
- Harel, Koren
(Show Context)
Citation Context ...rge number of relationships between data points [17] while non-graph-based approaches cannot. Recently published graph-based approaches include: CHAMELEON AUTOCLUST [16] , Subdue [17] and Random Walk =-=[14]-=- . From the perspective of whether user inputs play a role in the cluster analysis, we can categorize clustering methods into automatic clustering and semi-automatic clustering. Automatic clustering i... |

21 | AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive PointData
- Estivill-Castro, Lee
- 2000
(Show Context)
Citation Context ..., has the ability to represent a large number of relationships between data points [17] while non-graph-based approaches cannot. Recently published graph-based approaches include: CHAMELEON AUTOCLUST =-=[16]-=- , Subdue [17] and Random Walk [14] . From the perspective of whether user inputs play a role in the cluster analysis, we can categorize clustering methods into automatic clustering and semi-automatic... |

8 | Clustering for Mining - Ester, Kriegel, et al. - 1998 |

6 |
Effective graph visualization via node grouping
- Six, Tollis
- 2003
(Show Context)
Citation Context ...uts. (what’s the role of R + ?) 2.2 Graph Theoretical Analysis A clique (or a complete graph) is a simple graph in which there is one edge between every pair of vertices. The purpose of node grouping =-=[7]-=- is to abstract small node-disjoint cliques or sub-graphs that are almost-cliques (almost complete graphs). Node-disjoint cliques and almost-cliques are called groups, which form sets of high-cohesive... |

3 |
Partitioning approaches to clustering
- Batagelj, Mrvar, et al.
- 2000
(Show Context)
Citation Context ... i. Let S be a set of n nodes in a metric space, ask-clustering of S is a partition C of S into k subsets C 1, C 2,..., C k. The size of C is the maximum size of a cluster in C. A functional approach =-=[2]-=- describes the clustering criterion as a function P: Φ�R + . The process of finding the best clustering is to determine the clustering C* ∈Φ, for which P(C*) = min P(C) for C ∈Φ where Φ is a set of fe... |

2 | Clustering in Trees: Optimizing Cluster Sizes and Number of Subtrees
- Hambrusch, Liu, et al.
- 2000
(Show Context)
Citation Context ...me of them are based on computational geometry, some for spatial database system applications, while others focus on graph theoretical properties. 2.1 Computational Geometry Analysis Hambrusch et al. =-=[1]-=- defines that the ideal size (the number of vertices) of a cluster is c, n=cp, n is the number of total vertices in the given graph, and p is the number of clusters. The size of cluster C i must satis... |

1 |
Vistool: A Tool For Visualizing Graphs
- May-Six
- 2000
(Show Context)
Citation Context ...y meets the user’s requirements. There are two important observations that may help: (1) the small cliques (K 3, K 4, K 5) appear often in the typical structures laid out by graph visualization tools =-=[5]-=- ; (2) a clique of degree k+1 can possibly exist only in a k-core. This implies that if a k-clique exists in the given graph, it must be a sub-graph of the (k+1)-core of the given graph. It is no doub... |

1 | Locality Metrics and Program Physical Structures
- Zhang, Gorla
- 2000
(Show Context)
Citation Context ...nt. Although finding such a permutation requires exponential time, a near-optimal algorithm in O(n 4 ) time produces results that are close to those of exhaustive search, according to Zhang and Gorla =-=[20]-=- . Compared with other data grouping methods, the BEA is accurate and produces good results.sMore importantly, BEA can be measured and customized by users’ specifications and requirements. The procedu... |