## Convex Group Clustering of Large Geo-referenced Data Sets (1999)

### Cached

### Download Links

- [wwwcs.newcastle.edu.au]
- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.cccg.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In Abstracts for the Eleventh Canadian Conference on Computational Geometry (CCCG'99 |

Citations: | 1 - 0 self |

### BibTeX

@INPROCEEDINGS{Estivill-Castro99convexgroup,

author = {Vladimir Estivill-Castro},

title = {Convex Group Clustering of Large Geo-referenced Data Sets},

booktitle = {In Abstracts for the Eleventh Canadian Conference on Computational Geometry (CCCG'99},

year = {1999},

pages = {http://www.cs.ubc.ca}

}

### OpenURL

### Abstract

Clustering partitions a data set S = fs1 ; : : : ; sng ae ! m into groups of nearby points. Distance-based clustering uses optimisation criteria for defining the quality of the partition. Formulations using representatives (means or medians of groups) have received much more attention than minimisation of the total within group distance (TWGD). However, this non-representative approach has attractive properties while remaining distance-based. While representative approaches produce partitions with non-overlapping clusters, TWGD does not. We investigate the restriction of TWGD to producing convex-hull disjoint groups and show that this problem is NP-complete in the Euclidean case as soon as m 2. Nevertheless we provide efficient algorithms for solving it approximately. Keywords: clustering, optimisation, computational geometry, problem complexity, data mining in spatial databases. 1 Introduction Clustering is a fundamental task in data analysis since it identifies groups in heterog...

### Citations

10928 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ... S with n points in the plane and integers p and B. The decision question is if there exists a CH-disjoint partition of S into p parts whose TWGD value is less than B. The proof is a component design =-=[20]-=- proof that reduces 3-satisfiability to the CH-disjoint-restricted TWGD problem. In fact, the proof is similar to other proofs for Euclidean clustering problems [30]. An instance of 3-SAT is converted... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ta mining (KDDM) [12, 14, 37, 47] have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composition or merging items that are close together =-=[10, 29]-=-. However, top-down partition approaches to clustering are also interesting, in particular for spatial data mining [12, 37, 48]. This perspective defines clustering as partitioning a heterogeneous dat... |

1329 |
Finding Groups in Data: an Introduction to Cluster Analysis”, John While & Sons
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...s are restricted to be data points (this is referred as the facility location problem by theoreticians [1]). In clustering for spatial data, medians that are data points have been referred as medoids =-=[11, 12, 37, 26]-=-. Nevertheless, in the case m = 1, the problems are typically solved by different variants of dynamic programming in O(n 2 p) time [6, 30, and references]. For example, when m = 1, the problem in Equa... |

643 | Knowledge acquisition via incremental conceptual clustering
- Fisher
- 1987
(Show Context)
Citation Context ...analysis since it identifies groups in heterogeneous data. Clustering can be seen as a concept formation or class delineation problem. At least the fields of statistics [44, 46], machine intelligence =-=[5, 15, 32]-=- and more recently knowledge discovery and data mining (KDDM) [12, 14, 37, 47] have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composit... |

625 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, S, et al.
- 1995
(Show Context)
Citation Context ... is a fundamental task in data analysis since it identifies groups in heterogeneous data. Clustering can be seen as a concept formation or class delineation problem. At least the fields of statistics =-=[44, 46]-=-, machine intelligence [5, 15, 32] and more recently knowledge discovery and data mining (KDDM) [12, 14, 37, 47] have contributed with algorithms for many clustering approaches. Hierarchical bottomup ... |

596 | Efficient and Effective Clustering Methods for Spatial Data Mining
- Ng, Han
- 1994
(Show Context)
Citation Context ... be seen as a concept formation or class delineation problem. At least the fields of statistics [44, 46], machine intelligence [5, 15, 32] and more recently knowledge discovery and data mining (KDDM) =-=[12, 14, 37, 47]-=- have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composition or merging items that are close together [10, 29]. However, top-down parti... |

281 |
Clustering to minimize the maximum intercluster distance
- Gonzalez
- 1985
(Show Context)
Citation Context ...r, the TWGD problem is NP-complete (in its graph-theoretic decision formulation [4] and in the Euclidean formulation [27]). Although most other distance-based clustering problems are also NP-complete =-=[4, 22, 28]-=-, they are more commonly solved approximately in clustering applications. We believe that this due to several factors. 1. The TWGD problem, when formulated as an integer programming problem, is very d... |

270 | Pattern Classi cation and Scene Analysis - Duda, Hart, et al. - 1973 |

241 |
AutoClass: A bayesian classification system
- Cheeseman, Kelly, et al.
- 1988
(Show Context)
Citation Context ...analysis since it identifies groups in heterogeneous data. Clustering can be seen as a concept formation or class delineation problem. At least the fields of statistics [44, 46], machine intelligence =-=[5, 15, 32]-=- and more recently knowledge discovery and data mining (KDDM) [12, 14, 37, 47] have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composit... |

234 | STING: a statistical information grid approach to spatial data mining
- Wang, Yang, et al.
- 1997
(Show Context)
Citation Context ... be seen as a concept formation or class delineation problem. At least the fields of statistics [44, 46], machine intelligence [5, 15, 32] and more recently knowledge discovery and data mining (KDDM) =-=[12, 14, 37, 47]-=- have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composition or merging items that are close together [10, 29]. However, top-down parti... |

229 |
Future paths for integer programming and links to artificial intelligence
- Glover
- 1986
(Show Context)
Citation Context ...l. The TaB heuristic forbids the reconsideration of ~x i for inclusion until all data points have been considered as well. The heuristic can be therefore be regarded as a local variant of tabu search =-=[21]-=-. TaB's careful design balances the need to explore a variety of possible interchanges against the `greedy' desire to improve the solution as quickly as possible. The TaB heuristic has been remarkably... |

216 |
an efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...proaches form groups by composition or merging items that are close together [10, 29]. However, top-down partition approaches to clustering are also interesting, in particular for spatial data mining =-=[12, 37, 48]-=-. This perspective defines clustering as partitioning a heterogeneous data set into smaller more homogeneous groups [2, 19, 40]. Clustering typically uses a metric (or distance) to determine the dissi... |

206 |
An Algorithmic Approach to Network Location Problems II: The p-medians
- Kariv, Hakimi
- 1979
(Show Context)
Citation Context ...ff ) 1=ff ). At least this is the case in the computational geometry community and the theoretical computer science community [1, and references] (perhaps after Megiddo [30] or after Kariv and Hakimi =-=[25]-=- or perhaps by analogy to the case m = 1). The computational geometry community reserves the name p-centres problem for when P is replaces by max in Equation (3). This is because we are minimising the... |

187 |
Estimation and inference by compact coding
- Wallace, Freeman
- 1987
(Show Context)
Citation Context ... is a fundamental task in data analysis since it identifies groups in heterogeneous data. Clustering can be seen as a concept formation or class delineation problem. At least the fields of statistics =-=[44, 46]-=-, machine intelligence [5, 15, 32] and more recently knowledge discovery and data mining (KDDM) [12, 14, 37, 47] have contributed with algorithms for many clustering approaches. Hierarchical bottomup ... |

159 |
Leeuwen, "Maintenance of Configurations in the Plane
- Overmars, van
- 1981
(Show Context)
Citation Context ...a structures for dynamically maintaining the p convex hulls (with respect to insertions and deletions) are possible. Overmars and van Leewen structures will suffice since they require O(log 2 n) time =-=[38]-=-. We should mention that, alternatively, the restricted CHdisjoint TWGD problem that we have presented here may be amendable to polynomial approximation schemes [1]. These randomised algorithms produc... |

137 |
Data mining techniques: for marketing, sales, and customer relationship management
- Berry, Linoff
- 2004
(Show Context)
Citation Context ... clustering are also interesting, in particular for spatial data mining [12, 37, 48]. This perspective defines clustering as partitioning a heterogeneous data set into smaller more homogeneous groups =-=[2, 19, 40]-=-. Clustering typically uses a metric (or distance) to determine the dissimilarity between the items to be clustered. Here we consider the clustering problem in the context of spatial databases, those ... |

117 |
Facility Layout and Location: An Analytical Approach, 2nd ed
- Francis, McGinnis, et al.
- 1992
(Show Context)
Citation Context ...ly Equation (3)) as the within groups sum of squares problem [40]. In particular, for geographical data, it is very important to note this aspect. This squared version is known as the gravity problem =-=[17]-=- (or as the centroid problem [18]) in facility location literature because Euclid 2 (~x; X) is minimised by the centre of gravity. However, if the problem involves the direct Euclidean distance as in ... |

117 | On the complexity of some common geometric location problems
- Meggido, Supowit
- 1984
(Show Context)
Citation Context ...ll elements x in set X. 2 The contiguous restriction Unidimensional distance-based clustering problems (the case m = 1) are `easy' in the sense that they are solved optimally by polynomial algorithms =-=[30]-=-. In this section we contrast how distance-based clustering changes as we progress to two dimensions. It is now appropriate to rewrite the problem in Equation (2). Consider Euclid 2 (~x; X) = P ~x2X d... |

114 | Approximation schemes for euclidean k-medians and related problems
- Arora, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...st the operations research community [24, 41] for the problem in Equation (2) when representatives are restricted to be data points (this is referred as the facility location problem by theoreticians =-=[1]-=-). In clustering for spatial data, medians that are data points have been referred as medoids [11, 12, 37, 26]. Nevertheless, in the case m = 1, the problems are typically solved by different variants... |

70 |
Mining Very Large Databases with Parallel Processing
- Freitas, Lavington
- 1998
(Show Context)
Citation Context ... clustering are also interesting, in particular for spatial data mining [12, 37, 48]. This perspective defines clustering as partitioning a heterogeneous data set into smaller more homogeneous groups =-=[2, 19, 40]-=-. Clustering typically uses a metric (or distance) to determine the dissimilarity between the items to be clustered. Here we consider the clustering problem in the context of spatial databases, those ... |

64 | Initialisation of Iterative Refinement Clustering Algorithms
- Fayyad, Renia, et al.
- 1998
(Show Context)
Citation Context ... be seen as a concept formation or class delineation problem. At least the fields of statistics [44, 46], machine intelligence [5, 15, 32] and more recently knowledge discovery and data mining (KDDM) =-=[12, 14, 37, 47]-=- have contributed with algorithms for many clustering approaches. Hierarchical bottomup approaches form groups by composition or merging items that are close together [10, 29]. However, top-down parti... |

60 |
Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph
- Teitz, Bart
- 1968
(Show Context)
Citation Context ...s efficient that previous attempts for solving the unrestricted TWGD clustering problem. We illustrate this point now. We adapt well studied local search hill-climbers known as interchange heuristics =-=[8, 24, 34, 43]-=- to TWGD restricted to CH-disjoint partitions. These heuristics are typically used for the p-medians problem (solving Equation (2) with the added restriction that the representative be data points) Co... |

53 |
On Grouping for Maximum Homogeneity
- Fisher
- 1958
(Show Context)
Citation Context ...example, when m = 1, the problem in Equation (2) is solvable in O(n 2 p) time. The dynamic programming strategy dates back to 1958, when Fisher observed the so-called contiguous partition restrictions=-=[16]-=-. This simply states that, in the optimal solution, the groups do not overlap each other. We say that a partition P = X1 j : : : jXp is CH-disjoint if the convex hull CV (X i ) of X i does not interse... |

43 |
On the Complexity of Clustering Problems
- Brucker
- 1978
(Show Context)
Citation Context ...t included in the sum are those where the items belong to different groups. Therefore, it is minimising coupling. However, the TWGD problem is NP-complete (in its graph-theoretic decision formulation =-=[4]-=- and in the Euclidean formulation [27]). Although most other distance-based clustering problems are also NP-complete [4, 22, 28], they are more commonly solved approximately in clustering applications... |

42 |
Finding tailored partitions
- Hershberger, Suri
- 1991
(Show Context)
Citation Context ...2, 27, 28, 30, 39]. Work concentrated on suitable approximation algorithms for them [8, 41]. Others concentrated in special cases where polynomial algorithms can be found (for example, the case p = 2 =-=[23]-=-). More recently, theoretical results have concentrated on polynomial approximation schemes for the representative-based clustering approaches [1, and references]. 3 TWGD restricted to CH-disjoint par... |

36 | E cient and E ective Clustering Methods for Spatial Data - Ng, Han - 1994 |

34 |
from incomplete data via the EM algorithm, J.Royal Stat
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...noise and outliers, as well as to the initial random clustering. 3. The method is statistically biased (this has favoured the emergence of other statistical methods such as `expectation maximization' =-=[7]-=-) and statistically inconsistent (this has favoured the emergence of Bayesian and Minimum Message Length methods[9]). However, these alternative methods require the user to define a probabilistic mode... |

31 |
Cluster analysis and mathematical programming
- Rao
- 1971
(Show Context)
Citation Context ... clustering are also interesting, in particular for spatial data mining [12, 37, 48]. This perspective defines clustering as partitioning a heterogeneous data set into smaller more homogeneous groups =-=[2, 19, 40]-=-. Clustering typically uses a metric (or distance) to determine the dissimilarity between the items to be clustered. Here we consider the clustering problem in the context of spatial databases, those ... |

25 |
An e cient Tabu search procedure for the p-median problem
- Rolland, Schilling, et al.
- 1996
(Show Context)
Citation Context ...nvex group clustering of large geo-referenced data sets 3 centred at the p representatives to cover the n points in the data set. There is a p-median problem amongst the operations research community =-=[24, 41]-=- for the problem in Equation (2) when representatives are restricted to be data points (this is referred as the facility location problem by theoreticians [1]). In clustering for spatial data, medians... |

23 |
Worst-case and probabilistic analysis of a geometric location problem
- Papadimitriou
- 1981
(Show Context)
Citation Context ...on drifted away from the TWGD problem as the NP-hardiness results emerged for the graphical and geometric (Euclidean and even bidimensional, i.e. m = 2) versions of representative clustering problems =-=[4, 22, 27, 28, 30, 39]-=-. Work concentrated on suitable approximation algorithms for them [8, 41]. Others concentrated in special cases where polynomial algorithms can be found (for example, the case p = 2 [23]). More recent... |

20 |
Integer programming and the theory of grouping
- Vinod
- 1969
(Show Context)
Citation Context ...that describes clustering as an optimisation problem. For example, the criterion may consists of minimising the total dissimilarity in the groups. This criterion has received the name of the Grouping =-=[45]-=-, Total Within Group Distance [40], the Full-Exchange [42] and the Interactions[35]. Here we will refer to this criterion as the Total Within Group Distance, since this seems the most descriptive of t... |

17 |
Cluster discovery techniques for exploratory spatial data analysis
- Murray, Estivill-Castro
- 1998
(Show Context)
Citation Context ...y consists of minimising the total dissimilarity in the groups. This criterion has received the name of the Grouping [45], Total Within Group Distance [40], the Full-Exchange [42] and the Interactions=-=[35]-=-. Here we will refer to this criterion as the Total Within Group Distance, since this seems the most descriptive of the above. Definition 1.1 Let X = fx1 ; x2 ; : : : ; xng be a set of n objects and l... |

17 | Future paths for integer programming and links to arti®cial intelligence - Glover - 1986 |

16 |
A more efficient heuristic for solving large p-median problems
- Densham, Rushton
- 1992
(Show Context)
Citation Context ...al and geometric (Euclidean and even bidimensional, i.e. m = 2) versions of representative clustering problems [4, 22, 27, 28, 30, 39]. Work concentrated on suitable approximation algorithms for them =-=[8, 41]-=-. Others concentrated in special cases where polynomial algorithms can be found (for example, the case p = 2 [23]). More recently, theoretical results have concentrated on polynomial approximation sch... |

14 |
1998]: ‘Point Estimation using the Kullback-Leibler Loss Function and
- Dowe, Baxter, et al.
(Show Context)
Citation Context ...voured the emergence of other statistical methods such as `expectation maximization' [7]) and statistically inconsistent (this has favoured the emergence of Bayesian and Minimum Message Length methods=-=[9]-=-). However, these alternative methods require the user to define a probabilistic model of the classes and their high sensitivity to the initial random solution has prompted researchers to incorporate ... |

13 |
Discovering Associations in Spatial Data - An Efficient Medoid Based Approach
- Estivill-Castro, Murray
- 1998
(Show Context)
Citation Context |

13 | Comments on “Parallel Algorithms for Hierarchical Clustering and Cluster Validity - Murtagh - 1992 |

13 | Leewen. Maintenance of con gurations in the plane - Overmars, van - 1981 |

10 |
Roboust clustering of large geo-referenced data sets
- Estivill-Castro, Houle
- 1999
(Show Context)
Citation Context ...s are restricted to be data points (this is referred as the facility location problem by theoreticians [1]). In clustering for spatial data, medians that are data points have been referred as medoids =-=[11, 12, 37, 26]-=-. Nevertheless, in the case m = 1, the problems are typically solved by different variants of dynamic programming in O(n 2 p) time [6, 30, and references]. For example, when m = 1, the problem in Equa... |

10 |
Companion to Concrete Mathematics; Mathematical Techniques and Various Applications
- Melzak
- 1973
(Show Context)
Citation Context ...lem involves the direct Euclidean distance as in Minimise Euclid(P ) = n X i=1 w i dE (~x i ; rep[~x i ; C]); (4) then it receives the name of the Webber problem [17] (or the minimum distance problem =-=[31]-=-). This non-squared Euclid problem has no simple algebraic solution [17] and, even in the case p = 1, no algorithm can find the exact solution [31]. The difference between the Webber problem and Eucli... |

10 | Autoclass: A bayesian classi cation system - Cheeseman, Kelly, et al. - 1988 |

9 | Statistical analysis of nite mixtures distributions - Titterington, Smith, et al. - 1985 |

8 |
Applying simulated annealing to location-planning models
- Murray, Church
- 1996
(Show Context)
Citation Context ...s efficient that previous attempts for solving the unrestricted TWGD clustering problem. We illustrate this point now. We adapt well studied local search hill-climbers known as interchange heuristics =-=[8, 24, 34, 43]-=- to TWGD restricted to CH-disjoint partitions. These heuristics are typically used for the p-medians problem (solving Equation (2) with the added restriction that the representative be data points) Co... |

6 | Hybrid genetic algorithm for solving the p-media problem - Estivill-Castro, Torres-Vel'azquez - 1999 |

6 |
Analysis and computational schemes for p-median heuristics”, Environment and Planning A
- Horn
- 1996
(Show Context)
Citation Context ...nvex group clustering of large geo-referenced data sets 3 centred at the p representatives to cover the n points in the data set. There is a p-median problem amongst the operations research community =-=[24, 41]-=- for the problem in Equation (2) when representatives are restricted to be data points (this is referred as the facility location problem by theoreticians [1]). In clustering for spatial data, medians... |

5 |
A comparison of optimal classification strategies for choroplethic displays of spatially aggregated data
- Cromley
- 1996
(Show Context)
Citation Context ...solutions to the TWGD when data items are referenced in one and two dimensions (the case m = 2). This has applications in GIS. The unidimensional case is used for the construction of coroplethic maps =-=[6]-=- while the bidimensional case has applications in analysis of the spread in zone patterns [35]. The rest of the paper is organised as follows. Section 2 presents terminology from several communities r... |

5 | Birch: an e cient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

4 |
Comments of "Parallel algorithms for hierarchical clustering and cluster validity
- Murtagh
- 1992
(Show Context)
Citation Context ...tions,st; m; p ! ! n resulting in O(n) time. Moreover, KMeans is very easy to implement. This contrasts favourably with hierarchical clustering algorithms whose computational complexity is in O(n 2 ) =-=[36]-=-, or with the recent O(n log n)- algorithms [29]. This paper will present an approach to efficiently find approximate solutions to the TWGD when data items are referenced in one and two dimensions (th... |

4 | Automatische Klassi kation. Vandenhoek - Bock - 1974 |

4 | Initialization of iterative re nement clustering algorithms - Fayyad, Reina, et al. |