## Tree-Based Partitioning Querying: A Methodology for Computing Medoids in Large Spatial Datasets

Venue: | VLDB J |

Citations: | 9 - 4 self |

### BibTeX

@ARTICLE{Mouratidis_tree-basedpartitioning,

author = {Kyriakos Mouratidis and Dimitris Papadias and Spiros Papadimitriou},

title = {Tree-Based Partitioning Querying: A Methodology for Computing Medoids in Large Spatial Datasets},

journal = {VLDB J},

year = {}

}

### OpenURL

### Abstract

Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order to read only a small part of the dataset. Compared to

### Citations

10995 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- GAREY, JOHNSON
- 1979
(Show Context)
Citation Context ...ocations correspond to the medoids. Efficient solutions to medoid queries are essential in several applications related to resource allocation and spatial decision making. Since the problem is NPhard =-=[GJ79]-=-, research has focused on approximate algorithms. Despite a bulk of methods for small and moderate size datasets, currently there exists no technique applicable to very large databases. More formally,... |

2074 | The Elements of Statistical Learning - Hastie, Tibshirani, et al. - 2001 |

1517 | Clustering Algorithms - Hartigan - 1975 |

984 | The r*-tree: An efficient and robust access method for points and rectangles
- Beckmann, Kriegel, et al.
- 1990
(Show Context)
Citation Context ...ental results using both real and synthetic datasets. Finally, Section 9 concludes the paper. 2. Background Although our techniques can be used with any data partition method, here we assume R*-trees =-=[BKSS90]-=- due to their popularity. Section 2.1 overviews R*trees and their application to nearest neighbor queries. Section 2.2 presents existing algorithms for k-medoids and related problems. 2.1. R-trees and... |

595 | Efficient and Effective Clustering Methods for Spatial Data Mining
- Ng, Han
- 1994
(Show Context)
Citation Context ...culations, PAM is prohibitively expensive for large |P|. Clustering large applications (CLARA) [KR90] alleviates the problem by generating random samples from P and executing PAM on those. Ng and Han =-=[NH94]-=- propose clustering large applications based on randomized search (CLARANS) as an extension to PAM. CLARANS draws a random sample of size maxneighbors from all the k·(|P|-k) possible neighbor sets Ri'... |

563 | CURE: An Efficient Clustering Algorithm for Large Databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...anchise chain decides to add a new branch in a given area). The k-medoid problem is related to clustering. Clustering methods designed for large databases include DBSCAN [EKSX96], BIRCH [ZRL96], CURE =-=[GRS98]-=- and OPTICS [ABKS99]. However, 4 the objective of clustering is to partition data objects in groups (clusters) such that objects within the same group are more similar to each other than to points in ... |

511 | The x-tree: an index structure for high-dimensional data - Berchtold, Keim, et al. - 1996 |

320 | Polynomial time approximation schemes for Euclidean TSP and other geometric problems.Proc
- Arora
- 1996
(Show Context)
Citation Context ...d incremental (the number of NNs does need to be known in advance). 2.2. k-Medoids and Related Problems A number of approximation schemes for k-medoids 1 and related problems appear in the literature =-=[ARR98]-=-. Most of this work, however, is largely theoretical in nature. Kaufmann and Rousseeuw [KR90] propose partitioning around medoids (PAM), a practical algorithm based on the hill climbing paradigm. In p... |

293 | Distance browsing in spatial databases
- Hjaltason, Samet
- 1999
(Show Context)
Citation Context ...||e-q||. Similarly, when backtracking to the upper level, node N2 is also excluded and the process terminates with e as the result. The extension to k (>1) NNs is straightforward. Hjaltason and Samet =-=[HS99]-=- propose a best-first variation which is I/O optimal (i.e., it only visits nodes that may contain NNs) and incremental (the number of NNs does need to be known in advance). 2.2. k-Medoids and Related ... |

285 |
Clustering to minimize the maximum intercluster distance
- Gonzalez
- 1985
(Show Context)
Citation Context ...ees with only 50,559 points and 1,027 leaf nodes). Regarding the max case, to the best of our knowledge, there does not exist any method for diskresident data. For in-memory processing, the method of =-=[G85]-=- answers max k-medoid queries in O(k⋅|P|) time with an approximation factor of 2. In other words, the returned medoid set is guaranteed to achieve a maximum distance C(R) that is no more than two time... |

275 | X-means: Extending k-means with efficient estimation of the number of clusters - Pelleg, Moore - 2000 |

219 | On Packing R-Trees - Kamel, Faloutsos - 1993 |

217 |
BIRCH: an efficient data clustering method for very large databases
- Livny
- 1996
(Show Context)
Citation Context ...ng (e.g., a franchise chain decides to add a new branch in a given area). The k-medoid problem is related to clustering. Clustering methods designed for large databases include DBSCAN [EKSX96], BIRCH =-=[ZRL96]-=-, CURE [GRS98] and OPTICS [ABKS99]. However, 4 the objective of clustering is to partition data objects in groups (clusters) such that objects within the same group are more similar to each other than... |

173 | Smallest enclosing disks (balls and ellipsoids), in: New Results and New Trends
- Welzl
- 1991
(Show Context)
Citation Context ...onnecting s.c and e.c. In the max case, the new slot center is computed as the center of the minimum circle enclosing e.c and all the entry centers currently in s. We use the incremental algorithm of =-=[W91]-=- that finds the new slot center in expected constant time. 5 Function InsertEntry (extended entry e, slot s) 1. s.c = (e.w·e.c + s.w·s.c) / (e.w + s.w) 2. s.w = e.w + s.w 3. s.E = s.E ∪{e} Figure 3.1:... |

106 | Accelerating exact k-means algorithms with geometric reasoning - Pelleg, Moore - 1999 |

92 | Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification - Ester, Kriegel, et al. - 1995 |

89 | Monitoring k-nearest neighbor queries over moving objects - Yu, Pu, et al. - 2005 |

83 | Integrated coverage and connectivity configuration for energy conservation in sensor networks - XING, WANG, et al. - 2005 |

77 | Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring - Mouratidis, Papadias, et al. - 2005 |

47 | Efficient cost models for spatial queries using r-trees
- THEODORIDIS, STEFANAKIS, et al.
- 2000
(Show Context)
Citation Context ...minimum bounding rectangle (MBR) enclosing all the points in its sub-tree. The nodes of an R*-tree are meant to be compact, have small margin and achieve minimal overlap among nodes of the same level =-=[TSS00]-=-. Additionally, in practice, nodes at the same level contain a similar number of data points, due to a minimum utilization constraint (typically, 40%). These properties imply that the R*-tree (or any ... |

46 |
A density based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...emental processing (e.g., a franchise chain decides to add a new branch in a given area). The k-medoid problem is related to clustering. Clustering methods designed for large databases include DBSCAN =-=[EKSX96]-=-, BIRCH [ZRL96], CURE [GRS98] and OPTICS [ABKS99]. However, 4 the objective of clustering is to partition data objects in groups (clusters) such that objects within the same group are more similar to ... |

28 | Range aggregate processing in spatial databases
- TAO, D
(Show Context)
Citation Context ...k could be the number of its residents). Processing in the max case is identical to its un-weighted counterpart. However, the application of TPAQ to avg weighted queries requires an aggregate R*-tree =-=[TP04]-=-, or any other aggregate data partition method. The aggregate R*tree has the same structure and update algorithms as the regular R*-tree, except that each entry also stores the sum of weights of the d... |

24 | Generating Seeded Trees From Data Sets - Lo, Ravishankar - 1995 |

18 |
A Database Interface for Clustering
- Ester, Kriegel, et al.
- 1995
(Show Context)
Citation Context ...ire dataset in order to extract the representatives. Furthermore, in very large databases, the leaf level population may still be too high for the efficient application of CLARANS (the experiments of =-=[EKX95a]-=- use R-trees with only 50,559 points and 1,027 leaf nodes). Regarding the max case, to the best of our knowledge, there does not exist any method for diskresident data. For in-memory processing, the m... |

12 | C.: The design and implementation of seeded trees: an efficient method for spatial joins - Lo, Ravishankar - 1998 |

7 |
Minkowski-type theorems and least square partitioning
- Aurenhammer, Hoffmann, et al.
- 1992
(Show Context)
Citation Context ...ids in the way described in the previous sections (depending on the problem type; i.e., k-medoid, MA or MO). In a second step, TPAQ computes the assignment of the data points similar to the method of =-=[AHA92]-=-. The algorithm of [AHA92] computes a weight ai for each medoid so that if each point p is assigned according to distance function 4 pow(p, r(p)) = ||pr(p)|| 2 - ai (where ai is the weight of r(p)), t... |

1 | Learning the k in k-means. NIPS - Hamerly, Elkan - 2003 |