## Parallel Algorithms for High-dimensional Proximity Joins

### Cached

### Download Links

### BibTeX

@MISC{_parallelalgorithms,

author = {},

title = {Parallel Algorithms for High-dimensional Proximity Joins},

year = {}

}

### OpenURL

### Abstract

We consider the problem of parallelizing highdimensional proximity joins. We present a parallel multidimensional join algorithm based on an the epsilon-kdB tree and compare it with the more common approach of space partitioning. An evaluation of the algorithms on an IBM SP2 shared-nothing multiprocessor is presented using both synthetic and real-life datasets. We also examine the e ectiveness of the algorithms in the context of a speci c data-mining problem, that of nding similar time-series. The empirical results show that our algorithm exhibits good performance and scalability, aswell an ability to handle dataskew. 1

### Citations

433 | Fast subsequence matching in time-series databases
- Faloutsos, Ranganathan, et al.
- 1994
(Show Context)
Citation Context ...hafer Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 The work presented in this paper was motivated by the particular data-mining problem of nding similar time-series[1]=-=[7]-=-. In [2], an algorithm was proposed that rst nds all similar \atomic" subsequences, and then stitches together the atomic subsequence matches to obtain larger similar subsequences. A sliding window of... |

335 | Efficient processing of spatial joins using R-trees
- Brinkhoff, Kriegel, et al.
- 1993
(Show Context)
Citation Context ...osen such that two buckets will t entirely in memory. 2.2 Index Based Considerable recent work in multidimensional joins has focused on using indices to aid the join. This includes R-trees as used in =-=[4]-=-, [3] and [2], PMR quadtrees in [9], and seeded trees in [13]. Whatever the index used, they follow the same schema whereby two sets of multidimensional objects are joined by doing a synchronized dept... |

301 | MPI2: A message-passing interface standard
- Forum
- 1998
(Show Context)
Citation Context ...ary, but not x3. 4 Performance Evaluation We have implemented both the parallel -kdB and space-partitioning proximity join algorithms on an IBM SP2 [10] using the MPI-standard communication primitives=-=[8]-=-. The use of MPI allows our implementation to be portable to other shared-nothing parallel architectures, including workstation clusters. Experiments were conducted on a 16-node IBM SP2 Model 302. Eac... |

238 | The gamma database machine project
- Dewitt, Ghandeharizadeh, et al.
- 1990
(Show Context)
Citation Context ...each ofN processors has private memory and disks. The processors are connected by a communication network and can communicate only by passing messages. Examples of such parallel machines include GAMMA=-=[5]-=- and IBM's SP2[10]. We assume that the data to be joined is distributed equally over the local disks of the multiprocessor. 3.1 Previous Work Virtually all of the existing work on parallelizing multid... |

204 | Fast similarity search in the presence of noise, scaling, and translation
- Agrawal, Lin, et al.
- 1995
(Show Context)
Citation Context ...kesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 The work presented in this paper was motivated by the particular data-mining problem of nding similar time-series[1][7]. In =-=[2]-=-, an algorithm was proposed that rst nds all similar \atomic" subsequences, and then stitches together the atomic subsequence matches to obtain larger similar subsequences. A sliding window of size w ... |

194 |
clustering of objects with multiple attributes
- Linear
- 1990
(Show Context)
Citation Context ...ve"; assigned cell-numbers are called \Z values". Other space- lling curves include the Gray code[6] and the Hilbert curve[15]. Of these three, the Hilbert curve has been shown to cluster space better=-=[11]-=-. A shortcoming of space- lling curves is that some proximity information is always lost, so nearby objects mayhave very di erent Zvalues. This complicates the join algorithm. This approach works best... |

173 | Partition based spatial-merge join
- Patel, DeWitt
- 1996
(Show Context)
Citation Context ...ding all pairs of w-dimensional points that lie within -distance of each other, where is a user-speci ed parameter. While parallel algorithms for performing joins on spatial data already exists (e.g. =-=[18]-=-, [3], [9]), they have mainly concentrated on joining map data where spaces are typically limited to only two or three dimensions. Furthermore, these algorithms have been designed primarily to perform... |

148 |
A Class of Data Structures for Associative Searching
- Orenstein, Merrett
(Show Context)
Citation Context ... overlaps, a <cell-number, object-pointer> pair is created. Standard relational indices and techniques for computing joins can now be used on the tuples' one-dimensional cell values. This approach in =-=[17]-=- uses a space- lling curve known as the \Z curve"; assigned cell-numbers are called \Z values". Other space- lling curves include the Gray code[6] and the Hilbert curve[15]. Of these three, the Hilber... |

146 | Analysis of the clustering properties of the Hilbert space-filling curve
- Moon, Jagadish, et al.
(Show Context)
Citation Context ...ll values. This approach in [17] uses a space- lling curve known as the \Z curve"; assigned cell-numbers are called \Z values". Other space- lling curves include the Gray code[6] and the Hilbert curve=-=[15]-=-. Of these three, the Hilbert curve has been shown to cluster space better[11]. A shortcoming of space- lling curves is that some proximity information is always lost, so nearby objects mayhave very d... |

97 | Spatial hashjoins
- Lo, Ravishankar
- 1996
(Show Context)
Citation Context ...ith itself. Non-self joins are handled by using the same partitioning scheme on each dataset and then joining corresponding data buckets. This approach falls within the general framework presented in =-=[14]-=-. In that framework, an algorithm de nes bucket extents to hold data objects and an assignment function that maps data objects to buckets. Bucket extents may ormay not be immutable and the assignment ... |

62 | Size separation spatial join
- Koudas, Sevcik
- 1997
(Show Context)
Citation Context ...ely to retain the advantage since partitioning of space in that algorithm is dynamic and automatic. Recently, another serial spatial-join algorithm (the Size Separation Spatial Join) was presented in =-=[12]-=-. It is a space-partitioning algorithm but di ers in that it uses multiple levels of partitioning with increasing degrees of granularity. The algorithm appears to perform well on two-dimensional point... |

33 | High-dimensional Similarity Joins
- Shim, Srikant, et al.
- 1997
(Show Context)
Citation Context ...ithm exhibits good performance and scalability, aswell an ability to handle dataskew. 1 Introduction Many emerging applications require e cient processing of proximity joins on high-dimensional points=-=[20]-=-. Typical queries in these applications include: Find all pairs of similar images (often as a prelude to clustering the images). Retrieve music scores similar to a target music score. Discover all sto... |

31 |
Multiattribute hashing using gray codes
- FALOUTSOS
- 1985
(Show Context)
Citation Context ...uples' one-dimensional cell values. This approach in [17] uses a space- lling curve known as the \Z curve"; assigned cell-numbers are called \Z values". Other space- lling curves include the Gray code=-=[6]-=- and the Hilbert curve[15]. Of these three, the Hilbert curve has been shown to cluster space better[11]. A shortcoming of space- lling curves is that some proximity information is always lost, so nea... |

24 |
Generating seeded trees from data sets
- Lo, Ravishankar
- 1995
(Show Context)
Citation Context ...the join phase, an algorithm identi es pairs of buckets to be joined (termed join-bucket pairs) and e ects each join in turn. An example of this framework was presented in [14] where bootstrap seeding=-=[13]-=- and sampling was used to obtain the initial bucket extents. Another space partitioning algorithm (PBSM ) was recently presented in [18] and to some degree, it also ts within the above framework. To a... |

23 |
The grid le: an adaptable, symmetric multikey le structure
- Nievergelt, Hinterberger, et al.
- 1984
(Show Context)
Citation Context ... not well-suited to our motivating application of similar time-series since the data to be joined is typically generated \on-they". Other drawbacks include skew-handling capabilities. In the Grid File=-=[16]-=-, skewed data can cause rapid growth in the size of the directory structures. For other indices such as the R tree, skew-handling typically requires maintaining height-balanced trees so that range que... |

19 | Performance of Data-Parallel Spatial Operations
- Hoel, Samet
- 1994
(Show Context)
Citation Context ...irs of w-dimensional points that lie within -distance of each other, where is a user-speci ed parameter. While parallel algorithms for performing joins on spatial data already exists (e.g. [18], [3], =-=[9]-=-), they have mainly concentrated on joining map data where spaces are typically limited to only two or three dimensions. Furthermore, these algorithms have been designed primarily to perform intersect... |

17 | R.: Parallel algorithms for high-dimensional similarity joins for data mining applications
- Shafer, Agrawal
- 1997
(Show Context)
Citation Context ...nts use an value of 0:1 unless otherwise noted. Further experimental results studying the performance characteristics of the parallel -kdB algorithm can be found in an expanded version of this paper (=-=[19]-=-). 4.1 Algorithm Comparison In this section, we compare the performance of the parallel -kdB algorithm with that of space-partitioning. Due to the -kdB tree's ability to dynamically adjust to data ske... |

7 |
E cient similarity search in sequence databases
- Agrawal, Faloutsos, et al.
- 1993
(Show Context)
Citation Context .... Shafer Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 The work presented in this paper was motivated by the particular data-mining problem of nding similar time-series=-=[1]-=-[7]. In [2], an algorithm was proposed that rst nds all similar \atomic" subsequences, and then stitches together the atomic subsequence matches to obtain larger similar subsequences. A sliding window... |

3 |
Parallel processing of spatial joins using R-trees
- Brinkhof, Kriegel, et al.
(Show Context)
Citation Context ...ll pairs of w-dimensional points that lie within -distance of each other, where is a user-speci ed parameter. While parallel algorithms for performing joins on spatial data already exists (e.g. [18], =-=[3]-=-, [9]), they have mainly concentrated on joining map data where spaces are typically limited to only two or three dimensions. Furthermore, these algorithms have been designed primarily to perform inte... |