• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries (1997)

by Norio Katayama, et al.
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 228
Next 10 →

Content-based image retrieval at the end of the early years

by Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, Ramesh Jain - IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2000
"... The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for imag ..."
Abstract - Cited by 873 (16 self) - Add to MetaCart
The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

by Roger Weber, Hans-J. Schek, Stephen Blott , 1998
"... For similarity search in high-dimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Altho ..."
Abstract - Cited by 413 (12 self) - Add to MetaCart
For similarity search in high-dimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Although this phenomenon---known as the `dimensional curse'---is well known, little or no quantitative analysis of the phenomenon is available. In this paper, we provide a detailed analysis of partitioning and clustering techniques for similarity search in HDVSs. We show formally that these methods exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10. Consequently, we come up with an alternative organization based on approximations to make the unavoidable sequential scan as fast as possible. We describe a simple vector approximation scheme, called VA-file, and report on an ...

Similarity search in high dimensions via hashing

by Aristides Gionis, Piotr Indyk, Rajeev Motwani , 1999
"... The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image dat ..."
Abstract - Cited by 275 (11 self) - Add to MetaCart
The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. " That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points

Distance Browsing in Spatial Databases

by Gísli R. Hjaltason, Hanan Samet , 1999
"... Two different techniques of browsing through a collection of spatial objects stored in an R-tree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a k-nearest neighbor algorithm where k is kn ..."
Abstract - Cited by 240 (17 self) - Add to MetaCart
Two different techniques of browsing through a collection of spatial objects stored in an R-tree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a k-nearest neighbor algorithm where k is known prior to the invocation of the algorithm. Thus if m#kneighbors are needed, the k-nearest neighbor algorithm needs to be reinvoked for m neighbors, thereby possibly performing some redundant computations. The second approach is incremental in the sense that having obtained the k nearest neighbors, the k +1 st neighbor can be obtained without having to calculate the k +1nearest neighbors from scratch. The incremental approach finds use when processing complex queries where one of the conditions involves spatial proximity (e.g., the nearest city to Chicago with population greater than a million), in which case a query engine can make use of a pipelined strategy. A general incremental nearest neighbor algorithm is presented that is applicable to a large class of hierarchical spatial data structures. This algorithm is adapted to the R-tree and its performance is compared to an existing k-nearest neighbor algorithm for R-trees [45]. Experiments show that the incremental nearest neighbor algorithm significantly outperforms the k-nearest neighbor algorithm for distance browsing queries in a spatial database that uses the R-tree as a spatial index. Moreover, the incremental nearest neighbor algorithm also usually outperforms the k-nearest neighbor algorithm when applied to the k-nearest neighbor problem for the R-tree, although the improvement is not nearly as large as for distance browsing queries. In fact, we prove informally that, at any step in its execution, the incremental...

When Is "Nearest Neighbor" Meaningful?

by Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft - In Int. Conf. on Database Theory , 1999
"... . We explore the effect of dimensionality on the "nearest neighbor " problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the fa ..."
Abstract - Cited by 222 (1 self) - Add to MetaCart
. We explore the effect of dimensionality on the "nearest neighbor " problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple...

Approximation Algorithms for Projective Clustering

by Pankaj K. Agarwal, Cecilia M. Procopiuc - Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia , 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyper-strips (resp. hyper-cylinders) so that the maximum width of a hyper-strip (resp., the maximum diameter of a hyper-cylinder) is minimized. Let w ..."
Abstract - Cited by 196 (14 self) - Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyper-strips (resp. hyper-cylinders) so that the maximum width of a hyper-strip (resp., the maximum diameter of a hyper-cylinder) is minimized. Let w be the smallest value so that S can be covered by k hyper-strips (resp. hyper-cylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NP-Hard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...

MindReader: Querying databases through multiple examples

by Yoshiharu Ishikawa, Ravishankar Subramanya, and Christos Faloutsos - In Proc. of the 24 th VLDB Conference , 1998
"... Users often can not easily express their queries. For example, in a multimedia/image by content setting, the user might want photographs with sunsets; in current systems, like QBIC, the user has to give a sample query, and to specify the relative importance of color, shape and texture. Even worse, t ..."
Abstract - Cited by 159 (1 self) - Add to MetaCart
Users often can not easily express their queries. For example, in a multimedia/image by content setting, the user might want photographs with sunsets; in current systems, like QBIC, the user has to give a sample query, and to specify the relative importance of color, shape and texture. Even worse, the user might want correlations between attributes, like, for example, in a traditional, medical record database, a medical researcher might want to find "mildly overweight patients", where the implied query would be "weight/height ≈ 4 lb/inch". Our goal is to provide a user-friendly, but theoretically solid method, to handle such queries. We allow the user to give several examples, and, optionally, their 'goodness' scores, and we propose a novel method to "guess" which attributes are important, which correlations are important, and with what weight. Our contributions are twofold: (a) we formalize the problem as a minimization problem and show how to solve for the optimal solution, completely av...

On the Surprising Behavior of Distance Metrics in High Dimensional Space

by Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim - Lecture Notes in Computer Science , 2001
"... In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a efficienc ..."
Abstract - Cited by 107 (2 self) - Add to MetaCart
In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a efficiency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric (L1 norm) is consistently more preferable than the Euclidean distance metric (L2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm. 1

Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

by Kaushik Chakrabarti, Sharad Mehrotra , 2000
"... Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current state-of-the-art technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then in ..."
Abstract - Cited by 97 (1 self) - Add to MetaCart
Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current state-of-the-art technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dime...

The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces

by Kaushik Chakrabarti, Sharad Mehrotra - In Proceedings of ICDE’99 , 1999
"... Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search ..."
Abstract - Cited by 93 (11 self) - Add to MetaCart
Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search over the data structure. Although several data structures have been proposed for feature indexing, none of them is known to scale beyond 10-15 dimensional spaces. This paper introduces the hybrid tree – a multidimensional data structure for indexing high dimensional feature spaces. Unlike other multidimensional data structures, the hybrid tree cannot be classified as either a pure data partitioning (DP) index structure (e.g., R-tree, SS-tree, SRtree) or a pure space partitioning (SP) one (e.g., KDB-tree, hBtree); rather, it “combines ” positive aspects of the two types of index structures a single data structure to achieve search performance more scalable to high dimensionalities than either of the above techniques (hence, the name “hybrid”). Furthermore, unlike many data structures (e.g., distance based index structures like SS-tree, SR-tree), the hybrid tree can support queries based on arbitrary distance functions. Our experiments on “real” high dimensional large size feature databases demonstrate that the hybrid tree scales well to high dimensionality and large database sizes. It significantly outperforms both purely DPbased and SP-based index mechanisms as well as linear scan at all dimensionalities for large sized databases. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University