Results 1 - 10
of
48
iDistance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search
, 2005
"... In this article, we present an efficient B +-tree based indexing method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional metric space. iDistance partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition. The data ..."
Abstract
-
Cited by 93 (10 self)
- Add to MetaCart
In this article, we present an efficient B +-tree based indexing method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional metric space. iDistance partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition. The data points in each partition are transformed into a single dimensional value based on their similarity with respect to the reference point. This allows the points to be indexed using a B +-tree structure and KNN search to be performed using one-dimensional range search. The choice of partition and reference points adapts the index structure to the data distribution. We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness. We also present a cost model for iDistance KNN search, which can be exploited in query optimization.
An Efficient Cost Model for Optimization of Nearest Neighbor Search in Low and Medium Dimensional Spaces
- IEEE TKDE
, 2004
"... Existing models for nearest neighbor search in multi-dimensional spaces are not appropriate for query optimization because they either lead to erroneous estimation, or involve complex equations that are expensive to evaluate in real-time. This paper proposes an alternative method that captures the p ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
Existing models for nearest neighbor search in multi-dimensional spaces are not appropriate for query optimization because they either lead to erroneous estimation, or involve complex equations that are expensive to evaluate in real-time. This paper proposes an alternative method that captures the performance of nearest neighbor queries using approximation. For uniform data, our model involves closed formulae that are very efficient to compute and accurate for up to 10 dimensions. Further, the proposed equations can be applied on non-uniform data with the aid of histograms. We demonstrate the effectiveness of the model by using it to solve several optimization problems related to nearest neighbor search. To appear in IEEE TKDE
Fast Feature Selection Using Fractal Dimension
, 2000
"... Dimensionalitycurse and dimensionality reduction are two issues that have retained high interest for data mining, machine learning, multimedia indexing, and clustering. We present a fast, scalable algorithm to quickly select the most important attributes (dimensions) for a given set of n-dimensional ..."
Abstract
-
Cited by 35 (9 self)
- Add to MetaCart
Dimensionalitycurse and dimensionality reduction are two issues that have retained high interest for data mining, machine learning, multimedia indexing, and clustering. We present a fast, scalable algorithm to quickly select the most important attributes (dimensions) for a given set of n-dimensional vectors. In contrast to older methods, our method has the following desirable properties: (a) it does not do rotation of attributes, thus leading to easy interpretation of the resulting attributes; (b) it can spot attributes that have nonlinear correlations; (c) it requires a constant number of passes over the dataset; (d) it gives a good estimate on how many attributes we should keep. The idea is to use the `fractal' dimension of a dataset as a good approximation of its intrinsic dimension, and to drop attributes that do not affect it. We applied our method on real and synthetic datasets, where it gave fast and good results. 1 - Introduction and Motivation Whenmanaging the increasing vo...
Fast indexing and visualization of metric datasets using slim-trees
- IEEE Transactions on Knowledge and Data Engineering (TKDE
, 2002
"... AbstractÐMany recent database applications must deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the Slim-tree, a new dynamic tree for organizing metric da ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
(Show Context)
AbstractÐMany recent database applications must deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the Slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The Slim-tree uses the triangle inequality to prune distance calculations needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the Slim-tree uses a Minimal Spanning Tree to help with the split. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The Slim-tree is the first metric access method to tackle the problem of overlap between nodes in metric spaces and to propose a technique to minimize it. The proposed ªfat-factorº is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed ªSlim-downº algorithm. This paper also presents a new tool in the arsenal of resources of Slim-tree aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new algorithms of the Slim-tree lead to performance improvements. These results show that the Slim-tree outperforms the M-tree up to 200 percent for range queries. For insertion and split, the Minimal-Spanning-Tree-based algorithm achieves up to 40 times faster insertions. We observed improvements up to 40 percent in range queries after applying the
The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space
- In Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining
, 2000
"... The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
(Show Context)
The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results sho w that in high dimensional space, even the concept of similarity may not be very meaningful. In this paper, w e propose theIGrid-index; a method for similarity indexing which uses a distance function whose meaningfulness is retained with increasing dimensionality. In addition, this technique shows performance which is unique to all known index structures; the percentage of data accessed is inversely proportional to the overall data dimensionality. Th us, this technique relies on the dimensionality to be high in order to pro vide performance e cient similarity results. The IGridindex can also support a special kind of query whic hw e refer to as projected range queries; a query whic his increasingly relevan tfor very high dimensional data mining applications.
On the Effects of Dimensionality Reduction on High Dimensional Similarity Search
- In ACM PODS Conference Proceedings
, 2001
"... The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed t ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
(Show Context)
The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed to a lower dimensional space by finding a new axissystem in which most of the data variance is preserved in a few dimensions. This reduction may also have a positive effect on the quality of similarity for certain data domains such as text. For other domains, it may lead to loss of information and degradation of search quality. Recent research indicates that the improvement for the text domain is caused by the re-enforcement of the semantic concepts in the data. In this paper, we provide an intuitive model of the effects of dimensionality reduction on arbitrary high dimensional problems. We provide an effective diagnosis of the causality behind the qualitative effects of dimensionality reduction on a given data set. The analysis suggests that these effects are very data dependent. Our analysis also indicates that currently accepted techniques of picking the reduction which results in the least loss of information are useful for maximizing precision and recall, but are not necessarily optimum from a qualitative perspective. We demonstrate that by making simple changes to the implementation details of dimensionality reduction techniques, we can considerably improve the quality of similarity search.
Redundant Bit Vectors for Quickly Searching High-Dimensional Regions
- In Deterministic and Statistical Methods in Machine Learning
, 2005
"... Abstract. Applications such as audio fingerprinting require search in high dimensions: find an item in a database that is similar to a query. An important property of this search task is that negative answers are very frequent: much of the time, a query does not correspond to any database item. We p ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Applications such as audio fingerprinting require search in high dimensions: find an item in a database that is similar to a query. An important property of this search task is that negative answers are very frequent: much of the time, a query does not correspond to any database item. We propose Redundant Bit Vectors (RBVs):anovelmethodforquickly solving this search problem. RBVs rely on three key ideas: 1) approximate the high-dimensional regions/distributions as tightened hyperrectangles, 2) partition the query space to store each item redundantly in an index and 3) use bit vectors to store and search the index efficiently. We show that our method is the preferred method for very large databases or when the queries are often not in the database. Our method is 109 times faster than linear scan, and 48 times faster than localitysensitive hashing on a data set of 239369 audio fingerprints. 1
Modeling High-Dimensional Index Structures using Sampling
- In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 2001
"... A large number of index structures for high-dimensional data have been proposed previously. In order to tune and compare such index structures, it is vital to have efficient cost prediction techniques for these structures. Previous techniques either assume uniformity of the data or are not applicabl ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
A large number of index structures for high-dimensional data have been proposed previously. In order to tune and compare such index structures, it is vital to have efficient cost prediction techniques for these structures. Previous techniques either assume uniformity of the data or are not applicable to high-dimensional data. We propose the use of sampling to predict the number of accessed index pages during a query execution. Sampling is independent of the dimensionality and preserves clusters which is important for representing skewed data. We present a general model for estimating the index page layout using sampling and show how to compensate for errors. We then give an implementation of our model under restricted memory assumptions and show that it performs well even under these constraints. Errors are minimal and the overall prediction time is up to two orders of magnitude below the time for building and probing the full index without sampling. 1.
Tri-Plots: Scalable Tools for Multidimensional Data Mining
- PROC. OF ACM KDD
, 2001
"... We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: Are the two clouds of points separable? What is the smallest/largest pair-wise distance acros ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: Are the two clouds of points separable? What is the smallest/largest pair-wise distance across the two datasets? Which of the two clouds does a new point (feature vector) come from? We propose a new tool, the tri-plot, and its generalization, the pq-plot, which help us answer the above questions. We provide a set of rules on how to interpret a tri-plot, and we apply these rules on synthetic and real datasets. We also show how to use our tool for classification, when traditional methods (nearest neighbor, classification trees) may fail.