Results 1 - 10
of
11
Density-Based Indexing for Approximate Nearest-Neighbor Queries
- ACM SIGKDD Conference Proceedings
, 1999
"... We consider the problem of performing nearest-neighbor queries efficiently over large high-dimensional databases. Assuming that a full database scan to determine the nearest neighbor entries is not acceptable, we study the possibility of constructing an index structure over the database. It is wel ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
We consider the problem of performing nearest-neighbor queries efficiently over large high-dimensional databases. Assuming that a full database scan to determine the nearest neighbor entries is not acceptable, we study the possibility of constructing an index structure over the database. It is well-accepted that traditional database indexing algorithms fail for high-dimensional data (say d?10 or 20 depending on the scheme). Some arguments haveadvocated that nearest-neighbor queries do not even make sense for high-dimensional data since the ratio of maximum and minimum distance goes to 1 as dimensionality increases. We show that these arguments are based on over-restrictive assumptions, and that in the general case it is meaningful and possible to perform such queries. We present an approach for deriving a multidimensional index to support approximate nearestneighbor queries over large databases. Our approach, called DBIN, scales to high-dimensional databases by exploiting sta...
High Dimensional Similarity Search With Space Filling Curves
- In Proceedings of the 17th International Conference on Data Engineering
, 2000
"... We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t -metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) B-tr ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t -metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) B-trees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d^(1+1/t)) factor of the exact nearest, can be returned with at most (d + 1) log p n page accesses, where p is the branching factor of the B-trees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log p n) page accesses and generalizes easily to find approximate k-nea...
CUBIST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries
- In DOLAP
, 2000
"... Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with a ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with an efficient algorithm called CubiST, for evaluating ad-hoc OLAP queries on top of a relational data warehouse. We are focusing on a class of queries called cube queries, which generalize the data cube operator. CubiST represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views in some heuristic fashion. CubiST is the first OLAP algorithm that needs only one scan over the detailed data set and can efficiently answer any cube query without additional I/O when the ST fits into memory. We have implemented CubiST and our experiments have demonstrated significant improvements in performance and scalability over existing ROLAP/MOLAP approaches.
On optimizing nearest neighbor queries in high-dimensional data spaces
- In Proceedings of 8th International Conference on Database Theory (ICDT
, 2001
"... Abstract. Nearest-neighbor queries in high-dimensional space are of high importance in various applications, especially in content-based indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we p ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract. Nearest-neighbor queries in high-dimensional space are of high importance in various applications, especially in content-based indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we propose a new cost model for nearest neighbor queries in high-dimensional space, which we apply to enhance the performance of high-dimensional index structures. The model is based on new insights into effects occurring in high-dimensional space and provides a closed formula for the processing costs of nearest neighbor queries depending on the dimensionality, the block size and the database size. From the wide range of possible applications of our model, we select two interesting samples: First, we use the model to prove the known linear complexity of the nearest neighbor search problem in high-dimensional space, and second, we provide a technique for optimizing the block size. For data of medium dimensionality, the optimized block size allows significant speed-ups of the query processing time when compared to traditional block sizes and to the linear scan. 1.
Efficient top-k hyperplane query processing for multimedia information retrieval
- In Proc. ACM Multimedia
, 2006
"... A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthe ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthest from the hyperplane are deemed to be most relevant to the query, and that are nearest to the hyperplane to be most uncertain to the query. In this paper, we address the twin problems of efficient retrieval of the approximate set of instances (a) farthest from and (b) nearest to a query hyperplane. Retrieval of instances for this hyperplane-based query scenario is mapped to the range-query problem allowing for the reuse of existing index structures. Empirical evaluation on large image datasets confirms the effectiveness of our approach.
Supporting Subseries Nearest Neighbor Search via Approximation
- In proceedings of the 9 th ACM CIKM Int'l Conference on Information and Knowledge
, 2000
"... Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length as the query series and start at the given time, find the K subseries that are closest to the query series. To support such queries, the paper develops a technique that uses a fixed number of values to approximate each whole data series, and obtains the approximation of any required subseries at the query time. The paper then proposes three subseries search algorithms and compares them with the naive method that sequentially scans the whole data set, as well as a method adapted from a state-of-art subseries search algorithm. Experiments are conducted on both a real-life data set and a synthetic one. Results show that the proposed methods access only a small portion of the precise data and outperform the others in run time.
Evaluation Continuous Nearrest Neighbor Queries for Streaming Time . . .
, 2002
"... For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. ..."
Abstract
- Add to MetaCart
For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions.
CUBIST++: A New Approach to Improving the Performance of Ad-Hoc Cube Queries
, 2001
"... We provide a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Our new algorithm called CubiST (Cubing with Statistics Trees) represents a drastic departure from existing approaches in that it doe ..."
Abstract
- Add to MetaCart
We provide a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Our new algorithm called CubiST (Cubing with Statistics Trees) represents a drastic departure from existing approaches in that it does not use the familiar view lattice approach to compute and materialize new views from existing views. Instead, CubiST computes and stores all possible aggregate views in the leaves of a statistics tree during a one-time scan of the detailed data. To be able to handle queries that involve dimension hierarchies, we have developed an improved version of CubiST called CubiST++ which uses a family of trees instead of a single statistics tree. CubiST++ applies a greedy strategy to select a family of candidate trees which represent superviews for the different hierarchy levels of the dimensions. In addition, we have developed an algorithm to compute and materialize the candidate trees that make up the family starting from a single statistics tree (base tree). Given an input query, our new query evaluation x algorithm selects the smallest tree in the family which can answer the query. CubiST++ significantly reduced I/O time and improved in-memory performance when compared with CubiST. For cube queries that contain holistic operations e.g. median, top N, etc., we have reduced the 1-dimensional holistic cubing to quantiling and selection problems. To implement holistic operations efficiently, we have developed two new algorithms, namely, deterministic bucketing (DB) and random bucketing (RB). Experimental evaluations of our CubiST++ prototype implementation have demonstrated its superior run-time performance and scalability when compared to existing OLAP systems.
Exploiting Geometry for Support Vector Machine Indexing ∗
"... Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applications for learning a mining or query concept, and then retrieving the “top-k ” best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top ma ..."
Abstract
- Add to MetaCart
Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applications for learning a mining or query concept, and then retrieving the “top-k ” best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and thus improve the performance of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings (e.g., γ and σ) without performance compromise. Through theoretical analysis, and empirical studies on a wide variety of datasets, we demonstrate KDX to be very effective. 1

