Results 1  10
of
12
Densitybased indexing for approximate nearestneighbor queries
 In Proc. KDD
, 1999
"... We consider the problem of performing Nearestneighbor queries efficiently over large highdimensional databases. To avoid a full database scan, we target constructing a multidimensional index structure. It is wellaccepted that traditional database indexing algorithms fail for highdimensional data ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
We consider the problem of performing Nearestneighbor queries efficiently over large highdimensional databases. To avoid a full database scan, we target constructing a multidimensional index structure. It is wellaccepted that traditional database indexing algorithms fail for highdimensional data (say d> 10 or 20 depending on the scheme). Some arguments have advocated that nearestneighbor queries do not even make sense for highdimensional data. We show that these arguments are based on overrestrictive assumptions, and that in the general case it is meaningful and possible to build an index for such queries. Our approach, called DBIN, scales to highdimensional databases by exploiting statistical properties of the data. The approach is based on statistically modeling the density of the content of the data table. DBIN uses the density model to derive a single index over the data table and requires physically rewriting data in a new table sorted by the newly created index (i.e. create a clusteredindex). The indexing scheme produces a mapping between a query point (a data record) and an ordering on the clustered index values. Data is then scanned according to the index. We present theoretical and empirical justification for DBIN. The scheme supports a family of distance functions which includes the traditional Euclidean distance measure. 1
High Dimensional Similarity Search With Space Filling Curves
 In Proceedings of the 17th International Conference on Data Engineering
, 2000
"... We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) Btr ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) Btrees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d^(1+1/t)) factor of the exact nearest, can be returned with at most (d + 1) log p n page accesses, where p is the branching factor of the Btrees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log p n) page accesses and generalizes easily to find approximate knea...
CUBIST: A New Algorithm for Improving the Performance of Adhoc OLAP Queries
 In DOLAP
, 2000
"... Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with a ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with an efficient algorithm called CubiST, for evaluating adhoc OLAP queries on top of a relational data warehouse. We are focusing on a class of queries called cube queries, which generalize the data cube operator. CubiST represents a drastic departure from existing relational (ROLAP) and multidimensional (MOLAP) approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views in some heuristic fashion. CubiST is the first OLAP algorithm that needs only one scan over the detailed data set and can efficiently answer any cube query without additional I/O when the ST fits into memory. We have implemented CubiST and our experiments have demonstrated significant improvements in performance and scalability over existing ROLAP/MOLAP approaches.
On optimizing nearest neighbor queries in highdimensional data spaces
 In Proceedings of 8th International Conference on Database Theory (ICDT
, 2001
"... Abstract. Nearestneighbor queries in highdimensional space are of high importance in various applications, especially in contentbased indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we p ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Abstract. Nearestneighbor queries in highdimensional space are of high importance in various applications, especially in contentbased indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we propose a new cost model for nearest neighbor queries in highdimensional space, which we apply to enhance the performance of highdimensional index structures. The model is based on new insights into effects occurring in highdimensional space and provides a closed formula for the processing costs of nearest neighbor queries depending on the dimensionality, the block size and the database size. From the wide range of possible applications of our model, we select two interesting samples: First, we use the model to prove the known linear complexity of the nearest neighbor search problem in highdimensional space, and second, we provide a technique for optimizing the block size. For data of medium dimensionality, the optimized block size allows significant speedups of the query processing time when compared to traditional block sizes and to the linear scan. 1.
Efficient topk hyperplane query processing for multimedia information retrieval
 In Proc. ACM Multimedia
, 2006
"... A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthe ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthest from the hyperplane are deemed to be most relevant to the query, and that are nearest to the hyperplane to be most uncertain to the query. In this paper, we address the twin problems of efficient retrieval of the approximate set of instances (a) farthest from and (b) nearest to a query hyperplane. Retrieval of instances for this hyperplanebased query scenario is mapped to the rangequery problem allowing for the reuse of existing index structures. Empirical evaluation on large image datasets confirms the effectiveness of our approach.
Supporting Subseries Nearest Neighbor Search via Approximation
 In proceedings of the 9 th ACM CIKM Int'l Conference on Information and Knowledge
, 2000
"... Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length as the query series and start at the given time, find the K subseries that are closest to the query series. To support such queries, the paper develops a technique that uses a fixed number of values to approximate each whole data series, and obtains the approximation of any required subseries at the query time. The paper then proposes three subseries search algorithms and compares them with the naive method that sequentially scans the whole data set, as well as a method adapted from a stateofart subseries search algorithm. Experiments are conducted on both a reallife data set and a synthetic one. Results show that the proposed methods access only a small portion of the precise data and outperform the others in run time.
Evaluation Continuous Nearrest Neighbor Queries for Streaming Time . . .
, 2002
"... For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. ..."
Abstract
 Add to MetaCart
For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions.
CUBIST++: A New Approach to Improving the Performance of AdHoc Cube Queries
, 2001
"... We provide a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Our new algorithm called CubiST (Cubing with Statistics Trees) represents a drastic departure from existing approaches in that it doe ..."
Abstract
 Add to MetaCart
We provide a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Our new algorithm called CubiST (Cubing with Statistics Trees) represents a drastic departure from existing approaches in that it does not use the familiar view lattice approach to compute and materialize new views from existing views. Instead, CubiST computes and stores all possible aggregate views in the leaves of a statistics tree during a onetime scan of the detailed data. To be able to handle queries that involve dimension hierarchies, we have developed an improved version of CubiST called CubiST++ which uses a family of trees instead of a single statistics tree. CubiST++ applies a greedy strategy to select a family of candidate trees which represent superviews for the different hierarchy levels of the dimensions. In addition, we have developed an algorithm to compute and materialize the candidate trees that make up the family starting from a single statistics tree (base tree). Given an input query, our new query evaluation x algorithm selects the smallest tree in the family which can answer the query. CubiST++ significantly reduced I/O time and improved inmemory performance when compared with CubiST. For cube queries that contain holistic operations e.g. median, top N, etc., we have reduced the 1dimensional holistic cubing to quantiling and selection problems. To implement holistic operations efficiently, we have developed two new algorithms, namely, deterministic bucketing (DB) and random bucketing (RB). Experimental evaluations of our CubiST++ prototype implementation have demonstrated its superior runtime performance and scalability when compared to existing OLAP systems.
Exploiting Geometry for Support Vector Machine Indexing ∗
"... Support Vector Machines (SVMs) have been adopted by many datamining and informationretrieval applications for learning a mining or query concept, and then retrieving the “topk ” best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top ma ..."
Abstract
 Add to MetaCart
Support Vector Machines (SVMs) have been adopted by many datamining and informationretrieval applications for learning a mining or query concept, and then retrieving the “topk ” best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and thus improve the performance of topk queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of topk instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernelparameter settings (e.g., γ and σ) without performance compromise. Through theoretical analysis, and empirical studies on a wide variety of datasets, we demonstrate KDX to be very effective. 1