Results 1  10
of
18
Densitybased indexing for approximate nearestneighbor queries
 In Proc. KDD
, 1999
"... We consider the problem of performing Nearestneighbor queries efficiently over large highdimensional databases. To avoid a full database scan, we target constructing a multidimensional index structure. It is wellaccepted that traditional database indexing algorithms fail for highdimensional data ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
We consider the problem of performing Nearestneighbor queries efficiently over large highdimensional databases. To avoid a full database scan, we target constructing a multidimensional index structure. It is wellaccepted that traditional database indexing algorithms fail for highdimensional data (say d> 10 or 20 depending on the scheme). Some arguments have advocated that nearestneighbor queries do not even make sense for highdimensional data. We show that these arguments are based on overrestrictive assumptions, and that in the general case it is meaningful and possible to build an index for such queries. Our approach, called DBIN, scales to highdimensional databases by exploiting statistical properties of the data. The approach is based on statistically modeling the density of the content of the data table. DBIN uses the density model to derive a single index over the data table and requires physically rewriting data in a new table sorted by the newly created index (i.e. create a clusteredindex). The indexing scheme produces a mapping between a query point (a data record) and an ordering on the clustered index values. Data is then scanned according to the index. We present theoretical and empirical justification for DBIN. The scheme supports a family of distance functions which includes the traditional Euclidean distance measure. 1
High Dimensional Similarity Search With Space Filling Curves
 In Proceedings of the 17th International Conference on Data Engineering
, 2000
"... We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) Btr ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L t metric, t = 1,2,3,... The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d + 1) Btrees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d^(1+1/t)) factor of the exact nearest, can be returned with at most (d + 1) log p n page accesses, where p is the branching factor of the Btrees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log p n) page accesses and generalizes easily to find approximate knea...
Clustering techniques for large data sets  from the past to the future
 In Tutorial Notes for ACM SIGKDD 1999 International Conference on Knowledge Discovery and Data Mining
, 1999
"... Application Example: Marketing – Given: • Large data base of customer data containing their properties and past buying records ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
Application Example: Marketing – Given: • Large data base of customer data containing their properties and past buying records
Evaluating continuous nearest neighbor queries for streaming time series via prefetching
 in: Proceedings of the International Conference on Information and Knowledge Management, ACM CIKM
, 2002
"... For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is called a continuous nearest neighbor query. This ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is called a continuous nearest neighbor query. This paper seeks fast evaluation of continuous queries on large databases. The initial strategy is to use the result of one evaluation to restrict the search space for the next. A more fundamental idea is to extend the existing indexing methods, used in many traditional nearest neighbor algorithms, with prefetching. Specifically, prefetching is to predict the next value of the stream before it arrives, and to process the query as if the predicted value were the real one in order to load the needed index pages and time series into the allocated cache memory. Furthermore, if the prefetched candidates cannot fit into the cache memory, they are stored in a sequential file to facilitate fast access to them. Experiments show that prefetching improves the response time greatly over the direct use of traditional algorithms, even if the caching provided by the operating system is taken into consideration.
CUBIST: A New Algorithm for Improving the Performance of Adhoc OLAP Queries
 In DOLAP
, 2000
"... Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with a ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with an efficient algorithm called CubiST, for evaluating adhoc OLAP queries on top of a relational data warehouse. We are focusing on a class of queries called cube queries, which generalize the data cube operator. CubiST represents a drastic departure from existing relational (ROLAP) and multidimensional (MOLAP) approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views in some heuristic fashion. CubiST is the first OLAP algorithm that needs only one scan over the detailed data set and can efficiently answer any cube query without additional I/O when the ST fits into memory. We have implemented CubiST and our experiments have demonstrated significant improvements in performance and scalability over existing ROLAP/MOLAP approaches.
On optimizing nearest neighbor queries in highdimensional data spaces
 In Proceedings of 8th International Conference on Database Theory (ICDT
, 2001
"... Abstract. Nearestneighbor queries in highdimensional space are of high importance in various applications, especially in contentbased indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we p ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Abstract. Nearestneighbor queries in highdimensional space are of high importance in various applications, especially in contentbased indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we propose a new cost model for nearest neighbor queries in highdimensional space, which we apply to enhance the performance of highdimensional index structures. The model is based on new insights into effects occurring in highdimensional space and provides a closed formula for the processing costs of nearest neighbor queries depending on the dimensionality, the block size and the database size. From the wide range of possible applications of our model, we select two interesting samples: First, we use the model to prove the known linear complexity of the nearest neighbor search problem in highdimensional space, and second, we provide a technique for optimizing the block size. For data of medium dimensionality, the optimized block size allows significant speedups of the query processing time when compared to traditional block sizes and to the linear scan. 1.
Efficient topk hyperplane query processing for multimedia information retrieval
 In Proceedings of the 14th ACM international conference on Multimedia
, 2006
"... A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthe ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthest from the hyperplane are deemed to be most relevant to the query, and that are nearest to the hyperplane to be most uncertain to the query. In this paper, we address the twin problems of efficient retrieval of the approximate set of instances (a) farthest from and (b) nearest to a query hyperplane. Retrieval of instances for this hyperplanebased query scenario is mapped to the rangequery problem allowing for the reuse of existing index structures. Empirical evaluation on large image datasets confirms the effectiveness of our approach.
Supporting Subseries Nearest Neighbor Search via Approximation
 In proceedings of the 9 th ACM CIKM Int'l Conference on Information and Knowledge
, 2000
"... Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Searching for nearest neighbors in a large set of time series is an important data mining task. This paper studies the following type of time series nearest neighbor queries: Given a query series and a starting time, among all the subseries (of a collection of data series) that have the same length as the query series and start at the given time, find the K subseries that are closest to the query series. To support such queries, the paper develops a technique that uses a fixed number of values to approximate each whole data series, and obtains the approximation of any required subseries at the query time. The paper then proposes three subseries search algorithms and compares them with the naive method that sequentially scans the whole data set, as well as a method adapted from a stateofart subseries search algorithm. Experiments are conducted on both a reallife data set and a synthetic one. Results show that the proposed methods access only a small portion of the precise data and outperform the others in run time.
Extending HighDimensional Indexing Techniques Pyramid and iMinMax(θ) : Lessons
"... Abstract. Pyramid Technique and iMinMax(θ) are two popular highdimensional indexing approaches that map points in a highdimensional space to a singledimensional index. In this work, we perform the first independent experimental evaluation of Pyramid Technique and iMinMax(θ), and discuss in detai ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Pyramid Technique and iMinMax(θ) are two popular highdimensional indexing approaches that map points in a highdimensional space to a singledimensional index. In this work, we perform the first independent experimental evaluation of Pyramid Technique and iMinMax(θ), and discuss in detail promising extensions for testing kNearest Neighbor (kNN) and range queries. For datasets with skewed distributions, the parameters of these algorithms must be tuned to maintain balanced partitions. We show that, by using the medians of the distribution we can optimize these parameters. For the Pyramid Technique, different approximate median methods on data space partitioning are experimentally compared using kNN queries. For the iMinMax(θ), the default parameter setting and parameters tuned using the distribution median are experimentally compared using range queries. Also, as proposed in the iMinMax(θ) paper, we investigated the benefit of maintaining a parameter to account for the skewness of each dimension separately instead of a single parameter over all the dimensions.