Results 1  10
of
536
Similarity search in high dimensions via hashing
, 1999
"... The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image dat ..."
Abstract

Cited by 461 (13 self)
 Add to MetaCart
The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image databases, document collections, timeseries databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. &quot; That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in kd trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than bruteforce linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points
PeertoPeer Information Retrieval Using SelfOrganizing Semantic Overlay Networks
, 2003
"... Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSea ..."
Abstract

Cited by 215 (7 self)
 Add to MetaCart
Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSearch distributes document indices through the P2P network based on document semantics generated by Latent Semantic Indexing (LSI). The search cost (in terms of different nodes searched and data transmitted) for a given query is thereby reduced, since the indices of semantically related documents are likely to be colocated in the network. We also describe techniques that help distribute the indices more evenly across the nodes, and further reduce the number of nodes accessed using appropriate index distribution as well as using index samples and recently processed queries to guide the search. Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes. For a system with 128,000 nodes and 528,543 documents (from news, magazines, etc.), pSearch searches only 19 nodes and transmits only 95.5KB data during the search, whereas the top 15 documents returned by pSearch and LSI have a 91.7% intersection.
Discovering similar multidimensional trajectories
 In ICDE
, 2002
"... We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize nonmetric similarity functions based on the Longest C ..."
Abstract

Cited by 188 (6 self)
 Add to MetaCart
We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize nonmetric similarity functions based on the Longest Common Subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translating of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach, especially under the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach. 1
3D shape histograms for similarity search and classification in spatial databases
 SSD'99
, 1999
"... Classification is one of the basic tasks of data mining in modern database applications including molecular biology, astronomy, mechanical engineering, medical imaging or meteorology. The underlying models have to consider spatial properties such as shape or extension as well as thematic attributes ..."
Abstract

Cited by 151 (10 self)
 Add to MetaCart
Classification is one of the basic tasks of data mining in modern database applications including molecular biology, astronomy, mechanical engineering, medical imaging or meteorology. The underlying models have to consider spatial properties such as shape or extension as well as thematic attributes. We introduce 3D shape histograms as an intuitive and powerful similarity model for 3D objects. Particular flexibility is provided by using quadratic form distance functions in order to account for errors of measurement, sampling, and numerical rounding that all may result in small displacements and rotations of shapes. For query processing, a general filterrefinement architecture is employed that efficiently supports similarity search based on quadratic forms. An experimental evaluation in the context of molecular biology demonstrates both, the high classification accuracy of more than 90 % and the good performance of the approach.
Continuous Nearest Neighbor Search
, 2002
"... A continuous nearest neighbor query retrieves the nearest neighbor (NN) of every point on a line segment (e.g., "find all my nearest gas stations during my route from point s to point e"). The result contains a set of <point, interval> tuples, such that point is the NN of all po ..."
Abstract

Cited by 126 (10 self)
 Add to MetaCart
A continuous nearest neighbor query retrieves the nearest neighbor (NN) of every point on a line segment (e.g., "find all my nearest gas stations during my route from point s to point e"). The result contains a set of <point, interval> tuples, such that point is the NN of all points in the corresponding interval. Existing methods for continuous nearest neighbor search are based on the repetitive application of simple NN algorithms, which incurs significant overhead. In this paper we propose techniques that solve the problem by performing a single query for the whole input segment. As a result the cost, depending on the query and dataset characteristics, may drop by orders of magnitude.
What is the Nearest Neighbor in High Dimensional Spaces?
, 2000
"... Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very difficult one, not only with regards to the performance issue but also to the quality ..."
Abstract

Cited by 120 (9 self)
 Add to MetaCart
Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very difficult one, not only with regards to the performance issue but also to the quality issue. In this paper, we discuss the quality issue and identify a new generalized notion of nearest neighbor search as the relevant problem in high dimensional space. In contrast to previous approaches, our new notion of nearest neighbor search does not treat all dimensions equally but uses a quality criterion to select relevant dimensions (projections) with respect to the given query. As an example for a useful quality criterion, we rate how well the data is clustered around the query point within the selected projection. We then propose an efficient and effective algorithm to solve the generalized nearest neighbor problem. Our experiments based on a number of real and synthetic data sets show that our new approach provides new insights into the nature of nearest neighbor search on high dimensional data.
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces
, 2000
"... Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then in ..."
Abstract

Cited by 110 (1 self)
 Add to MetaCart
Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dime...
The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces
 In Proceedings of ICDE’99
, 1999
"... Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search ..."
Abstract

Cited by 109 (11 self)
 Add to MetaCart
Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search over the data structure. Although several data structures have been proposed for feature indexing, none of them is known to scale beyond 1015 dimensional spaces. This paper introduces the hybrid tree – a multidimensional data structure for indexing high dimensional feature spaces. Unlike other multidimensional data structures, the hybrid tree cannot be classified as either a pure data partitioning (DP) index structure (e.g., Rtree, SStree, SRtree) or a pure space partitioning (SP) one (e.g., KDBtree, hBtree); rather, it “combines ” positive aspects of the two types of index structures a single data structure to achieve search performance more scalable to high dimensionalities than either of the above techniques (hence, the name “hybrid”). Furthermore, unlike many data structures (e.g., distance based index structures like SStree, SRtree), the hybrid tree can support queries based on arbitrary distance functions. Our experiments on “real” high dimensional large size feature databases demonstrate that the hybrid tree scales well to high dimensionality and large database sizes. It significantly outperforms both purely DPbased and SPbased index mechanisms as well as linear scan at all dimensionalities for large sized databases. 1.
The Atree: An Index Structure for HighDimensional Spaces Using Relative Approximation
, 2000
"... We propose a novel index structure, Atree (Approximation tree), for similarity search of highdimensional data. The basic idea of the Atree is the introduction of Virtual Bounding Rectangles (VBRs), which contain and approximate MBRs and data objects. VBRs can be represented rather compactly, and ..."
Abstract

Cited by 103 (0 self)
 Add to MetaCart
We propose a novel index structure, Atree (Approximation tree), for similarity search of highdimensional data. The basic idea of the Atree is the introduction of Virtual Bounding Rectangles (VBRs), which contain and approximate MBRs and data objects. VBRs can be represented rather compactly, and thus affect the tree configuration both quantitatively and qualitatively. Firstly, since tree nodes can install large number of entries of VBRs, fanout of nodes becomes large, thus leads to fast search. More importantly, we have a free hand in arranging MBRs and VBRs in tree nodes. In the Atrees, nodes contain entries of an MBR and its children VBRs. Therefore, by fetching a node of an Atree, we can obtain the information of exact position of a parent MBR and approximate position of its children. We have performed experiments using both synthetic and real data sets. For the real data sets, the Atree outperforms the SRtree and the VAFile in all range of dimensionality up to 64 dimension, which is the highest dimension in our experiments. The Atree achieves 77.3 % (77.7%, resp.) savings in page accesses compared to the SRtree (the VAFile, resp.) for 64dimensional real data.
Robust and fast similarity search for moving object trajectories
 In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 2005
"... An important consideration in similaritybased retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, dis ..."
Abstract

Cited by 93 (14 self)
 Add to MetaCart
An important consideration in similaritybased retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imperfections. Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning techniques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increasing the pruning power significantly. The experimental results confirm the superior efficiency of the combined methods. 1.