Results 1  10
of
104
Similarity search in high dimensions via hashing
, 1999
"... The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image dat ..."
Abstract

Cited by 461 (13 self)
 Add to MetaCart
The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image databases, document collections, timeseries databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. &quot; That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in kd trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than bruteforce linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points
Distance Browsing in Spatial Databases
, 1999
"... Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is kn ..."
Abstract

Cited by 315 (19 self)
 Add to MetaCart
Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is known prior to the invocation of the algorithm. Thus if m#kneighbors are needed, the knearest neighbor algorithm needs to be reinvoked for m neighbors, thereby possibly performing some redundant computations. The second approach is incremental in the sense that having obtained the k nearest neighbors, the k +1 st neighbor can be obtained without having to calculate the k +1nearest neighbors from scratch. The incremental approach finds use when processing complex queries where one of the conditions involves spatial proximity (e.g., the nearest city to Chicago with population greater than a million), in which case a query engine can make use of a pipelined strategy. A general incremental nearest neighbor algorithm is presented that is applicable to a large class of hierarchical spatial data structures. This algorithm is adapted to the Rtree and its performance is compared to an existing knearest neighbor algorithm for Rtrees [45]. Experiments show that the incremental nearest neighbor algorithm significantly outperforms the knearest neighbor algorithm for distance browsing queries in a spatial database that uses the Rtree as a spatial index. Moreover, the incremental nearest neighbor algorithm also usually outperforms the knearest neighbor algorithm when applied to the knearest neighbor problem for the Rtree, although the improvement is not nearly as large as for distance browsing queries. In fact, we prove informally that, at any step in its execution, the incremental...
Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
 In proceedings of ACM SIGMOD Conference on Management of Data
, 2002
"... Similarity search in large time series databases has attracted much research interest recently. It is a difficult problem because of the typically high dimensionality of the data.. The most promising solutions' involve performing dimensionality reduction on the data, then indexing the reduced d ..."
Abstract

Cited by 252 (28 self)
 Add to MetaCart
(Show Context)
Similarity search in large time series databases has attracted much research interest recently. It is a difficult problem because of the typically high dimensionality of the data.. The most promising solutions' involve performing dimensionality reduction on the data, then indexing the reduced data with a multidimensional index structure. Many dimensionality reduction techniques have been proposed, including Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Adaptive Piecewise Constant Approximation (APCA). While previous techniques (e.g., SVD, DFT and DWT) choose a common representation for all the items in the database that minimizes the global reconstruction error, APCA approximates each time series by a set of constant value segments' of varying lengths' such that their individual reconstruction errors' are minimal. We show how APCA can be indexed using a multidimensional index structure. We propose two distance measures in the indexed space that exploit the high fidelity of APCA for fast searching: a lower bounding Euclidean distance approximation, and a nonlower bounding, but very tight Euclidean distance approximation and show how they can support fast exact searchin& and even faster approximate searching on the same index structure. We theoretically and empirically compare APCA to all the other techniques and demonstrate its' superiority.
Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases
, 2000
"... The problem of similarity search in large time series databases has attracted much attention recently. It is a nontrivial problem because of the inherent high dimensionality of the data. The most promising solutions involve first performing dimensionality reduction on the data, and then indexing th ..."
Abstract

Cited by 174 (18 self)
 Add to MetaCart
(Show Context)
The problem of similarity search in large time series databases has attracted much attention recently. It is a nontrivial problem because of the inherent high dimensionality of the data. The most promising solutions involve first performing dimensionality reduction on the data, and then indexing the reduced data with a spatial access method. Three major dimensionality reduction techniques have been proposed, Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and more recently the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Piecewise Aggregate Approximation (PAA). We theoretically and empirically compare it to the other techniques and demonstrate its superiority. In addition to being competitive with or faster than the other methods, our approach has numerous other advantages. It is simple to understand and to implement, it allows more flexible distance measures, including weighted Euclidean queries, and the index can be built in linear time.
Finding Generalized Projected Clusters in High Dimensional Spaces
"... High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projec ..."
Abstract

Cited by 152 (8 self)
 Add to MetaCart
(Show Context)
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to rede ne clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely to tradeoff with better accuracy.
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces
, 2000
"... Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then in ..."
Abstract

Cited by 110 (1 self)
 Add to MetaCart
Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dime...
Emergent Semantics Through Interaction in Image Databases
 IEEE Transactions on Knowledge and Data Engineering
, 2001
"... In this paper we discuss briefly some aspects of image semantics and the role that it plays for the design of Image Databases. We argue that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interactio ..."
Abstract

Cited by 73 (8 self)
 Add to MetaCart
(Show Context)
In this paper we discuss briefly some aspects of image semantics and the role that it plays for the design of Image Databases. We argue that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interaction. From this observation, we conclude that in an image database users should be allowed to manipulate not only the individual images, but also the relation between them. We present an interface model based on the manipulation of configurations of images. 1 Introduction In this paper we propose some new ideas on image semantics, and study some of their consequences on the interaction withand the organization ofimage databases. Many current Content Based Image Retrieval (CBIR) systems follow a semantic model derived from traditional databases according to which the meaning of a record is a compositional function of its syntactic structure and of the meaning of its elementary constituents. W...
Generalized Low Rank Approximations of Matrices
 MACHINE LEARNING
, 2004
"... We consider the problem of computing low rank approximations of matrices. The novelty of our approach is that the low rank approximations are on a sequence of matrices. Unlike the ..."
Abstract

Cited by 64 (6 self)
 Add to MetaCart
We consider the problem of computing low rank approximations of matrices. The novelty of our approach is that the low rank approximations are on a sequence of matrices. Unlike the
Similarity search over time series data using wavelets
 In ICDE
, 2002
"... We consider the use of wavelet transformations as a dimensionality reduction technique to permit efficient similarity search over highdimensional timeseries data. While numerous transformations have been proposed and studied, the only wavelet that has been shown to be effective for this applicatio ..."
Abstract

Cited by 63 (0 self)
 Add to MetaCart
We consider the use of wavelet transformations as a dimensionality reduction technique to permit efficient similarity search over highdimensional timeseries data. While numerous transformations have been proposed and studied, the only wavelet that has been shown to be effective for this application is the Haar wavelet. In this work, we observe that a large class of wavelet transformations (not only orthonormal wavelets but also biorthonormal wavelets)can be used to support similarity search. This class includes the most popular and most effective wavelets being used in image compression. We present a detailed performance study of the effects of using different wavelets on the performance of similarity search for timeseries data. We include several wavelets that outperform both the Haar wavelet and the best known nonwavelet transformations for this application. To ensure our results are usable by an application engineer, we also show how to configure an indexing strategy for the best performing transformations. Finally, we identify classes of data that can be indexed efficiently using these wavelet transformations. 1.
ClusterPreserving Embedding of Proteins
, 1999
"... Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as F ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exhaustive search is more accurate. Therefore, there are times when one would like to perform an exhaustive search.