Results 1  10
of
78
Searching in metric spaces
, 2001
"... The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather gen ..."
Abstract

Cited by 436 (38 self)
 Add to MetaCart
The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of “intrinsic dimensionality. ” We also present a unified
Robust and efficient fuzzy match for online data cleaning
 In SIGMOD
, 2003
"... To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the p ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the prerecorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 192 (8 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Nearestneighbor searching and metric space dimensions
 In NearestNeighbor Methods for Learning and Vision: Theory and Practice
, 2006
"... Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distan ..."
Abstract

Cited by 107 (0 self)
 Add to MetaCart
(Show Context)
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in lowdimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kdtree ” approach in the metric space setting, using Voronoi regions of a subset in place of axisaligned boxes. 1
Robust Identification of Fuzzy Duplicates
 In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same realworld entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract

Cited by 74 (0 self)
 Add to MetaCart
(Show Context)
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same realworld entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
A compact space decomposition for effective metric indexing
 Pattern Recognition Letters
, 2005
"... Abstract The metric space model abstracts many proximity search problems, from nearestneighborclassifiers to textual and multimedia information retrieval. In this context, an index is a data structure that speeds up proximity queries. However, indexes lose their efficiency as the intrinsicdata dime ..."
Abstract

Cited by 41 (8 self)
 Add to MetaCart
(Show Context)
Abstract The metric space model abstracts many proximity search problems, from nearestneighborclassifiers to textual and multimedia information retrieval. In this context, an index is a data structure that speeds up proximity queries. However, indexes lose their efficiency as the intrinsicdata dimensionality increases. In this paper we present a simple index called list of clusters (LC), which is based on a compact partitioning of the data set. The LC is shown to require little space,to be suitable both for main and secondary memory implementations, and most importantly, to be very resistant to the intrinsic dimensionality of the data set. In this aspect our structure isunbeaten. We finish with a discussion of the role of unbalancing in metric space searching, and how it permits trading memory space for construction time. 1 Introduction The problem of proximity searching has received much attention in recent times, due to an increasing interest in manipulating and retrieving the more and more common multimedia data. Multimedia data have to be classified, forecasted, filtered, organized, and so on. Their manipulation poses new challenges to classifiers and function approximators. The wellknown knearest neighbor (knn) classifier is a favorite candidate for this task for being simple enough and well understood. One of the main obstacles, however, of using this classifier for massive data classification is its linear complexity to find a set of k neighbors for a given query.
Effective Proximity Retrieval by Ordering Permutations
, 2007
"... We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in m ..."
Abstract

Cited by 34 (6 self)
 Add to MetaCart
We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in many pattern recognition tasks. This, for example, renders the KNN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against stateoftheart exact and approximate techniques, both in synthetic and real, metric and nonmetric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.
Techniques for Similarity Searching in Multimedia Databases
, 2010
"... Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding. ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding.
Fully Dynamic Spatial Approximation Trees
 In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE 2002), LNCS 2476
, 2002
"... The Spatial Approximation Tree (satree) is a recently proposed data structure for searching in metric spaces. It has been shown that it compares favorably against alternative data structures in spaces of high dimension or queries with low selectivity. Its main drawbacks are: costly construction ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
(Show Context)
The Spatial Approximation Tree (satree) is a recently proposed data structure for searching in metric spaces. It has been shown that it compares favorably against alternative data structures in spaces of high dimension or queries with low selectivity. Its main drawbacks are: costly construction time, poor performance in low dimensional spaces or queries with high selectivity, and the fact of being a static data structure, that is, once built, one cannot add or delete elements.
Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching
, 2001
"... . Pivotbased algorithms are effective tools for proximity searching in metric spaces. They allow trading space overhead for number of distance evaluations performed at query time. With additional search structures (that pose extra space overhead) they can also reduce the amount of side computations ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
. Pivotbased algorithms are effective tools for proximity searching in metric spaces. They allow trading space overhead for number of distance evaluations performed at query time. With additional search structures (that pose extra space overhead) they can also reduce the amount of side computations. We introduce a new data structure, the Fixed Queries Array (FQA), whose novelties are (1) it permits sublinear extra CPU time without any extra data structure; (2) it permits trading number of pivots for their precision so as to make better use of the available memory. We show experimentally that the FQA is an efficient tool to search in metric spaces and that it compares favorably against other state of the art approaches. Its simplicity converts it into a simple yet effective tool for practitioners seeking for a blackbox method to plug in their applications. Keywords: Metric spaces, similarity search, range search, fixed queries tree. 1.