Results 1  10
of
374
A Quantitative Analysis and Performance Study for SimilaritySearch Methods in HighDimensional Spaces
, 1998
"... For similarity search in highdimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on dataspace partitioning. However, the performance of these methods generally degrades as dimensionality increases. Altho ..."
Abstract

Cited by 487 (12 self)
 Add to MetaCart
For similarity search in highdimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on dataspace partitioning. However, the performance of these methods generally degrades as dimensionality increases. Although this phenomenonknown as the `dimensional curse'is well known, little or no quantitative analysis of the phenomenon is available. In this paper, we provide a detailed analysis of partitioning and clustering techniques for similarity search in HDVSs. We show formally that these methods exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10. Consequently, we come up with an alternative organization based on approximations to make the unavoidable sequential scan as fast as possible. We describe a simple vector approximation scheme, called VAfile, and report on an ...
OPTICS: Ordering Points To Identify the Clustering Structure
, 1999
"... Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract

Cited by 340 (45 self)
 Add to MetaCart
Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the wellknown clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many realdata sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its densitybased clustering structure. This clusterordering contains information which is equivalent to the densitybased clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only ‘traditional ’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the clusterordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.
Searching in Metric Spaces
, 1999
"... The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather ge ..."
Abstract

Cited by 321 (34 self)
 Add to MetaCart
The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We
Distance Browsing in Spatial Databases
, 1999
"... Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is kn ..."
Abstract

Cited by 291 (19 self)
 Add to MetaCart
Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is known prior to the invocation of the algorithm. Thus if m#kneighbors are needed, the knearest neighbor algorithm needs to be reinvoked for m neighbors, thereby possibly performing some redundant computations. The second approach is incremental in the sense that having obtained the k nearest neighbors, the k +1 st neighbor can be obtained without having to calculate the k +1nearest neighbors from scratch. The incremental approach finds use when processing complex queries where one of the conditions involves spatial proximity (e.g., the nearest city to Chicago with population greater than a million), in which case a query engine can make use of a pipelined strategy. A general incremental nearest neighbor algorithm is presented that is applicable to a large class of hierarchical spatial data structures. This algorithm is adapted to the Rtree and its performance is compared to an existing knearest neighbor algorithm for Rtrees [45]. Experiments show that the incremental nearest neighbor algorithm significantly outperforms the knearest neighbor algorithm for distance browsing queries in a spatial database that uses the Rtree as a spatial index. Moreover, the incremental nearest neighbor algorithm also usually outperforms the knearest neighbor algorithm when applied to the knearest neighbor problem for the Rtree, although the improvement is not nearly as large as for distance browsing queries. In fact, we prove informally that, at any step in its execution, the incremental...
Image retrieval: ideas, influences, and trends of the new age
 ACM COMPUTING SURVEYS
, 2008
"... We have witnessed great interest and a wealth of promise in contentbased image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger ass ..."
Abstract

Cited by 270 (8 self)
 Add to MetaCart
We have witnessed great interest and a wealth of promise in contentbased image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.
Optimal MultiStep kNearest Neighbor Search
, 1998
"... For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithm ..."
Abstract

Cited by 166 (19 self)
 Add to MetaCart
For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on indexes work well for simple mediumdimensional similarity distance functions, they do not meet the efficiency requirements of complex highdimensional and adaptable distance functions. The use of a multistep query processing strategy is recommended in these cases, and our investigations substantiate that the number of candidates which are produced in the filter step and exactly evaluated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the stateoftheart algorithm for knearest neighbor search [Korn et al. 1996], we present a novel multistep algorithm which is guaranteed to produce the minim...
Robust and efficient fuzzy match for online data cleaning
 In SIGMOD
, 2003
"... To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the p ..."
Abstract

Cited by 153 (7 self)
 Add to MetaCart
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the prerecorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 133 (6 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Incremental Clustering for Mining in a Data Warehousing Environment
 PROC. 24TH INT. CONF. VERY LARGE DATA BASES, VLDB
, 1998
"... Data warehouses provide a great deal of opportunities for performing data mining tasks such as classification and clustering. Typically, updates are collected and applied to the data warehouse periodically in a batch mode, e.g., during the night. Then, all patterns derived from the warehouse by some ..."
Abstract

Cited by 109 (7 self)
 Add to MetaCart
Data warehouses provide a great deal of opportunities for performing data mining tasks such as classification and clustering. Typically, updates are collected and applied to the data warehouse periodically in a batch mode, e.g., during the night. Then, all patterns derived from the warehouse by some data mining algorithm have to be updated as well. Due to the very large size of the databases, it is highly desirable to perform these updates incrementally. In this paper, we present the first incremental clustering algorithm. Our algorithm is based on the clustering algorithm DBSCAN which is applicable to any database containing data from a metric space, e.g., to a spatial database or to a WWWlog database. Due to the densitybased nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object. Thus, efficient algorithms can be given for incremental insertions and deletions to an existing clustering. Based on the formal definition of clusters, it can be proven that the incremental algorithm yields the same result as DBSCAN. A performance evaluation of IncrementalDBSCAN on a spatial database as well as on a WWWlog database is presented, demonstrating the efficiency of the proposed algorithm. IncrementalDBSCAN yields significant speedup factors over DBSCAN even for large numbers of daily updates in a data warehouse.
VoronoiBased K Nearest Neighbor Search for Spatial Network Databases
 In VLDB
, 2004
"... A frequent type of query in spatial networks (e.g., road networks) is to find the K nearest neighbors (KNN) of a given query object. ..."
Abstract

Cited by 102 (15 self)
 Add to MetaCart
A frequent type of query in spatial networks (e.g., road networks) is to find the K nearest neighbors (KNN) of a given query object.