Results 1  10
of
34
What is the Nearest Neighbor in High Dimensional Spaces?
, 2000
"... Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very difficult one, not only with regards to the performance issue but also to the quality ..."
Abstract

Cited by 135 (10 self)
 Add to MetaCart
Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very difficult one, not only with regards to the performance issue but also to the quality issue. In this paper, we discuss the quality issue and identify a new generalized notion of nearest neighbor search as the relevant problem in high dimensional space. In contrast to previous approaches, our new notion of nearest neighbor search does not treat all dimensions equally but uses a quality criterion to select relevant dimensions (projections) with respect to the given query. As an example for a useful quality criterion, we rate how well the data is clustered around the query point within the selected projection. We then propose an efficient and effective algorithm to solve the generalized nearest neighbor problem. Our experiments based on a number of real and synthetic data sets show that our new approach provides new insights into the nature of nearest neighbor search on high dimensional data.
Efficient discovery of errortolerant frequent itemsets in high dimensions
 In SIGKDD 2001
, 2001
"... We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc ..."
Abstract

Cited by 63 (0 self)
 Add to MetaCart
(Show Context)
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three realworld applications: clustering highdimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 52 (1 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
Clustering for approximate similarity search in highdimensional spaces
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... AbstractÐIn this paper, we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few ne ..."
Abstract

Cited by 52 (0 self)
 Add to MetaCart
(Show Context)
AbstractÐIn this paper, we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform significantly better than other approaches. Our scheme is based on finding clusters and, then, building a simple but efficient index for them. We analyze the tradeoffs involved in clustering and building such an index structure, and present extensive experimental results. Index TermsÐApproximate search, clustering, highdimensional index, similarity search. 1
The concentration of fractional distances
 IEEE Trans. on Knowledge and Data Engineering
, 2007
"... Abstract—Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, t ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowskilike norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given. Index Terms—Nearest neighbor search, highdimensional data, distance concentration, fractional distances. 1
Constrained KMeans Clustering
 MSRTR200065, Microsoft Research
, 2000
"... We consider practical methods for adding constraints to the KMeans clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. We often observe this phenomena when applying KMeans to datasets where the number of dimensions is n 10 and the number ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
We consider practical methods for adding constraints to the KMeans clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. We often observe this phenomena when applying KMeans to datasets where the number of dimensions is n 10 and the number of desired clusters is k 20. We propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. We then investigate the resulting cluster assignment step. Preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. Contrained KMeans Clustering 1 1 Introduction The KMeans clustering algorithm [5] has become a workhorse for the data analyst in many diverse fields. One drawback to the algorithm occurs when it is applied to datasets with m data points in n 10 dimensional real spac...
Quality and Efficiency in High Dimensional Nearest Neighbor Search
"... Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sublinearly with the dataset size, regardless of the data and query distributi ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sublinearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. RigorousLSH ensures good quality of query results, but requires expensive space and query cost. Although adhocLSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive Btree (LSBtree) that enables fast highdimensional NN search with excellent quality. The combination of several LSBtrees leads to a structure called the LSBforest that ensures the same result quality as rigorousLSH, but reduces its space and query cost dramatically. The LSBforest also outperforms adhocLSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSBtree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSBtree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linearspace) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality.
The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space
 In Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining
, 2000
"... The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
(Show Context)
The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results sho w that in high dimensional space, even the concept of similarity may not be very meaningful. In this paper, w e propose theIGridindex; a method for similarity indexing which uses a distance function whose meaningfulness is retained with increasing dimensionality. In addition, this technique shows performance which is unique to all known index structures; the percentage of data accessed is inversely proportional to the overall data dimensionality. Th us, this technique relies on the dimensionality to be high in order to pro vide performance e cient similarity results. The IGridindex can also support a special kind of query whic hw e refer to as projected range queries; a query whic his increasingly relevan tfor very high dimensional data mining applications.
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.