Results 1 - 10
of
55
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimens ..."
Abstract
-
Cited by 533 (28 self)
- Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an ffl-approximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
The geometry of graphs and some of its algorithmic applications
- Combinatorica
, 1995
"... In this paper we explore some implications of view-ing graphs as geometric objects. This approach of-fers a new perspective on a number of graph-theoretic and algorithmic problems. There are several ways to model graphs geometrically and our main concern here is with geometric representations that r ..."
Abstract
-
Cited by 376 (16 self)
- Add to MetaCart
In this paper we explore some implications of view-ing graphs as geometric objects. This approach of-fers a new perspective on a number of graph-theoretic and algorithmic problems. There are several ways to model graphs geometrically and our main concern here is with geometric representations that respect the met-ric of the (possibly weighted) graph. Given a graph G we map its vertices to a normed space in an attempt to (i) Keep down the dimension of the host space and (ii) Guarantee a small distortion, i.e., make sure that distances between vertices in G closely match the dis-tances between their geometric images. In this paper we develop efficient algorithms for em-bedding graphs low-dimensionally with a small distor-tion. Further algorithmic applications include: 0 A simple, unified approach to a number of prob-lems on multicommodity flows, including the Leighton-Rae Theorem [29] and some of its ex-tensions. 0 For graphs embeddable in low-dimensional spaces with a small distortion, we can find low-diameter decompositions (in the sense of [4] and [34]). The parameters of the decomposition depend only on the dimension and the distortion and not on the size of the graph. 0 In graphs embedded this way, small balanced separators can be found efficiently. Faithful low-dimensional representations of statisti-cal data allow for meaningful and efficient cluster-ing, which is one of the most basic tasks in pattern-recognition. For the (mostly heuristic) methods used
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract
-
Cited by 210 (8 self)
- Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
A Simple Proof of the Restricted Isometry Property for Random Matrices
- CONSTR APPROX
, 2008
"... We give a simple technique for verifying the Restricted Isometry Property (as introduced by Candès and Tao) for random matrices that underlies Compressed Sensing. Our approach has two main ingredients: (i) concentration inequalities for random inner products that have recently provided algorithmical ..."
Abstract
-
Cited by 165 (48 self)
- Add to MetaCart
We give a simple technique for verifying the Restricted Isometry Property (as introduced by Candès and Tao) for random matrices that underlies Compressed Sensing. Our approach has two main ingredients: (i) concentration inequalities for random inner products that have recently provided algorithmically simple proofs of the Johnson–Lindenstrauss lemma; and (ii) covering numbers for finite-dimensional balls in Euclidean space. This leads to an elementary proof of the Restricted Isometry Property and brings out connections between Compressed Sensing and the Johnson–Lindenstrauss lemma. As a result, we obtain simple and direct proofs of Kashin’s theorems on widths of finite balls in Euclidean space (and their improvements due to Gluskin) and proofs of the existence of optimal Compressed Sensing measurement matrices. In the process, we also prove that these measurements have a certain universality with respect to the sparsity-inducing basis.
Two Algorithms for Nearest-Neighbor Search in High Dimensions
, 1997
"... Representing data as points in a high-dimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems aris ..."
Abstract
-
Cited by 150 (0 self)
- Add to MetaCart
Representing data as points in a high-dimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems arising in these applications can involve several hundred or several thousand dimensions. We consider the nearest-neighbor problem for d-dimensional Euclidean space: we wish to pre-process a database of n points so that given a query point, one can efficiently determine its nearest neighbors in the database. There is a large literature on algorithms for this problem, in both the exact and approximate cases. The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the averagecase analysis of heuristics such as k-d trees reveals an exponential dependence on d in the query time. In this wor...
Database-friendly Random Projections
, 2001
"... A classic result of Johnson and Lindenstrauss asserts that any set of n points in d-dimensional Euclidean space can be embedded into k-dimensional Euclidean space | where k is logarithmic in n and independent of d | so that all pairwise distances are maintained within an arbitrarily small factor. Al ..."
Abstract
-
Cited by 113 (2 self)
- Add to MetaCart
A classic result of Johnson and Lindenstrauss asserts that any set of n points in d-dimensional Euclidean space can be embedded into k-dimensional Euclidean space | where k is logarithmic in n and independent of d | so that all pairwise distances are maintained within an arbitrarily small factor. All known constructions of such embeddings involve projecting the n points onto a random k-dimensional hyperplane. We give a novel construction of the embedding, suitable for database applications, which amounts to computing a simple aggregate over k random attribute partitions.
Random projection in dimensionality reduction: Applications to image and text data
- in Knowledge Discovery and Data Mining
, 2001
"... Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction t ..."
Abstract
-
Cited by 99 (0 self)
- Add to MetaCart
Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lower-dimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally signicantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
Approximating the Bandwidth Via Volume Respecting Embeddings
, 1999
"... A linear arrangement of an n-vertex graph is a one-to-one mapping of its vertices to the integers f1; : : : ; ng. The bandwidth of a linear arrangement is the maximum difference between mapped values of adjacent vertices. The problem of finding a linear arrangement with smallest possible bandwidt ..."
Abstract
-
Cited by 86 (3 self)
- Add to MetaCart
A linear arrangement of an n-vertex graph is a one-to-one mapping of its vertices to the integers f1; : : : ; ng. The bandwidth of a linear arrangement is the maximum difference between mapped values of adjacent vertices. The problem of finding a linear arrangement with smallest possible bandwidth in NP-hard. We present a randomized algorithm that runs in nearly linear time and outputs a linear arrangement whose bandwidth is within a polylogarithmic multiplicative factor of optimal. Our algorithm is based on a new notion, called volume respecting embeddings, which is a natural extension of small distortion embeddings of Bourgain and of Linial, London and Rabinovich. 1 Introduction We consider the problem of minimizing the bandwidth of an undirected connected graph G(V; E), where n = jV j and m = jEj. One needs to find a linear arrangement of the vertices, namely, a one-to-one mapping f : V \Gamma! f1; 2; : : : ng, for which the bandwidth, i.e. max (i;j)2E jf(i) \Gamma f(j)j, i...
Data mining for hypertext: A tutorial survey
- ACM SIGKDD Explorations
, 2000
"... With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of ...
Improved approximation algorithms for large matrices via random projections
- in Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
"... Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (ℓ2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence a ..."
Abstract
-
Cited by 56 (1 self)
- Add to MetaCart
Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (ℓ2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation. Our main contribution is summarized as follows. • Independent of the recent results of Har-Peled and of Deshpande and Vempala, one of the first – and to the best of our knowledge the most efficient – relative-error (1 + ɛ) ‖A − Ak‖F approximation algorithms for the singular value decomposition of an m × n matrix A with M non-zero entries that requires 2 passes over the data and runs in time O M k + (n + m)k2 ɛ ɛ2) log 1 δ • The first o(nd 2) time (1+ɛ) relative-error approximation algorithm for n×d linear (ℓ2) regression. • A matrix multiplication algorithm that easily applies to implicitly given matrices. 1

