Results 1  10
of
79
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimens ..."
Abstract

Cited by 715 (33 self)
 Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the bruteforce algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an fflapproximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
The geometry of graphs and some of its algorithmic applications
 Combinatorica
, 1995
"... In this paper we explore some implications of viewing graphs as geometric objects. This approach offers a new perspective on a number of graphtheoretic and algorithmic problems. There are several ways to model graphs geometrically and our main concern here is with geometric representations that r ..."
Abstract

Cited by 457 (19 self)
 Add to MetaCart
In this paper we explore some implications of viewing graphs as geometric objects. This approach offers a new perspective on a number of graphtheoretic and algorithmic problems. There are several ways to model graphs geometrically and our main concern here is with geometric representations that respect the metric of the (possibly weighted) graph. Given a graph G we map its vertices to a normed space in an attempt to (i) Keep down the dimension of the host space and (ii) Guarantee a small distortion, i.e., make sure that distances between vertices in G closely match the distances between their geometric images. In this paper we develop efficient algorithms for embedding graphs lowdimensionally with a small distortion. Further algorithmic applications include: 0 A simple, unified approach to a number of problems on multicommodity flows, including the LeightonRae Theorem [29] and some of its extensions. 0 For graphs embeddable in lowdimensional spaces with a small distortion, we can find lowdiameter decompositions (in the sense of [4] and [34]). The parameters of the decomposition depend only on the dimension and the distortion and not on the size of the graph. 0 In graphs embedded this way, small balanced separators can be found efficiently. Faithful lowdimensional representations of statistical data allow for meaningful and efficient clustering, which is one of the most basic tasks in patternrecognition. For the (mostly heuristic) methods used
A Simple Proof of the Restricted Isometry Property for Random Matrices
 CONSTR APPROX
, 2008
"... We give a simple technique for verifying the Restricted Isometry Property (as introduced by Candès and Tao) for random matrices that underlies Compressed Sensing. Our approach has two main ingredients: (i) concentration inequalities for random inner products that have recently provided algorithmical ..."
Abstract

Cited by 301 (56 self)
 Add to MetaCart
We give a simple technique for verifying the Restricted Isometry Property (as introduced by Candès and Tao) for random matrices that underlies Compressed Sensing. Our approach has two main ingredients: (i) concentration inequalities for random inner products that have recently provided algorithmically simple proofs of the Johnson–Lindenstrauss lemma; and (ii) covering numbers for finitedimensional balls in Euclidean space. This leads to an elementary proof of the Restricted Isometry Property and brings out connections between Compressed Sensing and the Johnson–Lindenstrauss lemma. As a result, we obtain simple and direct proofs of Kashin’s theorems on widths of finite balls in Euclidean space (and their improvements due to Gluskin) and proofs of the existence of optimal Compressed Sensing measurement matrices. In the process, we also prove that these measurements have a certain universality with respect to the sparsityinducing basis.
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract

Cited by 248 (8 self)
 Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Two Algorithms for NearestNeighbor Search in High Dimensions
, 1997
"... Representing data as points in a highdimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems aris ..."
Abstract

Cited by 169 (0 self)
 Add to MetaCart
Representing data as points in a highdimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems arising in these applications can involve several hundred or several thousand dimensions. We consider the nearestneighbor problem for ddimensional Euclidean space: we wish to preprocess a database of n points so that given a query point, one can efficiently determine its nearest neighbors in the database. There is a large literature on algorithms for this problem, in both the exact and approximate cases. The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the averagecase analysis of heuristics such as kd trees reveals an exponential dependence on d in the query time. In this wor...
Databasefriendly Random Projections
, 2001
"... A classic result of Johnson and Lindenstrauss asserts that any set of n points in ddimensional Euclidean space can be embedded into kdimensional Euclidean space  where k is logarithmic in n and independent of d  so that all pairwise distances are maintained within an arbitrarily small factor. Al ..."
Abstract

Cited by 158 (3 self)
 Add to MetaCart
A classic result of Johnson and Lindenstrauss asserts that any set of n points in ddimensional Euclidean space can be embedded into kdimensional Euclidean space  where k is logarithmic in n and independent of d  so that all pairwise distances are maintained within an arbitrarily small factor. All known constructions of such embeddings involve projecting the n points onto a random kdimensional hyperplane. We give a novel construction of the embedding, suitable for database applications, which amounts to computing a simple aggregate over k random attribute partitions.
Random projection in dimensionality reduction: Applications to image and text data
 in Knowledge Discovery and Data Mining
, 2001
"... Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction t ..."
Abstract

Cited by 137 (0 self)
 Add to MetaCart
Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lowerdimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally signicantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
Approximating the Bandwidth Via Volume Respecting Embeddings
, 1999
"... A linear arrangement of an nvertex graph is a onetoone mapping of its vertices to the integers f1; : : : ; ng. The bandwidth of a linear arrangement is the maximum difference between mapped values of adjacent vertices. The problem of finding a linear arrangement with smallest possible bandwidt ..."
Abstract

Cited by 93 (3 self)
 Add to MetaCart
A linear arrangement of an nvertex graph is a onetoone mapping of its vertices to the integers f1; : : : ; ng. The bandwidth of a linear arrangement is the maximum difference between mapped values of adjacent vertices. The problem of finding a linear arrangement with smallest possible bandwidth in NPhard. We present a randomized algorithm that runs in nearly linear time and outputs a linear arrangement whose bandwidth is within a polylogarithmic multiplicative factor of optimal. Our algorithm is based on a new notion, called volume respecting embeddings, which is a natural extension of small distortion embeddings of Bourgain and of Linial, London and Rabinovich. 1 Introduction We consider the problem of minimizing the bandwidth of an undirected connected graph G(V; E), where n = jV j and m = jEj. One needs to find a linear arrangement of the vertices, namely, a onetoone mapping f : V \Gamma! f1; 2; : : : ng, for which the bandwidth, i.e. max (i;j)2E jf(i) \Gamma f(j)j, i...
Improved approximation algorithms for large matrices via random projections
 in Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
"... Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (ℓ2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence a ..."
Abstract

Cited by 93 (3 self)
 Add to MetaCart
Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (ℓ2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation. Our main contribution is summarized as follows. • Independent of the recent results of HarPeled and of Deshpande and Vempala, one of the first – and to the best of our knowledge the most efficient – relativeerror (1 + ɛ) ‖A − Ak‖F approximation algorithms for the singular value decomposition of an m × n matrix A with M nonzero entries that requires 2 passes over the data and runs in time O M k + (n + m)k2 ɛ ɛ2) log 1 δ • The first o(nd 2) time (1+ɛ) relativeerror approximation algorithm for n×d linear (ℓ2) regression. • A matrix multiplication algorithm that easily applies to implicitly given matrices. 1
Data mining for hypertext: A tutorial survey
 ACM SIGKDD Explorations
, 2000
"... With over 800 million pages covering most areas of human endeavor, the Worldwide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the Worldwide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semisupervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of ...