• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Disorder inequality: a combinatorial approach to nearest neighbor search (2008)

by N Goyal, Y Lifshits, H Schutze
Venue:in WSDM
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 11
Next 10 →

Similarity-based Classification: Concepts and Algorithms

by Yihua Chen, Eric K. Garcia, Maya R. Gupta, Luca Cazzanti, Ali Rahimi, Yihua Chen, Eric K. Garcia, Maya R. Gupta, Luca Cazzanti, Ali Rahimi , 2008
"... This report reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
This report reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. 1

Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

by Yury Lifshits, Shengyu Zhang - In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’09 , 2009
"... We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the fo ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Then, for near-duplicate detection we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are near-logarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro’s impossibility of generalizing Delaunay graphs. The technical contribution of the paper consists of handling “false positives ” in data structures and an algorithmic technique up-aside-down-filter.

On Nonmetric Similarity Search Problems in Complex Domains

by Tom Á ˇ S Skopal, Benjamin Bustos
"... The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can be then queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics – we call them nonmetric similarity functions. In this paper we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review the state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.

Approximate nearest neighbor search through comparisons

by Suhas Diggavi - In ArXiv preprint’09
"... This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects. In contrast with most existing approaches, we can only access the “hidden ” space in which the objects live through a similarity oracle. The oracle, gi ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects. In contrast with most existing approaches, we can only access the “hidden ” space in which the objects live through a similarity oracle. The oracle, given two reference objects and a query object, returns the reference object closest to the query object. The oracle attempts to model the behavior of human users, capable of making statements about similarity, but not of assigning meaningful numerical values to distances between objects. Using such an oracle, the best we can hope for is to obtain, for every object u in the database, a sorted list of the other objects according to their distance to u. We call the position of object v in this list the rank of v with respect to u. The difficulty of searching using such an oracle depends on the non-homogeneities of the underlying space. We use two different characterizations of the underlying space to capture this property. The first one, rank distortion, relates pairwise ranks to the average difference in ranks w.r.t. other objects (a more precise definition is given in Section II). The second one, the combinatorial framework (a notion from [1], [2]), defines approximate triangle inequalities on ranks (a more precise definition is given in Section II). Roughly speaking, it defines a multiplicative factor D by which the triangle inequality on ranks

Similarity search via combinatorial nets

by Yury Lifshits, Shengyu Zhang
"... We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, it turns out that one can still design a deterministic preprocessing algorithm with almost linear time and space complexity, and answer queries deterministically in near-logarithmic time. A key procedure of our main algorithm is efficient constructions of combinatorial nets. We show that this data structure is useful for solving other important problems. For example, motivated by navigability questions we show that for any dataset a visibility graph can be constructed: all out-degrees are near-logarithmic and greedy routing deterministically converges to nearest neighbor in logarithmic number of steps. Also, for near-duplicate detection problem we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. 1

Combinatorial framework for similarity search

by Yury Lifshits - In Proc. 2nd International Workshop on Similarity Search and Applications (SISAP’09). IEEE Computer Society
"... Abstract—We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x, y, z answe ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract—We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x, y, z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with near-linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro’s impossibility of generalizing Delaunay graphs. Keywords-nearest neighbors, similarity search

DEFINITION

by Paolo Ciaccia
"... MM indexing ..."
Abstract - Add to MetaCart
MM indexing

Maximal Intersection Queries in Randomized Input Models

by Benjamin Hoffmann, Yury Lifshits, Dirk Nowotka
"... Consider a family of sets and a single set, called the query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersect ..."
Abstract - Add to MetaCart
Consider a family of sets and a single set, called the query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersection queries arise in a wide range of applications, including web search, recommendation systems, and distributing on-line advertisements. In general, maximal intersection queries are computationally expensive. We investigate two well-motivated distributions over all families of sets and propose an algorithm for each of them. We show that with very high probability an almost optimal solution is found in time which is logarithmic in the size of the family. Moreover, we point out a threshold phenomenon on the probabilities of intersecting sets in each of our two input models which leads to the efficient algorithms mentioned above. 1

Randomized Algorithms for Comparison-based Search

by Payam Delgosha, Suhas Diggavi, Soheil Mohajer
"... This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most simi ..."
Abstract - Add to MetaCart
This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle inequalities on ranks. We present a lower bound of Ω(D log n D + D2) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop a randomized scheme for NN retrieval in O(D3 log 2 n + D log 2 n loglog nD3) questions. The learning requires asking O(nD3 log 2 n + D log 2 n loglog nD3) questions and O(n log 2 n / log(2D)) bits to store. 1

Curse of Dimensionality in the Application of Pivot-based Indexes to the Similarity Search Problem

by Ilya Volnyansky , 905
"... In this work we study the validity of the so-called curse of dimensionality for indexing of databases for similarity search. We perform an asymptotic analysis, with a test model based on a sequence of metric spaces (Ωd) from which we pick datasets Xd in an i.i.d. fashion. We call the subscript d the ..."
Abstract - Add to MetaCart
In this work we study the validity of the so-called curse of dimensionality for indexing of databases for similarity search. We perform an asymptotic analysis, with a test model based on a sequence of metric spaces (Ωd) from which we pick datasets Xd in an i.i.d. fashion. We call the subscript d the dimension of the space Ωd (e.g. for R d the dimension is just the usual one) and we allow the size of the dataset n = nd to be such that d is superlogarithmic but subpolynomial in n. We study the asymptotic performance of pivot-based indexing schemes where the number of pivots is o(n/d). We pick the relatively simple cost model of similarity search where we count each distance calculation as a single computation and disregard the rest. We demonstrate that if the spaces Ωd exhibit the (fairly common) concentration of measure phenomenon the performance of similarity search using such indexes is asymptotically linear in n. That is for large enough d the difference between using such an index and performing a search without an index at all is negligeable. Thus we confirm the curse of dimensionality in this setting. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University