Results 1  10
of
16
Similaritybased Classification: Concepts and Algorithms
, 2008
"... This report reviews and extends the field of similaritybased classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
This report reviews and extends the field of similaritybased classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearestneighbors for similaritybased learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similaritybased learning. 1
Combinatorial algorithms for nearest neighbors, nearduplicates and smallworld design
 In Proceedings of the 20th Annual ACMSIAM Symposium on Discrete Algorithms, SODA’09
, 2009
"... We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the fo ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Then, for nearduplicate detection we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro’s impossibility of generalizing Delaunay graphs. The technical contribution of the paper consists of handling “false positives ” in data structures and an algorithmic technique upasidedownfilter.
Approximate nearest neighbor search through comparisons
 In ArXiv preprint’09
"... This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object q in a database of n objects. In contrast with most existing approaches, we can only access the “hidden ” space in which the objects live through a similarity oracle. The oracle, gi ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object q in a database of n objects. In contrast with most existing approaches, we can only access the “hidden ” space in which the objects live through a similarity oracle. The oracle, given two reference objects and a query object, returns the reference object closest to the query object. The oracle attempts to model the behavior of human users, capable of making statements about similarity, but not of assigning meaningful numerical values to distances between objects. Using such an oracle, the best we can hope for is to obtain, for every object u in the database, a sorted list of the other objects according to their distance to u. We call the position of object v in this list the rank of v with respect to u. The difficulty of searching using such an oracle depends on the nonhomogeneities of the underlying space. We use two different characterizations of the underlying space to capture this property. The first one, rank distortion, relates pairwise ranks to the average difference in ranks w.r.t. other objects (a more precise definition is given in Section II). The second one, the combinatorial framework (a notion from [1], [2]), defines approximate triangle inequalities on ranks (a more precise definition is given in Section II). Roughly speaking, it defines a multiplicative factor D by which the triangle inequality on ranks
On Nonmetric Similarity Search Problems in Complex Domains
, 2010
"... The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the databaseoriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can be then queried efficiently by socalled metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics – we call them nonmetric similarity functions. In this paper we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review the stateoftheart techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.
Content search through comparisons
, 2010
"... Abstract. We study the problem of navigating through a database of similar objects using comparisons under heterogeneous demand, a problem closely related to smallworld network design. We show that, under heterogeneous demand, the smallworld network design problem is NPhard. Given the above negati ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract. We study the problem of navigating through a database of similar objects using comparisons under heterogeneous demand, a problem closely related to smallworld network design. We show that, under heterogeneous demand, the smallworld network design problem is NPhard. Given the above negative result, we propose a novel mechanism for smallworld network design and provide an upper bound on its performance underheterogeneous demand.Theabovemechanism hasanatural equivalent in the context of content search through comparisons, again under heterogeneous demand; we use this to establish both upper and lower bounds on content search through comparisons. 1
Facebrowsing: Search and navigation through comparisons
 In ITA workshop
, 2010
"... Abstract—This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object in a database which is only accessible through a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object clos ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract—This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object in a database which is only accessible through a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object closest to the query object. The oracle attempts to model the behavior of human users, capable of making statements about similarity, but not of assigning meaningful numerical values to distances between objects. We develop nearestneighbor search algorithms and analyze its performance for such an oracles. Using such a comparison oracle, the best we can hope for is to obtain, for every object in the database, a ranking of the other objects according to their distance to it. The difficulty of searching using such an oracle depends on the nonhomogeneities of the underlying space. We introduce the new idea of a ranksensitive hash (RSH) function which gives same hash value for “similar ” objects based on the rankvalue of the objects obtained from the similarity oracle. As one application of RSH, we demonstrate that, we can retrieve one of the (1 + ǫ)rnearest neighbor of a query point in timecomplexity depending on an underlying property (termed rankdistortion) of the search space. We use this idea to implement a navigation system for an image database of human faces. In particular, we design a database for images that is organized adaptively based on both baseline comparisons using eigenfaces and refined using selected human input. We present a preliminary implementation of this system which seeks to minimize the number of questions asked to a (human) oracle. I.
Randomized Algorithms for Comparisonbased Search
"... This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most simi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper addresses the problem of finding the nearest neighbor (or one of the Rnearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle inequalities on ranks. We present a lower bound of Ω(D log n D + D2) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop a randomized scheme for NN retrieval in O(D3 log 2 n + D log 2 n loglog nD3) questions. The learning requires asking O(nD3 log 2 n + D log 2 n loglog nD3) questions and O(n log 2 n / log(2D)) bits to store. 1
Similarity search via combinatorial nets
"... We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, it turns out that one can still design a deterministic preprocessing algorithm with almost linear time and space complexity, and answer queries deterministically in nearlogarithmic time. A key procedure of our main algorithm is efficient constructions of combinatorial nets. We show that this data structure is useful for solving other important problems. For example, motivated by navigability questions we show that for any dataset a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to nearest neighbor in logarithmic number of steps. Also, for nearduplicate detection problem we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. 1
Combinatorial framework for similarity search
 In Proc. 2nd International Workshop on Similarity Search and Applications (SISAP’09). IEEE Computer Society
"... Abstract—We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x, y, z answe ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x, y, z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zeroerror algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with nearlinear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Combinatorial nets also have a number of side applications. For nearduplicate detection they lead to the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro’s impossibility of generalizing Delaunay graphs. Keywordsnearest neighbors, similarity search
Compression with Graphical Constraints: An Interactive Browser
"... Abstract—We study the problem of searching for a given element in a set of objects using a membership oracle. The membership oracle, given a subset of objects A, and a target object t, determines whether A contains t or not. The goal is to find the target object with the minimum number of questions ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—We study the problem of searching for a given element in a set of objects using a membership oracle. The membership oracle, given a subset of objects A, and a target object t, determines whether A contains t or not. The goal is to find the target object with the minimum number of questions asked from the oracle. This problem is known to be strongly related to lossless source compression. In fact, the optimum strategy is provided by Hufmman coding with the average number of questions very close to the entropy H(P) of the object set. The membership oracle aims at modelling interactive methods (i.e., incorporate human feedback) has many real life applications. Due to practical constraints imposed by such applications not every subset A of objects can be queried. It is known that in general finding the optimum strategy with such constrains is NPcomplete. Given this negative result we restrict attention to the cases represented by graphical models: graph G whose nodes are the database objects is given, and the queries are restricted to be those subsets A that are connected in G. We show that when G itself is connected, there is a search algorithm that finds the target in 4H(P) + 2 queries on the average. Since entropy is the trivial lower bound, our algorithm performs within a constant gap from the optimum strategy. I.