Results 1  10
of
103
Effective Proximity Retrieval by Ordering Permutations
, 2007
"... We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in m ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in many pattern recognition tasks. This, for example, renders the KNN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against stateoftheart exact and approximate techniques, both in synthetic and real, metric and nonmetric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.
LargeScale Malware Indexing Using FunctionCall Graphs
"... A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s functioncall graphs, which is a structural representation known to be less susceptible to instructionlevel obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearestneighbor search problem in a graph database. To speed
Query answering and ontology population: An inductive approach
 IN PROC. ESWC2008
"... Abstract. In order to overcome the limitations of deductive logicbased approaches to deriving operational knowledge from ontologies, especially when data come from distributed sources, inductive (instancebased) methods may be better suited, since they are usually efficient and noisetolerant. In th ..."
Abstract

Cited by 17 (12 self)
 Add to MetaCart
Abstract. In order to overcome the limitations of deductive logicbased approaches to deriving operational knowledge from ontologies, especially when data come from distributed sources, inductive (instancebased) methods may be better suited, since they are usually efficient and noisetolerant. In this paper we propose an inductive method for improving the instance retrieval and enriching the ontology population. By casting retrieval as a classification problem with the goal of assessing the individual classmemberships w.r.t. the query concepts, we propose an extension of the kNearest Neighbor algorithm for OWL ontologies based on an entropic distance measure. The procedure can classify the individuals w.r.t. the known concepts but it can also be used to retrieve individuals belonging to query concepts. Experimentally we show that the behavior of the classifier is comparable with the one of a standard reasoner. Moreover we show that new knowledge (not logically derivable) is induced. It can be suggested to the knowledge engineer for validation, during the ontology population task. 1
Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions
 INTERNATIONAL JOURNAL OF MATHEMATICAL MODELS AND METHODS IN APPLIED SCIENCES
, 2007
"... Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in b ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
Combinatorial algorithms for nearest neighbors, nearduplicates and smallworld design
 In Proceedings of the 20th Annual ACMSIAM Symposium on Discrete Algorithms, SODA’09
, 2009
"... We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the fo ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Then, for nearduplicate detection we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro’s impossibility of generalizing Delaunay graphs. The technical contribution of the paper consists of handling “false positives ” in data structures and an algorithmic technique upasidedownfilter.
A fast search algorithm for a large fuzzy database
 IEEE Trans. on Inf. Forensics and Security
, 2008
"... Abstract—In this paper, we propose a fast search algorithm for a large fuzzy database that stores iris codes or data with a similar binary structure. The fuzzy nature of iris codes and their high dimensionality render many modern search algorithms, mainly relying on sorting and hashing, inadequate. ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract—In this paper, we propose a fast search algorithm for a large fuzzy database that stores iris codes or data with a similar binary structure. The fuzzy nature of iris codes and their high dimensionality render many modern search algorithms, mainly relying on sorting and hashing, inadequate. The algorithm that is used in all current public deployments of iris recognition is based on a brute force exhaustive search through a database of iris codes, looking for a match that is close enough. Our new technique, Beacon Guided Search (BGS), tackles this problem by dispersing a multitude of “beacons ” in the search space. Despite random bit errors, iris codes from the same eye are more likely to collide with the same beacons than those from different eyes. By counting the number of collisions, BGS shrinks the search range dramatically with a negligible loss of precision. We evaluate this technique using 632 500 iris codes enrolled in the United Arab Emirates (UAE) border control system, showing a substantial improvement in search speed with a negligible loss of accuracy. In addition, we demonstrate that the empirical results match theoretical predictions. Index Terms—Biometric search, iris scanning and recognition (BIOIRIS), multiple colliding segments principle. I.
Spatial selection of sparse pivots for similarity search in metric spaces
 IN: SOFSEM 2007: 33RD CONFERENCE ON CURRENT TRENDS IN THEORY AND PRACTICE OF COMPUTER SCIENCE. LNCS (4362
, 2007
"... Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). The main characteristic of this method is that it guarantees a good pivot selection ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). The main characteristic of this method is that it guarantees a good pivot selection more efficiently than other methods previously proposed. In addition, SSS adapts itself to the dimensionality of the metric space we are working with, without being necessary to specify in advance the number of pivots to use. Furthermore, SSS is dynamic, that is, it is capable to support object insertions in the database efficiently, it can work with both continuous and discrete distance functions, and it is suitable for secondary memory storage. In this work we provide experimental results that confirm the advantages of the method with several vector and metric spaces. We also show that the efficiency of our proposal is similar to that of other existing ones over vector spaces, although it is better over general metric spaces.
Similarity search using sparse pivots for efficient multimedia information retrieval
 IN: PROC. OF THE 8TH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM’06
, 2006
"... Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). This method guarantees a good pivot selection more efficiently than other methods pr ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). This method guarantees a good pivot selection more efficiently than other methods previously proposed. In addition, SSS adapts itself to the dimensionality of the metric space we are working with, and it is not necessary to specify in advance the number of pivots to extract. Furthermore, SSS is dynamic, it supports object insertions in the database efficiently, it can work with both continuous and discrete distance functions, and it is suitable for secondary memory storage. In this work we provide experimental results that confirm the advantages of the method with several vector and metric spaces.
A Dynamic Pivot Selection Technique for Similarity Search ∗
"... All pivotbased algorithms for similarity search use a set of reference points called pivots. The pivotbased search algorithm precomputes some distances to these reference points, which are used to discard objects during a search without comparing them directly with the query. Though most of the al ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
All pivotbased algorithms for similarity search use a set of reference points called pivots. The pivotbased search algorithm precomputes some distances to these reference points, which are used to discard objects during a search without comparing them directly with the query. Though most of the algorithms proposed to date select these reference points at random, previous works have shown the importance of intelligently selecting these points for the index performance. However, the proposed pivot selection techniques need to know beforehand the complete database to obtain good results, which inevitably makes the index static. More recent works have addressed this problem, proposing techniques that dynamically select pivots as the database grows. This paper presents a new technique for choosing pivots, that combines the good properties of previous proposals with the recently proposed dynamic selection. The experimental evaluation provided in this paper shows that the new proposed technique outperforms the stateofart methods for selecting pivots. 1
Graphs for Metric Space Searching
, 2008
"... The problem of Similarity Searching consists in finding the elements from a set which are similar to a given query under some criterion. If the similarity is expressed by means of a metric, the problem is called Metric Space Searching. In this thesis we present new methodologies to solve this prob ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
The problem of Similarity Searching consists in finding the elements from a set which are similar to a given query under some criterion. If the similarity is expressed by means of a metric, the problem is called Metric Space Searching. In this thesis we present new methodologies to solve this problem using graphs G(V,E) to represent the metric database. In G, the set V corresponds to the objects from the metric space and E to a small subset of edges from V × V, whose weights are computed according to the metric of the space under consideration. In particular, we study knearest neighbor graphs (knngs). The knng is a weighted graph connecting each element from V —or equivalently, each object from the metric space — to its k nearest neighbors. We develop algorithms both to construct knngs in general metric spaces, and to use