Results 1  10
of
174
Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions
 INTERNATIONAL JOURNAL OF MATHEMATICAL MODELS AND METHODS IN APPLIED SCIENCES
, 2007
"... Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in b ..."
Abstract

Cited by 71 (0 self)
 Add to MetaCart
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
LargeScale Malware Indexing Using FunctionCall Graphs
"... A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s functioncall graphs, which is a structural representation known to be less susceptible to instructionlevel obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearestneighbor search problem in a graph database. To speed
Effective Proximity Retrieval by Ordering Permutations
, 2007
"... We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in m ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
We introduce a new probabilistic proximity search algorithm for range and Knearest neighbor (KNN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically highdimensional, as is the case in many pattern recognition tasks. This, for example, renders the KNN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against stateoftheart exact and approximate techniques, both in synthetic and real, metric and nonmetric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.
MChord: A scalable distributed similarity search structure
 In Proceedings of INFOSCALE 2006, Hong Kong, 2006
, 2006
"... The need for a retrieval based not on the attribute values but on the very data content has recently led to rise of the metricbased similarity search. The computational complexity of such a retrieval and large volumes of processed data call for distributed processing which allows to achieve scala ..."
Abstract

Cited by 29 (14 self)
 Add to MetaCart
(Show Context)
The need for a retrieval based not on the attribute values but on the very data content has recently led to rise of the metricbased similarity search. The computational complexity of such a retrieval and large volumes of processed data call for distributed processing which allows to achieve scalability. In this paper, we propose MChord, a distributed data structure for metricbased similarity search. The structure takes advantage of the idea of a vector index method iDistance in order to transform the issue of similarity searching into the problem of interval search in one dimension. The proposed peertopeer organization, based on the Chord protocol, distributes the storage space and parallelizes the execution of similarity queries. Promising features of the structure are validated by experiments on the prototype implementation and two reallife datasets. 1.
On scalability of the similarity search in the world of peers
 In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30
"... Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and the linear scalability of single server implementations prevents from efficient searching in large data volumes. ..."
Abstract

Cited by 27 (20 self)
 Add to MetaCart
(Show Context)
Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and the linear scalability of single server implementations prevents from efficient searching in large data volumes. In this paper, we shortly describe four recent scalable distributed similarity search techniques and study their performance of executing queries on three different datasets. Though all the methods employ parallelism to speed up query execution, different advantages for different objectives have been identified by experiments. The reported results can be exploited for choosing the best implementations for specific applications. They can also be used for designing new and better indexing structures in the future. 1.
Query answering and ontology population: An inductive approach
 IN PROC. ESWC2008
"... Abstract. In order to overcome the limitations of deductive logicbased approaches to deriving operational knowledge from ontologies, especially when data come from distributed sources, inductive (instancebased) methods may be better suited, since they are usually efficient and noisetolerant. In th ..."
Abstract

Cited by 22 (14 self)
 Add to MetaCart
Abstract. In order to overcome the limitations of deductive logicbased approaches to deriving operational knowledge from ontologies, especially when data come from distributed sources, inductive (instancebased) methods may be better suited, since they are usually efficient and noisetolerant. In this paper we propose an inductive method for improving the instance retrieval and enriching the ontology population. By casting retrieval as a classification problem with the goal of assessing the individual classmemberships w.r.t. the query concepts, we propose an extension of the kNearest Neighbor algorithm for OWL ontologies based on an entropic distance measure. The procedure can classify the individuals w.r.t. the known concepts but it can also be used to retrieve individuals belonging to query concepts. Experimentally we show that the behavior of the classifier is comparable with the one of a standard reasoner. Moreover we show that new knowledge (not logically derivable) is induced. It can be suggested to the knowledge engineer for validation, during the ontology population task. 1
Messif: Metric similarity search implementation framework
 In DELOS’07
, 2007
"... The similarity search has become a fundamental computational task in many applications. One of the mathematical models of the similarity – the metric space – has drawn attention of many researchers resulting in several sophisticated metricindexing techniques. An important part of a research in this ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
(Show Context)
The similarity search has become a fundamental computational task in many applications. One of the mathematical models of the similarity – the metric space – has drawn attention of many researchers resulting in several sophisticated metricindexing techniques. An important part of a research in this area is typically a prototype implementation and subsequent experimental evaluation of the proposed data structure. This paper describes an implementation framework called MESSIF that eases the task of building such prototypes. It provides a number of modules from basic storage management to automatic collecting of performance statistics. Due to its open and modular design it is also easy to implement additional modules if necessary. The MESSIF also offers several readytouse generic clients that allow to control and test the index structures and also measure its performance. 1
Spatial selection of sparse pivots for similarity search in metric spaces
 IN: SOFSEM 2007: 33RD CONFERENCE ON CURRENT TRENDS IN THEORY AND PRACTICE OF COMPUTER SCIENCE. LNCS (4362
, 2007
"... Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). The main characteristic of this method is that it guarantees a good pivot selection ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
Similarity search is a fundamental operation for applications that deal with unstructured data sources. In this paper we propose a new pivotbased method for similarity search, called Sparse Spatial Selection (SSS). The main characteristic of this method is that it guarantees a good pivot selection more efficiently than other methods previously proposed. In addition, SSS adapts itself to the dimensionality of the metric space we are working with, without being necessary to specify in advance the number of pivots to use. Furthermore, SSS is dynamic, that is, it is capable to support object insertions in the database efficiently, it can work with both continuous and discrete distance functions, and it is suitable for secondary memory storage. In this work we provide experimental results that confirm the advantages of the method with several vector and metric spaces. We also show that the efficiency of our proposal is similar to that of other existing ones over vector spaces, although it is better over general metric spaces.
A Dynamic Pivot Selection Technique for Similarity Search ∗
"... All pivotbased algorithms for similarity search use a set of reference points called pivots. The pivotbased search algorithm precomputes some distances to these reference points, which are used to discard objects during a search without comparing them directly with the query. Though most of the al ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
(Show Context)
All pivotbased algorithms for similarity search use a set of reference points called pivots. The pivotbased search algorithm precomputes some distances to these reference points, which are used to discard objects during a search without comparing them directly with the query. Though most of the algorithms proposed to date select these reference points at random, previous works have shown the importance of intelligently selecting these points for the index performance. However, the proposed pivot selection techniques need to know beforehand the complete database to obtain good results, which inevitably makes the index static. More recent works have addressed this problem, proposing techniques that dynamically select pivots as the database grows. This paper presents a new technique for choosing pivots, that combines the good properties of previous proposals with the recently proposed dynamic selection. The experimental evaluation provided in this paper shows that the new proposed technique outperforms the stateofart methods for selecting pivots. 1
Local Learning Based Feature Selection for High Dimensional Data Analysis
"... Abstract—This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexit ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Abstract—This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on wellestablished machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer, while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on eleven synthetic and realworld data sets demonstrate the viability of our formulation of the feature selection problem for supervised learning and the effectiveness of our algorithm. Index Terms—Feature selection, local learning, logistical regression, ℓ1 regularization, sample complexity. 1