Results 1  10
of
20
Similaritybased Learning via Data Driven
"... We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when worki ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [1] and [2]. An attractive feature of our framework is its adaptability to data we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarkingbased approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform taskdriven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similaritybased learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. 1
Similarity Learning for Provably Accurate Sparse Linear Classification
 In ICML
, 2012
"... In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semidefiniteness) ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semidefiniteness) for use in a local kNN algorithm. However, no theoretical link is established between the learned metrics and their performance in classification. In this paper, we make use of the formal framework of (ǫ,γ,τ)good similarities introduced by Balcan et al. to design an algorithm for learning a non PSD linear similarity optimized in a nonlinear feature space, which is then used to build a global linear classifier. We show that our approach has uniform stability and derive a generalization bound on the classification error. Experiments performed on various datasets confirm the effectiveness of our approach compared to stateoftheart methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers. 1.
Learning good edit similarities with generalization guarantees
 Machine Learning and Knowledge Discovery in Databases
, 2011
"... Abstract. Similarity and distance functions are essential to many learning algorithms, thus training them has attracted a lot of interest. When it comes to dealing with structured data (e.g., strings or trees), edit similarities are widely used, and there exists a few methods for learning them. Ho ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Similarity and distance functions are essential to many learning algorithms, thus training them has attracted a lot of interest. When it comes to dealing with structured data (e.g., strings or trees), edit similarities are widely used, and there exists a few methods for learning them. However, these methods offer no theoretical guarantee as to the generalization performance and discriminative power of the resulting similarities. Recently, a theory of learning with (ǫ, γ, τ)good similarity functions was proposed. This new theory bridges the gap between the properties of a similarity function and its performance in classification. In this paper, we propose a novel edit similarity learning approach (GESL) driven by the idea of (ǫ, γ, τ)goodness, which allows us to derive generalization guarantees using the notion of uniform stability. We experimentally show that edit similarities learned with our method induce classification models that are both more accurate and sparser than those induced by the edit distance or edit similarities learned with a stateoftheart method.
Theory and algorithms for learning with dissimilarity functions
 Neural Computation
"... We study the problem of classification when only a dissimilarity function between objects is accessible. That is, data samples are represented not by feature vectors but in terms of their pairwise dissimilarities. We establish sufficient conditions for dissimilarity functions to allow building accur ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
We study the problem of classification when only a dissimilarity function between objects is accessible. That is, data samples are represented not by feature vectors but in terms of their pairwise dissimilarities. We establish sufficient conditions for dissimilarity functions to allow building accurate classifiers. The theory immediately suggests a learning paradigm: construct an ensemble of simple classifiers each depending on a pair of examples, then find a convex combination of them to achieve a large margin. We next develop a practical algorithm referred to as Dissimilarity based Boosting (DBoost) for learning with dissimilarity functions under the theoretical guidance. Experiments on a variety of databases demonstrate that the DBoost algorithm is promising for several dissimilarity measures widely used in practice.
Sparse domain adaptation in projection spaces based on good similarity functions
 in IEEE ICDM
, 2011
"... Abstract—We address the problem of domain adaptation for binary classification which arises when the distributions generating the source learning data and target test data are somewhat different. We consider the challenging case where no target labeled data is available. From a theoretical standpoin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—We address the problem of domain adaptation for binary classification which arises when the distributions generating the source learning data and target test data are somewhat different. We consider the challenging case where no target labeled data is available. From a theoretical standpoint, a classifier has better generalization guarantees when the two domain marginal distributions are close. We study a new direction based on a recent framework of Balcan et al. allowing to learn linear classifiers in an explicit projection space based on similarity functions that may be not symmetric and not positive semidefinite. We propose a general method for learning a good classifier on target data with generalization guarantees and we improve its efficiency thanks to an iterative procedure by reweighting the similarity function compatible with Balcan et al. framework to move closer the two distributions in a new projection space. Hyperparameters and reweighting quality are controlled by a reverse validation procedure. Our approach is based on a linear programming formulation and shows good adaptation performances with very sparse models. We evaluate it on a synthetic problem and on real image annotation task.
SEE PROFILE
, 2014
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Author manuscript, published in "International Conference on Machine Learning, United Kingdom (2012)" Similarity Learning for Provably Accurate Sparse Linear Classification
, 2012
"... In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semidefiniteness) ..."
Abstract
 Add to MetaCart
(Show Context)
In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semidefiniteness) for use in a local kNN algorithm. However, no theoretical link is established between the learned metrics and their performance in classification. In this paper, we make use of the formal framework of (ǫ,γ,τ)good similarities introduced by Balcan et al. to design an algorithm for learning a non PSD linear similarity optimized in a nonlinear feature space, which is then used to build a global linear classifier. We show that our approach has uniform stability and derive a generalization bound on the classification error. Experiments performed on various datasets confirm the effectiveness of our approach compared to stateoftheart methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers. 1.
Rapid Execution of Weighted Edit Distances
"... Abstract. The comparison of large numbers of strings plays a central role in ontology matching, record linkage and link discovery. While several standard string distance and similarity measures have been developed with these explicit goals in mind, similarities and distances learned out of the data ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The comparison of large numbers of strings plays a central role in ontology matching, record linkage and link discovery. While several standard string distance and similarity measures have been developed with these explicit goals in mind, similarities and distances learned out of the data have been shown to often perform better with respect to the Fmeasure that they can achieve. Still, the practical use of dataspecific measures is often hindered by one major factor: their runtime. While timeefficient algorithms that allow scaling to millions of strings have been developed for standard metrics over the last years, dataspecific versions of these measures are usually slow to run and require significantly more time for the same task. In this paper, we present an approach for the timeefficient execution of weighted edit distances. Our approach is based on a sequence of efficient filters that allow reducing the number of candidate pairs for which the weighted edit distance has to be computed. We also show how existing timeefficient deduplication approaches based on the edit distance can be extended to deal with weighted edit distances. We compare our approach with such an extension of PassJoin on benchmark data and show that we outperform it by more than one order of magnitude. 1