Results 1  10
of
11
Coresets, sparse greedy approximation and the FrankWolfe algorithm
 Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
"... The problem of maximizing a concave function f(x) in a simplex S can be solved approximately by a simple greedy algorithm. For given k, the algorithm can find a point x(k) on a kdimensional face of S, such that f(x(k)) ≥ f(x∗) − O(1/k). Here f(x∗) is the maximum value of f in S. This algorithm an ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
The problem of maximizing a concave function f(x) in a simplex S can be solved approximately by a simple greedy algorithm. For given k, the algorithm can find a point x(k) on a kdimensional face of S, such that f(x(k)) ≥ f(x∗) − O(1/k). Here f(x∗) is the maximum value of f in S. This algorithm and analysis were known before, and related to problems of statistics and machine learning, such as boosting, regression, and density mixture estimation. In other work, coming from computational geometry, the existence of ɛcoresets was shown for the minimum enclosing ball problem, by means of a simple greedy algorithm. Similar greedy algorithms, that are special cases of the FrankWolfe algorithm, were described for other enclosure problems. Here these results are tied together, stronger convergence results are reviewed, and several coreset bounds are generalized or strengthened.
Activized Learning: Transforming Passive to Active with Improved Label Complexity
"... Active learning methods often achieve improved performance using fewer labels compared to passive learning methods. A variety of practically successful active learning algorithms use a passive learning algorithm as a subroutine, and the essential role of the active component is to construct data set ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Active learning methods often achieve improved performance using fewer labels compared to passive learning methods. A variety of practically successful active learning algorithms use a passive learning algorithm as a subroutine, and the essential role of the active component is to construct data sets to feed into the passive subroutine. This general idea is appealing for a variety of reasons, as it may be able
Coresets for Polytope Distance ∗
"... Following recent work of Clarkson, we translate the coreset framework to the problems of finding the point closest to the origin inside a polytope, finding the shortest distance between two polytopes, Perceptrons, and soft as well as hardmargin Support Vector Machines (SVM). We prove asymptoticall ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Following recent work of Clarkson, we translate the coreset framework to the problems of finding the point closest to the origin inside a polytope, finding the shortest distance between two polytopes, Perceptrons, and soft as well as hardmargin Support Vector Machines (SVM). We prove asymptotically matching upper and lower bounds on the size of coresets, stating that ɛcoresets of size ⌈(1 + o(1))E ∗ /ɛ⌉ do always exist as ɛ → 0, and that this is best possible. The crucial quantity E ∗ is what we call the excentricity of a polytope, or a pair of polytopes. Additionally, we prove linear convergence speed of Gilbert’s algorithm, one of the earliest known approximation algorithms for polytope distance, and generalize both the algorithm and the proof to the two polytope case. Interestingly, our coreset bounds also imply that we can for the first time prove matching upper and lower bounds for the sparsity of Perceptron and SVM solutions.
Structured prediction by joint kernel support estimation
 MACH LEARN
, 2009
"... Discriminative techniques, such as conditional random fields (CRFs) or structure aware maximummargin techniques (maximum margin Markov networks (M 3 N), structured output support vector machines (SSVM)), are stateoftheart in the prediction of structured data. However, to achieve good results th ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Discriminative techniques, such as conditional random fields (CRFs) or structure aware maximummargin techniques (maximum margin Markov networks (M 3 N), structured output support vector machines (SSVM)), are stateoftheart in the prediction of structured data. However, to achieve good results these techniques require complete and reliable ground truth, which is not always available in realistic problems. Furthermore, training either CRFs or marginbased techniques is computationally costly, because the runtime of current training methods depends not only on the size of the training set but also on properties of the output space to which the training samples are assigned. We propose an alternative model for structured output prediction, Joint Kernel Support Estimation (JKSE), which is rather generative in nature as it relies on estimating the joint probability density of samples and labels in the training set. This makes it tolerant against incomplete or incorrect labels and also opens the possibility of learning in situations where more than one output label can be considered correct. At the same time, we avoid typical problems of generative models as we do not attempt to learn the full joint probability distribution, but we model only its support in a joint reproducing
Active Learning as NonConvex Optimization
"... We propose a new view of active learning algorithms as optimization. We show that many online active learning algorithms can be viewed as stochastic gradient descent on nonconvex objective functions. Variations of some of these algorithms and objective functions have been previously proposed withou ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose a new view of active learning algorithms as optimization. We show that many online active learning algorithms can be viewed as stochastic gradient descent on nonconvex objective functions. Variations of some of these algorithms and objective functions have been previously proposed without noting this connection. We also point out a connection between the standard minmargin offline active learning algorithm and nonconvex losses. Finally, we discuss and show empirically how viewing active learning as nonconvex loss minimization helps explain two previously observed phenomena: certain active learning algorithms achieve better generalization error than passive learning algorithms on certain data sets (Schohn and Cohn, 2000; Bordes et al., 2005) and on other data sets many active learning algorithms are prone to local minima (Schütze et al., 2006). 1 1
Streamed Learning: OnePass SVMs
"... We present a streaming model for largescale classification (in the context of ℓ2SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The ℓ2SVM is known to have an equivalent formula ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We present a streaming model for largescale classification (in the context of ℓ2SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The ℓ2SVM is known to have an equivalent formulation in terms of the minimum enclosing ball (MEB) problem, and an efficient algorithm based on the idea of core sets exists (CVM) [Tsang et al., 2005]. CVM learns a (1+ε)approximate MEB for a set of points and yields an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. This paper presents a singlepass SVM which is based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector in a way similar to using online stochastic gradient updates. Our algorithm performs polylogarithmic computation at each example, and requires very small and constant storage. Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other stateoftheart SVM solvers (batch and online). We also give an analysis of the algorithm, and discuss some open issues and possible extensions. 1
A Linearly Convergent LinearTime FirstOrder Algorithm for Support Vector Classification with a Core Set Result
"... We present a simple, firstorder approximation algorithm for the support vector classification problem. Given a pair of linearly separable data sets and ɛ ∈ (0, 1), the proposed algorithm computes a separating hyperplane whose margin is within a factor of (1 − ɛ) of that of the maximummargin separa ..."
Abstract
 Add to MetaCart
We present a simple, firstorder approximation algorithm for the support vector classification problem. Given a pair of linearly separable data sets and ɛ ∈ (0, 1), the proposed algorithm computes a separating hyperplane whose margin is within a factor of (1 − ɛ) of that of the maximummargin separating hyperplane. We discuss how our algorithm can be extended to nonlinearly separable and inseparable data sets. The running time of our algorithm is linear in the number of data points and in 1/ɛ. In particular, the number of support vectors computed by the algorithm is bounded above by O(ζ/ɛ) for all sufficiently small ɛ> 0, where ζ is the square of the ratio of the distances between the farthest and closest points in the two data sets. Furthermore, we establish that our algorithm exhibits linear convergence. We adopt the real number model of computation in our analysis.
New Approximation Algorithms for Minimum Enclosing Convex Shapes
"... Given n points in a d dimensional Euclidean space, the Minimum Enclosing Ball (MEB) problem is to find the ball with the smallest radius which contains all n points. We give two approximation algorithms for producing an enclosing ball whose radius is at most ɛ away from the optimum. The first requir ..."
Abstract
 Add to MetaCart
Given n points in a d dimensional Euclidean space, the Minimum Enclosing Ball (MEB) problem is to find the ball with the smallest radius which contains all n points. We give two approximation algorithms for producing an enclosing ball whose radius is at most ɛ away from the optimum. The first requires O(ndL / √ ɛ) effort, where L is a constant that depends on the scaling of the data. The second is a O ∗ (ndQ / √ ɛ) approximation algorithm, where Q is an upper bound on the norm of the points. This is in contrast with coresets based algorithms which yield a O(nd/ɛ) greedy algorithm. Finding the Minimum Enclosing Convex Polytope (MECP) is a related problem wherein a convex polytope of a fixed shape is given and the aim is to find the smallest magnification of the polytope which encloses the given points. For this problem we present O(mndL/ɛ) and O ∗ (mndQ/ɛ) approximation algorithms, where m is the number of faces of the polytope. Our algorithms borrow heavily from convex duality and recently developed techniques in nonsmooth optimization, and are in contrast with existing methods which rely on geometric arguments. In particular, we specialize the excessive gap framework of Nesterov [19] to obtain our results. 1
INTERACTIVE LEARNING PROTOCOLS FOR NATURAL LANGUAGE APPLICATIONS
, 2009
"... Statistical machine learning has become an integral technology for solving many informatics applications. In particular, corpusbased statistical techniques have emerged as the dominant paradigm for core natural language processing (NLP) tasks such as parsing, machine translation, and information ex ..."
Abstract
 Add to MetaCart
Statistical machine learning has become an integral technology for solving many informatics applications. In particular, corpusbased statistical techniques have emerged as the dominant paradigm for core natural language processing (NLP) tasks such as parsing, machine translation, and information extraction, amongst others. However, while supervised machine learning is well understood, its successful application to practical scenarios is predicated on obtaining large annotated corpora and performing significant feature engineering, both notably expensive undertakings. Interactive learning protocols offer one promising solution for reducing these costs by allowing the learner and domain expert to interact during learning in an effort to both reduce sample complexity and improve system performance. By specifying a method where the learner may request targeted information, the domain expert is focused on providing the most useful information. This work formalizes a general framework for interactive learning and examines two interactive learning protocols with particular attention to natural language scenarios. We first examine active learning for structured output spaces, the scenario where there are multiple predictions which must be composed into a structurally coherent global prediction. Secondly, we examine active learning for pipeline models, where a complex prediction is decomposed into a sequence of predictions
Journal of Machine Learning Research 1–18 Unsupervised SVMs: On the complexity of the Furthest Hyperplane Problem
"... This paper introduces the Furthest Hyperplane Problem (FHP), which is an unsupervised counterpart of Support Vector Machines. Given a set of n points in Rd, the objective is to produce the hyperplane (passing through the origin) which maximizes the separation margin, that is, the minimal distance be ..."
Abstract
 Add to MetaCart
This paper introduces the Furthest Hyperplane Problem (FHP), which is an unsupervised counterpart of Support Vector Machines. Given a set of n points in Rd, the objective is to produce the hyperplane (passing through the origin) which maximizes the separation margin, that is, the minimal distance between the hyperplane and any input point. To the best of our knowledge, this is the first paper achieving provable results regarding FHP. We provide both lower and upper bounds to this NPhard problem. First, we give a simple randomized algorithm whose running time is nO(1/θ2) where θ is the optimal separation margin. We show that its exponential dependency on 1/θ2 is tight, up to subpolynomial factors, assuming SAT cannot be solved in subexponential time. Next, we give an efficient approximation algorithm. For any α ∈ [0, 1], the algorithm produces a hyperplane whose distance from at least 1 − 3α fraction of the points is at least α times the optimal separation margin. Finally, we show that FHP does not admit a PTAS by presenting a gap preserving reduction from a particular version of the PCP theorem.