• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Training Linear SVMs in Linear Time (2006)

by Thorsten Joachims
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 549
Next 10 →

LIBLINEAR: A Library for Large Linear Classification

by Rong-en Fan, Kai-wei Chang, Cho-jui Hsieh, Xiang-rui Wang, Chih-jen Lin , 2008
"... LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced u ..."
Abstract - Cited by 1416 (41 self) - Add to MetaCart
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
(Show Context)

Citation Context

...er such as LIBSVM would take several hours. Moreover, LIBLINEAR is competitive with or even faster than state of the art linear classifiers such as Pegasos (Shalev-Shwartz et al., 2007) and SVM perf (=-=Joachims, 2006-=-). The software is available at http://www.csie.ntu.edu.tw/˜cjlin/ liblinear. This article is organized as follows. In Sections 2 and 3, we discuss the design and implementation of LIBLINEAR. We show ...

Pegasos: Primal Estimated sub-gradient solver for SVM

by Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter
"... We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a singl ..."
Abstract - Cited by 542 (20 self) - Add to MetaCart
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is Õ(d/(λɛ)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

Anomaly Detection: A Survey

by Varun Chandola, Arindam Banerjee, Vipin Kumar , 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract - Cited by 540 (5 self) - Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

Improving the Fisher kernel for large-scale image classification.

by Florent Perronnin , Jorge Sánchez , Thomas Mensink - In ECCV, , 2010
"... Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enric ..."
Abstract - Cited by 362 (20 self) - Add to MetaCart
Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained using only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant resources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
(Show Context)

Citation Context

...(N 2 ) and O(N 3 ) – where N is the number of training images – and becomes impractical for N in the tens or hundreds of thousands. This is in contrast with linear SVMs whose training cost is in O(N) =-=[11,12]-=- and which can therefore be efficiently learned with large quantities of images [13]. However linear SVMs have been repeatedly reported to be inferior to non-linear SVMs on BOV histograms [14–17]. Sev...

Random features for large-scale kernel machines

by Ali Rahimi, Ben Recht - In Neural Infomration Processing Systems , 2007
"... To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the f ..."
Abstract - Cited by 258 (4 self) - Add to MetaCart
To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift-invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms that use these features outperform state-of-the-art large-scale kernel machines. 1

Efficient Additive Kernels via Explicit Feature Maps

by Andrea Vedaldi, Andrew Zisserman
"... Maji and Berg [13] have recently introduced an explicit feature map approximating the intersection kernel. This enables efficient learning methods for linear kernels to be applied to the non-linear intersection kernel, expanding the applicability of this model to much larger problems. In this paper ..."
Abstract - Cited by 245 (9 self) - Add to MetaCart
Maji and Berg [13] have recently introduced an explicit feature map approximating the intersection kernel. This enables efficient learning methods for linear kernels to be applied to the non-linear intersection kernel, expanding the applicability of this model to much larger problems. In this paper we generalize this idea, and analyse a large family of additive kernels, called homogeneous, in a unified framework. The family includes the intersection, Hellinger’s, and χ2 kernels commonly employed in computer vision. Using the framework we are able to: (i) provide explicit feature maps for all homogeneous additive kernels along with closed form expression for all common kernels; (ii) derive corresponding approximate finitedimensional feature maps based on the Fourier sampling theorem; and (iii) quantify the extent of the approximation. We demonstrate that the approximations have indistinguishable performance from the full kernel on a number of standard datasets, yet greatly reduce the train/test times of SVM implementations. We show that the χ2 kernel, which has been found to yield the best performance in most applications, also has the most compact feature representation. Given these train/test advantages we are able to obtain a significant performance improvement over current state of the art results based on the intersection kernel. 1.
(Show Context)

Citation Context

...th algorithms optimised for the linear case, including standard SVM solvers such as LIBSVM [38], stochastic gradient algorithms, on-line algorithms, and cutting-plane algorithms for structural models =-=[6]-=-. These algorithms apply unchanged; however, if data storage is a concern, the homogeneous kernel map can be computed on the fly inside the solver due to its speed. Compared to the addKPCA features of...

A dual coordinate descent method for large-scale linear SVM.

by Cho-Jui Hsieh , Kai-Wei Chang , Chih-Jen Lin , S Sathiya Keerthi , S Sundararajan - In ICML, , 2008
"... Abstract In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-l ..."
Abstract - Cited by 207 (20 self) - Add to MetaCart
Abstract In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-loss functions. The proposed method is simple and reaches an -accurate solution in O(log(1/ )) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf , and a recent primal coordinate descent implementation.
(Show Context)

Citation Context

...ios. For L1-SVM, Zhang (2004), Shalev-Shwartz et al. (2007), Bottou (2007) propose various stochastic gradient descent methods. Collins et al. (2008) apply an exponentiated gradient method. SVM perf (=-=Joachims, 2006-=-) uses a cutting plane technique. Smola et al. (2008) apply bundle methods, and view SVM perf as a special case. For L2-SVM, Keerthi and DeCoste (2005) propose modified Newton methods. A trust region ...

Hogwild!: A lock-free approach to parallelizing stochastic gradient descent

by Feng Niu, Benjamin Recht, Stephen J. Wright , 2011
"... Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work a ..."
Abstract - Cited by 161 (9 self) - Add to MetaCart
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be im-plemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwrit-ing each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the deci-sion variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude. 1
(Show Context)

Citation Context

...y1),...,(z |E|,y |E|)} where z 2 R n and y is a label for each (z,y) 2 E. X minimizex ↵2E max(1 y↵x T z↵, 0) + kxk 2 2 , (2) and we know a priori that the examples z↵ are very sparse (see for example =-=[14]-=-). To write this cost function in the form of (1), let e↵ denote the components which are non-zero in z↵ and let du denote the number of training examples which are non-zero in component u (u =1, 2,.....

Trust region Newton method for large-scale logistic regression

by Chih-jen Lin, Ruby C. Weng, S. Sathiya Keerthi - In Proceedings of the 24th International Conference on Machine Learning (ICML , 2007
"... Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in ..."
Abstract - Cited by 98 (22 self) - Add to MetaCart
Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations. 1
(Show Context)

Citation Context

... directly solving bigger optimization problems. We refer to such cases as linear SVM, and considerable efforts have been made on its fast training (e.g., (Kao et al., 2004; Keerthi and DeCoste, 2005; =-=Joachims, 2006-=-)). L1-SVM involves the optimization of a non-differentiable function of w, so unconstrained optimization techniques cannot be directly applied. For L2-SVM, the training objection function (3) is diff...

Exponentiated gradient algorithms for conditional random fields and maxmargin Markov networks

by Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett , 2008
"... Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large dat ..."
Abstract - Cited by 94 (2 self) - Add to MetaCart
Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O ( 1 ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log (1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be
(Show Context)

Citation Context

...t are slower than our O( 1 ɛ ), and no extension to structured learning (or multi-class) is discussed. Recently, several new algorithms have been presented, along with a rate of convergence analysis (=-=Joachims, 2006-=-; Shalev-Shwartz et al., 2007; Teo et al., 2007; Tsochantaridis et al., 2004). All of these algorithms are similar to ours in having a relatively low dependence on n in terms of memory and computation...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University