## A scalable modular convex solver for regularized risk minimization (2007)

### Cached

### Download Links

- [users.rsise.anu.edu.au]
- [users.cecs.anu.edu.au]
- [www.stat.purdue.edu]
- [sml.nicta.com.au]
- [sml.nicta.com.au]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In KDD. ACM |

Citations: | 61 - 14 self |

### BibTeX

@INPROCEEDINGS{Teo07ascalable,

author = {Choon Hui Teo and Quoc Le and Alex Smola and S. V. N. Vishwanathan},

title = {A scalable modular convex solver for regularized risk minimization},

booktitle = {In KDD. ACM},

year = {2007}

}

### OpenURL

### Abstract

A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as ℓ1 and ℓ2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox.

### Citations

3844 | Libsvm: A library for support vector machines
- Chang, Lin
- 2011
(Show Context)
Citation Context ...8], graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. =-=[9, 23]-=-. These algorithms construct the Lagrange dual, and solve for the Lagrange multipliers efficiently. Only recently, research focus has shifted back to solving (1) in the primal, e.g. [10, 25, 36]. This... |

2506 | Conditional random fields: probabilistic modeling for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...ession [37], ordinal regression [21], ranking [15], maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields =-=[28]-=-, graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. [9... |

2066 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...ression. Extensions of these loss functions allow us to handle structure in the output space [1]. Changing the regularizer Ω(w) to the sparsity inducing ‖w‖1 leads to Lasso-type estimation algorithms =-=[30, 39, 8]-=-. The kernel trick is widely used to transform many of these algorithms into ones operating on a Reproducing Kernel Hilbert Space (RKHS). One lifts w into an RKHS and replaces all inner product comput... |

1563 |
Making large-scale SVM learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...8], graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. =-=[9, 23]-=-. These algorithms construct the Lagrange dual, and solve for the Lagrange multipliers efficiently. Only recently, research focus has shifted back to solving (1) in the primal, e.g. [10, 25, 36]. This... |

1461 |
Statistics for Spatial Data
- Cressie
- 1993
(Show Context)
Citation Context ...ve [41] max(0, |f − y| − ɛ) 0 if |f − y| ≤ ɛ and sgn(f − y) otherwise 1 Huber’s robust loss [31] 2 (f − y)2 if |f − y| < 1, else |f − y| − 1 f − y if |f − y| ≤ 1, else sgn(f − y) 2 Poisson regression =-=[16]-=- exp(f) − yf exp(f) − y Table 2: Vectorial loss functions and their derivatives, depending on the vector f := W x and on y. Loss Derivative Soft Margin [38] maxy ′(fy ′ − fy + ∆(y, y′ )) ey∗ − ey, whe... |

704 | Decoding by linear programming
- Candes, Tao
- 2005
(Show Context)
Citation Context ...ression. Extensions of these loss functions allow us to handle structure in the output space [1]. Changing the regularizer Ω(w) to the sparsity inducing ‖w‖1 leads to Lasso-type estimation algorithms =-=[30, 39, 8]-=-. The kernel trick is widely used to transform many of these algorithms into ones operating on a Reproducing Kernel Hilbert Space (RKHS). One lifts w into an RKHS and replaces all inner product comput... |

551 | Estimating the support of a high-dimensional distribution
- Schölkopf, Platt, et al.
(Show Context)
Citation Context ... if yf ≥ 1 and −y otherwise 1 Squared Soft Margin [10] max(0, 1 − yf)2 0 if yf ≥ 1 and f − y otherwise 2 Exponential [14] exp(−yf) −y exp(−yf) Logistic [13] log(1 + exp(−yf)) −y/(1 + exp(yf)) Novelty =-=[32]-=- max(0, 1 − f) 0 if f ≥ 0 and −1 otherwise Least mean squares [43] (f − y)2 f − y 1 2 Least absolute deviation |f − y| sgn(f − y) Quantile regression [27] max(τ(f − y), (1 − τ)(y − f)) τ if f > y and ... |

487 | Quantile regression
- Koenker, KF
(Show Context)
Citation Context ...3] log(1 + exp(−yf)) −y/(1 + exp(yf)) Novelty [32] max(0, 1 − f) 0 if f ≥ 0 and −1 otherwise Least mean squares [43] (f − y)2 f − y 1 2 Least absolute deviation |f − y| sgn(f − y) Quantile regression =-=[27]-=- max(τ(f − y), (1 − τ)(y − f)) τ if f > y and τ − 1 otherwise ɛ-insensitive [41] max(0, |f − y| − ɛ) 0 if |f − y| ≤ ɛ and sgn(f − y) otherwise 1 Huber’s robust loss [31] 2 (f − y)2 if |f − y| < 1, els... |

473 | Shallow Parsing with Conditional Random Fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...the past n gradients (n is user defined). LBFGS is known to perform well on continuously differentiable problems, such as logistic regression, least-meansquares problems, or conditional random fields =-=[34]-=-. But, if the functions are not continuously differentiable (e.g., the hinge loss and its variants) then LBFGS may fail. Empirically, we observe that LBFGS does converge well even for the hinge losses... |

471 |
Convex Analysis and Minimization Algorithms
- Hiriart-Urruty, Lemarechal
- 1993
(Show Context)
Citation Context ...3) This gives rise to the hope that if we have a set W = {w1, . . . , wn} of locations where we compute such a Taylor approximation, we should be able to obtain an everimproving approximation of g(w) =-=[22]-=-. See Figure 2 for an illustration. Formally, we have g(w) ≥ max [g( ¯w) + 〈w − ¯w, ∂wg( ¯w)〉] , (4) ¯w∈W which means that g(w) can be lower-bounded by a piecewise linear function. Moreover, the appro... |

461 | Max-margin markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...[41], novelty detection [33], Huber’s robust regression, quantile regression [37], ordinal regression [21], ranking [15], maximization of multivariate performance measures [24], structured estimation =-=[38, 40]-=-, Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been de... |

399 | Large margin methods for structured and interdependent output variables
- Tsochantaridis, Joachims, et al.
- 1453
(Show Context)
Citation Context ...[41], novelty detection [33], Huber’s robust regression, quantile regression [37], ordinal regression [21], ranking [15], maximization of multivariate performance measures [24], structured estimation =-=[38, 40]-=-, Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been de... |

353 |
Training linear SVMs in linear time
- Joachims
- 2006
(Show Context)
Citation Context ... link 2 . The time reported for the experiments are the CPU time. One exception is for parallel experiments where we report the CPU and network communication time. 5.1 Datasets We use the datasets in =-=[25, 36]-=- for classification tasks. For regression tasks, we pick some of the largest datasets in Luís Torgo’s website 3 . Since some of the regression datasets 2 http://nf.apac.edu.au/facilities/ac/hardware.p... |

347 |
Multivariate Statistical Modelling Based on Generalized Linear Models
- Fahrmeir, Tutz
- 1994
(Show Context)
Citation Context ...ormance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families [3], and generalized linear models =-=[17]-=-. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. [9, 23]. These algorithms construct the Lagrange dual, and solve for the Lagrange multi... |

320 | A limited memory algorithm for bound constrained optimization
- Byrd, Lu, et al.
- 1995
(Show Context)
Citation Context ...tleneck. 4.2 Off-the-shelf Methods Since our architecture is modular (see figure 2), we show as a proof of concept that it can deal with different types of solvers, such as an implementation of LBFGS =-=[6]-=- from TAO [5]. There are two additional requirements: First, weneed to provide a subdifferential and value of the regularizer Ω(w). This is easily achieved via 1 ∂w ‖w‖2 2 2 = w and ∂w ‖w‖1 ∋ sgn w. ... |

318 |
Large margin rank boundaries for ordinal regression
- Herbrich, Graepel, et al.
- 2000
(Show Context)
Citation Context ...which employ the kernel trick (but essentially still solve (1)) include Support Vector regression [41], novelty detection [33], Huber’s robust regression, quantile regression [37], ordinal regression =-=[21]-=-, ranking [15], maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponen... |

224 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...es, depending on f := 〈w, x〉, and y. Loss l(f, y) Derivative l ′ (f, y) Hinge [20] max(0, −yf) 0 if yf ≥ 0 and −y otherwise 1 Squared Hinge [26] max(0, −yf)2 0 if yf ≥ 0 and f otherwise 2 Soft Margin =-=[4]-=- max(0, 1 − yf) 0 if yf ≥ 1 and −y otherwise 1 Squared Soft Margin [10] max(0, 1 − yf)2 0 if yf ≥ 1 and f − y otherwise 2 Exponential [14] exp(−yf) −y exp(−yf) Logistic [13] log(1 + exp(−yf)) −y/(1 + ... |

213 | Logistic regression, AdaBoost and Bregman distances. Machine Learning 48
- Collins, Schapire, et al.
- 2002
(Show Context)
Citation Context ...d f otherwise 2 Soft Margin [4] max(0, 1 − yf) 0 if yf ≥ 1 and −y otherwise 1 Squared Soft Margin [10] max(0, 1 − yf)2 0 if yf ≥ 1 and f − y otherwise 2 Exponential [14] exp(−yf) −y exp(−yf) Logistic =-=[13]-=- log(1 + exp(−yf)) −y/(1 + exp(yf)) Novelty [32] max(0, 1 − f) 0 if f ≥ 0 and −1 otherwise Least mean squares [43] (f − y)2 f − y 1 2 Least absolute deviation |f − y| sgn(f − y) Quantile regression [2... |

210 | Support vector method for function approximation, regression estimation, and signal processing - Vapnik, Golowich, et al. - 1997 |

208 | A support vector method for multivariate performance measures
- Joachims
- 2005
(Show Context)
Citation Context ...e Support Vector regression [41], novelty detection [33], Huber’s robust regression, quantile regression [37], ordinal regression [21], ranking [15], maximization of multivariate performance measures =-=[24]-=-, structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families [3], and generalized linear models [17]. Traditionally, ... |

203 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond - Williams - 1999 |

197 | Efficient SVM training using low-rank kernel representations
- Fine, Scheinberg
- 2001
(Show Context)
Citation Context ...arse features (e.g. the bag of words representation of a document). Second, many kernels (e.g. kernels on strings [42]) can effectively be linearized, and third, efficient factorization methods (e.g. =-=[18]-=-) can be used for a low rank representation of the kernel matrix thereby effectively rendering the problem linear. For each of the above estimation problems specialized solversexist, and the common a... |

190 |
PETSc users manual
- Balay, Buschelman, et al.
- 2004
(Show Context)
Citation Context ...nd ordinal regression, and a particular regularizer Ω, namely quadratic regularization, both methods are equivalent. The advantage in our solver is the use of efficient linear algebra tools via PETSc =-=[2]-=-, the modular structure, the considerably higher generality in both loss functions and regularizers, and the fact that data may be decentralized. Moreover, our work is related to [11], where MapReduce... |

181 |
Information and Exponential Families in Statistical Theory
- Barndorff-Nielsen
- 1978
(Show Context)
Citation Context ..., maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families =-=[3]-=-, and generalized linear models [17]. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. [9, 23]. These algorithms construct the Lagrange du... |

162 | Map-reduce for machine learning on multicore
- Chu, Kim, et al.
(Show Context)
Citation Context ...a tools via PETSc [2], the modular structure, the considerably higher generality in both loss functions and regularizers, and the fact that data may be decentralized. Moreover, our work is related to =-=[11]-=-, where MapReduce is used to accelerate machine learning on parallel computers. We use similar parallelization techniques to distribute the computation of values and gradients of the empirical risk Re... |

144 | Predicting Time Series with Support Vector Machines
- Müller, Smola, et al.
- 1999
(Show Context)
Citation Context ...sgn(f − y) Quantile regression [27] max(τ(f − y), (1 − τ)(y − f)) τ if f > y and τ − 1 otherwise ɛ-insensitive [41] max(0, |f − y| − ɛ) 0 if |f − y| ≤ ɛ and sgn(f − y) otherwise 1 Huber’s robust loss =-=[31]-=- 2 (f − y)2 if |f − y| < 1, else |f − y| − 1 f − y if |f − y| ≤ 1, else sgn(f − y) 2 Poisson regression [16] exp(f) − yf exp(f) − y Table 2: Vectorial loss functions and their derivatives, depending o... |

138 | Tools for privacy preserving distributed data mining, SIGKDDExplorations 4
- Clifton, Kantarcioglou, et al.
(Show Context)
Citation Context ...essary that individual nodes share the data, since all communication revolves around sharing only values and gradients of Remp(w). • This has the added benefit of preserving a large degree of privacy =-=[12]-=- between the individual database owners and the system using the solver. At every step the data owner will only return a gradient which is the linear combination of a set of observations. Assuming tha... |

122 | Hierarchical document categorization with support vector machines. CIKM
- Cai, Hofmann
- 2004
(Show Context)
Citation Context ... a matrix of the dimensionality of the number of classes. Let us discuss the following two cases: Ontologies for Structured Estimation: For hierarchical labels, e.g. whenever we deal with an ontology =-=[7]-=-, we can use a decomposition of the coefficient vector along the hierarchy of categories. Let d denote the depth of the hierarchy tree, and assume that each leaf of this tree corresponds to a label. W... |

101 | Training a support vector machine in the primal
- Chapelle
- 2007
(Show Context)
Citation Context ...dual, e.g. [9, 23]. These algorithms construct the Lagrange dual, and solve for the Lagrange multipliers efficiently. Only recently, research focus has shifted back to solving (1) in the primal, e.g. =-=[10, 25, 36]-=-. This spurt in research interest is due to three main reasons: First, many interesting problems in diverse areas such as text classification, word-sense disambiguation, and drug design already employ... |

91 | Fast kernels for string and tree matching
- Vishwanathan, Smola
- 2003
(Show Context)
Citation Context ... large datasets (with the number of data points of the order of a million) and very sparse features (e.g. the bag of words representation of a document). Second, many kernels (e.g. kernels on strings =-=[42]-=-) can effectively be linearized, and third, efficient factorization methods (e.g. [18]) can be used for a low rank representation of the kernel matrix thereby effectively rendering the problem linear.... |

88 | A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
- Keerthi, DeCoste
- 2005
(Show Context)
Citation Context ...ted. xj iTable 1: Scalar loss functions and their derivatives, depending on f := 〈w, x〉, and y. Loss l(f, y) Derivative l ′ (f, y) Hinge [20] max(0, −yf) 0 if yf ≥ 0 and −y otherwise 1 Squared Hinge =-=[26]-=- max(0, −yf)2 0 if yf ≥ 0 and f otherwise 2 Soft Margin [4] max(0, 1 − yf) 0 if yf ≥ 1 and −y otherwise 1 Squared Soft Margin [10] max(0, 1 − yf)2 0 if yf ≥ 1 and f − y otherwise 2 Exponential [14] ex... |

63 |
Predicting Structured Data
- Bakir, Hofmann, et al.
- 2007
(Show Context)
Citation Context ...regularizer but changing the loss function to l(xi, yi, w) = log(1 + exp(−yi 〈w, xi〉)), yields logistic regression. Extensions of these loss functions allow us to handle structure in the output space =-=[1]-=-. Changing the regularizer Ω(w) to the sparsity inducing ‖w‖1 leads to Lasso-type estimation algorithms [30, 39, 8]. The kernel trick is widely used to transform many of these algorithms into ones ope... |

61 |
Linear and nonlinear separation of patterns by linear programming
- Mangasarian
- 1965
(Show Context)
Citation Context ...ression. Extensions of these loss functions allow us to handle structure in the output space [1]. Changing the regularizer Ω(w) to the sparsity inducing ‖w‖1 leads to Lasso-type estimation algorithms =-=[30, 39, 8]-=-. The kernel trick is widely used to transform many of these algorithms into ones operating on a Reproducing Kernel Hilbert Space (RKHS). One lifts w into an RKHS and replaces all inner product comput... |

57 | Large scale semi-supervised linear SVMs
- Sindhwani, Keerthi
- 2006
(Show Context)
Citation Context ... of predicting binary valued labels y ∈ {±1}, we may set Ω(w) = 1 2 ‖w‖2 , and the loss l(xi, yi, w) to be the hinge loss, max(0, 1 − yi 〈w, xi〉), which recovers linear Support Vector Machines (SVMs) =-=[25, 36]-=-. On the other hand, using the same regularizer but changing the loss function to l(xi, yi, w) = log(1 + exp(−yi 〈w, xi〉)), yields logistic regression. Extensions of these loss functions allow us to h... |

39 | Linear hinge loss and average margin
- Gentile, Warmuth
- 1999
(Show Context)
Citation Context ...d gradients of l can be computed in linear time, once f is sorted. xj iTable 1: Scalar loss functions and their derivatives, depending on f := 〈w, x〉, and y. Loss l(f, y) Derivative l ′ (f, y) Hinge =-=[20]-=- max(0, −yf) 0 if yf ≥ 0 and −y otherwise 1 Squared Hinge [26] max(0, −yf)2 0 if yf ≥ 0 and f otherwise 2 Soft Margin [4] max(0, 1 − yf) 0 if yf ≥ 1 and −y otherwise 1 Squared Soft Margin [10] max(0, ... |

30 | Nonparametric quantile estimation
- Takeuchi, Le, et al.
- 2006
(Show Context)
Citation Context .... Examples of algorithms which employ the kernel trick (but essentially still solve (1)) include Support Vector regression [41], novelty detection [33], Huber’s robust regression, quantile regression =-=[37]-=-, ordinal regression [21], ranking [15], maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graph... |

28 | DSDP4 users manual
- Benson, Ye
- 2002
(Show Context)
Citation Context ...deals with the regularizer Ω(w) and is able to query the loss function for values of Remp(w) and ∂wRemp(w) as needed. This is very similar to the design of the Toolkit for Advanced Optimization (TAO) =-=[5]-=-. Depending on the type of loss function, computing Remp can be very costly. This is particularly true in cases where l(x, y, w) is the log-likelihood of an intractable conditional random fields or of... |

21 | Online learning meets optimization in the dual
- Shalev-Shwartz, Singer
- 2006
(Show Context)
Citation Context ... 1 2 ‖w‖2 . Then the bundle method produces a duality gap of at most ɛ after t steps, where t ≤ log 2 λRemp(0) − 2 log 2 G + 8G2 λɛ − 4. (13) Note that this bound is significantly better than that of =-=[40, 35]-=-, since it only depends logarithmically on the value of the loss and offers an O(1/ɛ) rate of convergence rather than the O(1/ɛ 2 ) rate in previous papers. This is largely due to an improved analysis... |

20 |
Direct optimization of ranking measures
- Smola
- 2007
(Show Context)
Citation Context ...y∗ is the argmax of the loss Softmax [14] log P y ′ hP exp(fy ′) − fy y ′ ey ′ exp(f ′ i y) / P y ′ exp(f ′ y) − ey Multivariate Regression 1 2 (f − y)⊤ M(f − y) where M ≽ 0 M(f − y) Document Ranking =-=[29]-=- show that a large number of ranking scores (normalized discounted cumulative gain, mean reciprocal rank, expected rank utility, etc.) can be optimized directly by minimizing the following loss: l(X, ... |

10 | Online ranking by projecting
- Crammer, Singer
- 2005
(Show Context)
Citation Context ...he kernel trick (but essentially still solve (1)) include Support Vector regression [41], novelty detection [33], Huber’s robust regression, quantile regression [37], ordinal regression [21], ranking =-=[15]-=-, maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models [14], exponential families ... |

4 |
Probabilistic Networks and Expert Sytems
- Cowell, Dawid, et al.
(Show Context)
Citation Context ...gression [21], ranking [15], maximization of multivariate performance measures [24], structured estimation [38, 40], Gaussian Process regression [43], conditional random fields [28], graphical models =-=[14]-=-, exponential families [3], and generalized linear models [17]. Traditionally, specialized solvers have been developed for solving the kernel version of (1) in the dual, e.g. [9, 23]. These algorithms... |