## OPTIMIZATION FOR SPARSE AND ACCURATE CLASSIFIERS (2010)

Citations: | 1 - 1 self |

### BibTeX

@MISC{Goldberg10optimizationfor,

author = {Noam Goldberg},

title = {OPTIMIZATION FOR SPARSE AND ACCURATE CLASSIFIERS},

year = {2010}

}

### OpenURL

### Abstract

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...46, 47], and the first algorithms were suggested by Schapire [61], and later Freund [34] and Freund and Schapire [36]. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues =-=[69]-=-, and together with boosting they have been the most widely used algorithms for selecting weighted voting classifiers. 1.1 Weighted voting classification In binary classification problems we are given... |

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...eoretic provably approximately correct (PAC) learning framework by Kearns and Valiant [46, 47], and the first algorithms were suggested by Schapire [61], and later Freund [34] and Freund and Schapire =-=[36]-=-. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues [69], and together with boosting they have been the most widely used algorithms for selecting weighted voting classif... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...ectly minimizing or penalizing the L0 norm of λ. In order to avoid a hard combinatorial optimization problem, the authors of various methods such as LP-Boost, Lasso and Support Vector Machines (SVMs) =-=[26, 37, 23, 11]-=- instead suggest using the L1 or L2 norms of λ. Minimizing any one of the common loss functions plus a complexity penalty proportional to an L1 or L2 norms of λ has the computational advantage of bein... |

1631 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...em.78 On the other hand, for some of the simple base classifiers considered in the literature, such as monomials of fixed degree, or decision trees of a single test (also known as “decision stumps”) =-=[35]-=-, maximization of (4.1) can be performed in polynomial time by simple enumeration. We now consider the case where U corresponds to abstaining monomials, and extend the branch-and-bound algorithm of Ch... |

895 |
Approximation Algorithms
- Vazirani
- 2001
(Show Context)
Citation Context ...ity gap of a MIP relaxation is defined as supH,y z(H, y)/zR(H, y), where z(H, y) and zR(H, y) are the optimal solution values of the SMDH MIP and its continuous relaxation, respectively (see Vazirani =-=[70]-=-). In order to show a lower bound for the integrality gap we will consider a particular construction of a simple SMDH instance with C = 1 and diag(y)H = I, where I is the identity matrix, meaning each... |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ...λu, and finally outputs the weighted voting (also known as the strong) classifier g(·) that either exactly or approximately maximizes the L1 margin (equivalently the L∞ distance to the closest point) =-=[63, 60, 73]-=-. The base classifier hu, for u ∈ U, is selected by a “black box” base (also known as weak) learning algorithm. A potential difficulty with this approach is that U can be very large and often exponent... |

665 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...of research. Boosting was first considered in the theoretic provably approximately correct (PAC) learning framework by Kearns and Valiant [46, 47], and the first algorithms were suggested by Schapire =-=[61]-=-, and later Freund [34] and Freund and Schapire [36]. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues [69], and together with boosting they have been the most widely u... |

640 |
UCI machine learning repository
- Asuncion, Newman
- 2007
(Show Context)
Citation Context ... is capable of parallelism). Table 2.1 and Figures 2.4-2.8 show our experimental results using Algorithm 1 as a weak learning algorithm inside the LP-Boost boosting procedure [26], applied to the UCI =-=[3]-=- binary dataset SPECTHRT and binarized versions of additional UCI datasets. We configured our binarization procedure to obtain a larger number of variables (as indicated by N in the table) than is cus... |

590 |
Learning with kernels : support vector machines, regularization, optimization, and beyond
- Schölkopf, Smola
- 1999
(Show Context)
Citation Context ... with minimum L0 norm is known to be N P-hard [55]. In practice, the common approach is to minimize the L1 or L2 norms. Greedy heuristics for minimizing L0 have also been suggested by several authors =-=[66]-=-. Natarajan proposes a greedy algorithm with an approximation guarantee in terms of an Lp-norm of the input matrix and right-hand side [55]. Similarly, in classification model selection, overfitting i... |

423 | Boosting a weak learning algorithm by majority
- Freund
- 1995
(Show Context)
Citation Context ...as first considered in the theoretic provably approximately correct (PAC) learning framework by Kearns and Valiant [46, 47], and the first algorithms were suggested by Schapire [61], and later Freund =-=[34]-=- and Freund and Schapire [36]. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues [69], and together with boosting they have been the most widely used algorithms for sele... |

419 | Reducing multiclass to binary: A unifying approach for margin classifiers
- Allwein, Schapire, et al.
(Show Context)
Citation Context ...the case of more than two classes has been reduced to one or more problems of binary classification, with the error rate of the multi-class problem bounded in terms of the error of the binary problem =-=[1, 31]-=-. By optimally sparse and accurate classifiers we mean to consider the problem of how to strike an ideal balance between sparsity and accuracy of a classifier on the given set of data. In this dissert... |

325 | New support vector algorithms
- Schölkopf, Smola, et al.
- 2000
(Show Context)
Citation Context ...action of margin errors, and 1 − ν as an upper bound on the fraction of points with margin larger than ρ; they provide a variational and graphical sketch of a proof, respectively. Schölkopf et14 al. =-=[65]-=- prove a similar theorem for an SVM quadratic programming formulation where the L2-norm of λ is being minimized; in particular they show that ν is an upper bound on the fraction of errors and a lower ... |

319 | The boosting approach to machine learning: An overview
- Schapire
- 2002
(Show Context)
Citation Context ...g algorithm pricing the best column within the column generation framework: that is, how to find the best feature in each iteration of a boosting algorithm. While much of the boosting literature (see =-=[36, 64, 59, 58, 26, 54, 62]-=-) has focused on run-time analysis as well as learning generalization bounds with respect to the boosting algorithm, not much work has been done in the area of developing new base learning algorithms ... |

316 |
Sparse approximate solutions to linear systems
- Natarajan
- 1995
(Show Context)
Citation Context ...ally sparse solutions correspond to solutions with minimum L0 norm, defined as the number of nonzero components of the solution vector. Finding a solution with minimum L0 norm is known to be N P-hard =-=[55]-=-. In practice, the common approach is to minimize the L1 or L2 norms. Greedy heuristics for minimizing L0 have also been suggested by several authors [66]. Natarajan proposes a greedy algorithm with a... |

315 |
What size net gives valid generalization
- Baum, Haussler
- 1989
(Show Context)
Citation Context ...sk (also known as generalization) bound in terms of the number of rounds, T , that the AdaBoost algorithm is run. In particular, they extend a previous result of Baum and Haussler for Neural Networks =-=[8]-=- to show that the VC-dimension of the classifier can be bounded in terms of the sum of VC-dimension of the base classifiers and T . Theorem 1.3.1 (Freund and Schapire). The VC-dimension of a set of fu... |

306 | Cryptographic limitations on learning boolean formulae and finite automata
- Kearns, Valiant
- 1994
(Show Context)
Citation Context ...ighted voting classifiers have been an especially active area of research. Boosting was first considered in the theoretic provably approximately correct (PAC) learning framework by Kearns and Valiant =-=[46, 47]-=-, and the first algorithms were suggested by Schapire [61], and later Freund [34] and Freund and Schapire [36]. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues [69], a... |

257 | Rademacher and Gaussian complexities: Risk bounds and structural results
- Bartlett, Mendelson
(Show Context)
Citation Context ...functions in place of (3.22) and different values of the penalty constant in future work. Particularly, complexity measures that can be computed based on the input data, such as Rademacher complexity =-=[7]-=- could be the subject of further investigation.68 Method Datasets BCW VOTE CLVHEART HUHEART SONAR Acc ||λ|| 0 Acc ||λ|| 0 Acc ||λ|| 0 Acc ||λ|| 0 Acc ||λ|| 0 L0RBoost K = 1 0.963 9.1 0.950 6.0 0.846 ... |

254 | Soft margins for AdaBoost
- Rätsch, Onoda, et al.
- 2001
(Show Context)
Citation Context ...corresponding to U. By “soft margin” it is meant that not all observations need to be separated and satisfy the same requirement of distance from the hyperplane. Graepel et al. [41] and Ratsch et al. =-=[58]-=- adapted the quadratic optimization formulation of SVMs using “soft margins” to a linear programming formulation. Demiriz, Bennett and Shawe-Taylor [26] use a linear programming formulation based on t... |

254 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...earance of the data) hierarchy of complexity classes10 S1, S2, . . . A choice of penalties based on a decomposition into classes of risk based on the actual data is known as a (un)luckiness function =-=[68]-=-, which is used in general to encode bias in favor of certain classifiers, for example those corresponding to hyperplanes with a larger margin over others. A different kind of risk bound that we will ... |

211 | Branch-and-price: Column generation for solving huge integer programs
- Barnhart, Johnson, et al.
- 1998
(Show Context)
Citation Context ...ly branch on the variables ξ and µ. However, we are also interested in instances where dimension of µ can be very large. In this case, we may be interested in implementing a branch-and-price approach =-=[5]-=- for generating the base classifiers and the corresponding nonzero variables λ and µ. Branch-and-price involves sophisticated branching schemes for preventing the regeneration of a column by column ge... |

210 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...arating hyperplanes for data classification applications by Mangasarian [52, 53]. Bennett and Mangasarian have also suggested robust linear programming formulations for data that may be non-separable =-=[12]-=-. Hammer’s logical Analysis of data (LAD) [25, 15, 16] applies the theory of Boolean functions to data analysis. LAD consists of a systematic approach to enumerating monomials (or interactions of vari... |

204 | Logistic regression, AdaBoost and Bregman distances
- Collins, Schapire, et al.
- 2004
(Show Context)
Citation Context ...m, or equivalently the average, of losses over all observations. The original AdaBoost algorithm by Fruend and Schapire [36] has been shown to minimize the sum of (1.4) over the training observations =-=[21]-=-, for certain optimal base learners. Logistic regression, which is a common classification model selection procedure in statistics, minimizes the sum of (1.5). The problem of minimizing (1.2) has been... |

202 | From sparse solutions of systems of equations to sparse modeling of signals and images
- Bruckstein, Donoho, et al.
- 2009
(Show Context)
Citation Context ...ed compression interpretation of learning within statistical learning theory. In signal detection and compressed sensing, one faces a related problem of solving an under-determined linear system (see =-=[72, 18]-=-). Optimally sparse solutions correspond to solutions with minimum L0 norm, defined as the number of nonzero components of the solution vector. Finding a solution with minimum L0 norm is known to be N... |

199 |
The Analysis of Binary Data
- Cox
- 1970
(Show Context)
Citation Context ...sformation to the binary response variable in order to apply general linear model techniques. These techniques are not explored further in this dissertation, but the reader may refer to Cox and Snell =-=[24]-=- for more detail. In the field of operations research, linear and nonlinear programs have been suggested for finding separating hyperplanes for data classification applications by Mangasarian [52, 53]... |

194 | Toward efficient agnostic learning
- Kearns, Schapire, et al.
- 1994
(Show Context)
Citation Context ...iables corresponding to constraints (1.12b) and (1.12c), respectively. The same objective, in fact, has been long known in machine learning and computational geometry communities as maximum agreement =-=[48, 9, 40]-=- or maximum bi-chromatic discrepancy [27]. A probabilistic argument for arriving at the same objective can be made for abstaining base classifiers. A base classifier (or hypothesis) is said to abstain... |

192 | Computational limitations on learning from examples
- Pitt, Valiant
- 1988
(Show Context)
Citation Context ...mum agreement problems was first investigated by Pitt and Valiant, who proved that it is N P-hard to decide if there is a 2-term Disjunctive Normal Form that agrees with all observations of a dataset =-=[57]-=-. Kearns, Schapire and Sellie [48], following work by Kearns and Li [45], investigated minimum disagreement with Boolean monomials and showed that the problem is N P-hard. The inapproximability of max... |

178 | The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Imporatant than the Size of the Network
- Bartlett
- 1998
(Show Context)
Citation Context ...ing the VC dimension of linear combinations of base classifiers in terms of the VC dimension of the base classifiers used, and ||λ|| 0 , as shown by Theorem 1.3.1. Following later results by Bartlett =-=[6]-=-, Schapire et al. [63] and Vapnik (see [66] section 5.5.6) providing risk bounds in terms of the margin of separation, algorithms for finding weighted voting classifiers have been mostly motivated by ... |

167 | Learning in the presence of malicious errors
- Kearns, Li
- 1993
(Show Context)
Citation Context ...roved that it is N P-hard to decide if there is a 2-term Disjunctive Normal Form that agrees with all observations of a dataset [57]. Kearns, Schapire and Sellie [48], following work by Kearns and Li =-=[45]-=-, investigated minimum disagreement with Boolean monomials and showed that the problem is N P-hard. The inapproximability of maximum agreement problems has been investigated by Ben David, Eiron and Lo... |

154 | The hardness of approximate optima in lattices, codes, and systems of linear equations - Arora, Babai, et al. - 1993 |

142 |
The Minimum Description Length Principle
- Grünwald
- 2007
(Show Context)
Citation Context ...prehensive account of MDH, its information theoretic6 foundations, and the equivalence of different variants of MDL to well known principles and methods of statistical inference is given by Grünwald =-=[42]-=-. Here we focus on the more intuitive aspects of MDL, and the related compression interpretation of learning within statistical learning theory. In signal detection and compressed sensing, one faces a... |

102 | Linear programming boosting via column generation
- Demiriz, Bennett, et al.
- 2002
(Show Context)
Citation Context ...ectly minimizing or penalizing the L0 norm of λ. In order to avoid a hard combinatorial optimization problem, the authors of various methods such as LP-Boost, Lasso and Support Vector Machines (SVMs) =-=[26, 37, 23, 11]-=- instead suggest using the L1 or L2 norms of λ. Minimizing any one of the common loss functions plus a complexity penalty proportional to an L1 or L2 norms of λ has the computational advantage of bein... |

96 |
A linear programming approach to the cutting-stock problem
- Gilmore, Gomory
- 1963
(Show Context)
Citation Context ... Column generation has been used to solve large LPs since the early 1960s, with a wide variety of applications starting with the seminal application of Gilmore and Gomory to the cutting stock problem =-=[39, 51]-=-. The method is successful in practice when there are many more columns than rows and the number of variables that need to be generated tends to be small.13 In the formulation (1.12), the objective f... |

93 | A Simple, Fast, and Effective Rule Learner
- Cohen, Singer
- 1999
(Show Context)
Citation Context ...ase’. However, monomial hypotheses have been found especially useful for constructing weighted voting classifiers. In particular, boosting of monomial hypotheses has been suggested by several authors =-=[20, 38, 40]-=-. The experimental results in [40] show that boosting optimal monomial hypotheses, as opposed to heuristically generated monomials (e.g. as in SLIPPER [20]), can improve the classification performance... |

84 | Robust trainability of single neurons
- Hoffgen, Horn, et al.
- 1995
(Show Context)
Citation Context ...ure space), as well of the size of the data M. The problem of finding a hyperplane g that minimizes the sum of (1.2) over i = 1, . . . , M is known as the minimum disagreement halfspace problem (MDH) =-=[44, 2]-=-. When considering the 0/1 loss (1.2), it turns out that even when U is given in the input, the problem of finding a loss minimizing hyperplane is N P-hard [44, 9]. This problem has been solved using ... |

72 | Selected topics in column generation
- Lübbecke, Desrosiers
- 2005
(Show Context)
Citation Context ... Column generation has been used to solve large LPs since the early 1960s, with a wide variety of applications starting with the seminal application of Gilmore and Gomory to the cutting stock problem =-=[39, 51]-=-. The method is successful in practice when there are many more columns than rows and the number of variables that need to be generated tends to be small.13 In the formulation (1.12), the objective f... |

71 |
Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions
- Robert
- 1999
(Show Context)
Citation Context ...g algorithm pricing the best column within the column generation framework: that is, how to find the best feature in each iteration of a boosting algorithm. While much of the boosting literature (see =-=[36, 64, 59, 58, 26, 54, 62]-=-) has focused on run-time analysis as well as learning generalization bounds with respect to the boosting algorithm, not much work has been done in the area of developing new base learning algorithms ... |

69 | Boosting as a Regularized Path to a Maximum Margin Classifier - Zhu, J, et al. - 2004 |

55 |
Linear and Nonlinear Separation of Patterns by Linear Programming
- Mangasarian
- 1965
(Show Context)
Citation Context ...t to unseen data in a probabilistic setting. We study linear programming formulations for finding a hyperplane that separates two sets of points. Such formulations were initially given by Mangasarian =-=[52]-=- for the separable case, and more recently extended by “soft margin” formulations that maximize the margin of separation subject to a penalty proportional to the sum of margin violations. LP-Boost is ... |

53 | Predictive learning via rule ensembles
- Friedman, Popescu
- 2008
(Show Context)
Citation Context ...ase’. However, monomial hypotheses have been found especially useful for constructing weighted voting classifiers. In particular, boosting of monomial hypotheses has been suggested by several authors =-=[20, 38, 40]-=-. The experimental results in [40] show that boosting optimal monomial hypotheses, as opposed to heuristically generated monomials (e.g. as in SLIPPER [20]), can improve the classification performance... |

44 | Logical analysis of numerical data
- Boros, Hammer, et al.
- 1997
(Show Context)
Citation Context ...plications by Mangasarian [52, 53]. Bennett and Mangasarian have also suggested robust linear programming formulations for data that may be non-separable [12]. Hammer’s logical Analysis of data (LAD) =-=[25, 15, 16]-=- applies the theory of Boolean functions to data analysis. LAD consists of a systematic approach to enumerating monomials (or interactions of variables) and using them to model cause-and-effect relati... |

42 | Boosting with early stopping: convergence and consistency, Annals of Statistics 33
- Zhang, Yu
- 2005
(Show Context)
Citation Context ...λu, and finally outputs the weighted voting (also known as the strong) classifier g(·) that either exactly or approximately maximizes the L1 margin (equivalently the L∞ distance to the closest point) =-=[63, 60, 73]-=-. The base classifier hu, for u ∈ U, is selected by a “black box” base (also known as weak) learning algorithm. A potential difficulty with this approach is that U can be very large and often exponent... |

39 | Misclassification minimization
- Mangasarian
- 1994
(Show Context)
Citation Context ...ell [24] for more detail. In the field of operations research, linear and nonlinear programs have been suggested for finding separating hyperplanes for data classification applications by Mangasarian =-=[52, 53]-=-. Bennett and Mangasarian have also suggested robust linear programming formulations for data that may be non-separable [12]. Hammer’s logical Analysis of data (LAD) [25, 15, 16] applies the theory of... |

38 | Computing the maximum bichromatic discrepancy, with applications to computer graphics and machine learning
- Dobkin, Gunopulos, et al.
- 1996
(Show Context)
Citation Context ...(1.12c), respectively. The same objective, in fact, has been long known in machine learning and computational geometry communities as maximum agreement [48, 9, 40] or maximum bi-chromatic discrepancy =-=[27]-=-. A probabilistic argument for arriving at the same objective can be made for abstaining base classifiers. A base classifier (or hypothesis) is said to abstain on i ∈ {1, . . . , M} if hu(Ai) = 0 (see... |

38 | Smooth boosting and learning with malicious noise
- Servedio
(Show Context)
Citation Context ... opposed to heuristically generated monomials (e.g. as in SLIPPER [20]), can improve the classification performance when using a sufficiently robust boosting algorithm, such as Servedio’s SmoothBoost =-=[67]-=-. Monomial hypotheses (also called logical patterns) are also a basic building block in the logical analysis of data (LAD) methodology [16], where linear programming as well as other techniques are su... |

36 | Classification on proximity data with lp-machines - Graepel, Herbrich, et al. - 1999 |

29 | Ruling Out PTAS for Graph Min-Bisection, Dense k-Subgraph, and Bipartite Clique - Khot |

28 |
Combinatorial Optimization
- Korte, Vygen
- 2002
(Show Context)
Citation Context ...ce the number of cuts (3.13d) is at most |Ω+||Ω−|, the running time of Algorithm 2 can be shown to be polynomial in the dimensions M and U, and the size of the encoding of the coefficients ρ61 and c =-=[50]-=-. A disadvantage of our formulation may be that we may not know how to set the required margin ρ. An alternative may be to try to maximize the margin within the optimization problem. Let z(ρ) denote t... |

27 |
Learning Boolean formulae or finite automata is as hard as factoring
- Kearns, Valiant
- 1988
(Show Context)
Citation Context ...ighted voting classifiers have been an especially active area of research. Boosting was first considered in the theoretic provably approximately correct (PAC) learning framework by Kearns and Valiant =-=[46, 47]-=-, and the first algorithms were suggested by Schapire [61], and later Freund [34] and Freund and Schapire [36]. Support Vector Machines (SVMs) were first suggested by Vapnik and his colleagues [69], a... |

24 |
The maximum box problem and its application to data analysis
- Eckstein, Hammer, et al.
(Show Context)
Citation Context ...e problem consists of points in {0, 1} N , then any axis-aligned rectangle corresponds to a monomial mJ,C. Another related (but not equivalent) problem for real-valued data is the maximum box problem =-=[29]-=-. For the special case of input data in {0, 1} N , the (weighted) maximum box problem can be stated as the problem of finding a box or subcube (J, C) that maximizes w(Ω + ∩ Cover(J, C)), and such that... |

21 | A compression approach to support vector model selection
- Luxburg, Bousquet, et al.
(Show Context)
Citation Context ... expressed in terms of the margin or equivalently the L1 or L2 norm of λ. Luxburg, Bousquet and Schölkopf have investigated the connection between statistical learning theory and compression for SVMs =-=[71]-=-. Although SVMs are designed to maximize the margin of separation in the space of features subject to a soft margin penalty, Luxburg et al. find that their compression-based bounds often perform bette... |