## Everything Old Is New Again: A Fresh Look at Historical Approaches (2002)

Venue: | in Machine Learning. PhD thesis, MIT |

Citations: | 88 - 6 self |

### BibTeX

@INPROCEEDINGS{Rifkin02everythingold,

author = {Ryan Michael Rifkin},

title = {Everything Old Is New Again: A Fresh Look at Historical Approaches},

booktitle = {in Machine Learning. PhD thesis, MIT},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

2 Everything Old Is New Again: A Fresh Look at Historical

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... experimental results exploring SvmFu's performance and comparing SvmFu to other popular codes. In recent years, the Support Vector Machine (SVM) has become an important technique in machine learning =-=[20, 112, 113]-=-. Loosely speaking, an SVM for classification attempts to find a hyperplane that maximizes the margin between positive and negative examples, while simultaneously minimizing training set misclassifica... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ... automatically a good solution to this problem, this is fairly obvious. In the second set of experiments, they compare their algorithm to a one-vs-all scheme on five data sets from the UCI repository =-=[75]-=- (the data sets used were iris, wine, glass, soy, and vowel). They find that on two of the five data sets, the multiclass support vector machine performs substantially better, and on the remaining thr... |

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...and that we can easily relate the margin loss function L(yf (x)) to the loss function V (f (x); y) we considered in Chapters 2 and 3. Allwein et al. are also very interested in the AdaBoost algorithm =-=[34, 93]-=-, which 153sbuilds a function f (x) that is a weighed linear combination of base hypothesis ht: f (x) = X t fftht(x); (4.40) where the ht are selected by a (weak) base learning algorithm, and referenc... |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...ation attempts to find a hyperplane that maximizes the margin between positive and negative examples, while simultaneously minimizing training set misclassifications (for excellent introductions, see =-=[16]-=- or [27]). SVMs are motivated by strong theoretical arguments, and their observed empirical performance is extremely good. However, training an SVM requires the solution of a quadratic program in as m... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... experimental results exploring SvmFu's performance and comparing SvmFu to other popular codes. In recent years, the Support Vector Machine (SVM) has become an important technique in machine learning =-=[20, 112, 113]-=-. Loosely speaking, an SVM for classification attempts to find a hyperplane that maximizes the margin between positive and negative examples, while simultaneously minimizing training set misclassifica... |

2028 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ... an overview of the mathematics of Support Vector Machines, focussing on the details necessary to understand the algorithms used to train SVMs. More details can be found in wide variety of references =-=[16, 27, 79, 113, 95, 51]-=-. Whereas most developments start from a geometric viewpoint emphasizing separating hyperplanes and "margin", we begin with the idea of regularization, which allows us to easily develop both the prima... |

1441 |
Making large-Scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ..., the largest and smallest imputed valid imputed values of b. When these two values become sufficiently close, the algorithm has converged. 2.3.3 Joachims' SVMLight Thorsten Joachims' SVMLight system =-=[53, 55]-=- introduced several additional key innovations. His system is publicly available, and is quite popular. Like Osuna's system, Joachims solved a large SVM problem by decomposing it into a series of smal... |

1291 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
(Show Context)
Citation Context ...a penalty linear in the amount by which we fail to satisfy the constraint. The quantity yif (xi) is also known as the margin (see below). The now-classical SVM algorithm as developed by Vapnik et al. =-=[12, 112, 113]-=- uses the hinge loss, and this shall be our focus for most of present chapter. 1 Since we are interested in solving binary classification problems, the most natural loss would be the 0-1 loss, which s... |

1011 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1998
(Show Context)
Citation Context ...h a fast algorithm and the flexibility to easily adapt its operating characteristics to individual problems. Conceptually, SvmFu can be viewed as a unification of ideas presented by Osuna [80], Platt =-=[82]-=-, and Joachims [55], with additional extensions. We describe and justify our approach, and present experimental results exploring SvmFu's performance and comparing SvmFu to other popular codes. In rec... |

976 | Fast effective rule induction
- Cohen
- 1995
(Show Context)
Citation Context ...es, which is of course our actual goal. F"urnkranz Very recently, F"urnkranz published a paper on Round Robin Classification [40], which is his 2002 name for all-vs-all classification. He used Ripper =-=[18]-=-, a rule-based learner, as his underlying binary learner. He experimentally found that an all-vs-all system had improved performance compared to a one-vs-all scheme. His data sets included the satimag... |

943 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...tempts to find a hyperplane that maximizes the margin between positive and negative examples, while simultaneously minimizing training set misclassifications (for excellent introductions, see [16] or =-=[27]-=-). SVMs are motivated by strong theoretical arguments, and their observed empirical performance is extremely good. However, training an SVM requires the solution of a quadratic program in as many vari... |

824 |
Solution of Ill-posed Problems
- Tikhonov, Arsenin
(Show Context)
Citation Context ...sed in Chapter 1, the problem of learning a function that will generalize well on new examples is ill-posed. The classical approach to restoring well-posedness 26sto learning is regularization theory =-=[46, 30, 28, 108, 116, 8, 9]-=-. This leads to the following regularized learning problem: min f2H 1 ` ` X i=1 V (yi; f (xi)) + *jjf jj 2 K : (2.1) Here, jjf jj 2 K is the norm in a Reproducing Kernel Hilbert Space H defined by a p... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...the optimal separating hyperplane has a normal vector parallel (or anti-parallel) to v, and passes through x + +x \Gammas2 . An algorithm based on this observation was derived by Vapnik in the 1970's =-=[111]-=-, and another such algorithm with extensions to the nonseparable case with squared slack variables has recently been proposed by Keerthi et al. [58]. 2.2.4 The SVM Dual The SVM dual problem is substan... |

777 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...fi \Gammas1 2 P ` i=1 P ` j=1 ffiffjQij (2.49) subject to : P ` i=1 yiffi = 0 (2.50) 0 ^ ffi ^ C; 8i (2.51) Here, Qij j yiyjK(xi; xj), where K is a user-supplied positive semidefinite kernel function =-=[3, 117]-=-. Choosing a kernel function is equivalent to choosing a (possibly high- or infinite-dimensional) feature space in which to embed the examples. Kernels commonly in use include linear, polynomial and r... |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ...generalization performance of multiclass loss-based schemes in the particular case when the underlying binary classifier is AdaBoost. The arguments are extensions of those given by Schapire et al. in =-=[92]-=-, but are beyond the scope of this thesis. The remainder of the paper is devoted to experiments on both toy and UCI 156sRepository data sets, using AdaBoost and SVMs as the base learners. The two stat... |

716 |
Shetty: “Non-linear programming theory and algorithms
- Bazaraa, M
(Show Context)
Citation Context ...c (2.13) subject to : yi( P ` j=1 cjK(xi; xj) + b) * 1 \Gammas,i i = 1; : : : ; ` (2.14) ,i * 0 i = 1; : : : ; ` (2.15) We derive the Wolfe dual quadratic program using Lagrange multiplier techniques =-=[6, 10, 73, 32]-=-. Because the primal is a feasible convex quadratic programming problem, strong duality holds -- the dual problem will also be feasible and convex, and the optimal objective values of the primal and d... |

698 | Improved Boosting Algorithms using Confidence-rated Predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...and that we can easily relate the margin loss function L(yf (x)) to the loss function V (f (x); y) we considered in Chapters 2 and 3. Allwein et al. are also very interested in the AdaBoost algorithm =-=[34, 93]-=-, which 153sbuilds a function f (x) that is a weighed linear combination of base hypothesis ht: f (x) = X t fftht(x); (4.40) where the ht are selected by a (weak) base learning algorithm, and referenc... |

639 |
F Girosi, “Networks for approximation and learning
- Poggio
- 1990
(Show Context)
Citation Context ...s [96] also used regularization. These authors considered regression problems rather than classification, and did not use Reproducing Kernel Hilbert Spaces as regularizers. In 1989, Girosi and Poggio =-=[46, 83]-=- considered regularized classification and regression problems with the square loss. They used pseudodifferential operators as their stabilizers; these are essentially equivalent to using the norm in ... |

455 |
Nonparametric Regression and Generalized Linear Models: A roughness penalty approach
- Green, Silverman
- 1994
(Show Context)
Citation Context ...ix G, without actually solving ` \Gammas1 training problems. The above statement is not too difficult to prove; we present here an argument given in the very readable monograph of Green and Silverman =-=[48]-=-. Define f i to be the vector whose jth coordinate is fSi(xj), the value of the function obtained by training on a dataset with the ith point removed, applied to the jth point. Define Y i to be a vect... |

455 | Parallel networks that learn to pronounce english text
- Sejnowski, Rosenberg
- 1987
(Show Context)
Citation Context ... 151sand each row of the matrix, choosing the minimizer: f (x) = arg min r21;:::;N F X i=1 1 \Gammassign(Mrifi(x)) 2 ! : (4.37) This representation had been previously used by Sejnowski and Rosenberg =-=[97]-=-, but in their case, the matrix M was chosen so that a column of M corresponded to the presence or absence of some specific feature across the given classes. For example (taken from [29]), in a digit ... |

423 |
Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...herefore, this relationship, while exact, does not represent a practical approach to computing the leave-one-out error for large RLSC problems. An alternative approach, introduced by Craven and Wahba =-=[26]-=-, is known as the generalized approximate cross-validation, or GACV for short. Instead of actually using the entries of the inverse matrix G directly, an approximation to the leave-one-out value is ma... |

419 | Reducing multiclass to binary: A unifying approach for margin classifiers
- Allwein, Schapire, et al.
(Show Context)
Citation Context ...ustrated graphically, and no comparisons to other methods are made. In the other example, a comparison to one-vs-all is made. The training data consists of 200 one-dimensional points (in the interval =-=[0; 1]-=-) from three overlapping classes, and the test data consists of 10; 000 independent test points from the distribution. The distributions are chosen so that class 2 never has a conditional probability ... |

365 | On the algorithmic implementation of multiclass kernel-based vector machines
- Crammer, Singer
(Show Context)
Citation Context ...SVM system performs (slightly) better than their single-machine system. Crammer and Singer Crammer and Singer consider a similar but not identical single-machine approach to multiclass classification =-=[24]-=-. This work is a specific case of a general method for solving multiclass problems, discussed in [23, 22, 25] and in Section 4.2.2 below. The method can be viewed as a simple modification of the appro... |

349 |
Introduction to Linear Optimization, Athena Scientific
- Bertsimas, Tsitsiklis
- 1997
(Show Context)
Citation Context ...c (2.13) subject to : yi( P ` j=1 cjK(xi; xj) + b) * 1 \Gammas,i i = 1; : : : ; ` (2.14) ,i * 0 i = 1; : : : ; ` (2.15) We derive the Wolfe dual quadratic program using Lagrange multiplier techniques =-=[6, 10, 73, 32]-=-. Because the primal is a feasible convex quadratic programming problem, strong duality holds -- the dual problem will also be feasible and convex, and the optimal objective values of the primal and d... |

341 |
Interior point polynomial algorithms in convex programming
- Nesterov, Nemirovskii
- 1994
(Show Context)
Citation Context ...have a big impact in the field. There exist polynomial time algorithms for solving large classes of convex programming problems, which include convex quadratic prorgramming problems as a special case =-=[78]-=-. For many quadratic programming problems, interior point methods have led to the most efficient algorithms in practice as well as theory. However, the quadratic programming problem arising from Suppo... |

305 | An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
- SHEWCHUK
- 1994
(Show Context)
Citation Context ... 2 x T Ax + b T x). Although the Conjugate Gradient algorithm is simple to state, understanding its workings is quite difficult. For a good tutorial on the Conjugate Gradient, see [109] or especially =-=[98]-=-. Here, we will restrict ourselves to stating the algorithm, and discussing why it is especially useful in machine learning problems such as Regularized Least Squares Classification. Conjugate Gradien... |

290 | Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Microsoft Research
- Platt
(Show Context)
Citation Context ...vely straightforward and widely known in the literature, it was nevertheless "rediscovered" and analyzed extensively by Lin, Lee and Wahba [67, 68] 2.3.2 Platt's Sequential Minimal Optimization Platt =-=[81, 82]-=- extended Osuna's ideas with the additional intriguing observation that one could set the working set size in the decomposition algorithm to 2. The resulting two point QPs could be solved analytically... |

283 | Using the Nyström method to speed up kernel machines
- Williams, Seeger
- 2001
(Show Context)
Citation Context ...nlinear RLSC 3.7.1 Low-Rank Kernel Approximations Several authors have suggested the use of low-rank approximations to the kernel matrix in order to avoid explicit storage of the entire kernel matrix =-=[100, 66, 119, 43, 31]-=-. These techniques can be used in a variety of methods, including Regularized Least Squares Classification, Gaussian Process regression and classification, interior point approaches to Support Vector ... |

267 | Regularization networks and support vector machines
- Evgeniou, Pontil, et al.
- 2000
(Show Context)
Citation Context ...sed in Chapter 1, the problem of learning a function that will generalize well on new examples is ill-posed. The classical approach to restoring well-posedness 26sto learning is regularization theory =-=[46, 30, 28, 108, 116, 8, 9]-=-. This leads to the following regularized learning problem: min f2H 1 ` ` X i=1 V (yi; f (xi)) + *jjf jj 2 K : (2.1) Here, jjf jj 2 K is the norm in a Reproducing Kernel Hilbert Space H defined by a p... |

259 | Least squares support vector machine classifiers
- Suykens, Vandewalle
- 1999
(Show Context)
Citation Context ... name of Least Squares Support Vector Machines. He has been extremely prolific on the topic, publishing literally dozens of papers; the papers most relevant to the topics discussed in this thesis are =-=[106, 107, 104, 105, 115, 102, 42, 103]-=-. However, these papers do not seem to advance either the theory or practice of RLSC. Like the work of Fung and Mangasarian, Suykens derives RLSC as a modification of Support Vector Machines, rather t... |

253 | Improved Training Algorithm for Support Vector
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ...roviding both a fast algorithm and the flexibility to easily adapt its operating characteristics to individual problems. Conceptually, SvmFu can be viewed as a unification of ideas presented by Osuna =-=[80]-=-, Platt [82], and Joachims [55], with additional extensions. We describe and justify our approach, and present experimental results exploring SvmFu's performance and comparing SvmFu to other popular c... |

252 | SVMTorch: Support Vector Machines for Large-Scale Regression Problems
- Collobert, Bengio
- 2001
(Show Context)
Citation Context ...ptimality guarantee on the final solution. Nevertheless, although we conjecture that 56sto have little effect on the optimal solution. This problem is faced by other codes which use shrinking as well =-=[53, 19]-=-. Additionally, we take a sophisticated approach to caching in an attempt to avoid redundant kernel computations. There are essentially two kinds of cache in SvmFu (although they are both part of the ... |

230 |
Multiclass cancer diagnosis using tumor gene expression signatures
- Ramaswamy, Tamayo, et al.
(Show Context)
Citation Context ...ll tasks, whereas SVM would be only 100 seconds faster. 105s3.6.2 Linear RLSC Application: Tumor Classification We also mention briefly an application to multiclass molecular tumor classification. In =-=[86]-=-, the author (among others) considered multiclass tumor classification. The data consisted of 190 tumors divided into 14 classes. Each example is the 16,063 dimensional output of a DNA microarray. The... |

224 | On the mathematical foundations of learning
- Cucker, Smale
(Show Context)
Citation Context ...sed in Chapter 1, the problem of learning a function that will generalize well on new examples is ill-posed. The classical approach to restoring well-posedness 26sto learning is regularization theory =-=[46, 30, 28, 108, 116, 8, 9]-=-. This leads to the following regularized learning problem: min f2H 1 ` ` X i=1 V (yi; f (xi)) + *jjf jj 2 K : (2.1) Here, jjf jj 2 K is the norm in a Reproducing Kernel Hilbert Space H defined by a p... |

217 | Multi-class support vector machines
- Weston, Watkins
- 1998
(Show Context)
Citation Context ...ely. 4.2.1 Single Machine Approaches Vapnik, Weston and Watkins The single machine approach was introduced simultaneously in the 1998 book of Vapnik [113] and a technical report by Weston and Watkins =-=[118]-=-. The formulations introduced in these two sources are essentially identical. The approach is a multiclass generalization of Support Vector Machines. A standard SVM finds a function f (x) = ` X j=1 cj... |

203 | An equivalence between sparse approximation and support vector machines
- Girosi
- 1998
(Show Context)
Citation Context ...t is nonconstructive: it tells us the form of the solution, but does not tell us how to actually find the ci. In the specific case of the square loss, where V (f (x); y) = (y \Gammasf (x) 2 ), Girosi =-=[45]-=- includes a constructive proof of the form of the ci; such a proof is very similar to the derivation of the RLSC algorithm given in Chapter 3, although there we begin by assuming the form of the solut... |

191 | K.: “Improvements to Platt’s SMO algorithm for SVM classifier design”, Neural Computation 13
- Keerthi, Shevade, et al.
- 2001
(Show Context)
Citation Context ... term [36], penalizing the bias term (or, equivalently, adding an extra dimension of all 1's to the data [71]), and adding an extra dimension for each data point, thereby making the problem separable =-=[35, 59]-=-. Evaluating whether these alternatives work as well or better than the standard formulation is an open research question; Olvi Mangasarian has made progress in addressing these issues [72]. Chapter 3... |

179 | Ill-posed problems in early vision
- Bertero, Poggio, et al.
- 1987
(Show Context)
Citation Context |

177 | Support vector machines: Training and applications
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ... an overview of the mathematics of Support Vector Machines, focussing on the details necessary to understand the algorithms used to train SVMs. More details can be found in wide variety of references =-=[16, 27, 79, 113, 95, 51]-=-. Whereas most developments start from a geometric viewpoint emphasizing separating hyperplanes and "margin", we begin with the idea of regularization, which allows us to easily develop both the prima... |

177 | Sparse greedy matrix approximation for machine learning
- Smola, Schölkopf
- 2000
(Show Context)
Citation Context ...nlinear RLSC 3.7.1 Low-Rank Kernel Approximations Several authors have suggested the use of low-rank approximations to the kernel matrix in order to avoid explicit storage of the entire kernel matrix =-=[100, 66, 119, 43, 31]-=-. These techniques can be used in a variety of methods, including Regularized Least Squares Classification, Gaussian Process regression and classification, interior point approaches to Support Vector ... |

176 | Multicategory support vector machines, theory, and application to the classi cation of microarray data and satellite radiance data
- Lee, Lee, et al.
- 2004
(Show Context)
Citation Context ...total number of support vectors corresponding to different data points, so it is also impossible to draw any conclusions about which formulation is faster at testing new points. Lee, Lin and Wahba In =-=[64]-=- and [65], Lee, Lin and Wahba present a substantially different single-machine approach to multiclass classification. The work has its roots in an earlier paper by Lin [69] on the asymptotic propertie... |

163 | Y (2000) On the learnability and design of output codes for multiclass problems
- Crammer, Singer
(Show Context)
Citation Context ...d Singer consider a similar but not identical single-machine approach to multiclass classification [24]. This work is a specific case of a general method for solving multiclass problems, discussed in =-=[23, 22, 25]-=- and in Section 4.2.2 below. The method can be viewed as a simple modification of the approach of Weston and Watkins [118]. Weston and Watkins start from the idea that if a point x is in class i, we s... |

150 | Support vector machines, reproducing kernel hilbert spaces and randomized gacv
- Wahba
- 1998
(Show Context)
Citation Context ...fi \Gammas1 2 P ` i=1 P ` j=1 ffiffjQij (2.49) subject to : P ` i=1 yiffi = 0 (2.50) 0 ^ ffi ^ C; 8i (2.51) Here, Qij j yiyjK(xi; xj), where K is a user-supplied positive semidefinite kernel function =-=[3, 117]-=-. Choosing a kernel function is equivalent to choosing a (possibly high- or infinite-dimensional) feature space in which to embed the examples. Kernels commonly in use include linear, polynomial and r... |

148 |
Spline models for observational data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context |

147 | Practical Methods of Optimization, second edition - Fletcher - 2000 |

136 |
Another Approach to Polychotomous Classification
- Friedman
- 1996
(Show Context)
Citation Context ... additional binary learners. The SvmFu implementation currently includes this strategy. If the data is high-dimensional, and the time to compute the kernel 11 This fact was stated by Friedman in 1996 =-=[37]-=- and proved at great length by F"urnkranz in 2002 [40] 179sproducts dominates the training time, this can have a substantial effect on the total computation time. Unfortunately, because the different ... |

135 | A generalized representer theorem
- Schölkopf, Herbrich, et al.
(Show Context)
Citation Context ...herently infinite-dimensional problem of finding the best function in an RKHS is transformed into a tractable problem of finding ` parameters ci. The proof we present here is due to Sch"olkopf et al. =-=[94]-=-, although the proof is "implicit" in much earlier work of Wahba [116]. Recall from Appendix A that we define \Phi (xi) to be the projection of xi into a high (possibly infinite) dimensional feature s... |

112 | Proximal Support Vector Machine Classifiers,” Knowledge Discovery and Data Mining
- Fung, Mangasarian
- 2001
(Show Context)
Citation Context ...exercise of deriving the dual may seem somewhat pointless, its value will become clear in later sections, when we use this dual approach to make connections to other ideas. Additionally, some authors =-=[39, 38]-=- who have published on this topic derive the dual in order to find the solution, possibly in order to make their least-squares formulation look as much like 76sthe standard SVM formulation as possible... |

109 | On the Convergence of the Decomposition Method for Support Vector Machines
- Lin
- 2001
(Show Context)
Citation Context ...to the optimal solution has not been formally established, the algorithm seems to converge on all real-world examples. Attempts to study the convergence of SVM algorithms have been made: Chang et al. =-=[17]-=- study general algorithms, but rely on the algorithm finding the examples which maximize the step size at each iteration. Keerthi and Gilbert [57] provide a general convergence proof for SMO-style alg... |

97 | Estimating the generalization performance of a SVM efficiently
- Joachims
- 2000
(Show Context)
Citation Context ...ludes Support Vector Machines (without an unregularized bias term, although Joachims extended the bound to include SVMs with a free "b" term, at the expense of an algebraically complicated derivation =-=[54, 56]-=-), but does not include Regularized Least Squares Classification, where the signs of the coefficients ffi are unconstrained. The proof given by Jaakkola and Haussler does depend crucially on the sign ... |