## DOI 10.1007/s10994-008-5056-8 On reoptimizing multi-class classifiers (2008)

### BibTeX

@MISC{Bourke08doi10.1007/s10994-008-5056-8,

author = {Chris Bourke and Kun Deng and Stephen D. Scott and C. Bourke and K. Deng and S. D. Scott and N. V. Vinodch and K. Deng and S. D. Scott and N. V. Vinodchandran and R. E. Schapire},

title = {DOI 10.1007/s10994-008-5056-8 On reoptimizing multi-class classifiers},

year = {2008}

}

### OpenURL

### Abstract

Abstract Significant changes in the instance distribution or associated cost function of a learning problem require one to reoptimize a previously-learned classifier to work under new conditions. We study the problem of reoptimizing a multi-class classifier based on its ROC hypersurface and a matrix describing the costs of each type of prediction error. For a binary classifier, it is straightforward to find an optimal operating point based on its ROC curve and the relative cost of true positive to false positive error. However, the corresponding multiclass problem (finding an optimal operating point based on a ROC hypersurface and cost matrix) is more challenging and until now, it was unknown whether an efficient algorithm existed that found an optimal solution. We answer this question by first proving that the decision version of this problem is NP-complete. As a complementary positive result, we give an algorithm that finds an optimal solution in polynomial time if the number of classes n is a constant. We also present several heuristics for this problem, including linear, nonlinear, and quadratic programming formulations, genetic algorithms, and a customized algorithm. Empirical results suggest that under both uniform and non-uniform cost models, simple greedy methods outperform more sophisticated methods. Preliminary results appeared in Deng et al. (2006). Editor: Tom Fawcett.

### Citations

3666 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...)}: Ii,jwj fj (xi) ≥ γi Ii,j (14) wkfk(xi) ≤ γi, (15) for all i ∈{1,...,n} and j,k ∈{1,...,m} where each γi is a new variable. Obviously (15) is a linear constraint, but (14) is not even quasiconvex (=-=Boyd and Vandenberghe 2004-=-). The complexity of this optimization problem motivates us to reformulate it a bit further. Let us assume that fk(xi) ∈ (0, 1] (e.g. if fk(·) are probability estimates from naïve Bayes or logistic re... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...class is in the metaclass) and 1 |C ′ | ∑ c(y,j) j∈C ′ otherwise. The MetaClass algorithm is presented as Algorithm 1. Figure 1 gives an example of a tree built by the MetaClass algorithm on the UCI (=-=Blake and Merz 2005-=-) data set Nursery, a 5-class data set. At the root, the classes are divided into two metaclasses, each with about the same number of training examples represented in their respective classes. In this... |

2028 | Learning with Kernels - Scholkopf, Smola - 2002 |

260 | Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions
- Provost, Fawcett
- 1997
(Show Context)
Citation Context ...finding the optimal operating point of a classifier given a ratio of true positive cost to false positive cost and has a straightforward solution via Receiver Operating Characteristic (ROC) analysis (=-=Provost and Fawcett 1997-=-, 1998, 2001; Lachiche and Flach 2003). ROC analysis takes a classifier F that outputs confidences in its predictions (i.e. a ranking classifier), and precisely describes the tradeoffs between true po... |

253 | Robust Classification for Imprecise Environments - Provost, Fawcett |

245 | Statistical comparisons of classifiers over multiple data sets - Demsar |

145 |
Multiple Comparison Procedures
- Hochberg, Tamhane
- 1987
(Show Context)
Citation Context ...t some set of algorithms outperform other methods with a high degree of certainty. Specific rankings can be found in Table 3. We also performed a post-hoc analysis using a Tukey-Kramer pairwise test (=-=Hochberg and Tamhane 1987-=-). The results of this analysis can be found in Fig. 2. For each table, the pairwise Tukey-Kramer test induces a partial order among the heuristics such that a relation from algorithm A to algorithm B... |

126 |
and R.J.Till, A Simple Generalisation of the Area under the ROC Curve to Multiple Class Classification Problems
- Hand
- 2001
(Show Context)
Citation Context ...ly) choice is to reoptimize F based on a new (possibly smaller) data set. Such problems have been studied extensively (Fieldsend and Everson 2005; Ferri et al. 2003;Mach Learn (2008) 71: 219–242 221 =-=Hand and Till 2001-=-; Lachiche and Flach 2003; Mossman 1999; O’Brien and Gray 2005; Srinivasan 1999). For learning tasks with only n = 2 classes, this problem is equivalent to that of finding the optimal operating point ... |

62 |
Convexity and Optimization in Finite Dimensions
- Stoer, Witzgall
- 1970
(Show Context)
Citation Context ...ich the objective function and the inequality constraint functions are convex and the equality constraint functions are affine. The theory of convex programming is well-established (Rockafellar 1970; =-=Stoer and Witzgall 1996-=-). For a convex program, a local optimum is the global optimum and there are well-studied efficient algorithms to find such a global optimum. We tried several quadratic programming methods based on th... |

50 | Robust classification systems for imprecise environments - Provost, Fawcett |

46 | Improving accuracy and cost of two-class and multi-class probabilistic classifiers under ROC curves
- Lachiche, Flach
- 2003
(Show Context)
Citation Context ...ptimize F based on a new (possibly smaller) data set. Such problems have been studied extensively (Fieldsend and Everson 2005; Ferri et al. 2003;Mach Learn (2008) 71: 219–242 221 Hand and Till 2001; =-=Lachiche and Flach 2003-=-; Mossman 1999; O’Brien and Gray 2005; Srinivasan 1999). For learning tasks with only n = 2 classes, this problem is equivalent to that of finding the optimal operating point of a classifier given a r... |

42 | 1BC: A first-order Bayesian classifier
- Flach, Lachiche
- 1999
(Show Context)
Citation Context ...�(nmlog m). Though there is no guarantee that this approach can find an optimal solution, they gave empirical results suggesting that it works well for optimizing 1BC, a logic-based Bayes classifier (=-=Lachiche and Flach 1999-=-). Although only briefly mentioned by Lachiche and Flach (2003), this ROC thresholding technique is extensible to cost-sensitive scenarios. O’Brien and Gray (2005) investigated the role of a cost matr... |

33 | Note on the location of optimal classifiers in ndimensional roc space
- Srinivasan
- 1999
(Show Context)
Citation Context ...blems have been studied extensively (Fieldsend and Everson 2005; Ferri et al. 2003;Mach Learn (2008) 71: 219–242 221 Hand and Till 2001; Lachiche and Flach 2003; Mossman 1999; O’Brien and Gray 2005; =-=Srinivasan 1999-=-). For learning tasks with only n = 2 classes, this problem is equivalent to that of finding the optimal operating point of a classifier given a ratio of true positive cost to false positive cost and ... |

26 |
The minimum satisfiability problem
- Kohli, Krishnamurti, et al.
- 1994
(Show Context)
Citation Context ...y the correct label yi. To prove the hardness of REWEIGHT, we will reduce from the minimum satisfiability problem MINSAT, shown to be NP-complete by Kohli et al. (1994). Definition 2 (Problem MINSAT (=-=Kohli et al. 1994-=-)) Given: a set of disjunctions of pairs of literals ℓ11 ∨ ℓ12 ℓ21 ∨ ℓ22 . ℓm1 ∨ ℓm2, where each ℓij is a boolean variable xi or its negation ¬xi. We are also given an integer K. Question: does there ... |

22 | Convex programming
- Boyd, Vandenberghe
- 2005
(Show Context)
Citation Context ...nnegative). Upon termination, a direct pattern search is performed using the best solution from the GA. The final column is the quadratic program (QP) from Sect. 5.1.3 using the CVX software package (=-=Grant et al. 2006-=-). For all columns, values in italics indicate a worse performance than the baseline. Entries in bold indicate a significant difference from the baseline with at least a 95% confidence according to a ... |

21 |
Three-way ROCs
- Mossman
- 1999
(Show Context)
Citation Context ...(possibly smaller) data set. Such problems have been studied extensively (Fieldsend and Everson 2005; Ferri et al. 2003;Mach Learn (2008) 71: 219–242 221 Hand and Till 2001; Lachiche and Flach 2003; =-=Mossman 1999-=-; O’Brien and Gray 2005; Srinivasan 1999). For learning tasks with only n = 2 classes, this problem is equivalent to that of finding the optimal operating point of a classifier given a ratio of true p... |

20 | Volume under the roc surface for multi-class problems
- Ferri, Hernández-Orallo, et al.
- 2003
(Show Context)
Citation Context ...ry restrictions. In this case the best (or perhaps only) choice is to reoptimize F based on a new (possibly smaller) data set. Such problems have been studied extensively (Fieldsend and Everson 2005; =-=Ferri et al. 2003-=-;Mach Learn (2008) 71: 219–242 221 Hand and Till 2001; Lachiche and Flach 2003; Mossman 1999; O’Brien and Gray 2005; Srinivasan 1999). For learning tasks with only n = 2 classes, this problem is equi... |

12 | NP-hardness of linear multiplicative programming and related problems - Matsui - 1996 |

7 | Formulation and comparison of multi-class ROC surfaces
- Fieldsend, Everson
- 2005
(Show Context)
Citation Context ...ilable, say due to proprietary restrictions. In this case the best (or perhaps only) choice is to reoptimize F based on a new (possibly smaller) data set. Such problems have been studied extensively (=-=Fieldsend and Everson 2005-=-; Ferri et al. 2003;Mach Learn (2008) 71: 219–242 221 Hand and Till 2001; Lachiche and Flach 2003; Mossman 1999; O’Brien and Gray 2005; Srinivasan 1999). For learning tasks with only n = 2 classes, t... |

4 | New algorithms for optimizing multiclass classifiers via ROC surfaces - Deng, Bourke, et al. - 2006 |

4 | Improving classification performance by exploring the role of cost matrices in partitioning the estimated class probability space - B, Gray - 2005 |

3 | Genetic algorithm and direct search toolbox. http://www.mathworks.com - Abramson - 2005 |

3 |
Convex Analysis, 2nd Edn
- Rockafellar
- 1999
(Show Context)
Citation Context ... programming in which the objective function and the inequality constraint functions are convex and the equality constraint functions are affine. The theory of convex programming is well-established (=-=Rockafellar 1970-=-; Stoer and Witzgall 1996). For a convex program, a local optimum is the global optimum and there are well-studied efficient algorithms to find such a global optimum. We tried several quadratic progra... |

2 |
The MOSEK optimization tools version 3.2. http://www.mosek.com
- ApS
- 2005
(Show Context)
Citation Context ...les). The results of the experiments on our heuristics can be found in the last five columns of each table. Here,“MC”istheMetaClass algorithm (Algorithm 1). “LP” is a linear programming solver (MOSEK =-=ApS 2005-=-) on(21) with η = 10 −6 . “GA1” is the sum of linear fractional functions formulation (22) using a genetic algorithm. “GA2” is a genetic algorithm optimization of (1). Both experiments used the GA imp... |