## 2.2 Oblique Trees, MSM-T.......................................................................10 (2007)

### BibTeX

@MISC{Tóth072.2oblique,

author = {Norbert Tóth},

title = {2.2 Oblique Trees, MSM-T.......................................................................10},

year = {2007}

}

### OpenURL

### Abstract

In this technical report a novel method is proposed that extends the decision tree framework, allowing standard decision tree classifiers to provide a unique certainty value for every input sample they classify. This value is calculated for every input sample individually and represents the classifier's certainty in the classification. The algorithm proposed in this report is not limited to axis-parallel trees, it can be applied to any kind of decision tree where the decisions are

### Citations

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...in 1973 by Morgan and Messenger which handled nominal or categorical responses. The THAID program created classification trees. Now several decision 6tree approaches exist, e.g.: CART, ID3 [4], C4.5 =-=[5]-=-, C5, THAID CHAID, TREEDISC, etc. One the most frequently used decision tree framework – upon which many approaches are based – is the Classification and Regression Trees (CART, 1984) [6] developed by... |

3666 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...gions. Most decision tree building algorithms define hyperplanes (either axis-parallel or oblique) at the nodes of the tree. Each of these hyperplanes split the input space into two convex halfspaces =-=[34]-=-. Every leaf of the decision tree is an intersection of these halfspaces that define a region in the input space, a polyhedron. A polyhedron is defined as the solution set of a finite number of inequa... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...rithm [3] in 1973 by Morgan and Messenger which handled nominal or categorical responses. The THAID program created classification trees. Now several decision 6tree approaches exist, e.g.: CART, ID3 =-=[4]-=-, C4.5 [5], C5, THAID CHAID, TREEDISC, etc. One the most frequently used decision tree framework – upon which many approaches are based – is the Classification and Regression Trees (CART, 1984) [6] de... |

1922 |
D: Pattern Classification
- Duda, Hart, et al.
- 2000
(Show Context)
Citation Context ...lassification confidence value attached to each individual input sample. To get true / false classification probability estimates, one can use a set of training samples and density estimation [37][38]=-=[40]-=- to approximate the probability density of the true and false classification based on the sample’s distance calculated from the decision boundary [15][33]. The oldest and simplest density estimator is... |

539 |
The meaning and use of the area under a receiver operating characteristic (ROC) curve
- Hanley, McNeil
- 1982
(Show Context)
Citation Context ...es. In this sense a classifier must not only classify a case it must also provide class probability estimates (or other certainty information that can be used for ranking purposes). It has been shown =-=[21]-=- that the Area Under the ROC (Receiver Operating Characteristics) Curve (AUC) of a classifier system measures its ranking performance [19] – [23]. It is now a widely used measure to compare classifier... |

250 | A system for induction of oblique decision trees
- Murthy, Kasif, et al.
- 1994
(Show Context)
Citation Context ...s for building oblique decision trees. In 1984 Breiman et al. [6] proposed a method to induce oblique decision trees with linear combinations. Their work was further improved by Murthy et al. in 1994 =-=[7]-=-. They developed the OC1 algorithm (based on CART) that adds randomization to avoid local minima. These methods used heuristics to find the decision boundary. Finding the optimal decision boundary pos... |

126 |
and R.J.Till, A Simple Generalisation of the Area under the ROC Curve to Multiple Class Classification Problems
- Hand
- 2001
(Show Context)
Citation Context ...ion that can be used for ranking purposes). It has been shown [21] that the Area Under the ROC (Receiver Operating Characteristics) Curve (AUC) of a classifier system measures its ranking performance =-=[19]-=- – [23]. It is now a widely used measure to compare classifiers. The original form of the ROC analysis was developed for binary (two class) classification problems. Although many extensions have been ... |

124 | H.: Breast cancer diagnosis and prognosis via linear programming
- Mangasarian, Street, et al.
- 1995
(Show Context)
Citation Context ...nce to the origin and e is a vector of ones of appropriate dimension. However the two classes are usually not separable and the inequalities (2.11) do not hold. Therefore the following linear program =-=[15]-=- tries to satisfy them as much as possible by minimizing the average distance of the misclassified points to the decision boundary. 11minimize subject to T e y m T e z k + , Aw y ≥ e + e + γ , Bw − z... |

107 |
Problems in the analysis of survey data and a proposal
- Morgan, Sunquist
- 1963
(Show Context)
Citation Context ...tion tree with three univariate splits and two classes. The Origin of Decision Tree Classifiers The origin of decision trees dates back to 1963, when the AID (Automatic Interaction Detection) program =-=[1]-=-[2] was developed at the Institute for Social Research, University of Michigan, by Morgan and Sonquist. They proposed a method for fitting trees to predict a quantitative variable. The AID algorithm c... |

97 | 2001b): Obtaining calibrated probability estimates from decision trees and naive bayes classi ers - Zadrozny, Elkan |

96 |
The attribute selection problem in decision tree generation
- Fayyad, Irani
- 1992
(Show Context)
Citation Context ...to various criteria. W. N. Street (2005) proposed the use of nonlinear programming to find optimal separators according to the orthogonality measure [17] introduced by Fayyad and Irani earlier (1992) =-=[12]-=-. This method creates oblique trees for n-class classification problems. Another approach to extend the two-class limitation of standard linear programming approaches was proposed by Bennett and Manga... |

49 | Applied Multivariate Statistical Analysis
- Härdle, Simar
- 2000
(Show Context)
Citation Context ...tion of the kernel function and the window width h . Selection of the Bandwidth There are several density estimation methods. One of the most widely used and simplest is Silverman’s rule of thumb [37]=-=[41]-=-. The idea is to choose h to minimize the difference to some reference distribution. For the Gaussian kernel and the normal reference distribution, the rule of thumb is to choose: where 5 ⎛ 4 ˆ σ ⎞ h ... |

46 | Nonparametric density estimation: toward computational tractability - Gray, Moore |

39 |
Stone.Classification And Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...D3 [4], C4.5 [5], C5, THAID CHAID, TREEDISC, etc. One the most frequently used decision tree framework – upon which many approaches are based – is the Classification and Regression Trees (CART, 1984) =-=[6]-=- developed by Breiman et al. Their work is based on the original ideas of the AID and the THAID algorithms. Their work was used as basis for the enhancements proposed in this discussion. Axis-parallel... |

39 |
Decision tree construction via linear programming
- Bennett
- 1992
(Show Context)
Citation Context ... decision boundary poses difficult challenges, and is a subject of mathematical-programming community since 1960s [8]-[18]. Bennett (1992) proposed a framework called Multisurface Method Tree (MSM-T) =-=[13]-=-[14] which is a variant of the Multisurface Method [9][14]. This is one of the most widely used and documented method. It will be briefly introduced as it will be used by the demonstrative examples th... |

39 | Retrofitting decision tree classifiers using kernel density estimation
- Smyth, Gray, et al.
- 1995
(Show Context)
Citation Context ...ce falling to the same leaf of the decision tree. The other two approaches try to estimate conditional class probabilities, considering the location of input samples at the leafs. Smyth et al. (1995) =-=[29]-=- proposed the use of kernel density estimation at the leafs of the decision tree to provide probability estimates. They used trees with larger leafs containing more sample points. The major drawback o... |

30 | Decision tree with better ranking
- Ling, Yan
(Show Context)
Citation Context ...0.8 0.9 1 false positive rate Figure 3.1. The ROC space regions. Probability Estimation Trees (PETs) There are numerous approaches to extend the decision tree framework to provide ranking information =-=[24]-=- – [32]. The very obvious approach to turn decision trees into probability estimators by using the absolute class frequencies at every leaf: p i n i = , (3.4) N where n i is the number of the observed... |

24 |
Density estimation for statistics and data analysis
- W
- 1986
(Show Context)
Citation Context ...unique classification confidence value attached to each individual input sample. To get true / false classification probability estimates, one can use a set of training samples and density estimation =-=[37]-=-[38][40] to approximate the probability density of the true and false classification based on the sample’s distance calculated from the decision boundary [15][33]. The oldest and simplest density esti... |

22 | Convex programming
- Boyd, Vandenberghe
- 2005
(Show Context)
Citation Context ...rticular cell. 3.2 Distance from the Induced Decision Boundaries Distance from sections – leafs of the tree – of the input space can be measured solving a set of constrained quadratic programs [34] – =-=[36]-=-, which is a special case of convex optimization problems. A convex optimization problem [34] is given as minimize ( ) f 0 x subject to ( x) ≤ 0 f , i = 1... m i (3.9) f ,..., f a x = b , i = 1... p T... |

16 | The Use of the Area under the ROC - Bradley - 1997 |

15 |
THAID: A sequential analysis program for the analysis of nominal scale dependent variables
- MORGAN, C
- 1973
(Show Context)
Citation Context ...gan, by Morgan and Sonquist. They proposed a method for fitting trees to predict a quantitative variable. The AID algorithm created regression trees. A modification to the AID was the THAID algorithm =-=[3]-=- in 1973 by Morgan and Messenger which handled nominal or categorical responses. The THAID program created classification trees. Now several decision 6tree approaches exist, e.g.: CART, ID3 [4], C4.5... |

8 |
Kohavi R: The Case Against Accuracy Estimation for Comparing Induction Algorithms
- Provost, Fawcett
(Show Context)
Citation Context ...t can be used for ranking purposes). It has been shown [21] that the Area Under the ROC (Receiver Operating Characteristics) Curve (AUC) of a classifier system measures its ranking performance [19] – =-=[23]-=-. It is now a widely used measure to compare classifiers. The original form of the ROC analysis was developed for binary (two class) classification problems. Although many extensions have been develop... |

8 |
Tree induction for probability based ranking
- Provost, Domingos
(Show Context)
Citation Context ...proaches that aim the improvement of PETs can be categorized into two groups: • • The first bunch of methods alter the growth process of the tree to make them more suitable for probability estimation =-=[25]-=-[28]. They define different splitting criteria, or pruning techniques. The other approaches try to obtain better probability estimates without altering the tree. These second group of approaches are i... |

7 | Ranking cases with decision trees: a geometric method that preserves intelligibility
- Alvarez, Bernard
- 2005
(Show Context)
Citation Context ... 1 false positive rate Figure 3.1. The ROC space regions. Probability Estimation Trees (PETs) There are numerous approaches to extend the decision tree framework to provide ranking information [24] – =-=[32]-=-. The very obvious approach to turn decision trees into probability estimators by using the absolute class frequencies at every leaf: p i n i = , (3.4) N where n i is the number of the observed sample... |

6 | Explaining the result of a decision tree to the end-user
- Alvarez
- 2004
(Show Context)
Citation Context ... render the method useless in high dimensions or without a huge number test samples (to estimate the densities). See chapter 3.6 for discussion on the “curse-of-dimensionality”. Alvarez et al. (2004) =-=[30]-=- – [32] proposed a geometric ranking method that is based on the distance calculation of the input sample to the decision boundary induced by the decision tree. The input sample is projected onto ever... |

4 | Decision trees for ranking: effect of new smoothing methods, new splitting criteria and simple pruning methods. Technical report, UPV(DSIC 2003) Forswall CD and Higgins KE (2006) Clean air act implementation in Houston: an historical perspective, 1970-200
- Ferri, Flach, et al.
- 2003
(Show Context)
Citation Context ...estimation at the leafs. 16One way to improve the probability estimates provided by decision trees is the use of smoothing techniques, which is one of the most widely used methods to create PETs [25]=-=[26]-=-. Mostly the Laplace smoothing correction is used, where the class probability estimates take the form of: p i = n i + N + C 1 , (3.5) where C is the number of classes. The use of Laplace-correction i... |

3 | Oblique multicategory decision trees using nonlinear programming
- Street
(Show Context)
Citation Context ...ovement seek to find optimal hyperplanes according to various criteria. W. N. Street (2005) proposed the use of nonlinear programming to find optimal separators according to the orthogonality measure =-=[17]-=- introduced by Fayyad and Irani earlier (1992) [12]. This method creates oblique trees for n-class classification problems. Another approach to extend the two-class limitation of standard linear progr... |

3 | Model assessment with ROC curves, in: The Encyclopedia of Data Warehousing - Hamel - 2008 |

2 |
Keep the Decision Tree and Estimate the Class Probabilites Using its Decision Boundary
- Alvarez, Bernard, et al.
- 2007
(Show Context)
Citation Context ...ance values are the better ones). This assumption is based on the hypothesis that the classification certainty is increasing with the distance from the decision boundary. Later in 2007 Alvarez et al. =-=[33]-=- combined the geometric ranking method with the kernel density estimation to provide class probability estimates and in the same time avoiding the “curse-of-dimensionality” coming from the nature of k... |

1 |
Mangasarian: “Linear and Nonlinear Separation of Patterns by Linear Programming
- L
- 1965
(Show Context)
Citation Context ...nima. These methods used heuristics to find the decision boundary. Finding the optimal decision boundary poses difficult challenges, and is a subject of mathematical-programming community since 1960s =-=[8]-=--[18]. Bennett (1992) proposed a framework called Multisurface Method Tree (MSM-T) [13][14] which is a variant of the Multisurface Method [9][14]. This is one of the most widely used and documented me... |

1 |
Mangasarian: “Multisurface Method of Pattern Separation
- L
- 1968
(Show Context)
Citation Context ... subject of mathematical-programming community since 1960s [8]-[18]. Bennett (1992) proposed a framework called Multisurface Method Tree (MSM-T) [13][14] which is a variant of the Multisurface Method =-=[9]-=-[14]. This is one of the most widely used and documented method. It will be briefly introduced as it will be used by the demonstrative examples through the following sections. The main difference of M... |

1 |
Mangasarian: Multicategory Discrimination via Linear Programming. Computer Sciences
- Benett, L
- 1992
(Show Context)
Citation Context ... to extend the two-class limitation of standard linear programming approaches was proposed by Bennett and Mangasarian to extend the linear programming approaches to multicategory separation (MSMT-MC) =-=[10]-=- A very big portion of the improvements aim to improve the class probability estimates of decision trees. This topic is addressed in the next chapter: Uncertainty Using Decision Tree Classifiers. 133... |

1 | Mangasarian: Robust Linear Programming Discrimination of two Linearly Inseparable Sets. Computer Sciences - Benett, L - 1991 |

1 |
Mangasarian: Mathematical Programming in Neural Networks. Computer Sciences
- L
- 1992
(Show Context)
Citation Context ...ision boundary poses difficult challenges, and is a subject of mathematical-programming community since 1960s [8]-[18]. Bennett (1992) proposed a framework called Multisurface Method Tree (MSM-T) [13]=-=[14]-=- which is a variant of the Multisurface Method [9][14]. This is one of the most widely used and documented method. It will be briefly introduced as it will be used by the demonstrative examples throug... |

1 |
Vanderbei: Linear Programming: Foundations and Extensions
- J
(Show Context)
Citation Context .... These methods used heuristics to find the decision boundary. Finding the optimal decision boundary poses difficult challenges, and is a subject of mathematical-programming community since 1960s [8]-=-=[18]-=-. Bennett (1992) proposed a framework called Multisurface Method Tree (MSM-T) [13][14] which is a variant of the Multisurface Method [9][14]. This is one of the most widely used and documented method.... |

1 |
Margineantu: Improved Class Probability Estimates from Decision Tree Models. Research note
- D
(Show Context)
Citation Context ...ches that aim the improvement of PETs can be categorized into two groups: • • The first bunch of methods alter the growth process of the tree to make them more suitable for probability estimation [25]=-=[28]-=-. They define different splitting criteria, or pruning techniques. The other approaches try to obtain better probability estimates without altering the tree. These second group of approaches are in th... |

1 |
Mahalanobis: “On the generalised distance in statistics
- C
- 1936
(Show Context)
Citation Context ... polyhedron to a point s , which is the minimal distance projection of s to P , is different for the Euclidean and the Mahalanobis metric. The Mahalanobis distance was introduced by P. C. Mahalanobis =-=[39]-=-[44]. It differs from the Euclidean distance in that it takes into account the correlations between the input variables, and is scale invariant. The Mahalanobis distance between two random vectors x a... |

1 |
Non Asymptotic Binomial Confidence Intervals. Statistics Research Assosiates
- Harde
(Show Context)
Citation Context ...of the test samples correctly classified is the maximum-likelihood estimate for p : k p = n ˆ . (3.33) 35For the binomial proportion p confidence intervals can be calculated using the F distribution =-=[42]-=-. Define the 100(1-α )% confidence interval for p as (φ ,ψ ). The lower bound can be calculated as: = 0 φ if k = 0 1 φ = k + ( n − k + 1) F 2( n−k + 1), 2k , 1−α / 2 if k ≠ 0 . (3.34) The higher bound... |