## Mathematical Programming in Data Mining (1996)

Venue: | Data Mining and Knowledge Discovery |

Citations: | 26 - 3 self |

### BibTeX

@ARTICLE{Mangasarian96mathematicalprogramming,

author = {O. L. Mangasarian},

title = {Mathematical Programming in Data Mining},

journal = {Data Mining and Knowledge Discovery},

year = {1996},

volume = {42},

pages = {183--201}

}

### Years of Citing Articles

### OpenURL

### Abstract

Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained m...

### Citations

8980 | Statistical Learning Theory - Vapnik - 1998 |

3265 | Convex Analysis - ROCKAFELLAR - 1996 |

2649 |
Introduction to Statistical Pattern Recognition”, 2nd edition
- Fukunaga
- 1990
(Show Context)
Citation Context ...idered as an application of Occam's "law of parsimony", also known as Occam's Razor [65, 9], which states: "What can be done with fewer [assumptions] is done in vain with more". Th=-=ere are statistical [22]-=-, machine learning [39, 33] as well as mathematical programming [12, 45, 10] approaches to the feature selection problem. In this work we shall deal principally with the latter because of the novelty ... |

2152 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...consin Prognosis Breast Cancer Database (WPBC), distinct and clinically important survival curves were discovered from the database by the k-Median Algorithm, whereas the traditional k-Mean Algorithm =-=[32, 64]-=-, which uses the square of the 2-norm distance, thus emphasizing outliers, failed to obtain such distinct survival curves for the same database. On four other publicly available databases each of the ... |

1827 |
Robust statistics
- Huber
- 1981
(Show Context)
Citation Context ...tween point A T i and center C ` , and e is an n \Theta 1 vector of ones in R n . Hence e T D i` bounds the 1-norm distance between A i and C ` . We note that just as in the case of robust regression =-=[31]-=-,[27, pp 82-87], the use of the 1-norm here to measure the error criterion leads to insensitivity to outliers such as those resulting from distributions with pronounced tails. We also note that since ... |

1774 |
Introduction to the Theory of Neural Computation
- Hertz, Palmer
- 1991
(Show Context)
Citation Context ...ints [10]. A step function that appears in the objective function is approximated by a concave exponential on the nonnegative real line instead of the conventional sigmoid function of neural networks =-=[28]-=-. This leads to a very fast iterative linear-programming-based algorithm for solving the problem that terminates in a finite number of steps. On the Wisconsin Prognosis Breast Cancer (WPBC) [72, 51] d... |

1040 |
Theory of Games and Economic Behavior
- Neumann, Morgenstern
- 1944
(Show Context)
Citation Context ...straints, is a broad discipline that has been applied to a great variety of theoretical and applied problems such as operations research [29, 54], network problems [60, 53], game theory and economics =-=[71, 35]-=-, engineering mechanics [57, 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathematical-programming-based developments that are relevant to data... |

1038 |
Integer and Combinatorial Optimization
- Nemhauser, Wolsey
- 1988
(Show Context)
Citation Context ... that the problem features must be real numbers or easily mapped into real numbers. If some of the features are discrete and can be represented as integers, then the techniques of integer programming =-=[24, 55, 14]-=- can be employed. Integer programming approaches have been applied for example to clustering problems [64, 1], but will not be described here, principally because the combinatorial approach is fundame... |

1027 |
Nonparametric estimation from incomplete observations
- EL, Meier
(Show Context)
Citation Context ...stic Breast Cancer Database (WPBC) in order to discover medical knowledge. For such medical databases, extracting well-separated survival curves provides an essential prognostic tool. Survival curves =-=[34, 38]-=- give expected percent of surviving patients as a function of time. The k-Median Algorithm was applied to WPBC to extract such curves. Survival curves were constructed for 194 patients using two clini... |

811 |
Programming and Extensions
- Dantzig, Linear
- 1963
(Show Context)
Citation Context ... of real variables, f is a real-valued function of x, g and h are finite dimensional vector functions of x. If all the functions f , g and h are linear then the problem simplifies to a linear program =-=[15, 52, 69]-=- which is the classical problem of mathematical Mathematical Programming Technical Report 96-05, August 1996 -- Revised November 1996 & March 1997. This material is based on research supported by Nati... |

741 |
Aha, “UCI repository of machine learning data bases,” http: //www.ics.uci.edu/~mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...works [28]. This leads to a very fast iterative linear-programming-based algorithm for solving the problem that terminates in a finite number of steps. On the Wisconsin Prognosis Breast Cancer (WPBC) =-=[72, 51]-=- database the proposed algorithm reduced cross-validation error on a cancer prognosis database by 35.4% while reducing problem features from 32 to 4. 2. Clustering The clustering problem considered in... |

740 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...y effective applications of the latter to the former. We will, however, point out other approaches that are mostly not based on mathematical programming. The fundamental nonlinear programming problem =-=[6, 44]-=- consists of minimizing an objective function subject to inequality and equality constraints and is typically written as follows min x f(x) subject to g(x)s0; h(x) = 0; (1) where x is an n-dimensional... |

643 | Knowledge acquisition via incremental conceptual clustering
- Fisher
- 1987
(Show Context)
Citation Context ...gnment of elements of a given set into groups or clusters of like points, is the objective of cluster analysis. There are many approaches to this problem, including statistical [32], machine learning =-=[20]-=-, integer and mathematical programming approaches [64, 1, 58, 11]. We shall describe here the recent approach of [11] that utilizes a fast bilinear programming approach: minimizing the product of two ... |

619 |
Tsitsiklis, Parallel and Distributed Computation: Numerical Methods
- Bertsekas, N
- 1989
(Show Context)
Citation Context ...atabases are amenable to the proposed linear-programming-based methods. In addition there is a large body of literature on the parallel solution and decomposition of large scale mathematical programs =-=[7, 16, 17, 19]-=- where for many of the algorithms only part of the data is loaded into memory at a time. Such parallel and decomposition algorithms extend further the applicability of the proposed methods to very lar... |

595 | Irrelevant Features and the Subset Selection Problem
- John, Kohavi, et al.
- 1994
(Show Context)
Citation Context ...n of Occam's "law of parsimony", also known as Occam's Razor [65, 9], which states: "What can be done with fewer [assumptions] is done in vain with more". There are statistical [22=-=], machine learning [39, 33]-=- as well as mathematical programming [12, 45, 10] approaches to the feature selection problem. In this work we shall deal principally with the latter because of the novelty of the approach and it effe... |

555 | Stacked generalization
- Wolpert
(Show Context)
Citation Context ...changes. This problem is closely related to the generalization problem of machine learning of how to train a system on a given training set so as to improve generalization on a new unseen testing set =-=[39, 63, 73]. We-=- use here a simple linear model [67] and will show that if a sufficiently small error �� is purposely tolerated in constructing the model, then for a broad class of perturbations the model will be... |

419 | Optimal brain damage
- LeCun, Denker, et al.
- 1990
(Show Context)
Citation Context ...changes. This problem is closely related to the generalization problem of machine learning of how to train a system on a given training set so as to improve generalization on a new unseen testing set =-=[39, 63, 73]. We-=- use here a simple linear model [67] and will show that if a sufficiently small error �� is purposely tolerated in constructing the model, then for a broad class of perturbations the model will be... |

324 |
Robust estimation of a location parameter
- Huber, Peter
- 1964
(Show Context)
Citation Context ... two point sets [70, Section 5.4]. Bennett and Bredensteiner [2] have formulated a similar problem using linear programming. Vapnik [70, Section 5.9] also makes use of Huber's robust regression ideas =-=[30] and-=- extends the latter's robust regression loss function [70, p 152] by adding an fflinsensitive zone to it. This ffl-insensitive zone is similar to our ��-tolerance zone (see (17) and (18) below) wh... |

281 |
An Introduction to Operations Research
- Hiller, Lieberman
- 1995
(Show Context)
Citation Context ...n Mathematical programming, that is optimization subject to constraints, is a broad discipline that has been applied to a great variety of theoretical and applied problems such as operations research =-=[29, 54]-=-, network problems [60, 53], game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathem... |

264 | The KDD Process for Extracting Useful Knowledge from Data
- Fayyard, Piatetsky-Shapiro, et al.
- 1996
(Show Context)
Citation Context ... that is not valley-like), a fast finite k-Median Algorithm consisting of solving few linear programs in closed form leads to a stationary point. Computational testing of this algorithm as a KDD tool =-=[18]-=- has been quite encouraging. On the Wisconsin Prognosis Breast Cancer Database (WPBC), distinct and clinically important survival curves were discovered from the database by the k-Median Algorithm, wh... |

213 | Fundamentals of Artificial Neural Networks - Hassoun - 1995 |

210 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ... solution in general, one resorts to satisfying them in some best approximate sense by minimizing an average sum of their violations. This leads to the following Robust Linear Programming formulation =-=[4]-=-: RLP min w;fl;y;z 8 ? ! ? : e T y m + e T z k fi fi fi fi fi fi fi \GammaAw + efl + esy; Bw \Gamma efl + esz; ys0; zs0 9 ? = ? ; : (7) Robustness here refers to the fact that the useless null vector ... |

197 |
A new polynomial time algorithm for linear programming
- Karmarkar
- 1984
(Show Context)
Citation Context ..." [68, p 1]. From the point of view of applicability to large-scale data mining problems, the proposed algorithms employ either linear programming (Sections 2 and 3) which is polynomial-time-solv=-=able [36, 69]-=-, or convex quadratic programming (Section 4) which is also polynomial-time-solvable [69]. Extremely fast linear and quadratic programming codes [14] that are capable of solving linear programs with m... |

193 |
Nonlinear programming
- Mangasarian
- 1994
(Show Context)
Citation Context ...ently nonlinear problems where for example the parameters of a separating surface appear nonlinearly, one may have to resort to nonlinear models and the theory and algorithms of nonlinear programming =-=[42, 6]-=-. This would again be a promising problem to pursue. We conclude with the hope that the problems solved demonstrate the theoretical and computational potential of mathematical programming as a versati... |

142 | Linear Programming: Foundations and Extensions
- Vanderbei
- 2008
(Show Context)
Citation Context ... of real variables, f is a real-valued function of x, g and h are finite dimensional vector functions of x. If all the functions f , g and h are linear then the problem simplifies to a linear program =-=[15, 52, 69]-=- which is the classical problem of mathematical Mathematical Programming Technical Report 96-05, August 1996 -- Revised November 1996 & March 1997. This material is based on research supported by Nati... |

124 | H.: Breast cancer diagnosis and prognosis via linear programming
- Mangasarian, Street, et al.
- 1995
(Show Context)
Citation Context ...(6), or equivalently (5), is exactly satisfied. The linear programming formulation (7) which has a number of natural theoretical properties including robustness is also very effective computationally =-=[4, 49]-=-. However it does not address the problem of suppressing irrelevant features. In order suppress such features, the objective function of (7), which merely measures the average sum of the violations of... |

121 |
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...screte and can be represented as integers, then the techniques of integer programming [24, 55, 14] can be employed. Integer programming approaches have been applied for example to clustering problems =-=[64, 1]-=-, but will not be described here, principally because the combinatorial approach is fundamentally different than the analytical approach of optimization with real variables. Stochastic optimization me... |

119 |
Interior point methods for linear programming: Ready for production use
- MARSTEN, SHANNO
- 1990
(Show Context)
Citation Context ... programming (Section 4) which is also polynomial-time-solvable [69]. Extremely fast linear and quadratic programming codes [14] that are capable of solving linear programs with millions of variables =-=[8, 40]-=- and very large quadratic programs, make the proposed algorithms easily scalable and effective for solving a wide range of problems. One limitation however is that the problem features must be real nu... |

119 | Overfitting avoidance as bias
- Schaffer
- 1993
(Show Context)
Citation Context ...changes. This problem is closely related to the generalization problem of machine learning of how to train a system on a given training set so as to improve generalization on a new unseen testing set =-=[39, 63, 73]. We-=- use here a simple linear model [67] and will show that if a sufficiently small error �� is purposely tolerated in constructing the model, then for a broad class of perturbations the model will be... |

116 |
Integer Programming
- Garfinkel, Nemhauser
- 1972
(Show Context)
Citation Context ... that the problem features must be real numbers or easily mapped into real numbers. If some of the features are discrete and can be represented as integers, then the techniques of integer programming =-=[24, 55, 14]-=- can be employed. Integer programming approaches have been applied for example to clustering problems [64, 1], but will not be described here, principally because the combinatorial approach is fundame... |

98 |
Network Flows and Monotropic Optimization
- Rockafellar
- 1984
(Show Context)
Citation Context ... that is optimization subject to constraints, is a broad discipline that has been applied to a great variety of theoretical and applied problems such as operations research [29, 54], network problems =-=[60, 53]-=-, game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathematical-programming-based de... |

60 | Feature selection via mathematical programming
- Bradley, Mangasarian, et al.
(Show Context)
Citation Context ..., 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathematical-programming-based developments that are relevant to data mining: feature selection =-=[45, 10]-=-, clustering [11] and robust representation [67]. We note at the outset that we do not plan to survey either the fields of data mining or mathematical programming, but rather highlight some recent and... |

55 |
Linear and Nonlinear Separation of Patterns by Linear Programming
- Mangasarian
- 1965
(Show Context)
Citation Context ... (7) Robustness here refers to the fact that the useless null vector (w = 0) is naturally excluded as a solution of (7), which is not the case in other linear programming formulations of this problem =-=[41, 66, 26, 25]-=-. Note that because of the constraints of the problem, the variables y and z will satisfy the following conditions: ysminf0; \GammaAw + efl + eg and zsminf0; Bw \Gamma efl + eg: Hence minimizing e T m... |

54 |
Nonlinear perturbation of linear programs
- Mangasarian, Meyer
- 1979
(Show Context)
Citation Context ...force suppression of unnecessary components of w. As described below in Algorithm 2.1 the parametersis chosen to give the best cross-validated error. For small values ofsit can be shown theoretically =-=[47]-=- that the minimization problem (8) picks that solution of the Robust Linear Program (7) that minimizes the exponential term of (8), and hence solves the RLP (7) while suppressing redundant components ... |

49 | Clustering via Concave Minimization
- Bradley, Mangasarian, et al.
- 1997
(Show Context)
Citation Context ...ers of like points, is the objective of cluster analysis. There are many approaches to this problem, including statistical [32], machine learning [20], integer and mathematical programming approaches =-=[64, 1, 58, 11]-=-. We shall describe here the recent approach of [11] that utilizes a fast bilinear programming approach: minimizing the product of two linear functions on a set defined by linear inequalities. A princ... |

49 |
Linear programming
- Murty
- 1983
(Show Context)
Citation Context ... that is optimization subject to constraints, is a broad discipline that has been applied to a great variety of theoretical and applied problems such as operations research [29, 54], network problems =-=[60, 53]-=-, game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathematical-programming-based de... |

43 |
A Tabu Search Approach to Clustering Problems
- Al-Sultan
- 1995
(Show Context)
Citation Context ...screte and can be represented as integers, then the techniques of integer programming [24, 55, 14] can be employed. Integer programming approaches have been applied for example to clustering problems =-=[64, 1]-=-, but will not be described here, principally because the combinatorial approach is fundamentally different than the analytical approach of optimization with real variables. Stochastic optimization me... |

40 | Mathematical programming in neural networks
- Mangasarian
- 1993
(Show Context)
Citation Context ...ne that attempts to separate R n into two halfspaces such that each open halfspace contains points mostly of A or B: ffl Alternatively, a separating plane can be interpreted as a classical perceptron =-=[62, 43]-=- with a threshold determined by the distance of the plane to the origin, and the incoming arc weights of the perceptron determined by the components of the normal vector to the plane. 2 Feature Select... |

36 |
Neural network training via linear programming
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...oretical and applied problems such as operations research [29, 54], network problems [60, 53], game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning =-=[3, 61, 23, 48, 46]-=-. In this paper we describe three recent mathematical-programming-based developments that are relevant to data mining: feature selection [45, 10], clustering [11] and robust representation [67]. We no... |

36 |
Very large-scale linear programming: A case study in combining interior point and simplex method
- Bixby, Gregory, et al.
- 1992
(Show Context)
Citation Context ... programming (Section 4) which is also polynomial-time-solvable [69]. Extremely fast linear and quadratic programming codes [14] that are capable of solving linear programs with millions of variables =-=[8, 40]-=- and very large quadratic programs, make the proposed algorithms easily scalable and effective for solving a wide range of problems. One limitation however is that the problem features must be real nu... |

35 | Bilinear separation of two sets in n-space - Bennett, Mangasarian - 1992 |

34 | Parallel Variable Distribution
- Ferris, Mangasarian
- 1994
(Show Context)
Citation Context ...atabases are amenable to the proposed linear-programming-based methods. In addition there is a large body of literature on the parallel solution and decomposition of large scale mathematical programs =-=[7, 16, 17, 19]-=- where for many of the algorithms only part of the data is loaded into memory at a time. Such parallel and decomposition algorithms extend further the applicability of the proposed methods to very lar... |

31 |
Cluster analysis and mathematical programming
- Rao
- 1971
(Show Context)
Citation Context ...ers of like points, is the objective of cluster analysis. There are many approaches to this problem, including statistical [32], machine learning [20], integer and mathematical programming approaches =-=[64, 1, 58, 11]-=-. We shall describe here the recent approach of [11] that utilizes a fast bilinear programming approach: minimizing the product of two linear functions on a set defined by linear inequalities. A princ... |

30 |
Improved linear programming models for discriminant analysis. Decision Sciences
- Glover
- 1990
(Show Context)
Citation Context ... (7) Robustness here refers to the fact that the useless null vector (w = 0) is naturally excluded as a solution of (7), which is not the case in other linear programming formulations of this problem =-=[41, 66, 26, 25]-=-. Note that because of the constraints of the problem, the variables y and z will satisfy the following conditions: ysminf0; \GammaAw + efl + eg and zsminf0; Bw \Gamma efl + eg: Hence minimizing e T m... |

27 | Machine learning via polyhedral concave minimization
- Mangasarian
- 1995
(Show Context)
Citation Context ..., 37] and more recently to machine learning [3, 61, 23, 48, 46]. In this paper we describe three recent mathematical-programming-based developments that are relevant to data mining: feature selection =-=[45, 10]-=-, clustering [11] and robust representation [67]. We note at the outset that we do not plan to survey either the fields of data mining or mathematical programming, but rather highlight some recent and... |

27 | Serial and parallel backpropagation convergence via nonmonotone perturbed minimization
- Mangasarian, Solodov
- 1994
(Show Context)
Citation Context ...oretical and applied problems such as operations research [29, 54], network problems [60, 53], game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning =-=[3, 61, 23, 48, 46]-=-. In this paper we describe three recent mathematical-programming-based developments that are relevant to data mining: feature selection [45, 10], clustering [11] and robust representation [67]. We no... |

27 |
A quadratically convergent method for minimizing a sum of Euclidean norms
- OVERTON
- 1983
(Show Context)
Citation Context ...m except that kD i` k 2 is replaced by kD i` k 2 2 and thus favoring outliers. Without this squared distance term, the subproblem of the k-Mean Algorithm becomes the considerably harder Weber problem =-=[56, 13]-=- which locates a center in R n closest in sum of Euclidean distances (not their squares!) to a finite set of given points. The Weber problem has no closed form solution. However, using the mean as a c... |

26 |
Nonlinear and Mixed-Integer Optimization
- Floudas
- 1995
(Show Context)
Citation Context ... of problems that can be handled. Nevertheless the proposed methods can be applied to problems with discrete variables if one is willing to use the techniques of integer and mixed integer programming =-=[55, 21]-=- which are more difficult. In fact one of the proposed algorithms, the k-Median Algorithm, whose finite termination is established for problems with real variables, is directly applicable with no chan... |

23 |
A polynomial time algorithm for the construction and training of a class of multilayer perceptrons, Neural Networks 6
- Roy, Kim, et al.
- 1993
(Show Context)
Citation Context ...oretical and applied problems such as operations research [29, 54], network problems [60, 53], game theory and economics [71, 35], engineering mechanics [57, 37] and more recently to machine learning =-=[3, 61, 23, 48, 46]-=-. In this paper we describe three recent mathematical-programming-based developments that are relevant to data mining: feature selection [45, 10], clustering [11] and robust representation [67]. We no... |

20 |
Pattern Classifier Design by Linear Programming
- Smith
- 1968
(Show Context)
Citation Context ... (7) Robustness here refers to the fact that the useless null vector (w = 0) is naturally excluded as a solution of (7), which is not the case in other linear programming formulations of this problem =-=[41, 66, 26, 25]-=-. Note that because of the constraints of the problem, the variables y and z will satisfy the following conditions: ysminf0; \GammaAw + efl + eg and zsminf0; Bw \Gamma efl + eg: Hence minimizing e T m... |