## Mathematical Programming in Neural Networks (1993)

Venue: | ORSA Journal on Computing |

Citations: | 40 - 13 self |

### BibTeX

@ARTICLE{Mangasarian93mathematicalprogramming,

author = {O. L. Mangasarian},

title = {Mathematical Programming in Neural Networks},

journal = {ORSA Journal on Computing},

year = {1993},

volume = {5},

pages = {349--360}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper highlights the role of mathematical programming, particularly linear programming, in training neural networks. A neural network description is given in terms of separating planes in the input space that suggests the use of linear programming for determining these planes. A more standard description in terms of a mean square error in the output space is also given, which leads to the use of unconstrained minimization techniques for training a neural network. The linear programming approach is demonstrated by a brief description of a system for breast cancer diagnosis that has been in use for the last four years at a major medical facility. 1 What is a Neural Network? A neural network is a representation of a map between an input space and an output space. A principal aim of such a map is to discriminate between the elements of a finite number of disjoint sets in the input space. Typically one wishes to discriminate between the elements of two disjoint point sets in the n-dim...

### Citations

1774 |
Introduction to the Theory of Neural Computation
- Hertz, Palmer
- 1991
(Show Context)
Citation Context ...ining the weights and thresholds of a feedforward neural network with a single hidden layer as an unconstrained optimization problem, and relate this problem to the standard backpropagation algorithm =-=[41, 44, 23]-=- for training such a neural network. We consider the neural network depicted in Figure 16 with an input vector x in R n ; h hidden LTU's with threshold values ` i 2 R; incoming arc weights w i 2 R n ;... |

1225 |
Multilayer feedforward networks are universal approximators. Neural Netw
- Hornik, Stinchcombe, et al.
- 1989
(Show Context)
Citation Context ...th the input and output layers. The presence of such a hidden layer is crucial and endows a neural network with a complexity that is not possessed by a single layer of LTU's. In fact, it can be shown =-=[24, 29]-=- that such a neural network can separate any two disjoint sets in R n given a sufficient number of hidden units. (See Theorem 2.2 below.) This is equivalent to the multisurface method separating such ... |

942 |
Numerical Methods for Unconstrained Optimization and Nonlinear Equations
- Dennis, Schnabel
- 1983
(Show Context)
Citation Context ... based on line search techniques similar to those of [22]. Batch BP on the other hand can be treated as an ordinary unconstrained minimization gradient method, and all the machinery of search methods =-=[12]-=- can be applied to it. In fact, by considering the whole objective function f instead of its components f i separately, one can apply a whole variety of first and second order methods to the problem o... |

716 | A logical calculus of the ideas immanent in nervous activity - McCulloch, Pitts - 1943 |

515 |
Perceptrons: An Introduction to Computational Geometry
- Minsky, Papert
- 1969
(Show Context)
Citation Context ...dle the linearly inseparable case one has to employ a more complex map than that provided by an LTU. This was made evident in the early days of neural network development by Minsky and Papert in 1969 =-=[32]-=- when they presented their now-classical exclusive-or (XOR) counterexample which is not linearly separable and hence for which no LTU will work. (See Figure 2). This essentially brought the early deve... |

509 |
Beyond Regression: New Tools for Prediction and Analysis
- Werbos
(Show Context)
Citation Context ... counterexample which is not linearly separable and hence for which no LTU will work. (See Figure 2). This essentially brought the early development of neural networks to a halt until it was realized =-=[47, 41]-=- that a more complex function than that represented by an LTU was needed to correctly map these simple four points into the set f0; 1g. Curiously enough, however, it should be noted that even before M... |

419 | Optimal brain damage
- LeCun, Denker, et al.
- 1990
(Show Context)
Citation Context ...ed neural network does on unseen data. One approach is to obtain a least value of h for which an approximate solution to Problem 3.1 is satisfactory in the sense that y(x) has a tolerable error in it =-=[26, 23]-=-. Note that both of our Algorithms MSMT 2.1 and MSM 2.2 essentially take this approach. In the sequel, however, we shall assume that h is fixed. An interesting empirical study of generalization in mac... |

210 | Robust linear programming discrimination of two linearly inseparable sets - Bennett, Mangasarian - 1992 |

203 | Training a 3-node neural network is NPcomplete
- Blum, Rivest
- 1992
(Show Context)
Citation Context ...rating surface, as measured by the number of planes constituting it, cannot be guaranteed. Unfortunately, the problem of deciding whether two sets are separable by as few planes as two is NP-complete =-=[31, 10]-=-. Although there are effective bilinear programming algorithms for solving the bilinear separability problem [5], there are no methods for directly obtaining piecewise-linear separators other than the... |

150 | Learning Machines - Nilsson - 1965 |

132 |
Multisurface method of pattern separation for medical diagnosis applied to breast cytology
- Wolberg, Mangasarian
- 1990
(Show Context)
Citation Context ...in R n : We note that MSM is a greedy algorithm that generates various pieces of a separating surface between the sets A and B sequentially. Although MSM has produced very effective practical results =-=[49, 29, 7]-=-, optimality of the separating surface, as measured by the number of planes constituting it, cannot be guaranteed. Unfortunately, the problem of deciding whether two sets are separable by as few plane... |

128 | Second-order Methods for Learning: Between Steepest Descent and Newton's Method
- Battiti, First-
(Show Context)
Citation Context ...f its components f i separately, one can apply a whole variety of first and second order methods to the problem of minimizing f . First and second order methods for various BP algorithms are given in =-=[2, 3, 36]-=-. More recently Gaivoronski [16] gave a convergence proof for online BP (17) under minimal conditions using stochastic gradient ideas [15] in which he established convergence of fk rf(z ` ) kg to zero... |

97 |
Improving the Convergence of back-propagation Learning with Second order Methods
- Becker, leCun
- 1988
(Show Context)
Citation Context ...f its components f i separately, one can apply a whole variety of first and second order methods to the problem of minimizing f . First and second order methods for various BP algorithms are given in =-=[2, 3, 36]-=-. More recently Gaivoronski [16] gave a convergence proof for online BP (17) under minimal conditions using stochastic gradient ideas [15] in which he established convergence of fk rf(z ` ) kg to zero... |

93 |
MINOS 5.0 user's guide
- Murtagh, Saunders
- 1983
(Show Context)
Citation Context ...eparation depicted in Figure 3 for the XOR example and hence the equivalent neural network of Figure 4. By letting A = " 1 0 0 1 # ; B = " 0 0 1 1 # and solving the linear program (6) using =-=MINOS 5.3 [34]-=- we obtain the solution w = (\Gamma2 + 2); ` = 1; y = (4 0); z = (0 0) (7) (We note in passing that w = (0 0); ` = 0; y = (1 1); z = (1 1) (8) is also a solution but is neither unique in w nor is it t... |

81 |
Principles of Neurodynamics. Spartan
- Rosenblatt
- 1962
(Show Context)
Citation Context ...fire when their input exceeds their threshold. The simplest and earliest neural network is the linear threshold unit (LTU), first proposed by McCulloch and Pitts in 1943 [30] and for which Rosenblatt =-=[39, 38]-=- proposed his iterative perceptron training algorithm in 1957. Such an LTU represents the following (nonlinear) step function y from the n-dimensional real space R n into f0; 1g: y(x) = s(wx \Gamma `)... |

69 | Pattern recognition via linear programming: Theory and application to medical diagnosis. Large-scale numerical optimization
- Mangasarian, Setiono, et al.
- 1990
(Show Context)
Citation Context ...or exposition purposes. In order to cover possible degenerate cases that are not usually encountered in practice, one has to provide for the possibility of both A i = A i+1 and B i = B i+1 for some i =-=[28, 29]-=-. One such procedure is given below. 2.4 Degeneracy Procedure for MSM Algorithm Insert before step (c) of MSM Algorithm 2.2 the following step: (b1) If A i+1 = A i and B i+1 = B i ; find new w i ; ` i... |

55 |
Linear and Nonlinear Separation of Patterns by Linear Programming
- Mangasarian
- 1965
(Show Context)
Citation Context ...separation, by considering a slightly different linear program (6) as shown by Theorem 2.1 below. The first linear programming formulations for the linearly separable case were given in 1964 and 1965 =-=[11, 27]-=-, but they also suffered from the nullsolution difficulty for the linearly inseparable case. In order to handle the linearly inseparable case one has to employ a more complex map than that provided by... |

52 |
The perceptron: A perceiving and recognizing automaton
- Rosenblatt
- 1957
(Show Context)
Citation Context ...fire when their input exceeds their threshold. The simplest and earliest neural network is the linear threshold unit (LTU), first proposed by McCulloch and Pitts in 1943 [30] and for which Rosenblatt =-=[39, 38]-=- proposed his iterative perceptron training algorithm in 1957. Such an LTU represents the following (nonlinear) step function y from the n-dimensional real space R n into f0; 1g: y(x) = s(wx \Gamma `)... |

42 |
The relaxation method for linear inequalities
- Motzkin, Shoenberg
- 1954
(Show Context)
Citation Context ...ble. Among the earliest algorithms for obtaining such a separating plane or an LTU was Rosenblatt's perceptron algorithm, which turns out to be a version of the Motzkin-Schoenberg iterative algorithm =-=[33]-=- for solving a system of linear inequalities. This algorithm terminates in a finite number of steps Computer Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WI 53706, e... |

39 |
Decision tree construction via linear programming
- Bennett
- 1992
(Show Context)
Citation Context ... z = (0 0) (7) (We note in passing that w = (0 0); ` = 0; y = (1 1); z = (1 1) (8) is also a solution but is neither unique in w nor is it the one given by MINOS.) The multisurface method tree (MSMT) =-=[4]-=- can be employed to obtain the complete separation of Figure 3 as follows. Note that the plane wx = ` for this problem \Gamma 2x 1 + 2x 2 = 1; (9) obtained from the solution (7) and depicted in Figure... |

38 |
Multi-surface method of pattern separation
- Mangasarian
- 1968
(Show Context)
Citation Context ...g. Curiously enough, however, it should be noted that even before Minsky-Papert proposed their classical XOR counterexample, a linear-programming-based piecewise-linear separator was proposed in 1968 =-=[28]-=- that could easily and correctly handle this problem, and which in fact can be represented as a neural network [7]. (See Figure 14 and discussion following Algorithm 2.2 below.) We shall now use this ... |

38 | On the complexity of polyhedral separability
- Megiddo
- 1986
(Show Context)
Citation Context ...rating surface, as measured by the number of planes constituting it, cannot be guaranteed. Unfortunately, the problem of deciding whether two sets are separable by as few planes as two is NP-complete =-=[31, 10]-=-. Although there are effective bilinear programming algorithms for solving the bilinear separability problem [5], there are no methods for directly obtaining piecewise-linear separators other than the... |

38 | Nuclear Feature Extraction for Breast Tumor Diagnosis
- Street, Wolberg, et al.
- 1905
(Show Context)
Citation Context ...ces of an experienced oncologist for making the measurements. An automated system has been developed, and recently put into use, that completely eliminates the subjective assessment by the oncologist =-=[48]-=-. Instead vision techniques are used to draw boundaries around the nuclei of a few cells from which 30 numerical features are extracted. Three of these features enable a neural network, as simple as a... |

36 |
Neural network training via linear programming
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...rexample, a linear-programming-based piecewise-linear separator was proposed in 1968 [28] that could easily and correctly handle this problem, and which in fact can be represented as a neural network =-=[7]-=-. (See Figure 14 and discussion following Algorithm 2.2 below.) We shall now use this example to motivate a general multisurface method (MSM) for separating the sets A and B of the XOR example or any ... |

35 | Bilinear separation of two sets in n-space
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ... of deciding whether two sets are separable by as few planes as two is NP-complete [31, 10]. Although there are effective bilinear programming algorithms for solving the bilinear separability problem =-=[5]-=-, there are no methods for directly obtaining piecewise-linear separators other than the proposed greedy MSM algorithms [28, 29, 8]. We note that in these greedy MSM algorithms, each piece of the sepa... |

33 |
Simulated Annealing and Boltzman Machines
- Aarts, Korst
- 1990
(Show Context)
Citation Context ...ue, is an interesting but not completely settled question [23, pages 76--79], even though there have been a number of interesting applications of neural networks to optimization problems, for example =-=[37, 1, 14]-=- and [23, pages 71--87]. Acknowledgement I wish to thank my colleagues Yann le Cun, Renato De Leone, Laurence Dixon, Greg Madey, Jim Noyes, Boris Polyak, Steve Robinsosn and Jude Shavlik for helpful r... |

30 |
Improved linear programming models for discriminant analysis. Decision Sciences
- Glover
- 1990
(Show Context)
Citation Context ...is purpose, we utilize the linear program introduced recently in [8] and which has the following desirable features not all of which are possessed by any other previous linear programming formulation =-=[11, 27, 28, 45, 20, 19]-=-: (i) A strict separating plane (that is neither set lies on the separating plane) for linearly separable sets A and B (ii) An error-minimizing plane is obtained when the sets A and B are linearly ins... |

27 |
Artificial neural systems
- Simpson
- 1990
(Show Context)
Citation Context ...ining the weights and thresholds of a feedforward neural network with a single hidden layer as an unconstrained optimization problem, and relate this problem to the standard backpropagation algorithm =-=[41, 44, 23]-=- for training such a neural network. We consider the neural network depicted in Figure 16 with an input vector x in R n ; h hidden LTU's with threshold values ` i 2 R; incoming arc weights w i 2 R n ;... |

20 |
Pattern Classifier Design by Linear Programming
- Smith
- 1968
(Show Context)
Citation Context ...is purpose, we utilize the linear program introduced recently in [8] and which has the following desirable features not all of which are possessed by any other previous linear programming formulation =-=[11, 27, 28, 45, 20, 19]-=-: (i) A strict separating plane (that is neither set lies on the separating plane) for linearly separable sets A and B (ii) An error-minimizing plane is obtained when the sets A and B are linearly ins... |

15 |
Multicategory separation via linear programming
- P, Mangasarian
- 1992
(Show Context)
Citation Context ...k disjoint point sets in R n ; and show how it can be set as a single linear program. Smith [46] proposed solving k linear programs separating each set from the remaining k \Gamma 1 sets. However, in =-=[6]-=- a single linear program is solved to obtain a convex k-piece piecewise-linear surface that exactly separates k disjoint sets under certain conditions, otherwise an error-minimizing approximate separa... |

15 |
Decomposition into functions in the minimization problem. Avtomatika i Telemekhanika
- Kibardin
- 1979
(Show Context)
Citation Context ...implest conditions under which a rigorous mathematical convergence of the online BP has been established. The complete result will appear in [17]. We note also that, in a little known paper, Kibardin =-=[25]-=- used similar conditions to (19) for establishing convergence of (17) for convex f i (z); i = 1; : : : k. Unfortunately the convexity assumption does not hold here. We can relate the unconstrained min... |

10 |
Global convergence and stabilization of unconstrained minimization methods without derivatives
- Grippo, Lampariello, et al.
- 1988
(Show Context)
Citation Context ...de acceptability and success of the method. Recently, however, Grippo [21] has proposed a promising convergence proof under certain assumptions and based on line search techniques similar to those of =-=[22]-=-. Batch BP on the other hand can be treated as an ordinary unconstrained minimization gradient method, and all the machinery of search methods [12] can be applied to it. In fact, by considering the wh... |

6 |
On the boundedness of an iterative procedure for solving a system of linear inequalities
- Block, Levin
- 1970
(Show Context)
Citation Context ...etwork, a perceptron or linear threshold unit (LTU), characterized by the weight vector w in R n and threshold ` in R: if the sets A and B are linearly separable [32, pages 164-175], [35, Chapter 5], =-=[9]-=-. However if A and B are linearly inseparable the algorithm need not terminate but the iterates are bounded [32, pages 181-187], [9] and hence merely have an accumulation point. By contrast consider t... |

6 |
Some fundamental theorems of perceptron theory and their geometry
- Charnes
- 1964
(Show Context)
Citation Context ...separation, by considering a slightly different linear program (6) as shown by Theorem 2.1 below. The first linear programming formulations for the linearly separable case were given in 1964 and 1965 =-=[11, 27]-=-, but they also suffered from the nullsolution difficulty for the linearly inseparable case. In order to handle the linearly inseparable case one has to employ a more complex map than that provided by... |

5 |
Predicting salinity in the Chesapeake Bay using backpropagation
- DeSilets, Golden, et al.
- 1992
(Show Context)
Citation Context ...nd speech recognition, finance problems such as bank failure prediction and credit evaluation, oil drilling as well as medical and prognosis problems. See for example [44, 23], references therein and =-=[43, 13]-=-. Most of these applications use neural networks trained by backpropagation or variants thereof. We shall describe here a successful linear programming-based application to breast cancer diagnosis tha... |

5 |
Mathematical methods for pattern classification
- Grinold
- 1972
(Show Context)
Citation Context ...is purpose, we utilize the linear program introduced recently in [8] and which has the following desirable features not all of which are possessed by any other previous linear programming formulation =-=[11, 27, 28, 45, 20, 19]-=-: (i) A strict separating plane (that is neither set lies on the separating plane) for linearly separable sets A and B (ii) An error-minimizing plane is obtained when the sets A and B are linearly ins... |

3 |
Comments on hidden nodes in neural nets
- Georgiou
- 1991
(Show Context)
Citation Context ...generally, we can think of a feedforward neural network with a single hidden layer of h LTU's as a representation of h planes in R n that divide the space into p polyhedral regions, psn X i=0 / h i ! =-=[18]-=-, each containing elements of only one set A or B: These planes, which are represented by the vector-weighted incoming arcs to the hidden LTU's together with their threshold values, map each polyhedra... |

3 |
Neural Networks for Selecting Vehicle Routing Heuristics
- Nygard, Juell, et al.
- 1990
(Show Context)
Citation Context ...ue, is an interesting but not completely settled question [23, pages 76--79], even though there have been a number of interesting applications of neural networks to optimization problems, for example =-=[37, 1, 14]-=- and [23, pages 71--87]. Acknowledgement I wish to thank my colleagues Yann le Cun, Renato De Leone, Laurence Dixon, Greg Madey, Jim Noyes, Boris Polyak, Steve Robinsosn and Jude Shavlik for helpful r... |

3 |
Pattern classification using linear programming
- Roy, Mukhopadhyay
- 1990
(Show Context)
Citation Context ...h piece of the separating surface can be nonlinear as long as it is linear in its parameters such as a quadratic surface for instance. The separation can still be achieved by solving a linear program =-=[27, 40]-=- for each piece. We turn now to the case of multicategory discrimination, that is, discriminating between the elements of k disjoint point sets in R n ; and show how it can be set as a single linear p... |

3 |
Symbolic and neural network learning algorithms: An experimental comparison. Machine Learning 6:111–143
- Shavlik, Mooney, et al.
- 1991
(Show Context)
Citation Context ...ke this approach. In the sequel, however, we shall assume that h is fixed. An interesting empirical study of generalization in machine learning has been carried out on several real world data sets in =-=[43]. Fo-=-r the neural network depicted in Figure 16 we can define the mean square error for a given set of weights and thresholds w i 2 R n ; v i 2 R; ` i 2 R; i = 1; : : : ; h; �� 2 R as follows: f(w; `; ... |

3 |
Design of multicategory pattern classifiers with two-category classifier design procedures
- Smith
- 1969
(Show Context)
Citation Context ...e. We turn now to the case of multicategory discrimination, that is, discriminating between the elements of k disjoint point sets in R n ; and show how it can be set as a single linear program. Smith =-=[46]-=- proposed solving k linear programs separating each set from the remaining k \Gamma 1 sets. However, in [6] a single linear program is solved to obtain a convex k-piece piecewise-linear surface that e... |

1 |
Neural nets for massively parallel optimisation
- Dixon, Mills
- 1992
(Show Context)
Citation Context ...ue, is an interesting but not completely settled question [23, pages 76--79], even though there have been a number of interesting applications of neural networks to optimization problems, for example =-=[37, 1, 14]-=- and [23, pages 71--87]. Acknowledgement I wish to thank my colleagues Yann le Cun, Renato De Leone, Laurence Dixon, Greg Madey, Jim Noyes, Boris Polyak, Steve Robinsosn and Jude Shavlik for helpful r... |

1 |
Stochastic gradient methods for neural networks
- Gaivoronski
- 1994
(Show Context)
Citation Context ...1 =" ` g ! 1 (19) This condition appears to be one of the simplest conditions under which a rigorous mathematical convergence of the online BP has been established. The complete result will appea=-=r in [17]-=-. We note also that, in a little known paper, Kibardin [25] used similar conditions to (19) for establishing convergence of (17) for convex f i (z); i = 1; : : : k. Unfortunately the convexity assumpt... |

1 |
Private communication
- Grippo
- 1992
(Show Context)
Citation Context ...are, to the best knowledge of the author, no published convergence proofs for the method. This is a curious fact in view of the wide acceptability and success of the method. Recently, however, Grippo =-=[21]-=- has proposed a promising convergence proof under certain assumptions and based on line search techniques similar to those of [22]. Batch BP on the other hand can be treated as an ordinary unconstrain... |

1 |
Neural network optimization methods
- Noyes
- 1991
(Show Context)
Citation Context ...f its components f i separately, one can apply a whole variety of first and second order methods to the problem of minimizing f . First and second order methods for various BP algorithms are given in =-=[2, 3, 36]-=-. More recently Gaivoronski [16] gave a convergence proof for online BP (17) under minimal conditions using stochastic gradient ideas [15] in which he established convergence of fk rf(z ` ) kg to zero... |