## Cost functions to estimate a posteriori probabilities in multiclass problems (1999)

Venue: | IEEE Trans. Neural Networks |

Citations: | 4 - 2 self |

### BibTeX

@ARTICLE{Cid-sueiro99costfunctions,

author = {Jesús Cid-sueiro and Juan Ignacio Arribas and Sebastián Urbán-muñoz and Aníbal R. Figueiras-vidal and Senior Member},

title = {Cost functions to estimate a posteriori probabilities in multiclass problems},

journal = {IEEE Trans. Neural Networks},

year = {1999},

volume = {10},

pages = {645--656}

}

### OpenURL

### Abstract

Abstract—The problem of designing cost functions to estimate a posteriori probabilities in multiclass problems is addressed in this paper. We establish necessary and sufficient conditions that these costs must satisfy in one-class one-output networks whose outputs are consistent with probability laws. We focus our attention on a particular subset of the corresponding cost functions; those which verify two usually interesting properties: symmetry and separability (well-known cost functions, such as the quadratic cost or the cross entropy are particular cases in this subset). Finally, we present a universal stochastic gradient learning rule for single-layer networks, in the sense of minimizing a general version of these cost functions for a wide family of nonlinear activation functions. Index Terms — Neural networks, pattern classification, probability estimation.

### Citations

9458 | The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ....00 © 1999 IEEE estimate a posteriori probabilities by means of a supervised training [3]–[6]. The estimation of a posteriori probabilities is not required in order to make a decision. Indeed, Vapnik =-=[7]-=- remarks that to include this estimation increases the complexity of the process. This fact was the origin of Rosenblatt’s Perceptron Rule, along with its many variants which deal with the lack of con... |

4057 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...rve to balance all them. Once the decision costs are established, data likelihoods and a priori probabilities or a posteriori probabilities of the hypotheses are enough to obtain optimal results [1], =-=[2]-=-. In practice, with the exception of cases in which the “mechanics” of the problem are known (as in many transmission problems), likelihoods or a posteriori probabilities must be estimated. In particu... |

343 | Connectionist learning procedures
- Hinton
- 1989
(Show Context)
Citation Context .... Reference [13] presents a very complete overview, and [14] proposes using Csiszár measures [15] in several forms, following the idea of minimizing divergences introduced by Hopfield [16] and Hinton =-=[17]-=-. However, in many cases, proposing cost functions for strict decision purposes is not enough [e.g., in (auxiliary) automatic diagnostic systems for clinical applications, in financial problems, or wh... |

274 | Neural network classifiers estimate Bayesian a posteriori probabilities - Richard, Lippman - 1991 |

229 |
Adaptive filter theory, 3rd ed
- Haykin
- 1996
(Show Context)
Citation Context ...network. Since the goal is to estimate probabilities, the use of networks whose outputs are constrained by (13) is more adequate. A multioutput network with a softmax output activation function [23], =-=[24]-=- is an example. We consider in this paper networks satisfying this probability law. III. GENERAL SSB COST FUNCTIONS This section provides theoretical results on the conditions that SSB cost functions ... |

178 |
Entropy optimization principles with applications
- Kapur, Kesevan
- 1992
(Show Context)
Citation Context ...e, other objective functions have appeared to address robustness, training speed, decision performance, etc. Reference [13] presents a very complete overview, and [14] proposes using Csiszár measures =-=[15]-=- in several forms, following the idea of minimizing divergences introduced by Hopfield [16] and Hinton [17]. However, in many cases, proposing cost functions for strict decision purposes is not enough... |

168 |
Neural Network Learning and Expert Systems
- Gallant
- 1993
(Show Context)
Citation Context ...tion increases the complexity of the process. This fact was the origin of Rosenblatt’s Perceptron Rule, along with its many variants which deal with the lack of convergence in nonseparable situations =-=[8]-=-. Even structural and training generalizations, such as the decision-based neural networks [9] have difficulties; to solve them is an NP-complete problem [10] and only suboptimal procedures can be app... |

128 | Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers
- Schölkopf, Sung, et al.
- 1997
(Show Context)
Citation Context ...done by quadratic (or linear) programming in minimizing combinations of performance and generalization measures to build support vector machines (see [7] and subsequent papers on the subject, such as =-=[30]-=-). The use of a varying adaptation step can also be immediately related to the above discussion. A form could be included as a factor in , and a joint discussion will be possible. Most of the variable... |

90 |
The multilayer perceptron as an approximation to a Bayes optimal discriminant function
- Ruck, Rogers, et al.
- 1990
(Show Context)
Citation Context ...ires an SSB cost function. For instance, the well-known crossentropy (logarithmic cost) function given by (11) is SSB [19], [20]. Not every cost function is SSB, for instance, an norm is SSB only for =-=[4]-=-, [21]. Besides a learning algorithm, an SSB classifier requires a network structure able to construct class probabilities and optimal decisions. In the following, we will assume that the network sati... |

56 | Equivalence proofs for multi-layer perceptron classifiers and the Bayesian discriminant function
- Hampshire, Pearlmutter
- 1990
(Show Context)
Citation Context ...rlos III de Madrid, 28911 Leganés-Madrid, Spain. Publisher Item Identifier S 1045-9227(99)03832-1. 1045–9227/99$10.00 © 1999 IEEE estimate a posteriori probabilities by means of a supervised training =-=[3]-=-–[6]. The estimation of a posteriori probabilities is not required in order to make a decision. Indeed, Vapnik [7] remarks that to include this estimation increases the complexity of the process. This... |

37 |
Learning algorithms and probability distributions in feed-forward and feedback networks
- Hopfield
- 1987
(Show Context)
Citation Context ...performance, etc. Reference [13] presents a very complete overview, and [14] proposes using Csiszár measures [15] in several forms, following the idea of minimizing divergences introduced by Hopfield =-=[16]-=- and Hinton [17]. However, in many cases, proposing cost functions for strict decision purposes is not enough [e.g., in (auxiliary) automatic diagnostic systems for clinical applications, in financial... |

34 |
Neural network classification: A bayesian interpretation
- Wan
- 1990
(Show Context)
Citation Context ...an SSB cost function. For instance, the well-known crossentropy (logarithmic cost) function given by (11) is SSB [19], [20]. Not every cost function is SSB, for instance, an norm is SSB only for [4], =-=[21]-=-. Besides a learning algorithm, an SSB classifier requires a network structure able to construct class probabilities and optimal decisions. In the following, we will assume that the network satisfies ... |

27 | Conditional distribution learning with neural networks and its application to channel equalization
- Adali, Liu, et al.
- 1997
(Show Context)
Citation Context ...al nonlinearity, . This is a well known factor responsible for slow training when the quadratic cost function is used , since the adaptation is slow when is near zero or one (for example, see [19] or =-=[25]-=-). This is one of the justifications for using cross-entropy. In this case, (see [6]), , and (34) Wittner and Denker [26] provide theoretical results showing that cross-entropy behaves better than the... |

21 |
Decision-Based neural networks with signal/image classification applications
- Kung, Taur
- 1995
(Show Context)
Citation Context ...ron Rule, along with its many variants which deal with the lack of convergence in nonseparable situations [8]. Even structural and training generalizations, such as the decision-based neural networks =-=[9]-=- have difficulties; to solve them is an NP-complete problem [10] and only suboptimal procedures can be applied in practice. Since perceptron-rule-based algorithms have these drawbacks and generalize p... |

15 |
Backpropagation and stochastic gradient descent method
- AMARI
- 1993
(Show Context)
Citation Context ...6) (7) (8) (9) (10) According to the above definitions, an SSB detector requires an SSB cost function. For instance, the well-known crossentropy (logarithmic cost) function given by (11) is SSB [19], =-=[20]-=-. Not every cost function is SSB, for instance, an norm is SSB only for [4], [21]. Besides a learning algorithm, an SSB classifier requires a network structure able to construct class probabilities an... |

14 |
Repeat until bored: A pattern selection strategy
- Munro
- 1992
(Show Context)
Citation Context ...nvergence of the algorithm. Another subject directly related to the above discussion, although it has not been analyzed from this point of view, is that of selecting samples. It was proposed first in =-=[28]-=-, as a procedure according to which the samples that are difficult to learn are most frequently applied (after reaching a reasonable degree of convergence). Curiously, many subsequent selection strate... |

11 |
Classification of linearly non-separable patterns by linear threshold elements
- Roychowdhury, Siu, et al.
- 1995
(Show Context)
Citation Context ... of convergence in nonseparable situations [8]. Even structural and training generalizations, such as the decision-based neural networks [9] have difficulties; to solve them is an NP-complete problem =-=[10]-=- and only suboptimal procedures can be applied in practice. Since perceptron-rule-based algorithms have these drawbacks and generalize poorly, other authors have followed alternative routes for minimu... |

11 |
A new error criterion for posterior probability estimation with neural nets
- EL-JAROUDI, MAKOUL
- 1990
(Show Context)
Citation Context ....e., (6) (7) (8) (9) (10) According to the above definitions, an SSB detector requires an SSB cost function. For instance, the well-known crossentropy (logarithmic cost) function given by (11) is SSB =-=[19]-=-, [20]. Not every cost function is SSB, for instance, an norm is SSB only for [4], [21]. Besides a learning algorithm, an SSB classifier requires a network structure able to construct class probabilit... |

10 |
Pedagogical pattern selection strategies
- Cachin
- 1994
(Show Context)
Citation Context ... procedure according to which the samples that are difficult to learn are most frequently applied (after reaching a reasonable degree of convergence). Curiously, many subsequent selection strategies (=-=[29]-=- presents a significant number of options) take the perspective of selecting samples close to the decision borders. Of course, this is (qualitatively) equivalent to selecting an “equivalent” cost func... |

6 |
Objective functions for probability estimation
- Smyth, Miller, et al.
- 1991
(Show Context)
Citation Context ... III de Madrid, 28911 Leganés-Madrid, Spain. Publisher Item Identifier S 1045-9227(99)03832-1. 1045–9227/99$10.00 © 1999 IEEE estimate a posteriori probabilities by means of a supervised training [3]–=-=[6]-=-. The estimation of a posteriori probabilities is not required in order to make a decision. Indeed, Vapnik [7] remarks that to include this estimation increases the complexity of the process. This fac... |

5 |
Trees, Detection Estimation and Modulation Theory, vol
- Van
- 1971
(Show Context)
Citation Context ...ns serve to balance all them. Once the decision costs are established, data likelihoods and a priori probabilities or a posteriori probabilities of the hypotheses are enough to obtain optimal results =-=[1]-=-, [2]. In practice, with the exception of cases in which the “mechanics” of the problem are known (as in many transmission problems), likelihoods or a posteriori probabilities must be estimated. In pa... |

3 |
Jr., “The softmax nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element
- Elfadel, Wyatt
- 1994
(Show Context)
Citation Context ...nd of network. Since the goal is to estimate probabilities, the use of networks whose outputs are constrained by (13) is more adequate. A multioutput network with a softmax output activation function =-=[23]-=-, [24] is an example. We consider in this paper networks satisfying this probability law. III. GENERAL SSB COST FUNCTIONS This section provides theoretical results on the conditions that SSB cost func... |

3 |
Strategies for teaching layered neural networks classification tasks
- WITTNER, DENKER
- 1988
(Show Context)
Citation Context ...ince the adaptation is slow when is near zero or one (for example, see [19] or [25]). This is one of the justifications for using cross-entropy. In this case, (see [6]), , and (34) Wittner and Denker =-=[26]-=- provide theoretical results showing that cross-entropy behaves better than the quadratic cost because it is a well-formed cost function. A cost-function is said to be well formed if it satisfies the ... |

2 |
Csiszar's generalized error measures for gradient-descent-based optimizations in neural networks using the backpropagation algorithm
- Neelakanta, Abusalah, et al.
- 1996
(Show Context)
Citation Context ...on output [11], [12]. At the same time, other objective functions have appeared to address robustness, training speed, decision performance, etc. Reference [13] presents a very complete overview, and =-=[14]-=- proposes using Csiszár measures [15] in several forms, following the idea of minimizing divergences introduced by Hopfield [16] and Hinton [17]. However, in many cases, proposing cost functions for s... |

2 | Concurrent methodology on LRD simulation studies - unknown authors - 1995 |

1 |
Implementing the minimummissclassification-error energy function for target recognition
- Telfer, Szu
- 1992
(Show Context)
Citation Context ...eralize poorly, other authors have followed alternative routes for minimum error decision: Telfer and Szu, for example, use Minkowski’s -norm minimization and a very steep sigmoidal activation output =-=[11]-=-, [12]. At the same time, other objective functions have appeared to address robustness, training speed, decision performance, etc. Reference [13] presents a very complete overview, and [14] proposes ... |

1 |
functions for minimizing missclassification error with minimum-complexity networks
- “Energy
- 1994
(Show Context)
Citation Context ...e poorly, other authors have followed alternative routes for minimum error decision: Telfer and Szu, for example, use Minkowski’s -norm minimization and a very steep sigmoidal activation output [11], =-=[12]-=-. At the same time, other objective functions have appeared to address robustness, training speed, decision performance, etc. Reference [13] presents a very complete overview, and [14] proposes using ... |

1 |
Neural Networks for Optimization and Control
- Cichocki, Unbenhauen
- 1993
(Show Context)
Citation Context ...mization and a very steep sigmoidal activation output [11], [12]. At the same time, other objective functions have appeared to address robustness, training speed, decision performance, etc. Reference =-=[13]-=- presents a very complete overview, and [14] proposes using Csiszár measures [15] in several forms, following the idea of minimizing divergences introduced by Hopfield [16] and Hinton [17]. However, i... |

1 |
A method of generating objective functions for probability estimation
- Billa, El-Jaroudi
- 1996
(Show Context)
Citation Context ...y parameterizing and giving values to thesCID-SUEIRO et al.: COST FUNCTIONS TO ESTIMATE A POSTERIORI PROBABILITIES 647 parameters, for instance by using the Taylor series expansion of as suggested in =-=[22]-=-. The multiclass problem can be addressed using different network configurations: 1) single output networks where classes are denoted with scalars , , and is the number of classes; 2) a network with o... |