Results 1  10
of
42
The Evidence Framework applied to Classification Networks
 Neural Computation
, 1992
"... Three Bayesian ideas are presented for supervised adaptive classifiers. First, it is argued that the output of a classifier should be obtained by marginalising over the posterior distribution of the parameters; a simple approximation to this integral is proposed and demonstrated. This involves a `mo ..."
Abstract

Cited by 152 (10 self)
 Add to MetaCart
Three Bayesian ideas are presented for supervised adaptive classifiers. First, it is argued that the output of a classifier should be obtained by marginalising over the posterior distribution of the parameters; a simple approximation to this integral is proposed and demonstrated. This involves a `moderation' of the most probable classifier 's outputs, and yields improved performance. Second, it is demonstrated that the Bayesian framework for model comparison described for regression models in (MacKay, 1992a, 1992b) can also be applied to classification problems. This framework successfully chooses the magnitude of weight decay terms, and ranks solutions found using different numbers of hidden units. Third, an informationbased data selection criterion is derived and demonstrated within this framework. 1 Introduction A quantitative Bayesian framework has been described for learning of mappings in feedforward networks (MacKay, 1992a, 1992b). It was demonstrated that this `evidence' fram...
A tutorial on energybased learning
 Predicting Structured Data
, 2006
"... EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graphtransformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of nonprobabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches. 1
Lexical Modeling in a Speaker Independent Speech Understanding System
, 1993
"... Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current stateoftheart speech recognition systems are capable of achieving wordlevel accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even la ..."
Abstract

Cited by 41 (8 self)
 Add to MetaCart
Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current stateoftheart speech recognition systems are capable of achieving wordlevel accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even larger systems, capable of recognizing 20,000 words are just now being developed. Speech understanding systems have recently been developed that perform fairly well within a restricted domain. While the size and performance of modern speech recognition and understanding systems are impressive, it is evident to anyone who has used these systems that the technology is primitive compared to our own human ability to understand speech. Some of the difficulties hampering progress in the fields of speech recognition and understanding stem from the many sources of variation that occur during human communication. One of the sources of variation that occurs in human communication is the different ways that words can be pronounced. There are many causes of pronunciation variation, such as: the phonetic environment in which the word occurs, the dialect of the speaker,
Learning compatibility coefficients for relaxation labeling processes
 IEEE Trans. Pattern Anal. Machine Intell
, 1994
"... AbstractRelaxation labeling processes have been widely used in many different domains including image processing, pattern recognition, and artificial intelligence. They are iterative procedures that aim at reducing local ambiguities and achieving global consistency through a parallel exploitation o ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
AbstractRelaxation labeling processes have been widely used in many different domains including image processing, pattern recognition, and artificial intelligence. They are iterative procedures that aim at reducing local ambiguities and achieving global consistency through a parallel exploitation of contextual information, which is quantitatively expressed in terms of a set of “compatibility coefficients. ” The problem of determining compatibility coefficients has received a considerable attention in the past and many heuristic, statisticalbased methods have been suggested. In this paper, we propose a rather different viewpoint to solve this problem: we derive them attempting to optimize the performance of the relaxation algorithm over a sample of training data; no statistical interpretation is given: compatibility coefficients are simply interpreted as real numbers, for which performance is optimal. Experimental results over a novel application of relaxation are given, which prove the effectiveness of the proposed approach. Index Terms Compatibility coefficients, constraint satisfaction, gradient projection, learning, neural networks, nonlinear
Feedforward Neural Nets as Models for Time Series Forecasting
 ORSA Journal of Computing
, 1993
"... We have studied neural networks as models for time series forecasting, and our research compares the BoxJenkins method against the neural network method for long and short term memory series. Our work was inspired by previously published works that yielded inconsistent results about comparative per ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
We have studied neural networks as models for time series forecasting, and our research compares the BoxJenkins method against the neural network method for long and short term memory series. Our work was inspired by previously published works that yielded inconsistent results about comparative performance. We have since experimented with 16 time series of differing complexity using neural networks. The performance of the neural networks is compared with that of the BoxJenkins method. Our experiments indicate that for time series with long memory, both methods produced comparable results. However, for series with short memory, neural networks outperformed the BoxJenkins model. Because neural networks can be easily built for multiplestepahead forecasting, they present a better long term forecast model than the BoxJenkins method. We discussed the representation ability, the model building process and the applicability of the neural net approach. Neural networks appear to provide a ...
Relative Loss Bounds for Single Neurons
 IEEE Transactions on Neural Networks
, 1996
"... We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the standard backpropagation algorithm. In this paper we prove worstcase loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worstcase bounds for gradientbased algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worstcase loss bounds for any such transfer function and its corresponding matching loss. For example, the matching loss for the identity function is the square loss and the matching loss for the logistic transfer function is the entropic loss. The different forms of the two algori...
Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks
, 1997
"... This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining timedelay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connec ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining timedelay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connected recurrent network grows superlinear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal number of connections. The implementation of the combined architecture and training scheme is described in detail. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database, and for word recognition on the WAXHOLM database. The achieved phone errorrate, 27.8%, for the standard 39 phoneme set on the core testset of the TIMIT database is in the range of the lowest reported. All training and simulation softwar...
Exponentially Many Local Minima for Single Neurons
, 1995
"... We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension. 1 INTRODUCTION Consider a single artificial neuron with d inputs. The neuron has d weights w 2 R d ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension. 1 INTRODUCTION Consider a single artificial neuron with d inputs. The neuron has d weights w 2 R d . The output of the neuron for an input pattern x 2 R d is y = OE(x \Delta w), where OE : R ! R is a transfer function. For a given sequence of training examples h(x t ; y t )i 1tm ; each consisting of a pattern x t 2 R d and a desired output y t 2 R, the goal of the training phase for neural networks consists of minimizing the error function with respect to the weight vector w 2 R d . This function is the sum of the losses between outputs of the neuron and the desired outputs summed over all training examples. In notation, the error function is E(w) = m X t=1 L(y t ; OE(x t \Delta w)) ; where L : R \Theta R ! [0; 1) is the loss function. Acommon example of a transfer function...
Understanding Neural Networks as Statistical Tools
 The American Statistician
, 1996
"... Neural networks have received a great deal of attention over the last few years. They are being used in the areas of prediction and classification; areas where regression models and other related statistical techniques have traditionally been used. In this paper, we discuss neural networks and compa ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Neural networks have received a great deal of attention over the last few years. They are being used in the areas of prediction and classification; areas where regression models and other related statistical techniques have traditionally been used. In this paper, we discuss neural networks and compare them to regression models. We start by exploring the history of neural networks. This includes a review of relevant literature on the topic of neural networks. Neural network nomenclature is then introduced and the backpropagation algorithm, the most widely used learning algorithm, is derived and explained in detail. A comparison between regression analysis and neural networks in terms of notation and implementation is conducted to aid the reader in understanding neural networks. We compare the performance of regression analysis with that of neural networks on two simulated examples and one example on a large data set. We show that neural networks act as a type of nonparametric regression...