Results 1  10
of
14
Efficient Agnostic Learning of Neural Networks with Bounded Fanin
, 1996
"... We show that the class of two layer neural networks with bounded fanin is efficiently learnable in a realistic extension to the Probably Approximately Correct (PAC) learning model. In this model, a joint probability distribution is assumed to exist on the observations and the learner is required to ..."
Abstract

Cited by 68 (18 self)
 Add to MetaCart
We show that the class of two layer neural networks with bounded fanin is efficiently learnable in a realistic extension to the Probably Approximately Correct (PAC) learning model. In this model, a joint probability distribution is assumed to exist on the observations and the learner is required to approximate the neural network which minimizes the expected quadratic error. As special cases, the model allows learning realvalued functions with bounded noise, learning probabilistic concepts and learning the best approximation to a target function that cannot be well approximated by the neural network. The networks we consider have realvalued inputs and outputs, an unlimited number of threshold hidden units with bounded fanin, and a bound on the sum of the absolute values of the output weights. The number of computation This work was supported by the Australian Research Council and the Australian Telecommunications and Electronics Research Board. The material in this paper was pres...
Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference
 Machine Learning
, 2001
"... Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, nonstationarity, and nonlinearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, nonstationarity, and nonlinearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method proposed uses conversion into a symbolic representation with a selforganizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with nonstationarity, overfitting, and unequal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for th...
MemoryUniversal Prediction of Stationary Random Processes
 IEEE Trans. Inform. Theory
, 1998
"... We consider the problem of onestepahead prediction of a realvalued, stationary, strongly mixing random process fX i g i=01 . The best meansquare predictor of X0 is its conditional mean given the entire infinite past fX i g i=01 . Given a sequence of observations X1 X2 111 XN, we propose estimato ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
We consider the problem of onestepahead prediction of a realvalued, stationary, strongly mixing random process fX i g i=01 . The best meansquare predictor of X0 is its conditional mean given the entire infinite past fX i g i=01 . Given a sequence of observations X1 X2 111 XN, we propose estimators for the conditional mean based on sequences of parametric models of increasing memory and of increasing dimension, for example, neural networks and Legendre polynomials. The proposed estimators select both the model memory and the model dimension, in a datadriven fashion, by minimizing certain complexity regularized least squares criteria. When the underlying predictor function has a finite memory, we establish that the proposed estimators are memoryuniversal: the proposed estimators, which do not know the true memory, deliver the same statistical performance (rates of integrated meansquared error) as that delivered by estimators that know the true memory. Furthermore, when the underlying predictor function does not have a finite memory, we establish that the estimator based on Legendre polynomials is consistent.
Hardness Results for Neural Network Approximation Problems
, 1999
"... Introduction Previous negative results for learning twolayer neural network classifiers show that it is difficult to find a network that correctly classifies all examples in a training set. However, for learning to a particular accuracy it is only necessary to approximately solve this problem, tha ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Introduction Previous negative results for learning twolayer neural network classifiers show that it is difficult to find a network that correctly classifies all examples in a training set. However, for learning to a particular accuracy it is only necessary to approximately solve this problem, that is, to find a network that correctly classifies most examples in a training set. In this paper, we show that this approximation problem is hard for several neural network classes. The hardness of PAC style learning is a very natural question that has been addressed from a variety of viewpoints. The strongest nonlearnability conclusions are those stating that no matter what type of algorithm a learner may use, as long as its computational resources are limited, it would not be able to predict a previously unseen label (with probability significantly better than that of a random guess). Such results have been derived by noticing that, in some precise sense, learning
Minimum Complexity Regression Estimation with Weakly Dependent Observations
 IEEE Trans. Inform. Theory
, 1996
"... Parameter Spaces and Abstract Complexities For each integer rt _> 1, let % denote a model dimension, for example, see (2), and let S, denote a compact subset of ]R The set S, will serve as a collection of parameters associated with the model dimension %, for example, see (5). For every v S,, let f( ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
Parameter Spaces and Abstract Complexities For each integer rt _> 1, let % denote a model dimension, for example, see (2), and let S, denote a compact subset of ]R The set S, will serve as a collection of parameters associated with the model dimension %, for example, see (5). For every v S,, let f(,, v) denote a realvalued function on Bx parameterized by (n, v), for example, see (3). The following condition is required to invoke the exponential inequalities in Theorems 4.2 and 4.3.
Noisy time series prediction using symbolic representation and recurrent neural network grammatical inference
, 1996
"... Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, nonstationarity, and nonlinearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, nonstationarity, and nonlinearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method uses conversion into a symbolic representation with a selforganizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with nonstationarity, overfitting, and unequal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for the next day with an error rate of 47.1%. The error rate reduces to around 40% when rejecting examples where the system has low confidence in its prediction. The symbolic representation aids the extraction of symbolic knowledge from the recurrent neural networks in the form of deterministic finite state automata. These automata explain the operation of the system and are often relatively simple. Rules related to well known behavior such as trend following and mean reversal are extracted.
Presenting and analyzing the results of AI experiments: Data averaging and data snooping
 Proc. of the Fourteenth Natl. Conf. on Artificial Intelligence
, 1997
"... Experimental results reported in the machine learning AI literature can be misleading. This paper investigates the common processes of data averaging (reporting results in terms of the mean and standard deviation of the results from multiple trials) and data snooping in the context of neural network ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Experimental results reported in the machine learning AI literature can be misleading. This paper investigates the common processes of data averaging (reporting results in terms of the mean and standard deviation of the results from multiple trials) and data snooping in the context of neural networks, one of the most popular AI machine learning models. Both of these processes can result in misleading results and inaccurate conclusions. We demonstrate how easily this can happen and propose techniques for avoiding these very important problems. For data averaging, common presentation assumes that the distribution of individual results is Gaussian. However, we investigate the distribution for common problems and find that it often does not approximate the Gaussian distribution, may not be symmetric, and may be multimodal. We show that assuming Gaussian distributions can significantly affect the interpretation of results, especially those of comparison studies. For a controlled task, we find that the distribution of performance is skewed towards better performance for smoother target functions and skewed towards worse performance for more complex target functions. We propose new guidelines for reporting performance which provide more information about the actual distribution (e.g. boxwhiskers plots). For data snooping, we demonstrate that optimization of performance via experimentation with multiple parameters can lead to significance being assigned to results which are due to chance. We suggest that precise descriptions of experimental techniques can be very important to the evaluation of results, and that we need to be aware of potential data snooping biases when formulating these experimental techniques (e.g. selecting the test procedure). Additionally, it is important to only rely on appropriate statistical tests and to ensure that any assumptions made in the tests are valid (e.g. normality of the distribution).
On the Distribution of Performance from Multiple NeuralNetwork Trials
, 1997
"... The performance of neuralnetwork simulations is often reported in terms of the mean and standard deviation of a number of simulations performed with different starting conditions. However, in many cases, the distribution of the individual results does not approximate a Gaussian distribution, may no ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The performance of neuralnetwork simulations is often reported in terms of the mean and standard deviation of a number of simulations performed with different starting conditions. However, in many cases, the distribution of the individual results does not approximate a Gaussian distribution, may not be symmetric, and may be multimodal. We present the distribution of results for practical problems and show that assuming Gaussian distributions can significantly affect the interpretation of results, especially those of comparison studies. For a controlled task which we consider, we find that the distribution of performance is skewed toward better performance for smoother target functions and skewed toward worse performance for more complex target functions. We propose new guidelines for reporting performance which provide more information about the actual distribution.
Agnostic Learning and Single Hidden Layer Neural Networks
, 1996
"... This thesis is concerned with some theoretical aspects of supervised learning of realvalued functions. We study a formal model of learning called agnostic learning. The agnostic learning model assumes a joint probability distribution on the observations (inputs and outputs) and requires the learnin ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This thesis is concerned with some theoretical aspects of supervised learning of realvalued functions. We study a formal model of learning called agnostic learning. The agnostic learning model assumes a joint probability distribution on the observations (inputs and outputs) and requires the learning algorithm to produce an hypothesis with performance close to that of the best function within a specified class of functions. It is a very general model of learning which includes function learning, learning with additive noise and learning the best approximation in a class of functions as special cases. Within the agnostic learning model, we concentrate on learning functions which can be well approximated by single hidden layer neural networks. Artificial neural networks are often used as black box models for modelling phenomena for which very little prior knowledge is available. Agnostic learning is a natural model for such learning problems. The class of single hidden layer neural netwo...
On the Consistency of Boosting Algorithms
, 2001
"... Boosting algorithms have been shown to perform well on many realworld problems, although they sometimes tend to overfit in noisy situations. While excellent finite sample bounds are known, it has not been clear whether boosting is statistically consistent, implying asymptotic convergence to the opti ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Boosting algorithms have been shown to perform well on many realworld problems, although they sometimes tend to overfit in noisy situations. While excellent finite sample bounds are known, it has not been clear whether boosting is statistically consistent, implying asymptotic convergence to the optimal classification rule. Recent work has provided su#cient conditions for the consistency of boosting for onedimensional problems. In this work we provide su#cient conditions for the consistency of boosting in the multivariate case. These conditions require nontrivial geometric concepts, which play no role in the onedimensional setting. An interesting connection to the recently introduced notion of kernel alignment is pointed out. 1