Results 1  10
of
66
Arcing Classifiers
, 1998
"... Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging (Breiman [1996a] ) Here, modified training sets are formed by resampling from the original training set, classifiers con ..."
Abstract

Cited by 277 (6 self)
 Add to MetaCart
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging (Breiman [1996a] ) Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Freund and Schapire [1995,1996] propose an algorithm the basis of which is to adaptively resample and combine (hence the acronymarcing) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets....
PCFG Models of Linguistic Tree Representations
 Computational Linguistics
, 1998
"... This paper points out that the Penn lI treebank representations are of the kind predicted to have such an effect, and describes a simple node relabeling transformation that improves a treebank PCFGbased parser's average precision and recall by around 8%, or approximately half of the performance dif ..."
Abstract

Cited by 211 (9 self)
 Add to MetaCart
This paper points out that the Penn lI treebank representations are of the kind predicted to have such an effect, and describes a simple node relabeling transformation that improves a treebank PCFGbased parser's average precision and recall by around 8%, or approximately half of the performance difference between a simple PCFG model and the best broadcoverage parsers available today. This performance variation comes about because any PCFG, and hence the corpus of trees from which the PCFG is induced, embodies independence assumptions about the distribution of words and phrases. The particular independence assumptions implicit in a tree representation can be studied theoretically and investigated empirically by means of a tree transformation / detransformation process
The neural basis of cognitive development: A constructivist manifesto
 Behavioral and Brain Sciences
, 1997
"... Quartz, S. & Sejnowski, T.J. (1997). The neural basis of cognitive development: A constructivist manifesto. ..."
Abstract

Cited by 128 (2 self)
 Add to MetaCart
Quartz, S. & Sejnowski, T.J. (1997). The neural basis of cognitive development: A constructivist manifesto.
Lazy Decision Trees
, 1996
"... Lazy learning algorithms, exemplified by nearestneighbor algorithms, do not induce a concise hypothesis from a given training set; the inductive process is delayed until a test instance is given. Algorithms for constructing decision trees, such as C4.5, ID3, and CART create a single "best" decision ..."
Abstract

Cited by 96 (5 self)
 Add to MetaCart
Lazy learning algorithms, exemplified by nearestneighbor algorithms, do not induce a concise hypothesis from a given training set; the inductive process is delayed until a test instance is given. Algorithms for constructing decision trees, such as C4.5, ID3, and CART create a single "best" decision tree during the training phase, and this tree is then used to classify test instances. The tests at the nodes of the constructed tree are good on average, but there may be better tests for classifying a specific instance. We propose a lazy decision tree algorithmLazyDTthat conceptually constructs the "best" decision tree for each test instance. In practice, only a path needs to be constructed, and a caching scheme makes the algorithm fast. The algorithm is robust with respect to missing values without resorting to the complicated methods usually seen in induction of decision trees. Experiments on real and artificial problems are presented. Introduction Delay is preferable to error. ...
Combining Multiple Classifiers By Averaging Or By Multiplying?
, 2000
"... In classification tasks it may be wise to combine observations from di!erent sources. Not only it decreases the training time but it can also increase the robustness and the performance of the classi"cation. Combining is often done by just (weighted) averaging of the outputs of the di!erent classi"e ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
In classification tasks it may be wise to combine observations from di!erent sources. Not only it decreases the training time but it can also increase the robustness and the performance of the classi"cation. Combining is often done by just (weighted) averaging of the outputs of the di!erent classi"ers. Using equal weights for all classi"ers then results in the mean combination rule. This works very well in practice, but the combination strategy lacks a fundamental basis as it cannot readily be derived from the joint probabilities. This contrasts with the product combination rule which can be obtained from the joint probability under the assumption of independency. In this paper we will show di!erences and similarities between this mean combination rule and the product combination rule in theory and in practice. # 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
Constructing Deterministic FiniteState Automata in Recurrent Neural Networks
 Journal of the ACM
, 1996
"... Recurrent neural networks that are trained to behave like deterministic finitestate automata (DFAs) can show deteriorating performance when tested on long strings. This deteriorating performance can be attributed to the instability of the internal representation of the learned DFA states. The use o ..."
Abstract

Cited by 70 (16 self)
 Add to MetaCart
Recurrent neural networks that are trained to behave like deterministic finitestate automata (DFAs) can show deteriorating performance when tested on long strings. This deteriorating performance can be attributed to the instability of the internal representation of the learned DFA states. The use of a sigmoidal discriminant function together with the recurrent structure contribute to this instability. We prove that a simple algorithm can construct secondorder recurrent neural networks with a sparse interconnection topology and sigmoidal discriminant function such that the internal DFA state representations are stable, i.e. the constructed network correctly classifies strings of arbitrary length. The algorithm is based on encoding strengths of weights directly into the neural network. We derive a relationship between the weight strength and the number of DFA states for robust string classification. For a DFA with n states and m input alphabet symbols, the constructive algorithm genera...
Selecting Input Variables Using Mutual Information and Nonparametric Density Estimation
, 1996
"... In learning problems where a connectionist network is trained with a finite sized training set, better generalization performance is often obtained when unneeded weights in the network are eliminated. One source of unneeded weights comes from the inclusion of input variables that provide little info ..."
Abstract

Cited by 47 (2 self)
 Add to MetaCart
In learning problems where a connectionist network is trained with a finite sized training set, better generalization performance is often obtained when unneeded weights in the network are eliminated. One source of unneeded weights comes from the inclusion of input variables that provide little information about the output variables. We propose a method for identifying and eliminating these input variables. The method first determines the relationship between input and output variables using nonparametric density estimation and then measures the relevance of input variables using the information theoretic concept of mutual information. We present results from our method on a simple toy problem and a nonlinear time series. 1 INTRODUCTION Generalization performance on a fixedsize training set is closely related to the number of free parameters in a network. Selecting too many free parameters can lead to poor generalization performance (Baum & Haussler, 1989; Geman, Bienenstock, & Dours...
A unified biasvariance decomposition for zeroone and squared loss
 In AAAI’00
"... The biasvariance decomposition is a very useful and widelyused tool for understanding machinelearning algorithms. It was originally developed for squared loss. In recent years, several authors have proposed decompositions for zeroone loss, but each has significant shortcomings. In particular, al ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
The biasvariance decomposition is a very useful and widelyused tool for understanding machinelearning algorithms. It was originally developed for squared loss. In recent years, several authors have proposed decompositions for zeroone loss, but each has significant shortcomings. In particular, all of these decompositions have only an intuitive relationship to the original squaredloss one. In this paper, we define bias and variance for an arbitrary loss function, and show that the resulting decomposition specializes to the standard one for the squaredloss case, and to a close relative of Kong and Dietterich’s (1995) one for the zeroone case. The same decomposition also applies to variable misclassification costs. We show a number of interesting consequences of the unified definition. For example, Schapire et al.’s (1997) notion of “margin ” can be expressed as a function of the zeroone bias and variance, making it possible to formally relate a classifier ensemble’s generalization error to the base learner’s bias and variance on training examples. Experiments with the unified definition lead to further insights.
A New MetricBased Approach to Model Selection
 In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI97
, 1997
"... We introduce a new approach to model selection that performs better than the standard complexitypenalization and holdout error estimation techniques in many cases. The basic idea is to exploit the intrinsic metric structure of a hypothesis space, as determined by the natural distribution of unlabel ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
We introduce a new approach to model selection that performs better than the standard complexitypenalization and holdout error estimation techniques in many cases. The basic idea is to exploit the intrinsic metric structure of a hypothesis space, as determined by the natural distribution of unlabeled training patterns, and use this metric as a reference to detect whether the empirical error estimates derived from a small (labeled) training sample can be trusted in the region around an empirically optimal hypothesis. Using simple metric intuitions we develop new geometric strategies for detecting overfitting and performing robust yet responsive model selection in spaces of candidate functions. These new metricbased strategies dramatically outperform previous approaches in experimental studies of classical polynomial curve fitting. Moreover, the technique is simple, efficient, and can be applied to most function learning tasks. The only requirement is access to an auxiliary collection ...