Results 1  10
of
11
The role of Occam’s Razor in knowledge discovery
 Data Mining and Knowledge Discovery
, 1999
"... Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite di ..."
Abstract

Cited by 78 (3 self)
 Add to MetaCart
Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility tradeoff.
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
 In Proceedings of the Eighteenth International Conference on Machine Learning
, 2001
"... We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm. ..."
Abstract

Cited by 64 (3 self)
 Add to MetaCart
We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm.
LogicalShapelets: An Expressive Primitive for Time Series Classification
"... Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant a ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant. In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation LogicalShapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.
Expected Error Analysis for Model Selection
 International Conference on Machine Learning (ICML
, 1999
"... In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing this generalization performance. We present a new analysis which characterizes the expected generalization error of the hypothesis with least training error in terms of the distribution of error rates of the hypotheses in the model. This distribution can be estimated very efficiently from the data which immediately leads to an efficient model selection algorithm. The analysis predicts learning curves with a very high precision and thus contributes to a better understanding of why and when overfitting occurs. We present empirical studies (controlled experiments on Boolean decision trees and a largescale text categorization problem) which show that the model selection algorithm leads to err...
AverageCase Analysis of Classification Algorithms for Boolean Functions and Decision Trees
, 2000
"... We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem, taking properties of the problem and the learner into account. The solution depends only on known quantities (e.g., the sample size), and the histogram of error rates in the model class which we determine for the case that the sought target is a randomly drawn Boolean function. We then discuss how the error histogram can be estimated from a given sample and thus show how the analysis can be applied approximately in the more realistic scenario that the target is unknown. Experiments show that our analysis can predict the behavior of decision tree algorithms fairly accurately even if the error histogram is estimated from a sample.
The Biases of Decision Tree Pruning Strategies
 Advances in Intelligent Data Analysis: Proc. 3rd Intl. Symp
, 1999
"... Post pruning of decision trees has been a successful approach in many realworld experiments, but over all possible concepts it does not bring any inherent improvement to an algorithm's performance. This work explores how a PACproven decision tree learning algorithm fares in comparison with two var ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Post pruning of decision trees has been a successful approach in many realworld experiments, but over all possible concepts it does not bring any inherent improvement to an algorithm's performance. This work explores how a PACproven decision tree learning algorithm fares in comparison with two variants of the normal topdown induction of decision trees. The algorithm does not prune its hypothesis per se, but it can be understood to do prepruning of the evolving tree. We study a backtracking search algorithm, called Rank, for learning rankminimal decision trees. Our experiments follow closely those performed by Schaffer [20]. They confirm the main findings of Schaffer: in learning concepts with simple description pruning works, for concepts with a complex description and when all concepts are equally likely pruning is injurious, rather than beneficial, to the average performance of the greedy topdown induction of decision trees. Prepruning, as a gentler technique, settles in the ...
Nonparametric Regularization of Decision Trees
, 2000
"... We discuss the problem of choosing the complexity of a decision tree (measured in the number of leaf nodes) that gives us highest generalization performance. We first discuss an analysis of the generalization error of decision trees that gives us a new perspective on the regularization parameter tha ..."
Abstract
 Add to MetaCart
We discuss the problem of choosing the complexity of a decision tree (measured in the number of leaf nodes) that gives us highest generalization performance. We first discuss an analysis of the generalization error of decision trees that gives us a new perspective on the regularization parameter that is inherent to any regularization (e.g., pruning) algorithm. There is an optimal setting of this parameter for every learning problem; a setting that does well for one problem will inevitably do poorly for others. We will see that the optimal setting can in fact be estimated from the sample, without "trying out" various settings on holdout data. This leads us to a nonparametric decision tree regularization algorithm that can, in principle, work well for all learning problems.
Predicting the Generalization Performance of Cross Validatory Model Selection Criteria
, 2000
"... We conduct an averagecase analysis of the generalization error rate of holdout testing and nfold cross validation "wrappers" for model selection. Unlike previous approaches, we do not rely on worstcase bounds that hold for all possible learning problems. Instead, we study the behavior of a learni ..."
Abstract
 Add to MetaCart
We conduct an averagecase analysis of the generalization error rate of holdout testing and nfold cross validation "wrappers" for model selection. Unlike previous approaches, we do not rely on worstcase bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm with a crossvalidation wrapper for a given problem, taking properties of the problem (that can be estimated using the sample) into account. We have to pay for this (and the efficiency of our solution) by having to make some approximations. Experiments show that our analysis can nevertheless predict the behavior of cross validation wrappers fairly accurately.
Predicting the Relation between Model Class, Domain, and Error Rate
"... One of the questions central to the field of MetaLearning is how domain properties, the model class that a learner uses, and the resulting error rate relate. The histogram of error rates in a given model class is a joint property of model class and domain. For a certain class of learners, this prop ..."
Abstract
 Add to MetaCart
One of the questions central to the field of MetaLearning is how domain properties, the model class that a learner uses, and the resulting error rate relate. The histogram of error rates in a given model class is a joint property of model class and domain. For a certain class of learners, this property (together with the sample size) can be shown to approximately determine the resulting generalization error. In many cases, this result can be exploited to determine a model class which is close to optimal for a given domain. In my talk, I will summarize the analysis that quantifies the generalization error in terms of this histogram, sketch applications of the analysis to model selection problems, and discuss a list of open questions.
A Method for Discovering the Insignificance of One's
, 2002
"... Consider the following common scenario: a datamining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best. ..."
Abstract
 Add to MetaCart
Consider the following common scenario: a datamining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best.