Results 1  10
of
13
The role of Occam’s Razor in knowledge discovery
 Data Mining and Knowledge Discovery
, 1999
"... Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite di ..."
Abstract

Cited by 78 (3 self)
 Add to MetaCart
Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility tradeoff.
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
 In Proceedings of the Eighteenth International Conference on Machine Learning
, 2001
"... We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm. ..."
Abstract

Cited by 65 (3 self)
 Add to MetaCart
We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm.
LogicalShapelets: An Expressive Primitive for Time Series Classification
"... Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant a ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant. In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation LogicalShapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.
Expected Error Analysis for Model Selection
 International Conference on Machine Learning (ICML
, 1999
"... In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing this generalization performance. We present a new analysis which characterizes the expected generalization error of the hypothesis with least training error in terms of the distribution of error rates of the hypotheses in the model. This distribution can be estimated very efficiently from the data which immediately leads to an efficient model selection algorithm. The analysis predicts learning curves with a very high precision and thus contributes to a better understanding of why and when overfitting occurs. We present empirical studies (controlled experiments on Boolean decision trees and a largescale text categorization problem) which show that the model selection algorithm leads to err...
AverageCase Analysis of Classification Algorithms for Boolean Functions and Decision Trees
, 2000
"... We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem, taking properties of the problem and the learner into account. The solution depends only on known quantities (e.g., the sample size), and the histogram of error rates in the model class which we determine for the case that the sought target is a randomly drawn Boolean function. We then discuss how the error histogram can be estimated from a given sample and thus show how the analysis can be applied approximately in the more realistic scenario that the target is unknown. Experiments show that our analysis can predict the behavior of decision tree algorithms fairly accurately even if the error histogram is estimated from a sample.
The Biases of Decision Tree Pruning Strategies
 Advances in Intelligent Data Analysis: Proc. 3rd Intl. Symp
, 1999
"... Post pruning of decision trees has been a successful approach in many realworld experiments, but over all possible concepts it does not bring any inherent improvement to an algorithm's performance. This work explores how a PACproven decision tree learning algorithm fares in comparison with tw ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Post pruning of decision trees has been a successful approach in many realworld experiments, but over all possible concepts it does not bring any inherent improvement to an algorithm's performance. This work explores how a PACproven decision tree learning algorithm fares in comparison with two variants of the normal topdown induction of decision trees. The algorithm does not prune its hypothesis per se, but it can be understood to do prepruning of the evolving tree. We study a backtracking search algorithm, called Rank, for learning rankminimal decision trees. Our experiments follow closely those performed by Schaffer [20]. They confirm the main findings of Schaffer: in learning concepts with simple description pruning works, for concepts with a complex description and when all concepts are equally likely pruning is injurious, rather than beneficial, to the average performance of the greedy topdown induction of decision trees. Prepruning, as a gentler technique, settles in the ...
Efficient Sampling in Relational Feature Spaces
"... Abstract. Stateoftheart algorithms implementing the ‘extended transformation approach ’ to propositionalization use backtrack depth first search for the construction of relational features (first order atom conjunctions) complying to user’s mode/type declarations and a few basic syntactic conditi ..."
Abstract
 Add to MetaCart
Abstract. Stateoftheart algorithms implementing the ‘extended transformation approach ’ to propositionalization use backtrack depth first search for the construction of relational features (first order atom conjunctions) complying to user’s mode/type declarations and a few basic syntactic conditions. As such they incur a complexity factor exponential in the maximum allowed feature size. Here I present an alternative based on an efficient reduction of the feature construction problem on the propositional satisfiability (SAT) problem, such that the latter involves only Horn clauses and is therefore tractable: a model to a propositional Horn theory can be found without backtracking in time linear in the number of literals contained. This reduction allows to either efficiently enumerate the complete set of correct features (if their total number is polynomial in the maximum feature size), or otherwise efficiently obtain a random sample from the uniform distribution on the feature space. The proposed sampling method can also efficiently provide an unbiased estimate of the total number of correct features entailed by the user language declaration. 1
Predicting the Generalization Performance of Cross Validatory Model Selection Criteria
, 2000
"... We conduct an averagecase analysis of the generalization error rate of holdout testing and nfold cross validation "wrappers" for model selection. Unlike previous approaches, we do not rely on worstcase bounds that hold for all possible learning problems. Instead, we study the behavior o ..."
Abstract
 Add to MetaCart
We conduct an averagecase analysis of the generalization error rate of holdout testing and nfold cross validation "wrappers" for model selection. Unlike previous approaches, we do not rely on worstcase bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm with a crossvalidation wrapper for a given problem, taking properties of the problem (that can be estimated using the sample) into account. We have to pay for this (and the efficiency of our solution) by having to make some approximations. Experiments show that our analysis can nevertheless predict the behavior of cross validation wrappers fairly accurately.
Predicting the Relation between Model Class, Domain, and Error Rate
"... One of the questions central to the field of MetaLearning is how domain properties, the model class that a learner uses, and the resulting error rate relate. The histogram of error rates in a given model class is a joint property of model class and domain. For a certain class of learners, this prop ..."
Abstract
 Add to MetaCart
One of the questions central to the field of MetaLearning is how domain properties, the model class that a learner uses, and the resulting error rate relate. The histogram of error rates in a given model class is a joint property of model class and domain. For a certain class of learners, this property (together with the sample size) can be shown to approximately determine the resulting generalization error. In many cases, this result can be exploited to determine a model class which is close to optimal for a given domain. In my talk, I will summarize the analysis that quantifies the generalization error in terms of this histogram, sketch applications of the analysis to model selection problems, and discuss a list of open questions.