Results 1  10
of
16
Towards parameterfree data mining
 In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining
, 2004
"... Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorit ..."
Abstract

Cited by 118 (18 self)
 Add to MetaCart
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible, ideally none. A parameterfree algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameterfree datamining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any offtheshelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the stateoftheart approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
The role of Occam’s Razor in knowledge discovery
 Data Mining and Knowledge Discovery
, 1999
"... Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite di ..."
Abstract

Cited by 78 (3 self)
 Add to MetaCart
Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility tradeoff.
Occam's Two Razors: The Sharp and the Blunt
 In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
, 1998
"... Occam's razor has been the subject of much controversy. This paper argues that this is partly because it has been interpreted in two quite different ways, the first of which (simplicity is a goal in itself) is essentially correct, while the second (simplicity leads to greater accuracy) is not. ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Occam's razor has been the subject of much controversy. This paper argues that this is partly because it has been interpreted in two quite different ways, the first of which (simplicity is a goal in itself) is essentially correct, while the second (simplicity leads to greater accuracy) is not. The paper reviews the large variety of theoretical arguments and empirical evidence for and against the "second razor," and concludes that the balance is strongly against it. In particular, it builds on the case of (Schaffer, 1993) and (Webb, 1996) by considering additional theoretical arguments and recent empirical evidence that the second razor fails in most domains. A version of the first razor more appropriate to KDD is proposed, and we argue that continuing to apply the second razor risks causing significant opportunities to be missed. 1 Occam's Two Razors William of Occam's famous razor states that "Nunquam ponenda est pluralitas sin necesitate," which, approximately translated, means "En...
Microchoice Bounds and Self Bounding Learning Algorithms
 Machine Learning
, 2001
"... A major topic in machine learning is to determine good upper bounds on the true error rates of learned hypotheses based upon their empirical performance on training data. In this paper, we demonstrate new adaptive bounds designed for learning algorithms that operate by making a sequence of choices. ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
A major topic in machine learning is to determine good upper bounds on the true error rates of learned hypotheses based upon their empirical performance on training data. In this paper, we demonstrate new adaptive bounds designed for learning algorithms that operate by making a sequence of choices. These bounds, which we call Microchoice bounds, are similar to Occamstyle bounds and can be used to make learning algorithms selfbounding in the style of Freund [Fre98]. We then show how to combine these bounds with Freund's querytree approach producing a version of Freund's querytree structure that can be implemented with much more algorithmic efficiency.
ProcessOriented Estimation of Generalization Error
 In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
, 1999
"... Methods to avoid overfitting fall into two broad categories: dataoriented (using separate data for validation) and representationoriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if t ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Methods to avoid overfitting fall into two broad categories: dataoriented (using separate data for validation) and representationoriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if the search process by which models are obtained is also taken into account. To this end, we recently proposed a method for processoriented evaluation (POE), and successfully applied it to rule induction [ Domingos, 1998b ] . However, for the sake of simplicity this treatment made a number of rather artificial assumptions. In this paper the assumptions are removed, and a simple formula for error estimation is obtained. Empirical trials show the new, betterfounded form of POE to be as accurate as the previous one, while further reducing theory sizes. 1 Introduction Overfitting avoidance is a central problem in machine learning. If a learner is su#ciently powerful, whatever repre...
Frankenstein classifiers: Some experiments on the Sisyphus data set
 In Proceedings of IDDM01  Workshop on Integration of Data Mining, Decision Support, and MetaLearning
, 2001
"... . We present some empirical results on the use of two methods for integrating different classifiers into a hybrid classifier that should perform better than each of its constituent classifiers. The main point of these methods is that instead of combining full classifiers, they combine pieces of ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
. We present some empirical results on the use of two methods for integrating different classifiers into a hybrid classifier that should perform better than each of its constituent classifiers. The main point of these methods is that instead of combining full classifiers, they combine pieces of them. One of the methods is based on ROC analysis; the other is based on augmenting the data with features derived from earlier learned classifiers. Experimental results are presented that suggest that these approaches can yield improvements over combining full classifiers. 1
Inducing interpretable Voting classifiers without trading accuracy for simplicity: theoretical results, approximation algorithms, and experiments
 JAIR
, 2004
"... Recent advances in the study of voting classification algorithms have brought empirical and theoretical results clearly showing the discrimination power of ensemble classifiers. It has been previously argued that the search of this classification power in the design of the algorithms has marginalize ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Recent advances in the study of voting classification algorithms have brought empirical and theoretical results clearly showing the discrimination power of ensemble classifiers. It has been previously argued that the search of this classification power in the design of the algorithms has marginalized the need to obtain interpretable classifiers. Therefore, the question of whether one might have to dispense with interpretability in order to keep classification strength is being raised in a growing number of machine learning or data mining papers. The purpose of this paper is to study both theoretically and empirically the problem. First, we provide numerous results giving insight into the hardness of the simplicityaccuracy tradeoff for voting classifiers. Then we provide an efficient &quot;topdown and prune &quot; induction heuristic, WIDC, mainly derived from recent results on the weak learning and boosting frameworks. It is to our knowledge the first attempt to build a voting classifier as a base formula using the weak learning framework (the one which was previously highly successful for decision tree induction), and not the strong learning framework (as usual for such classifiers with boostinglike approaches). While it uses a wellknown induction scheme previously successful in other classes of concept representations, thus making it easy to implement and compare, WIDC also relies on recent or new results we give about particular cases of boosting known as partition boosting and ranking loss boosting. Experimental results on thirtyone domains, most of which readily available, tend to display the ability of WIDC to produce small, accurate, and interpretable decision committees.
Expected Error Analysis for Model Selection
 International Conference on Machine Learning (ICML
, 1999
"... In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very efficient way of assessing this generalization performance. We present a new analysis which characterizes the expected generalization error of the hypothesis with least training error in terms of the distribution of error rates of the hypotheses in the model. This distribution can be estimated very efficiently from the data which immediately leads to an efficient model selection algorithm. The analysis predicts learning curves with a very high precision and thus contributes to a better understanding of why and when overfitting occurs. We present empirical studies (controlled experiments on Boolean decision trees and a largescale text categorization problem) which show that the model selection algorithm leads to err...
AverageCase Analysis of Classification Algorithms for Boolean Functions and Decision Trees
, 2000
"... We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We conduct an averagecase analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worstcase approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem, taking properties of the problem and the learner into account. The solution depends only on known quantities (e.g., the sample size), and the histogram of error rates in the model class which we determine for the case that the sought target is a randomly drawn Boolean function. We then discuss how the error histogram can be estimated from a given sample and thus show how the analysis can be applied approximately in the more realistic scenario that the target is unknown. Experiments show that our analysis can predict the behavior of decision tree algorithms fairly accurately even if the error histogram is estimated from a sample.