• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Knowledge Discovery Through Induction with Randomization Testing (1991)

by David Jensen
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

by Steven L. Salzberg, Usama Fayyad - Data Mining and Knowledge Discovery , 1997
"... Abstract. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in stati ..."
Abstract - Cited by 120 (0 self) - Add to MetaCart
Abstract. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

Multiple Comparisons in Induction Algorithms

by David Jensen, Paul R. Cohen - Machine Learning , 1998
"... Keywords Running Head multiple comparison procedure Multiple Comparisons in Induction Algorithms David Jensen and Paul R. Cohen Experimental Knowledge Systems Laboratory Department of Computer Science Box 34610 LGRC University of Massachusetts Amherst, MA 01003-4610 413-545-3613 A single ..."
Abstract - Cited by 67 (9 self) - Add to MetaCart
Keywords Running Head multiple comparison procedure Multiple Comparisons in Induction Algorithms David Jensen and Paul R. Cohen Experimental Knowledge Systems Laboratory Department of Computer Science Box 34610 LGRC University of Massachusetts Amherst, MA 01003-4610 413-545-3613 A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a ( ). We analyze the statistical properties of and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation. Inductive learning, overfitting, oversearching, attribute selection, hypothesis testing, parameter estimation Multiple Com...

Large Datasets Lead to Overly Complex Models: An Explanation and a Solution

by Tim Oates, David Jensen , 1998
"... This paper explores unexpected results that lie at the intersection of two common themes in the KDD community: large datasets and the goal of building compact models. Experiments with many different datasets and several model construction algorithms (including tree learning algorithms suchasc4. ..."
Abstract - Cited by 40 (4 self) - Add to MetaCart
This paper explores unexpected results that lie at the intersection of two common themes in the KDD community: large datasets and the goal of building compact models. Experiments with many different datasets and several model construction algorithms (including tree learning algorithms suchasc4.5 with three different pruning methods, and rule learning algorithms such as c4.5rules and ripper) show that increasing the amount of data used to build a model often results in a linear increase in model size, even when that additional complexity results in no significantincrease in model accuracy. Despite the promise of better parameter estimation held out by large datasets, as a practical matter, models built with large amounts of data are often needlessly complex and cumbersome. In the case of decision trees, the cause of this pathology is identified as a bias inherentinseveral common pruning techniques. Pruning errors made low in the tree, where there is insufficient data to make accurate parameter estimates, are propagated and magnified higher in the tree, working against the accurate parameter estimates that are made possible there by abundant data. We propose a general solution to this problem based on a statistical technique known as randomization testing, and empirically evaluate its utility.

GA-MINER: Parallel Data Mining with Hierarchical Genetic Algorithms - Final Report

by Ian W. Flockhart , 1995
"... Many organisations now routinely gather vast and ever-increasing amounts of data in the ordinary course of their business. While much of this information is collected for day-to-day operational reasons, many businesses are now realising that this data has much additional value for improving operatio ..."
Abstract - Cited by 16 (1 self) - Add to MetaCart
Many organisations now routinely gather vast and ever-increasing amounts of data in the ordinary course of their business. While much of this information is collected for day-to-day operational reasons, many businesses are now realising that this data has much additional value for improving operational processes. Large databases can form the basis of decision support systems, often based around a data warehouse. Such systems may then be used for a variety of applications such as trend spotting, pattern recognition, behavioral modeling and customer worth assessment. Against this backdrop, the term data mining is used to refer to the process of searching through a large volume of data to discover interesting and useful information. The authors have traditionally sought to divide data mining into three types or levels---undirected or pure data mining, where the system is left almost entirely unconstrained to discover patterns in the data free of prejudices from the user; directed data mi...

Data Mining At The Interface Of Computer Science And Statistics

by Padhraic Smyth , 2001
"... This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, i ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications. Keywords: Data mining, statistics, pattern recognition, transaction data, correlation. 1.

KDD-93: Progress and Challenges in Knowledge Discovery in Databases

by Gregory Piatetsky-shapiro, Christopher Matheus, Padhraic Smyth, Pamasamy Uthurusamy - Artificial Intelligence Magazine , 1994
"... this article. We thank Peter Patel-Schneider for his editorial guidance and rapid processing of this article. ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
this article. We thank Peter Patel-Schneider for his editorial guidance and rapid processing of this article.

Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining

by Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, Heikki Mannila
"... There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

Methodological Note On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

by Steven L. Salzberg, Usama Fayyad , 1996
"... Abstract. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in stati ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

THE SMALLEST SET OF CONSTRAINTS THAT EXPLAINS THE DATA: A RANDOMIZATION APPROACH

by Jefrey Lijffijt, Panagiotis Papapetrou, Niko Vuokko, Kai Puolamäki, Computer Science, Jefrey Lijffijt, Panagiotis Papapetrou, Niko Vuokko, Kai Puolamäki, C Jefrey Lijffijt, Panagiotis Papapetrou, Niko Vuokko, Kai Puolamäki
"... Aalto-yliopiston teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietojenkäsittelytieteen laitosDistribution: ..."
Abstract - Add to MetaCart
Aalto-yliopiston teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietojenkäsittelytieteen laitosDistribution:
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University