## Automating Exploratory Data Analysis for Efficient Data Mining (2000)

Venue: | Mining |

Citations: | 5 - 2 self |

### BibTeX

@INPROCEEDINGS{Becher00automatingexploratory,

author = {Jonathan D. Becher and Pavel Berkhin and Edmund Freeman},

title = {Automating Exploratory Data Analysis for Efficient Data Mining},

booktitle = {Mining},

year = {2000},

pages = {424--429},

publisher = {ACM}

}

### OpenURL

### Abstract

Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation and encoding, and generating derived attributes. In particular, the selection of the most appropriate subset of attributes to include is a critical step in building an accurate and efficient model. We describe an automated approach to the exploration, preprocessing, and selection of the optimal attribute subset whose goal is to simplify the KDD process and dramatically shorten the time to build a model. Our implementation finds inappropriate and suspicious attributes, performs target dependency analysis, determiningoptimal attribute encoding, generates new derived attributes, and provides a flexible approach to attribute selection. We present results generated by an industrial KDD environment called the Accrue Decision Series on several real world Web data sets.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... q, mutual information I (x, ) is defined as I(X,Y) = F. jq Pjq log(Pjq / Pj.P.q). where we use base two logarithm if the information units are bits. This measure is widely used in information theory =-=[3]-=- and machine learning [25]. It is sometimes referred to as information gain, due to a property that it is equal to a decrease in entropy H () - H ( I X) caused by knowing x, where H (X) = - .q P. q 1 ... |

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...e and descriptive engines, much of our focus has been on the efficient building of models using standard predictive techniques: neural networks [ 12,11 ], classification and regression decision trees =-=[2, 25]-=-, and Bayesian learning [7]. Due to the practical limitations of commercial mining, we have tried to achieve a balance between the time spent on data exploration and the gains we get in this process. ... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...e and descriptive engines, much of our focus has been on the efficient building of models using standard predictive techniques: neural networks [ 12,11 ], classification and regression decision trees =-=[2, 25]-=-, and Bayesian learning [7]. Due to the practical limitations of commercial mining, we have tried to achieve a balance between the time spent on data exploration and the gains we get in this process. ... |

2014 | Principal Component Analysis - Jolliffe - 2002 |

1774 | Introduction to the Theory of Neural Computation - Hertz, Palmer - 1991 |

1240 | A.: On information and sufficiency - KULLBACK, LEIBLER - 1951 |

1031 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...isualization environment for EDA is discussed in [4]. For attribute selection in unsupervised learning see [5]. Two models, filter and wrapper, exist for attribute selection and both are described in =-=[14]-=-. For an earlier work on attribute selection see [ 16]. The Markov Blanket attribute selection algorithm is a modification of the algorithm introduced in [ 18]. inconsistency Rate, utilized in iR attr... |

881 |
Exploratory Data Analysis
- Tukey
- 1977
(Show Context)
Citation Context ...s were obtained using a boosted naYve bayesian classifier; a classification tree induction technique produced analogous results. 8 Related Work Data preprocessing is a standard practice in statistics =-=[28]-=-, pattern recognition and data mining [26]. Generic data cleansing techniques are well-described in [ 10]. Grouping of categorical values as it relates to tree induction techniques is discussed in [25... |

653 |
K.B.: Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...butes are processed to determine the most appropriate representation. This step handles outliers, missing values, and encoding. Continuous attributes are encoded by thresholding (a.k.a. discretizing) =-=[8]-=- the original values into a small number of value ranges. For categorical attributes, encoding merges several values (categories) together. This grouping is similar to a subset option in C4.5 [25]. As... |

362 | Towards optimal feature selection
- Koller, Sahami
- 1996
(Show Context)
Citation Context ...e and multivariate transformations. When all original and new derived attributes are cleansed, confirmed to be appropriate, and discretized, EDA uses two independent approaches to attribute selection =-=[18, 13]-=-, both of which are based on filter 4 model selection [ 14]. Using two algorithms provides additional flexibility and increases our confidence in the results. 3 Inappropriate and Suspicious Attributes... |

258 | The feature selection problem: Traditional methods and a new algorithm - Kira, Rendell - 1992 |

109 | A probabilistic approach to feature selection - a filter solution
- Liu, Setiono
- 1996
(Show Context)
Citation Context ...e and multivariate transformations. When all original and new derived attributes are cleansed, confirmed to be appropriate, and discretized, EDA uses two independent approaches to attribute selection =-=[18, 13]-=-, both of which are based on filter 4 model selection [ 14]. Using two algorithms provides additional flexibility and increases our confidence in the results. 3 Inappropriate and Suspicious Attributes... |

97 |
The Annealing algorithm
- Otten, Ginneken
- 1989
(Show Context)
Citation Context ...mber of groups or thresholds and their location. For a continuous attribute and a fixed number of thresholding intervals, the corresponding cut points are optimized by means of an annealing algorithm =-=[23]-=-. In practice, we also impose a lower bound on the number of cases per thresholding interval to ensure that the ranges are relevant. For a categorical attribute and a fixed number of groups (less than... |

75 |
Pattern Classification: A Unified View of Statistical and Neural Approaches
- Schurmann
- 1996
(Show Context)
Citation Context ...rget beyond a specified threshold, that transformation is retained for further processing. EDA relies on the fact that the concept of correlation can be generalized to a continuous-categorical couple =-=[26]-=- so that these transforms can be used regardless of whether the target is continuous or categorical. EDA also supports exploring functions of several continuous attributes, including linear combinatio... |

64 | Discretizing Continuous Attributes While Learning Bayesian Networks
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...chniques are well-described in [ 10]. Grouping of categorical values as it relates to tree induction techniques is discussed in [25] while thresholding of continuous variables is discussed in [8]. in =-=[9]-=- information based thresholding of continuous attributes is augmented by the use of the minimal description length principal. A comprehensive introduction to information theory is contained in [3]. Fo... |

30 |
Signal Detection Theory and ROC
- Egan
- 1975
(Show Context)
Citation Context ...e measures was significant, we ran multiple models with different training and verification sets. We calculated from formulas that the top 5% lift had a standard deviation of 0.11 and the ROC metrics =-=[20, 6]-=- had a standard deviation of 0.0037. Second, these results were obtained using a boosted naYve bayesian classifier; a classification tree induction technique produced analogous results. 8 Related Work... |

25 | An Interactive Visualization Environment for Data Exploration
- Derthick, Kolojejchick, et al.
- 1997
(Show Context)
Citation Context ... iterative, interactive endeavor is advocated in [27]. While we agree with this philosophy in some aspect, our primary focus is on automatic process. Visualization environment for EDA is discussed in =-=[4]-=-. For attribute selection in unsupervised learning see [5]. Two models, filter and wrapper, exist for attribute selection and both are described in [14]. For an earlier work on attribute selection see... |

13 |
Anytime exploratory data analysis for massive data sets
- Smyth, Wolpert
- 1997
(Show Context)
Citation Context ...near statistical modeling, many similar approaches have been used; most notably, principal component analysis [ 15]. The idea that EDA is inherently an iterative, interactive endeavor is advocated in =-=[27]-=-. While we agree with this philosophy in some aspect, our primary focus is on automatic process. Visualization environment for EDA is discussed in [4]. For attribute selection in unsupervised learning... |

11 |
Data mining for direct marketing
- Ling, Li
- 1998
(Show Context)
Citation Context ...e measures was significant, we ran multiple models with different training and verification sets. We calculated from formulas that the top 5% lift had a standard deviation of 0.11 and the ROC metrics =-=[20, 6]-=- had a standard deviation of 0.0037. Second, these results were obtained using a boosted naYve bayesian classifier; a classification tree induction technique produced analogous results. 8 Related Work... |

10 | Feature engineering and classifier selection: A case study in Venusian volcano detection
- Asker, Maclin
- 1997
(Show Context)
Citation Context ...econd association measure based on chi-squared statistic z2(X,Y) = N F. jq (Pjq - pj.p.q) 2 / (pj.p.q). where N is total number of cases. A well known normalization of chi-squared statistic scaled to =-=[0,1]-=-, which represents the strength of association is Cramer's V [24] V(X,Y) = z2(X,Y)/(N rain (Q, J) -l) . The third measure of association used in EDA is a Goodman-Kruskal association index. Consider a ... |

8 |
Efficient Feature Selection
- Devaney, Ram
- 1997
(Show Context)
Citation Context ...le we agree with this philosophy in some aspect, our primary focus is on automatic process. Visualization environment for EDA is discussed in [4]. For attribute selection in unsupervised learning see =-=[5]-=-. Two models, filter and wrapper, exist for attribute selection and both are described in [14]. For an earlier work on attribute selection see [ 16]. The Markov Blanket attribute selection algorithm i... |

1 |
Boosting and NaYve Bayesian Learning
- Elkan
- 1997
(Show Context)
Citation Context ... of our focus has been on the efficient building of models using standard predictive techniques: neural networks [ 12,11 ], classification and regression decision trees [2, 25], and Bayesian learning =-=[7]-=-. Due to the practical limitations of commercial mining, we have tried to achieve a balance between the time spent on data exploration and the gains we get in this process. In this paper, we assume th... |

1 | Neural Networks..4 comprehensive Foundation - Haykin - 1999 |

1 | Wrappers for Feature Subset - Kohavi, John - 1997 |

1 | On Feature Selection: Lerning with Exponentially many Irrelevant Features as Training Examples - Ng - 1998 |