Results 1 - 10
of
11
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey
- Data Mining and Knowledge Discovery
, 1997
"... Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial ne ..."
Abstract
-
Cited by 121 (1 self)
- Add to MetaCart
Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. Researchers in these disciplines, sometimes working on quite different problems, identified similar issues and heuristics for decision tree construction. This paper surveys existing work on decision tree construction, attempting to identify the important issues involved, directions the work has taken and the current state of the art. Keywords: classification, tree-structured classifiers, data compaction 1. Introduction Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data exploration techniques. Enormous amounts of data are being collected daily from major scientific projects e.g., Human Genome...
Multiple Comparisons in Induction Algorithms
- Machine Learning
, 1998
"... Keywords Running Head multiple comparison procedure Multiple Comparisons in Induction Algorithms David Jensen and Paul R. Cohen Experimental Knowledge Systems Laboratory Department of Computer Science Box 34610 LGRC University of Massachusetts Amherst, MA 01003-4610 413-545-3613 A single ..."
Abstract
-
Cited by 67 (9 self)
- Add to MetaCart
Keywords Running Head multiple comparison procedure Multiple Comparisons in Induction Algorithms David Jensen and Paul R. Cohen Experimental Knowledge Systems Laboratory Department of Computer Science Box 34610 LGRC University of Massachusetts Amherst, MA 01003-4610 413-545-3613 A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a ( ). We analyze the statistical properties of and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation. Inductive learning, overfitting, oversearching, attribute selection, hypothesis testing, parameter estimation Multiple Com...
Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison
, 1995
"... The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Asso ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing efforts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one datab...
Techniques for Dealing with Missing Values in Classification
, 1997
"... . A brief overview of the history of the development of decision tree induction algorithms is followed by a review of techniques for dealing with missing attribute values in the operation of these methods. The technique of dynamic path generation is described in the context of treebased classificati ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
. A brief overview of the history of the development of decision tree induction algorithms is followed by a review of techniques for dealing with missing attribute values in the operation of these methods. The technique of dynamic path generation is described in the context of treebased classification methods. The waste of data which can result from casewise deletion of missing values in statistical algorithms is discussed and alternatives proposed. Keywords: Missing values, Dynamic path generation, Intelligent data analysis, Inductive learning, Knowledge discovery, Data mining, Machine learning. 1 Introduction In the information age, data is generated almost everywhere: satellites orbiting the moons of Jupiter; submarines in the deepest ocean trench; even electronic point of sale machines in the high street produce data. All of these systems generate millions of megabytes of data every day. Some of these data contain information that could lead to important discoveries in science; s...
Integrating Induction and Case-Based Reasoning: Methodological Approach and First Evaluations
- Proc. 17th Conference of the GfKl
, 1994
"... Abstract. We propose in this paper a general framework for integrating inductive and case-based reasoning (CBR) techniques for diagnosis tasks. We present a set of practical integrated approaches realised between the KATE-Induction decision tree builder and the PATDEX case-based reasoning system. Th ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. We propose in this paper a general framework for integrating inductive and case-based reasoning (CBR) techniques for diagnosis tasks. We present a set of practical integrated approaches realised between the KATE-Induction decision tree builder and the PATDEX case-based reasoning system. The integration is based on the deep understanding about the weak and strong points of each technology. This theoretical knowledge permits to specify the structural possibilities of a sound integration between the relevant components of each approach. We define different levels of integration called "cooperative", "workbench " and "seamless". They realise respectively a tight, medium and strong link between both techniques. Experimental results show the appropriateness of these integrated approaches for the treatment of noisy or unknown data. 1
Constructing New Attributes for Decision Tree Learning
, 1996
"... A well-known fundamental limitation of selective induction algorithms is that when tasksupplied attributes are not adequate for, or directly relevant to, describing hypotheses, their performance in terms of prediction accuracy and/or theory complexity is poor. One solution to this problem is constru ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
A well-known fundamental limitation of selective induction algorithms is that when tasksupplied attributes are not adequate for, or directly relevant to, describing hypotheses, their performance in terms of prediction accuracy and/or theory complexity is poor. One solution to this problem is constructive induction. It constructs, by using task-supplied attributes, new attributes that are expected to be more appropriate than the task-supplied attributes for describing the target concepts. This thesis focuses on constructive induction with decision trees as the theory description language. It explores: (1) novel approaches to constructing new binary attributes using existing constructive operators, and (2) novel methods of constructing new nominal and new continuous-valued attributes based on a newly proposed constructive operator. The thesis investigates a fixed rule-based approach to constructing new binary attributes for decision tree learning. It generates conjunctions from producti...
Machine Learning Techniques for Civil Engineering Problems
, 1997
"... The growing volume of information databases presents opportunities for advanced data analysis techniques from machine learning (ML) research. Practical applications of ML are very different from theoretical or empirical studies, involving organizational and human aspects, and various other constrain ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
The growing volume of information databases presents opportunities for advanced data analysis techniques from machine learning (ML) research. Practical applications of ML are very different from theoretical or empirical studies, involving organizational and human aspects, and various other constraints. Despite the importance of applied ML, little has been discussed in the general ML literature on this topic. In order to remedy this situation, we studied practical applications of ML and developed a proposal for a seven-steps process that can guide practical applications of ML in engineering. The process is illustrated by relevant applications of ML in civil engineering. This illustration shows that the potential of ML has only begun to be explored, but also cautions that in order to be successful, the application process must carefully address the issues related to the seven-step process. 1 Introduction Over the last several decades we have witnessed an explosion in information generat...
A note on split selection bias in classification trees
- Computational Statistics and Data Analysis
, 2004
"... A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into correspo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into corresponding subnodes. However, this greedy method is biased in variable selection when the numbers of the available split points for each variable are different. Such result may thus hamper the intuitively appealing nature of classification trees. The problem of the split selection bias for two-class tasks with numerical predictors is examined. The statistical explanation of its existence is given and a solution based on the P-values is provided, when the Pearson chisquare statistic is used as the splitting criterion. keyword Cramér V 2 statistic; Kolmogorov-Smirnov statistic, P-value; Pearson chi-square statistic 1
Conservation of Generalization: A Case Study
, 1995
"... We present results of a study of the applicability of information loss as an attribute selection criterion in decision tree induction. These results illustrate basic---though sometimes counter-intuitive---consequences of the conservation law for generalization performance and also suggest new avenue ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present results of a study of the applicability of information loss as an attribute selection criterion in decision tree induction. These results illustrate basic---though sometimes counter-intuitive---consequences of the conservation law for generalization performance and also suggest new avenues for research. 1 Introduction In many branches of mathematics, the construction and study of counterexamples is well accepted as a productive mode for research. The premise of the work reported here is that a similar tack may be useful in our study of inductive concept learning. An essential property of the problem of inductive generalization is that it admits no general solution. An algorithm that is good for learning certain sets of concepts must necessarily be bad for learning others. Moreover, no algorithm strictly dominates any other. If two learners differ in generalization performance, there must be problems for which each is superior to the other. As a consequence, every algorith...
Reducing The Number Of Binary Splits In Decision Tree Induction, . . .
"... : The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (20 4 in our application). Few number of relevant binary attributes are gathered from each initial attribute. The significant idea consists in grouping the value ..."
Abstract
- Add to MetaCart
: The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (20 4 in our application). Few number of relevant binary attributes are gathered from each initial attribute. The significant idea consists in grouping the values of an attribute by means of an hierarchical classification method. The similarity between values is associated with the classes to be predicted. The solution that we propose is independant of the number of these classes and can be applied to various situations. A specific use of the obtained classification tree reduces very significantly the number of binary splits of the attribute value set that have to be retained. In fact and for complexity reasons, the hierarchical classification method is combined with formal decomposition and recomposition of the attribute value set. The ARCADE method that we have set up is mainly a powerful hybridation of the celebrated CART method, by our above out...

