Results 1  10
of
21
Efficient Progressive Sampling
, 1999
"... Having access to massiveamounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size is rarely obvious. We analyze methods for progressive samplingstarting with ..."
Abstract

Cited by 113 (10 self)
 Add to MetaCart
(Show Context)
Having access to massiveamounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size is rarely obvious. We analyze methods for progressive samplingstarting with small samples and progressively increasing them as long as model accuracy improves. We show that a simple, geometric sampling schedule is efficient in an asymptotic sense. We then explore the notion of optimal efficiency: what is the absolute best sampling schedule? We describe the issues involved in instantiating an "optimally efficient" progressive sampler. Finally,we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling often is preferable to analyzing all data instances.
A Survey of Methods for Scaling Up Inductive Algorithms
 Data Mining and Knowledge Discovery
, 1999
"... . One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule ..."
Abstract

Cited by 107 (11 self)
 Add to MetaCart
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research. Keywords: scaling up, inductive learning, decision trees, rule learning 1. Introduction The knowledge discovery and data...
Distributed Data Mining: Scaling up and beyond
 In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing alreadydistributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.
Modelling Classification Performance for Large Data Sets  An Empirical Study
, 2001
"... . For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the wellknown learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
. For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the wellknown learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithmsC4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a fulllength learning curve; by using a small portion of the data, we fit a partlength learning curve. The models are then compared in terms of two performances: (1) how well they fit a fulllength learning curve, and (2) how well a fitted partlength learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a  b # x c ) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining. 1
Computational models of surprise in evaluating creative design
 in Proceedings of the International Conference on Computational Creativity (ICCC
"... In this paper we consider how to evaluate whether a design or other artifact is creative. Creativity and its evaluation have been studied as a social process, a creative arts practice, and as a design process with guidelines for people to judge creativity. However, there are few approaches that s ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In this paper we consider how to evaluate whether a design or other artifact is creative. Creativity and its evaluation have been studied as a social process, a creative arts practice, and as a design process with guidelines for people to judge creativity. However, there are few approaches that seek to evaluate creativity computationally. In prior work we presented novelty, value, and surprise as a set of necessary conditions when identifying creative designs. In this paper we focus on the least studied of these – surprise. Surprise occurs when expectations are violated, suggesting that there is a temporal component when evaluating how surprising an artifact is. This paper presents an approach to quantifying surprise by projecting into the future. We illustrate this ap
The Association between the Disclosure and the Realization of Information Security Risk Factors
, 2009
"... Firms often disclose information security risk factors in public filings such as 10K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to repr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Firms often disclose information security risk factors in public filings such as 10K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to represent the internal information regarding information security, is associated with future breach announcements. For this purpose, we build a decision tree model, which classifies the occurrence of future security breaches based on the textual contents of the disclosed security risk factors. The model is able to accurately associate disclosure characteristics with breach announcements about 77 % of the time. We further explore the contents of the security risk factors using text mining techniques to provide a richer interpretation of the results. The results show that the security risk factors with actionoriented terms and phrases are less likely to be related to future incidents. We also conduct a crosssectional analysis to study how the market interprets the nature of information security risk factors in annual reports at different time points. We find that the market reaction following the security breach announcement is different
PMCRI : A Parallel Modular Classification Rule Induction Framework
 in in Machine Learning and Data Mining in Pattern Recognition
, 2009
"... Abstract. In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly r ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation. 1.
PPrism: A Computationally Efficient Approach to Scaling up Classification Rule Induction
 in IFIP International Conference on Artificial Intelligence. 2008
"... used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. How ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism. 1
SAMPLE SIZE AND MODELING ACCURACY WITH DECISIONTREE BASED DATA MINING TOOLS
"... Given the cost associated with modeling very large datasets and overfitting issues of decisiontree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Given the cost associated with modeling very large datasets and overfitting issues of decisiontree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper presents results of sets of decisiontree models generated across progressive sets of sample sizes. The models were applied to two sets of actual client data using each of six prominent commercial data mining tools. The results suggest that model accuracy improves at a decreasing rate with increasing sample size. When a power curve was fitted to accuracy estimates across various sample sizes, more than 80 percent of the time accuracy within 0.5 percent of the expected terminal (accuracy of a theoretical infinite sample) was achieved by the time the sample size reached 10,000 records. Based on these results, fitting a power curve to progressive samples and using it to establish an appropriate sample size appears to be a promising mechanism to support sample based modeling for a large dataset.
Modeling Performance of Different Classification Methods: Deviation from the Power Law
, 2005
"... Abstract – This project studied the effect of varying the training size for different classification techniques. The learning curves were then regressed using four common equations. In the restricted domain which was studied, the logarithmic equation was the best fit. This contradicts the earlier wo ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract – This project studied the effect of varying the training size for different classification techniques. The learning curves were then regressed using four common equations. In the restricted domain which was studied, the logarithmic equation was the best fit. This contradicts the earlier work carried out on Decision Trees in which the performance was best modeled by the Power Law. The other classification techniques studied in this project were KNearest Neighbors, Support Vector Machines, and Artificial Neural Networks, which have not yet been included in such a study. A preliminary study of how the modeling can be used for predicting the performance of the project was also undertaken. The equations which best predicted the performance were not same as the ones which best fit the final curve, and depended on the classification method more than the dataset. 1.