Results 1 
9 of
9
Efficient Progressive Sampling
, 1999
"... Having access to massiveamounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size is rarely obvious. We analyze methods for progressive samplingstarting with ..."
Abstract

Cited by 91 (9 self)
 Add to MetaCart
Having access to massiveamounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size is rarely obvious. We analyze methods for progressive samplingstarting with small samples and progressively increasing them as long as model accuracy improves. We show that a simple, geometric sampling schedule is efficient in an asymptotic sense. We then explore the notion of optimal efficiency: what is the absolute best sampling schedule? We describe the issues involved in instantiating an "optimally efficient" progressive sampler. Finally,we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling often is preferable to analyzing all data instances.
A Survey of Methods for Scaling Up Inductive Algorithms
 Data Mining and Knowledge Discovery
, 1999
"... . One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule ..."
Abstract

Cited by 85 (10 self)
 Add to MetaCart
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research. Keywords: scaling up, inductive learning, decision trees, rule learning 1. Introduction The knowledge discovery and data...
Distributed Data Mining: Scaling up and beyond
 In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing alreadydistributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.
Modelling Classification Performance for Large Data Sets  An Empirical Study
, 2001
"... . For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the wellknown learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
. For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the wellknown learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithmsC4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a fulllength learning curve; by using a small portion of the data, we fit a partlength learning curve. The models are then compared in terms of two performances: (1) how well they fit a fulllength learning curve, and (2) how well a fitted partlength learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a  b # x c ) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining. 1
SAMPLE SIZE AND MODELING ACCURACY WITH DECISIONTREE BASED DATA MINING TOOLS
"... Given the cost associated with modeling very large datasets and overfitting issues of decisiontree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Given the cost associated with modeling very large datasets and overfitting issues of decisiontree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper presents results of sets of decisiontree models generated across progressive sets of sample sizes. The models were applied to two sets of actual client data using each of six prominent commercial data mining tools. The results suggest that model accuracy improves at a decreasing rate with increasing sample size. When a power curve was fitted to accuracy estimates across various sample sizes, more than 80 percent of the time accuracy within 0.5 percent of the expected terminal (accuracy of a theoretical infinite sample) was achieved by the time the sample size reached 10,000 records. Based on these results, fitting a power curve to progressive samples and using it to establish an appropriate sample size appears to be a promising mechanism to support sample based modeling for a large dataset.
The Association between the Disclosure and the Realization of Information Security Risk Factors
, 2009
"... Firms often disclose information security risk factors in public filings such as 10K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to repr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Firms often disclose information security risk factors in public filings such as 10K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to represent the internal information regarding information security, is associated with future breach announcements. For this purpose, we build a decision tree model, which classifies the occurrence of future security breaches based on the textual contents of the disclosed security risk factors. The model is able to accurately associate disclosure characteristics with breach announcements about 77 % of the time. We further explore the contents of the security risk factors using text mining techniques to provide a richer interpretation of the results. The results show that the security risk factors with actionoriented terms and phrases are less likely to be related to future incidents. We also conduct a crosssectional analysis to study how the market interprets the nature of information security risk factors in annual reports at different time points. We find that the market reaction following the security breach announcement is different
Modeling Performance of Different Classification Methods: Deviation from the Power Law
, 2005
"... Abstract – This project studied the effect of varying the training size for different classification techniques. The learning curves were then regressed using four common equations. In the restricted domain which was studied, the logarithmic equation was the best fit. This contradicts the earlier wo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract – This project studied the effect of varying the training size for different classification techniques. The learning curves were then regressed using four common equations. In the restricted domain which was studied, the logarithmic equation was the best fit. This contradicts the earlier work carried out on Decision Trees in which the performance was best modeled by the Power Law. The other classification techniques studied in this project were KNearest Neighbors, Support Vector Machines, and Artificial Neural Networks, which have not yet been included in such a study. A preliminary study of how the modeling can be used for predicting the performance of the project was also undertaken. The equations which best predicted the performance were not same as the ones which best fit the final curve, and depended on the classification method more than the dataset. 1.
DTD 5
"... www.elsevier.com/locate/dsw Intelligent agents in electronic markets for information goods: customization, preference revelation and pricing ..."
Abstract
 Add to MetaCart
www.elsevier.com/locate/dsw Intelligent agents in electronic markets for information goods: customization, preference revelation and pricing
Seventh IEEE International Conference on Data Mining Workshops Predicting and Optimizing Classifier Utility with the Power Law
"... When data collection is costly and/or takes a significant amount of time, an early prediction of the classifier performance is extremely important for the design of the data mining process. Power law has been shown in the past to be a good predictor of decisiontree error rates as a function of the s ..."
Abstract
 Add to MetaCart
When data collection is costly and/or takes a significant amount of time, an early prediction of the classifier performance is extremely important for the design of the data mining process. Power law has been shown in the past to be a good predictor of decisiontree error rates as a function of the sample size. In this paper, we show that the optimal training set size for a given dataset can be computed from a learning curve characterized by a power law. Such a curve can be approximated using a small subset of potentially available data and then used to estimate the expected tradeoff between the error rate and the amount of additional observations. The proposed approach to projected optimization of classifier utility is demonstrated and evaluated on several benchmark datasets. 1.