Results 1 -
6 of
6
Efficient Progressive Sampling
- In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining
, 1999
"... Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious. We analyze methods for progressive sampling--- using progre ..."
Abstract
-
Cited by 76 (8 self)
- Add to MetaCart
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious. We analyze methods for progressive sampling--- using progressively larger samples as long as model accuracy improves. We explore several notions of efficient progressive sampling. We analyze efficiency relative to induction with all instances; we show that a simple, geometric sampling schedule is asymptotically optimal, and we describe how best to take into account prior expectations of accuracy convergence. We then describe the issues involved in instantiating an efficient progressive sampler, including how to detect convergence. Finally, we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling can be remarkably efficient. 1 Introduction Induction algorithms face competing requiremen...
A Survey of Methods for Scaling Up Inductive Algorithms
- Data Mining and Knowledge Discovery
, 1999
"... . One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule ..."
Abstract
-
Cited by 74 (10 self)
- Add to MetaCart
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research. Keywords: scaling up, inductive learning, decision trees, rule learning 1. Introduction The knowledge discovery and data...
Distributed Data Mining: Scaling up and beyond
- In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing already-distributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.
Modelling Classification Performance for Large Data Sets - An Empirical Study
, 2001
"... . For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
. For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms---C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a - b # x -c ) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining. 1
SAMPLE SIZE AND MODELING ACCURACY WITH DECISION-TREE BASED DATA MINING TOOLS
"... Given the cost associated with modeling very large datasets and over-fitting issues of decision-tree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Given the cost associated with modeling very large datasets and over-fitting issues of decision-tree based models, sample based models are an attractive alternative – provided that the sample based models have a predictive accuracy approximating that of models based on all available data. This paper presents results of sets of decision-tree models generated across progressive sets of sample sizes. The models were applied to two sets of actual client data using each of six prominent commercial data mining tools. The results suggest that model accuracy improves at a decreasing rate with increasing sample size. When a power curve was fitted to accuracy estimates across various sample sizes, more than 80 percent of the time accuracy within 0.5 percent of the expected terminal (accuracy of a theoretical infinite sample) was achieved by the time the sample size reached 10,000 records. Based on these results, fitting a power curve to progressive samples and using it to establish an appropriate sample size appears to be a promising mechanism to support sample based modeling for a large dataset.
The Association between the Disclosure and the Realization of Information Security Risk Factors
"... Firms often disclose information security risk factors in public filings such as 10-K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to repr ..."
Abstract
- Add to MetaCart
Firms often disclose information security risk factors in public filings such as 10-K reports. The internal information associated with disclosures may be positive or negative. In this paper, we are interested in evaluating how the nature of security risk factors disclosed, which is believed to represent the internal information regarding information security, is associated with future breach announcements. For this purpose, we build a decision tree model, which classifies the occurrence of future security breaches based on the textual contents of the disclosed security risk factors. The model is able to accurately associate disclosure characteristics with breach announcements about 77 % of the time. We further explore the contents of the security risk factors using text mining techniques to provide a richer interpretation of the results. The results show that the security risk factors with action-oriented terms and phrases are less likely to be related to future incidents. We also conduct a cross-sectional analysis to study how the market interprets the nature of information security risk factors in annual reports at different time points. We find that the market reaction following the security breach announcement is different

