Results 11  20
of
32
Robust order statistics based ensemble for distributed data mining
 In Advances in Distributed and Parallel Knowledge Discovery
, 2000
"... Integrating the outputs of multiple classifiers via combiners or metalearners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randoml ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Integrating the outputs of multiple classifiers via combiners or metalearners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of metaclassifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is
Probabilistic estimation based data mining for discovering insurance risks
 IEEE Intelligent Systems
, 1999
"... The UPA (Underwriting Profitability Analysis) application embodies a new approach to mining Property & Casualty (P&C) insurance policy and claims data for the purpose of constructing predictive models for insurance risks. UPA utilizes the ProbE (Probabilistic Estimation) predictive modeling data min ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
The UPA (Underwriting Profitability Analysis) application embodies a new approach to mining Property & Casualty (P&C) insurance policy and claims data for the purpose of constructing predictive models for insurance risks. UPA utilizes the ProbE (Probabilistic Estimation) predictive modeling data mining kernel to discover risk characterization rules by analyzing large and noisy insurance data sets. Each rule defines a distinct risk group and its level of risk. To satisfy regulatory constraints, the risk groups are mutually exclusive and exhaustive. The rules generated by ProbE are statisticallyrigorous, interpretable, and credible from an actuarial standpoint. Our approach to modeling insurance risks and the implementation of that approach have been validated in an actual engagement with a P&C firm. The benefit assessment of the results suggest that this methodology provides significant value to the P&C insurance risk management process.
Decision tree learning on very large data sets
 In IEEE Conference on Systems, Man and Cybernetics
, 1998
"... Consider a labeled data set of 1 terabyte in size. A salient subset might depend upon the users interests. Clearly, browsing such a large data set to find interesting areas would be very time consuming. An intelligent agent which, for a given class of user, could provide hints on areas of the data t ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
Consider a labeled data set of 1 terabyte in size. A salient subset might depend upon the users interests. Clearly, browsing such a large data set to find interesting areas would be very time consuming. An intelligent agent which, for a given class of user, could provide hints on areas of the data that might interest the user would be very useful. Given large data sets having categories of salience for different user classes attached to the data in them, these labeled sets of data can be used to train a decision tree to label unseen data examples with a category of salience. The training set will be much larger than usual, This paper describes an approach to generating the rules for an agent from a large training set. A set of decision trees are built in parallel on tractable size training data sets which are a subset of the original data. Each learned decision tree will be reduced to a set of rules, conflicting rules resolved and the resultant rules merged into one set. Results from cross validation experiments on a data set suggest this approach may be effectively applied to large sets of data. 1.
Toward a Theoretical Understanding of Why and When Decision Tree Pruning Algorithms Fail
, 1999
"... Recent empirical studies revealed two surprising pathologies of several common decision tree pruning algorithms. First, tree size is often a linear function of training set size, even when additional tree structure yields no increase in accuracy. Second, building trees with data in which the cl ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Recent empirical studies revealed two surprising pathologies of several common decision tree pruning algorithms. First, tree size is often a linear function of training set size, even when additional tree structure yields no increase in accuracy. Second, building trees with data in which the class label and the attributes are independent often results in large trees.
Data snooping, dredging and fishing: The dark side of data mining a SIGKDD99 panel report
 SIGKDD Explorations
, 2000
"... This article briefly describes a panel discussion at SIGKDD99. ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This article briefly describes a panel discussion at SIGKDD99.
A Comparative Evaluation of MetaLearning Strategies over Large and Distributed Data Sets
 In Workshop on Metalearning, Sixteenth Intl. Conf. Machine Learning
, 1999
"... There has been considerable interest recently in various approaches to scaling up machine learning systems to large and distributed data sets. We have been studying approaches based upon the parallel application of multiple learning programs at distributed sites, followed by a metalearning stage to ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
There has been considerable interest recently in various approaches to scaling up machine learning systems to large and distributed data sets. We have been studying approaches based upon the parallel application of multiple learning programs at distributed sites, followed by a metalearning stage to combine the multiple models in a principled fashion. In this paper, we empirically determine the "best" data partitioning scheme for a selected data set to compose "appropriatelysized " subsets and we evaluate and compare three di#erent strategies, Voting, Stacking and Stacking with Correspondence Analysis (SCANN) for combining classification models trained over these subsets. We seek to find ways to e#ciently scale up to large data sets while maintaining or improving predictive performance measured by the error rate, a cost model, and the TPFP spread. Keywords: classification, multiple models, metalearning, stacking, voting, correspondence analysis, data partitioning Email address of co...
ReducedError Pruning With Significance Tests
 Available: http://libra.msra.cn/paperdetail.aspx?id=305368
, 1998
"... When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. "Reducederror pruning" is a fast pruning procedure for decision trees that is known to produce small and accurate trees. ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. "Reducederror pruning" is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent "pruning" set, and pruning decisions are based on the model's error rate on this fresh data. Recently it has been observed that reducederror pruning overfits the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical significance tests can be used to counter this phenomenon. The problem of overfitting to the pruning set highlights the need for significance testing. We investigate two classes of test, "parametric" and "nonparametric." The standard chisquared statistic can be used both in a parametric test and as the basis for a nonparametric permutation tes...
Performance Controlled Data Reduction for Knowledge Discovery in Distributed Databases
 in Distributed Databases,” Proc. PacificAsia Conf. on Knowledge Discovery and Data Mining
"... The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of nonredundant information with complex and slow learning algorithms and to allow efficient data transfer and storage. For a usercontrollable allowed accuracy loss we propose an ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of nonredundant information with complex and slow learning algorithms and to allow efficient data transfer and storage. For a usercontrollable allowed accuracy loss we propose an effective data reduction procedure based on guided sampling for identifying a minimal size representative subset, followed by a modelsensitivity analysis for determining an appropriate compression level for each attribute. Experiments were performed on 3 large data sets and, depending on an allowed accuracy loss margin ranging from 1% to 5% of the ideal generalization, the achieved compression rates ranged between 95 and 12,500 times. These results indicate that transferring reduced data sets from multiple locations to a centralized site for an efficient and accurate knowledge discovery might often be possible in practice.
Efficiently Determine the Starting Sample Size for Progressive Sampling
"... Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starti ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an informationbased measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.