Results 1 - 10
of
103
Privacy Preserving Association Rule Mining in Vertically Partitioned Data
- In The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2002
"... Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association ru ..."
Abstract
-
Cited by 295 (21 self)
- Add to MetaCart
Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. However, the sites must not reveal individual transaction data. We present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values.
Tools for Privacy Preserving Distributed Data Mining
- ACM SIGKDD Explorations
, 2003
"... Privacy preserving mining of distributed data has numerous applications. Each application poses di#erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the sol ..."
Abstract
-
Cited by 189 (8 self)
- Add to MetaCart
Privacy preserving mining of distributed data has numerous applications. Each application poses di#erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for specific privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy-preserving data mining problems.
A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees
- International Journal of Hybrid Intelligent Systems
, 2004
"... This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms ..."
Abstract
-
Cited by 46 (17 self)
- Add to MetaCart
(Show Context)
This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms for decision tree induction from distributed data; and identifies the conditions under which the algorithms in the distributed setting are superior to their centralized counterparts in terms of time and communication complexity; The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained in the centralized setting. Some natural extensions leading to algorithms for learning from heterogeneous distributed data and learning under privacy constraints are outlined.
Ensemble Pruning Via Semi-definite Programming
- Journal of Machine Learning Research
, 2006
"... An ensemble is a group of learning models that jointly solve a problem. However, the ensembles generated by existing techniques are sometimes unnecessarily large, which can lead to extra memory usage, computational costs, and occasional decreases in effectiveness. The purpose of ensemble pruning is ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
An ensemble is a group of learning models that jointly solve a problem. However, the ensembles generated by existing techniques are sometimes unnecessarily large, which can lead to extra memory usage, computational costs, and occasional decreases in effectiveness. The purpose of ensemble pruning is to search for a good subset of ensemble members that performs as well as, or better than, the original ensemble. This subset selection problem is a combinatorial optimization problem and thus finding the exact optimal solution is computationally prohibitive. Various heuristic methods have been developed to obtain an approximate solution. However, most of the existing heuristics use simple greedy search as the optimization method, which lacks either theoretical or empirical quality guarantees. In this paper, the ensemble subset selection problem is formulated as a quadratic integer programming problem. By applying semi-definite programming (SDP) as a solution technique, we are able to get better approximate solutions. Computational experiments show that this SDP-based pruning algorithm outperforms other heuristics in the literature. Its application in a classifier-sharing study also demonstrates the effectiveness of the method.
Bias-variance analysis of support vector machines for the development of svm-based ensemble methods
- Journal of Machine Learning Research
, 2004
"... Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
(Show Context)
Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, offering insights into the way SVMs learn. The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected, especially in Gaussian and polynomial kernels. We show that the bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners, and we outline two directions for developing SVM ensembles, exploiting the SVM bias characteristics and the bias-variance dependence on the kernel parameters. Keywords: Bias-variance analysis, support vector machines, ensemble methods, multi-classifier systems.
Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources
- In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03
, 2003
"... this paper presents an approach to the design of systems for learning from distributed, autonomous and heterogeneous data sources. We precisely formulate a class of distributed learning problems; present a general strategy for transforming a large class of traditional machine learning algorithms int ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
(Show Context)
this paper presents an approach to the design of systems for learning from distributed, autonomous and heterogeneous data sources. We precisely formulate a class of distributed learning problems; present a general strategy for transforming a large class of traditional machine learning algorithms into distributed learning algorithms; and demonstrate the application of this strategy to devise algorithms for decision tree induction (using a variety of splitting criteria) from distributed data. The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained by the corresponding algorithm in the batch setting (i.e., when all of the data is available in a central location)
When distribution is part of the semantics: A new problem class for distributed knowledge discovery
- IN UBIQUITOUS DATA MINING FOR MOBILE AND DISTRIBUTED ENVIRONMENTS WORKSHOP ASSOCIATED WITH THE JOINT 12TH EUROPEAN CONFERENCE ON MACHINE LEARNING (ECML’01) AND 5TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES (PKDD’01
, 2001
"... Within a research project at DaimlerChrysler we use vehicles as mobile data sources for distributed knowledge discovery. We realized that current approaches are not suitable for our purposes. They aim to infer a global model and try to approximate the results one would get from a single joined dat ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
(Show Context)
Within a research project at DaimlerChrysler we use vehicles as mobile data sources for distributed knowledge discovery. We realized that current approaches are not suitable for our purposes. They aim to infer a global model and try to approximate the results one would get from a single joined data source. Thus, they treat distribution as a technical issue only and ignore that the distribution itself may have a meaning and that models depend on the context in which they were derived. The main contribution of this paper is the identification of a practically relevant new problem class for distributed knowledge discovery which addresses the semantics of distribution. We show that this problem class is the proper framework for many important applications in which it should become an integral part of the knowledge discovery process, affecting the results as well as the process itself. We outline a novel solution, called Knowledge Discovery from Models, which uses models as primary input and combines content driven and context driven analyses. Finally, we discuss challenging research questions, which are raised by the new problem class.
Robust order statistics based ensemble for distributed data mining
- In Advances in Distributed and Parallel Knowledge Discovery
, 2000
"... Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randoml ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is
Multi-Database Mining
, 2003
"... Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono- and multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono- and multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a new multidatabase mining process. Some research issues involving mining multi-databases, including database clustering and local pattern analysis, are discussed.
Ontology-Driven Information Extraction and Knowledge Acquisition from Heterogeneous, Distributed, Autonomous Biological Data Sources
- In Proceedings of the IJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources
, 2001
"... Scientific discovery in data rich domains (e.g., biological sciences, atmospheric sciences) presents several challenges in information extraction and knowledge acquisition from heterogeneous, distributed, autonomously operated, dynamic data sources. This paper describes these problems and outlines ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Scientific discovery in data rich domains (e.g., biological sciences, atmospheric sciences) presents several challenges in information extraction and knowledge acquisition from heterogeneous, distributed, autonomously operated, dynamic data sources. This paper describes these problems and outlines the key elements of algorithmic and systems solutions for computer assisted scientific discovery in such domains. These include: ontology-assisted approaches to customizable data integration and information extraction from heterogeneous, distributed data sources; distributed data mining algorithms for knowledge acquisition from large, distributed data sets which obviate the need for transmitting large volumes of data across the network; ontology-driven approaches to exploratory data analysis from alternative ontological perspectives; and modular and extensible agent-based implementations of the algorithms within a platform-independent agent infrastructure. Prototype implementations of ...