Results 1  10
of
59
Privacy Preserving Association Rule Mining in Vertically Partitioned Data
 In The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2002
"... Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association ru ..."
Abstract

Cited by 207 (20 self)
 Add to MetaCart
Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. However, the sites must not reveal individual transaction data. We present a twoparty algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values.
Tools for Privacy Preserving Distributed Data Mining
 ACM SIGKDD Explorations
, 2003
"... Privacy preserving mining of distributed data has numerous applications. Each application poses di#erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the sol ..."
Abstract

Cited by 121 (7 self)
 Add to MetaCart
Privacy preserving mining of distributed data has numerous applications. Each application poses di#erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for specific privacypreserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacypreserving data mining problems.
A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees
 International Journal of Hybrid Intelligent Systems
, 2004
"... This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms ..."
Abstract

Cited by 37 (16 self)
 Add to MetaCart
This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms for decision tree induction from distributed data; and identifies the conditions under which the algorithms in the distributed setting are superior to their centralized counterparts in terms of time and communication complexity; The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained in the centralized setting. Some natural extensions leading to algorithms for learning from heterogeneous distributed data and learning under privacy constraints are outlined.
Biasvariance analysis of support vector machines for the development of svmbased ensemble methods
 Journal of Machine Learning Research
, 2004
"... Biasvariance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Biasvariance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of biasvariance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, offering insights into the way SVMs learn. The results show that the expected tradeoff between bias and variance is sometimes observed, but more complex relationships can be detected, especially in Gaussian and polynomial kernels. We show that the biasvariance decomposition offers a rationale to develop ensemble methods using SVMs as base learners, and we outline two directions for developing SVM ensembles, exploiting the SVM bias characteristics and the biasvariance dependence on the kernel parameters. Keywords: Biasvariance analysis, support vector machines, ensemble methods, multiclassifier systems.
Ensemble Pruning Via Semidefinite Programming
 Journal of Machine Learning Research
, 2006
"... An ensemble is a group of learning models that jointly solve a problem. However, the ensembles generated by existing techniques are sometimes unnecessarily large, which can lead to extra memory usage, computational costs, and occasional decreases in effectiveness. The purpose of ensemble pruning is ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
An ensemble is a group of learning models that jointly solve a problem. However, the ensembles generated by existing techniques are sometimes unnecessarily large, which can lead to extra memory usage, computational costs, and occasional decreases in effectiveness. The purpose of ensemble pruning is to search for a good subset of ensemble members that performs as well as, or better than, the original ensemble. This subset selection problem is a combinatorial optimization problem and thus finding the exact optimal solution is computationally prohibitive. Various heuristic methods have been developed to obtain an approximate solution. However, most of the existing heuristics use simple greedy search as the optimization method, which lacks either theoretical or empirical quality guarantees. In this paper, the ensemble subset selection problem is formulated as a quadratic integer programming problem. By applying semidefinite programming (SDP) as a solution technique, we are able to get better approximate solutions. Computational experiments show that this SDPbased pruning algorithm outperforms other heuristics in the literature. Its application in a classifiersharing study also demonstrates the effectiveness of the method.
Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources
 In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03
, 2003
"... this paper presents an approach to the design of systems for learning from distributed, autonomous and heterogeneous data sources. We precisely formulate a class of distributed learning problems; present a general strategy for transforming a large class of traditional machine learning algorithms int ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
this paper presents an approach to the design of systems for learning from distributed, autonomous and heterogeneous data sources. We precisely formulate a class of distributed learning problems; present a general strategy for transforming a large class of traditional machine learning algorithms into distributed learning algorithms; and demonstrate the application of this strategy to devise algorithms for decision tree induction (using a variety of splitting criteria) from distributed data. The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained by the corresponding algorithm in the batch setting (i.e., when all of the data is available in a central location)
When distribution is part of the semantics: A new problem class for distributed knowledge discovery
 IN UBIQUITOUS DATA MINING FOR MOBILE AND DISTRIBUTED ENVIRONMENTS WORKSHOP ASSOCIATED WITH THE JOINT 12TH EUROPEAN CONFERENCE ON MACHINE LEARNING (ECML’01) AND 5TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES (PKDD’01
, 2001
"... Within a research project at DaimlerChrysler we use vehicles as mobile data sources for distributed knowledge discovery. We realized that current approaches are not suitable for our purposes. They aim to infer a global model and try to approximate the results one would get from a single joined dat ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Within a research project at DaimlerChrysler we use vehicles as mobile data sources for distributed knowledge discovery. We realized that current approaches are not suitable for our purposes. They aim to infer a global model and try to approximate the results one would get from a single joined data source. Thus, they treat distribution as a technical issue only and ignore that the distribution itself may have a meaning and that models depend on the context in which they were derived. The main contribution of this paper is the identification of a practically relevant new problem class for distributed knowledge discovery which addresses the semantics of distribution. We show that this problem class is the proper framework for many important applications in which it should become an integral part of the knowledge discovery process, affecting the results as well as the process itself. We outline a novel solution, called Knowledge Discovery from Models, which uses models as primary input and combines content driven and context driven analyses. Finally, we discuss challenging research questions, which are raised by the new problem class.
MultiDatabase Mining
, 2003
"... Multidatabase mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono and multidatabase mining, and (3) there are limitations in existing multidatabase mining efforts. This paper designs a ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Multidatabase mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono and multidatabase mining, and (3) there are limitations in existing multidatabase mining efforts. This paper designs a new multidatabase mining process. Some research issues involving mining multidatabases, including database clustering and local pattern analysis, are discussed.
Robust order statistics based ensemble for distributed data mining
 In Advances in Distributed and Parallel Knowledge Discovery
, 2000
"... Integrating the outputs of multiple classifiers via combiners or metalearners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randoml ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Integrating the outputs of multiple classifiers via combiners or metalearners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of metaclassifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is
Learning classifiers from distributed, semantically heterogeneous, autonomous data sources
, 2004
"... Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for largescale datadriven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structurefunction relationships in biology) in many datarich domains. In such applications,
the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition.
However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity.
To deal with the semantical heterogeneity problem, we introduce ontologyextended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data.
The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, querycentric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources.