Results 1  10
of
41
Collective Data Mining: A New Perspective Toward Distributed Data Analysis
 Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It ..."
Abstract

Cited by 83 (14 self)
 Add to MetaCart
This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It also notes that any function can be expressed in a distributed fashion using a set of appropriate basis functions and orthogonal basis functions can be eectively used for developing a general DDM framework that guarantees correct local analysis and correct aggregation of local data models with minimal data communication. This paper develops the foundation of CDM, discusses decision tree learning and polynomial regression in CDM for discrete and continuous variables, and describes the BODHI, a CDMbased experimental system for distributed knowledge discovery. 1 Introduction Distributed data mining (DDM) is a fast growing area that deals with the problem of nding data patterns in a...
Distributed Clustering Using Collective Principal Component Analysis
 Knowledge and Information Systems
, 1999
"... This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to inte ..."
Abstract

Cited by 49 (9 self)
 Add to MetaCart
This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given otheshelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using dierent test data sets including an application for web mining.
Distributed Data Mining: Algorithms, Systems, and Applications
, 2002
"... This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subs ..."
Abstract

Cited by 49 (4 self)
 Add to MetaCart
This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subsequently, the architectural issues in DDM systems and future directions are discussed
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
, 1999
"... . This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) tim ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
. This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) time, with a O(jSjn) space requirement and O(n) communication requirement, where n is the number of elements in the data set and jSj is the number of data sites. This approach shows significant improvement over naive methods with O(n 2 ) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is pres...
Efficient phrasebased document indexing for Web document clustering
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract

Cited by 37 (2 self)
 Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrasebased document index model, the Document Index Graph, which allows for incremental construction of a phrasebased index of the document set with an emphasis on efficiency, rather than relying on singleterm indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pairwise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Papyrus: A System for Data Mining over Local and Wide Area Clusters and SuperClusters
, 1999
"... this paper, we introduce a system called Papyrus for distributed data mining over commodity and high performance networks and give some preliminary experimental results about its performance. We are particularly interested in data mining over clusters of workstations, distributed clusters connected ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
this paper, we introduce a system called Papyrus for distributed data mining over commodity and high performance networks and give some preliminary experimental results about its performance. We are particularly interested in data mining over clusters of workstations, distributed clusters connected by high performance networks (superclusters), and distributed clusters and superclusters connected by commodity networks (metaclusters)
Distributed Multivariate Regression Using Waveletbased Collective Data Mining
 Journal of Parallel and Distributed Computing
, 1999
"... This paper presents a method for distributed multivariate regression using waveletbased Collective Data Mining (CDM). The method seamlessly blends machine learning and information theory with the statistical methods employed in multivariate regression to provide an effective data mining technique f ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
This paper presents a method for distributed multivariate regression using waveletbased Collective Data Mining (CDM). The method seamlessly blends machine learning and information theory with the statistical methods employed in multivariate regression to provide an effective data mining technique for use in a distributed data and computation environment. Evaluation of the method in terms of model accuracy as a function of appropriateness of the selected wavelet function, relative number of nonlinear crossterms, and sample size demonstrates that accurate multivariate regression models can be generated from distributed, heterogeneous, data sets with minimal data communication overhead compared to that required to aggregate a centralized data set. Application of this method to Linear Discriminant Analysis, which is closely related to multivariate regression, produced classification results on the Iris data set that are comparable to those obtained with centralized data analysis. 1 Intr...
The Preliminary Design of Papyrus: A System for High Performance, Distributed Data Mining over Clusters, MetaClusters and SuperClusters
 In Proceedings of Workshop on Distributed Data Mining, alongwith KDD98
, 1999
"... Data mining is a problem for which cluster computing provides a competitive alternative to specialized high performance computers for mining large data sets. Distribued clusters provide a natural infrastructure for mining large distributed data sets. Distributed clusters can be connected by commodit ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
Data mining is a problem for which cluster computing provides a competitive alternative to specialized high performance computers for mining large data sets. Distribued clusters provide a natural infrastructure for mining large distributed data sets. Distributed clusters can be connected by commodity networks to form what we call metaclusters and by high performance networks to form what we call superclusters. In this paper, we describe the design of a system called Papyrus which is designed for mining data which is distributed over clusters, metaclusters, and superclusters. We also describe some experimental results of a preliminary implementation.
AgentBased Distributed Data Mining: The KDEC Scheme
"... One key aspect of exploiting the huge amount of autonomous and heterogeneous data sources in the Internet is not only how to retrieve, collect and integrate relevant information but to discover previously unknown, implicit and valuable knowledge. In recent years several approaches to distributed ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
One key aspect of exploiting the huge amount of autonomous and heterogeneous data sources in the Internet is not only how to retrieve, collect and integrate relevant information but to discover previously unknown, implicit and valuable knowledge. In recent years several approaches to distributed data mining and knowledge discovery have been developed, but only a few of them make use of intelligent agents.