Results 1  10
of
13
A middleware for developing parallel data mining implementations
 In Proceedings of the first SIAM conference on Data Mining
, 2001
"... Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent year ..."
Abstract

Cited by 21 (13 self)
 Add to MetaCart
Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent years, making the scalability of data
A local facility location algorithm for sensor networks
 In DCOSS ’05
, 2005
"... Abstract. In this paper we address a wellknown facility location problem (FLP) in a sensor network environment. The problem deals with finding the optimal way to provide service to a (possibly) very large number of clients. We show that a variation of the problem can be solved using a local algorit ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Abstract. In this paper we address a wellknown facility location problem (FLP) in a sensor network environment. The problem deals with finding the optimal way to provide service to a (possibly) very large number of clients. We show that a variation of the problem can be solved using a local algorithm. Local algorithms are extremely useful in a sensor network scenario. This is because they allow the communication range of the sensor to be restricted to the minimum, they can operate in routerless networks, and they allow complex problems to be solved on the basis of very little information, gathered from nearby sensors. The local facility location algorithm we describe is entirely asynchronous, seamlessly supports failures and changes in the data during calculation, poses modest memory and computational requirements, and can provide an anytime solution which is guaranteed to converge to the exact same one that would be computed by a centralized algorithm given the entire data. 1
A Local Facility Location Algorithm for LargeScale Distributed Systems
 Journal of Grid Computing
, 2007
"... Abstract. In the facility location problem (FLP) we are given a set of facilities and a set of clients, each of which is to be served by one facility. The goal is to decide which subset of facilities to open, such that the clients will be served at a minimal cost. In this paper we investigate the FL ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Abstract. In the facility location problem (FLP) we are given a set of facilities and a set of clients, each of which is to be served by one facility. The goal is to decide which subset of facilities to open, such that the clients will be served at a minimal cost. In this paper we investigate the FLP in a setting where the cost depends on data known only to peer nodes. This setting typifies modern distributed systems: peertopeer file sharing networks, grid systems, and wireless sensor networks. All of them need to perform network organization, data placement, collective power management, and other tasks of this kind. We propose a local and efficient algorithm that solves FLP in these settings. The algorithm presented here is extremely scalable, entirely decentralized, requires no routing capabilities, and is resilient to failures and changes in the data throughout its execution.
Learning classifiers from distributed, semantically heterogeneous, autonomous data sources
, 2004
"... Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for largescale datadriven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structurefunction relationships in biology) in many datarich domains. In such applications,
the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition.
However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity.
To deal with the semantical heterogeneity problem, we introduce ontologyextended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data.
The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, querycentric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources.
Approximate kernel kmeans: Solution to large scale kernel clustering
 in Proceedings of the International Conference on Knowledge Discovery and Data mining
"... Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and h ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernelbased clustering algorithms can capture the nonlinear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel kmeans, termed approximate kernel kmeans, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel kmeans is similar to that of the kernel kmeans algorithm, but with dramatically reduced runtime complexity and memory requirements.
CONQUEST: A distributed tool for constructing summaries of highdimensional discreteattributed datasets
 in Proc. 4th SIAM Intl. Conf. Data Mining (SDM’04
, 2004
"... The problem of constructing boundederror summaries of binary attributed data of very high dimensions is an important and difficult one. These summaries enable more expensive analysis techniques to be applied efficiently with little loss in accuracy. Recent work in this area has resulted in the use ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The problem of constructing boundederror summaries of binary attributed data of very high dimensions is an important and difficult one. These summaries enable more expensive analysis techniques to be applied efficiently with little loss in accuracy. Recent work in this area has resulted in the use of discrete linear algebraic transforms to construct such summaries efficiently. This paper addresses the problem of constructing summaries of distributed datasets. Specifically, the problem can be stated as follows: given a set of n discrete attributed vectors distributed across p sites, construct a summary of k << n vectors such that each of the input vectors is within given bounded distance from some output vector. In addition to being algorithmically efficient (i.e., must do no more work than corresponding serial algorithm), the distributed formulation must have low parallelization overheads. We present here, CONQUEST, a tool that achieves excellent performance and scalability for summarizing distributed datasets. In contrast to traditional parallel techniques that distribute the kernel operations, CONQUEST uses a less aggressive parallel formulation that relies on the principle of sampling to reduce communication overhead while maintaining high accuracy. Specifically, each individual site computes its local patterns independently. Various sites cooperate within dynamically orchestrated workgroups to construct consensus patters from these local patterns. Individual sites then decide to participate in the consensus or leave the group. Experimental results on a set of Intel Xeon servers demonstrate that this strategy is capable of excellent performance in terms of compression time, ratio, and accuracy with respect to postprocessing tasks. The communication overhead associated with CONQUEST is also shown to be minimal, making it ideally suited to widearea deployment.
Parallelism in knowledge discovery techniques
 LNCS 2367: Applied Parallel Computing, 6th International Conference PARA’02
, 2002
"... Abstract. Knowledge discovery in databases or data mining is the semiautomated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ’interesting’ in the sense of impacting an organization’s practice. Data mining and knowled ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. Knowledge discovery in databases or data mining is the semiautomated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ’interesting’ in the sense of impacting an organization’s practice. Data mining and knowledge discovery on large amounts of data can benefit of the use of parallel computers both to improve performance and quality of data selection. This paper presents and discusses different forms of parallelism that can be exploited in data mining techniques and algorithms. For the main data mining techniques, such as rule induction, clustering algorithms, decision trees, genetic algorithms, and neural networks, the possible ways to exploit parallelism are presented and discussed in detail. Finally, some promising research directions in the parallel data mining research area are outlined. 1
Dynamic Load Balancing in Parallel KDTree kMeans
 THE 10TH IEEE INTERNATIONAL CONFERENCE ON SCALABLE COMPUTING AND COMMUNICATIONS (SCALCOM2010), BRADFORD, UK, 29 JUNE – 1 JULY, 2010
"... Abstract—One among the most influential and popular data mining methods is the kMeans algorithm for cluster analysis. Techniques for improving the efficiency of kMeans have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical ..."
Abstract
 Add to MetaCart
Abstract—One among the most influential and popular data mining methods is the kMeans algorithm for cluster analysis. Techniques for improving the efficiency of kMeans have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KDTree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KDTrees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient kMeans variants in parallel computing environments. In this work, we provide a parallel formulation of the KDTree based kMeans algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy. KeywordsDynamic Load Balancing; Parallel kMeans; KDTrees; Clustering;
Scalability of Efficient Parallel KMeans
 5TH IEEE INTERNATIONAL CONFERENCE ON ESCIENCE, WORKSHOP ON COMPUTATIONAL ESCIENCE
, 2009
"... Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular ..."
Abstract
 Add to MetaCart
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is KMeans, which adopts a greedy approach to produce a set of Kclusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of KMeans have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multidimensional binary search tree (KDTree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KDTrees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient KMeans techniques in parallel computational environments. In this work, we provide a parallel formulation for the KDTree based KMeans algorithm and address its load balancing issues. 1.
CLUSTERED DISTRIBUTED INDEX FOR EFFICIENT TEXT RETRIEVAL USING THREADS
"... In this research paper, a novel method of improving the clustered distributed indices for efficient text retrieval using threads is presented. In text retrieval, text search refers to a technique of searching stored document or database. In a full text search, the search engine examines all the word ..."
Abstract
 Add to MetaCart
In this research paper, a novel method of improving the clustered distributed indices for efficient text retrieval using threads is presented. In text retrieval, text search refers to a technique of searching stored document or database. In a full text search, the search engine examines all the words in every stored document as it tries to match search words supplied by the user. When dealing with a small number of documents, the fulltext search engine performs a serial scan, where it directly scans the contents of the documents with each query. When the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks, viz., indexing and searching. The indexing stage scans for text of all the documents and builds a list of search terms, often called an index. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents. Considering all the above mentioned criterias, this paper aims at improving the search time on the index, by clustering the index. Threads are used to perform a parallel search on each of these clusters. The algorithm developed in C has been tested on various sizes of data and queries and compared with the sequential search method. The depicted results shown in the result section clearly show that this approach improves the search time significantly & the method proposed shows the efficacy, effectiveness, which can be further implemented for real time applications.