Results 1 -
9 of
9
Distributed Higher Order Association Rule Mining Using Information Extracted from Textual Data
- SIGKDD Explorations, Volume 7, Issue
, 2005
"... The burgeoning amount of textual data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. Recently, as the need to mine patterns across distributed data ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
(Show Context)
The burgeoning amount of textual data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining (D-ARM) algorithms have been developed. These algorithms, however, assume that the databases are either horizontally or vertically distributed. In the special case of databases populated from information extracted from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. In this article we present D-HOTM, a framework for Distributed Higher Order Text Mining. D-HOTM is a hybrid approach that combines information extraction and distributed data mining. We employ a novel information extraction technique to extract meaningful entities from unstructured text in a distributed environment. The information extracted is stored in local databases and a mapping function is applied to identify globally unique keys. Based on the extracted information, a novel distributed association rule mining algorithm is applied to discover higher-order associations between items (i.e., entities) in records fragmented across the distributed databases using the keys. Unlike existing algorithms, D-HOTM requires neither knowledge of a global schema nor that the distribution of data be horizontal or vertical. Evaluation methods are proposed to incorporate the performance of the mapping function into the traditional support metric used in ARM evaluation. An example application of the algorithm on distributed law enforcement data demonstrates the relevance of D-HOTM in the fight against terrorism. Keywords Distributed data mining, distributed association rule mining, knowledge discovery, artificial intelligence, machine learning, data mining, association rule mining, text mining, evaluation, privacy-preserving, terrorism, law enforcement, criminal justice 1.
A Fast Online Learning Algorithm for Distributed Mining of
"... BigData analytics require that distributed mining of numerous data streams is performed in real-time. Unique challengesassociated withdesigningsuchdistributedminingsystems are: online adaptation to incoming data characteristics, online processing of large amounts of heterogeneous data, limited data ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
BigData analytics require that distributed mining of numerous data streams is performed in real-time. Unique challengesassociated withdesigningsuchdistributedminingsystems are: online adaptation to incoming data characteristics, online processing of large amounts of heterogeneous data, limited data access and communication capabilities between distributed learners, etc. We propose a general framework for distributed data mining and develop an efficient online learning algorithm based on this. Our framework consists of an ensemble learner and multiple local learners, which can only access different parts of the incoming data. By exploiting the correlations of the learning models among local learners, our proposed learning algorithms can optimize the prediction accuracy while requiring significantly less information exchange and computational complexity than existing state-of-the-art learning solutions. 1.
Cooperative Training for Attribute-Distributed Data: Trade-off Between Data Transmission and Performance ∗
, 907
"... Abstract – This paper introduces a modeling framework for distributed regression with agents/experts observing attribute-distributed data (heterogeneous data). Under this model, a new algorithm, the iterative covariance optimization algorithm (ICOA), is designed to reshape the covariance matrix of t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract – This paper introduces a modeling framework for distributed regression with agents/experts observing attribute-distributed data (heterogeneous data). Under this model, a new algorithm, the iterative covariance optimization algorithm (ICOA), is designed to reshape the covariance matrix of the training residuals of individual agents so that the linear combination of the individual estimators minimizes the ensemble training error. Moreover, a scheme (Minimax Protection) is designed to provide a trade-off between the number of data instances transmitted among the agents and the performance of the ensemble estimator without undermining the convergence of the algorithm. This scheme also provides an upper bound (with high probability) on the test error of the ensemble estimator. The efficacy of ICOA combined with Minimax Protection and the comparison between the upper bound and actual performance are both demonstrated by simulations.
Attribute-distributed learning: The iterative covariance optimization algorithm and its applications
"... Abstract-This paper introduces a framework for multivariate regression with attribute-distributed data on a distributed system with a fusion center. Unlike other types of algorithms for attribute-distributed learning that directly refit the ensemble residual or average among the predictions of the ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-This paper introduces a framework for multivariate regression with attribute-distributed data on a distributed system with a fusion center. Unlike other types of algorithms for attribute-distributed learning that directly refit the ensemble residual or average among the predictions of the agents, the new algorithm, the iterative covariance optimization algorithm (ICOA), coordinates the agents to reshape the covariance matrix of the individual training residuals so that the ensemble estimator, a linear combination of the individual estimators, minimizes the ensemble training error. Moreover, ICOA empirically demonstrates strong insusceptibility to overtraining, especially compared with residual refitting algorithms. Extensive simulations on both artificial and real datasets indicate that ICOA consistently outperforms weighted averaging algorithms and residual refitting algorithms.
Abstract A Distributed Approach for Prediction in Sensor Networks
"... Sensor networks in which the sensors are capable of local computation create the possibility of training and using predictors in a distributed way. We have previously shown that global predictors based on voting the predictions of local predictors each trained on one (or a few) attributes can be as ..."
Abstract
- Add to MetaCart
(Show Context)
Sensor networks in which the sensors are capable of local computation create the possibility of training and using predictors in a distributed way. We have previously shown that global predictors based on voting the predictions of local predictors each trained on one (or a few) attributes can be as accurate as a centralized predictor. We extend this approach to sensor networks. We also show that, when the target function drifts over time, sensors are able to make local decisions about when they need to relearn to capture the changing class boundaries.
1Decentralized Online Big Data Classification- a Bandit Framework
"... Abstract—Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by ..."
Abstract
- Add to MetaCart
Abstract—Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other’s data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We assume that the classification functions available to each processing element are fixed, but their prediction accuracy for various types of incoming data are unknown and can change dynamically over time, and thus they need to be learned online. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a cooperative contextual bandit problem where each data is characterized by a specific context. We develop distributed online learning algorithms for which we can prove that they have sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithms. Index Terms—distributed online learning, Big Data mining, on-line classification, exploration-exploitation tradeoff, decentralized classification, contextual bandits I.
1Distributed Online Big Data Classification Using Context Information
"... Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distribu ..."
Abstract
- Add to MetaCart
Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other’s data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a distributed contextual bandit problem where each data is characterized by a specific context. We develop a distributed online learning algorithm for which we can prove sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithm. I.
Toward parallel feature selection from vertically partitioned data
"... Feature selection is often required as a preliminary step for many pattern recog-nition problems. In recent years, parallel learning has been the focus of much attention due to the advent of high dimensionality. Still, most of the existing algo-rithms only work in a centralized manner, i.e. using th ..."
Abstract
- Add to MetaCart
(Show Context)
Feature selection is often required as a preliminary step for many pattern recog-nition problems. In recent years, parallel learning has been the focus of much attention due to the advent of high dimensionality. Still, most of the existing algo-rithms only work in a centralized manner, i.e. using the whole dataset at once. This paper proposes a parallel filter approach for vertically partitioned data. The idea is to split the data by features and then apply a filter at each partition performing sev-eral rounds to obtain a stable set of features. Later, a merging procedure is carried out to combine the results into a single subset of relevant features. Experiments on three representative datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets. The proposed approach can be used with any filter algorithm, so it could be seen as a general framework for par-allel feature selection. 1