Results 1  10
of
13
Scalable Statistical Monitoring of Fleet Data
, 2011
"... Abstract: This paper considers the problem of fitting regression models to historical fleet data with mixed effects, which arises in the context of statistical monitoring of data from a fleet (population) ofsimilar units. A fleetwide extension of the multivariable statistical process control approa ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract: This paper considers the problem of fitting regression models to historical fleet data with mixed effects, which arises in the context of statistical monitoring of data from a fleet (population) ofsimilar units. A fleetwide extension of the multivariable statistical process control approach is used to monitor for three different types of faults: a performance anomaly, a performance shift, and an anomalous unit. Our formulation requires the solution of a leastsquaresproblemwithverylargenumbersofbothregressors(variables)anddatameasurements. For problems of interest, this leastsquares problem cannot be solved using standard methods. We propose a method for solving the problem that is scalable to extremely large datasets, even ones that do not fit in to the memory of a single computer system. Our method can be parallelized, but also works serially on a single processor. This approach is demonstrated in a simulated example for monitoring a fleet of aircraft from historical cruise flight data. 1.
A Local Scalable Distributed Expectation Maximization Algorithm for Large PeertoPeer Networks
, 2009
"... This paper offers a local distributed algorithm for expectation maximization in large peertopeer environments. The algorithm can be used for a variety of wellknown data mining tasks in a distributed environment such as clustering, anomaly detection, target tracking to name a few. This technology ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
This paper offers a local distributed algorithm for expectation maximization in large peertopeer environments. The algorithm can be used for a variety of wellknown data mining tasks in a distributed environment such as clustering, anomaly detection, target tracking to name a few. This technology is crucial for many emerging peertopeer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Centralizing all or some of the data for building global models is impractical in such peertopeer environments because of the large number of data sources, the asynchronous nature of the peertopeer networks, and dynamic nature of the data/network. The distributed algorithm we have developed in this paper is provablycorrect i.e. it converges to the same result compared to a similar centralized algorithm and can automatically adapt to changes to the data and the network. We show that the communication overhead of the algorithm is very low due to its local nature. This monitoring algorithm is then used as a feedback loop to sample data from the network and rebuild the model when it is outdated. We present thorough experimental results to verify our theoretical claims.
SinglePass Distributed Learning of MultiClass SVMs using CoreSets
"... We explore a technique to learn Support Vector Models (SVMs) when training data is partitioned among several data sources. The basic idea is to consider SVMs which can be reduced to Minimal Enclosing Ball (MEB) problems in an feature space. Computation of such SVMs can be efficiently achieved by fin ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We explore a technique to learn Support Vector Models (SVMs) when training data is partitioned among several data sources. The basic idea is to consider SVMs which can be reduced to Minimal Enclosing Ball (MEB) problems in an feature space. Computation of such SVMs can be efficiently achieved by finding a coreset for the image of the data in the feature space. Our main result is that the union of local coresets provides a close approximation to a global coresetfrom which the SVM can be recovered. The method requires hence a single pass through each source of data in order to compute local coresets and then to recover the SVM from its union. Extensive simulations in small and large datasets are presented in order to evaluate its classification accuracy, transmission efficiency and global complexity, comparing its results with a widely used singlepass heuristic to learn standard SVMs.
Scalable Distributed Change Detection from Astronomy Data Streams using Local, Asynchronous Eigen Monitoring Algorithms
 In Proceedings of SDM’09 (accepted
, 2009
"... This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby ge ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be coanalyzed with other astronomical data stored at different locations around the world. Change point detection and event classification in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such highthroughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish wellknown astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm. 1
Privacy Preservation for feature selection in data mining using Centralized network
 IJCSI
"... This paper proposed a feature selection with privacy preservation in centralized network. Data can be preserved for privacy by perturbation technique as alias name. In centralized data evaluation, it makes data classification and feature selection for data mining decision model which make the struct ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper proposed a feature selection with privacy preservation in centralized network. Data can be preserved for privacy by perturbation technique as alias name. In centralized data evaluation, it makes data classification and feature selection for data mining decision model which make the structural information of model in this paper. The application of gain ratio technique for better performance of feature selection has taken to perform the centralized computational task. All features don‟t need to preserve the privacy for confidential data for best model. The chisquare testing has taken for the classification of data by centralized data mining model using own processing unit. The alias data model for privacy preserving data mining has taken to develop the data mining technique to make best model without violating the privacy individuals. The proposed process of data miner task has made best feature selection and two type experimental tests have taken in this paper.
Distributed Monitoring of the R2 Statistic for Linear Regression
"... The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment wh ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes’ data in a communicationefficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safetycritical application scenarios. In this paper we develop DReMo — a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a networkwide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communicationefficient and scalable, and also provide theoretical guarantees on correctness. 1
Noname manuscript No. (will be inserted by the editor) InNetwork Outlier Detection in Wireless Sensor Networks
"... the date of receipt and acceptance should be inserted later Abstract To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an approach that (1) is flexible with respect to the outlier definition, (2) computes the result innetwork to reduce both bandwidth a ..."
Abstract
 Add to MetaCart
(Show Context)
the date of receipt and acceptance should be inserted later Abstract To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an approach that (1) is flexible with respect to the outlier definition, (2) computes the result innetwork to reduce both bandwidth and energy consumption, (3) uses only singlehop communication, thus permitting very simple node failure detection and message reliability assurance mechanisms (e.g., carriersense), and (4) seamlessly accommodates dynamic updates to data. We examine performance by simulation, using real sensor data streams. Our results demonstrate that our approach is accurate and imposes reasonable communication and power consumption demands.
1 Scalable, Asynchronous, Distributed EigenMonitoring of Astronomy Data Streams
, 2011
"... In this paper we develop a distributed algorithm for monitoring the principal components (PCs) for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 3 ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we develop a distributed algorithm for monitoring the principal components (PCs) for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be coanalyzed with other astronomical data stored at different locations around the world. Event detection, classification and isolation in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such highthroughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams and discuss a prototype webbased system PADMINI (Peer to Peer Astronomy Data Mining) which implements this algorithm for use by the astronomers. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish wellknown astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without
Constructing Scalable local Distributed Decision Trees algorithm for heterogeneous data sources
"... Abstract—This papers proposes a new scalable and robust distributed algorithm for constructing distributed decision trees in peertopeer environment for the heterogeneous data sources. Computation and communication cost in the peertopeer environment is higher and also on chances of reduced accura ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—This papers proposes a new scalable and robust distributed algorithm for constructing distributed decision trees in peertopeer environment for the heterogeneous data sources. Computation and communication cost in the peertopeer environment is higher and also on chances of reduced accuracy and response time may be higher. Proposed algorithm scales good and also provides the best prediction model in the well known classification technique of distributed decision trees Keywords Distributed Decision Trees,peertopeer environment, heterogeneous data I.
Distributed Monitoring of theR 2 Statistic for Linear Regression
"... The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when ..."
Abstract
 Add to MetaCart
(Show Context)
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes’ data in a communicationefficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safetycritical application scenarios. In this paper we develop DReMo — a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R 2 statistic). When the nodes collectively determine thatR 2 has dropped below a fixed threshold, the linear regression model is recomputed via a networkwide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communicationefficient and scalable, and also provide theoretical guarantees on correctness. 1