Results 1  10
of
17
Algorithms for Distributed Functional Monitoring
, 2008
"... We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi ..."
Abstract

Cited by 40 (13 self)
 Add to MetaCart
We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi(t), continuously at all times t. The goal is to minimize the number of bits communicated between the players and the coordinator. A simple example is when f is the sum, and the coordinator is required to alert when the sum of a distributed set of values exceeds a given threshold τ. Of interest is the approximate version where the coordinator outputs 1 if f ≥ τ and 0 if f ≤ (1 − ɛ)τ. This defines the (k, f, τ, ɛ) distributed, functional monitoring problem. Functional monitoring problems are fundamental in distributed systems, in particular sensor networks, where we must minimize communication; they also connect to problems in communication complexity, communication theory, and signal processing. Yet few formal bounds are known for functional monitoring. We give upper and lower bounds for the (k, f, τ, ɛ) problem for some of the basic f’s. In particular, we study frequency moments (F0, F1, F2). For F0 and F1, we obtain continuously monitoring algorithms with costs almost the same as their oneshot computation algorithms. However, for F2 the monitoring problem seems much harder. We give a carefully constructed multiround algorithm that uses “sketch summaries ” at multiple levels of detail and solves the (k, F2, τ, ɛ) problem with communication Õ(k2 /ɛ+ ( √ k/ɛ) 3). Since frequency moment estimation is central to other problems, our results have immediate applications to histograms, wavelet computations, and others. Our algorithmic techniques are likely to be useful for other functional monitoring problems as well.
Issues in Evaluation of Stream Learning Algorithms
"... Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resourceaware environments, detect and react to changes in the environment generati ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resourceaware environments, detect and react to changes in the environment generating data. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. There are no golden standards for assessing performance in nonstationary environments. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of Predictive Sequential methods for error estimate – the prequential error. The prequential error allows us to monitor the evolution of the performance of models that evolve over time. Nevertheless, it is known to be a pessimistic estimator in comparison to holdout estimates. To obtain more reliable estimators we need some forgetting mechanism. Two viable alternatives are: sliding windows and fading factors. We observe that the prequential error converges to an holdout estimator when estimated over a sliding window or using fading factors. We present illustrative examples of the use of prequential error estimators, using fading factors, for the tasks of: i) assessing performance of a learning algorithm; ii) comparing learning algorithms; iii) hypothesis testing using McNemar test; and iv) change detection using PageHinkley test. In these tasks, the prequential error estimated using fading factors provide reliable estimators. In comparison to sliding windows, fading factors are faster and memoryless, a requirement for streaming applications. This paper is a contribution to a discussion in the goodpractices on performance assessment when learning dynamic models that evolve over time. Categories and Subject Descriptors H.2.8 [Database Management]: Database applications— data mining; I.2.6 [Artificial Intelligence]: Learning—
Optimal tracking of distributed heavy hitters and quantiles
 In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φheavy hitters are those elements of A whose frequency in A is at least φA; the φquantile of A is an element x of U such that at most φA  elements of A are smaller than A and at most (1 − φ)A  elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a twoway communication channel to a designated coordinator, whose goal is to track the set of φheavy hitters and the φquantile of A approximately at all times with minimum communication. We give tracking algorithms with worstcase communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φquantiles for all 0 ≤ φ ≤ 1. 1
Shape sensitive geometric monitoring
 In Proc. ACM Symposium on Principles of Database Systems
, 2008
"... A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are in ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are interested in determining, at any given time, whether the value of an arbitrary function on the average of these vectors crosses a predetermined threshold. In this paper, we introduce a new method for monitoring distributed data, which we term shape sensitive geometric monitoring. It is based on a geometric interpretation of the problem, which enables to define local constraints on the data received at the nodes. It is guaranteed that as long as none of these constraints has been violated, the value of the function does not cross the threshold. We generalize previous work on geometric monitoring, and solve two problems which seriously hampered its performance: as opposed to the constraints used so far, which depend only on the current values of the local input vectors, here we incorporate their temporal behavior into the constraints. Also, the new constraints are tailored to the geometric properties of the specific function which is being monitored, while the previous constraints were generic. Experimental results on real world data reveal that using the new geometric constraints reduces communication by up to three orders of magnitude in comparison to existing approaches, and considerably narrows the gap between existing results and a newly defined lower bound on the communication complexity.
Compressing kinetic data from sensor networks
, 2009
"... We introduce a framework for storing and processing kinetic data observed by sensor networks. These sensor networks generate vast quantities of data, which motivates a significant need for data compression. We are given a set of sensors, each of which continuously monitors some region of space. We ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We introduce a framework for storing and processing kinetic data observed by sensor networks. These sensor networks generate vast quantities of data, which motivates a significant need for data compression. We are given a set of sensors, each of which continuously monitors some region of space. We are interested in the kinetic data generated by a finite set of objects moving through space, as observed by these sensors. Our model relies purely on sensor observations; it allows points to move freely and requires no advance notification of motion plans. Sensor outputs are represented as random processes, where nearby sensors may be statistically dependent. We model the local nature of sensor networks by assuming that two sensor outputs are statistically dependent only if the two sensors are among the k nearest neighbors of each other. We present an algorithm for the lossless compression of the data produced by the network. We show that, under the statistical dependence and locality assumptions of our framework, asymptotically this compression algorithm encodes the data to within a constant factor of the informationtheoretic lower bound optimum dictated by the joint entropy of the system.
Multiscale realtime grid monitoring with job stream mining
 in 9th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2009
"... Abstract—The ever increasing scale and complexity of large computational systems ask for sophisticated management tools, paving the way toward Autonomic Computing. A first step toward Autonomic Grids is presented in this paper; the interactions between the grid middleware and the stream of computati ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract—The ever increasing scale and complexity of large computational systems ask for sophisticated management tools, paving the way toward Autonomic Computing. A first step toward Autonomic Grids is presented in this paper; the interactions between the grid middleware and the stream of computational queries are modeled using statistical learning. The approach is implemented and validated in the context of the EGEE grid. The GSTRAP system, embedding the STRAP Data Streaming algorithm, provides manageable and understandable views of the computational workload based on gLite reporting services. An online monitoring module shows the instant distribution of the jobs in realtime and its dynamics, enabling anomaly detection. An offline monitoring module provides the administrator with a consolidated view of the workload, enabling the visual inspection of its longterm trends. I.
Continuous Adaptive Outlier Detection on Distributed Data Streams
"... Abstract. In many applications, stream data are too voluminous to be collected in a central fashion and often transmitted on a distributed network. In this paper, we focus on the outlier detection over distributed data streams in real time, firstly, we formalize the problem of outlier detection usin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In many applications, stream data are too voluminous to be collected in a central fashion and often transmitted on a distributed network. In this paper, we focus on the outlier detection over distributed data streams in real time, firstly, we formalize the problem of outlier detection using the kernel density estimation technique. Then, we adopt the fading strategy to keep pace with the transient and evolving natures of stream data, and micocluster technique to conquer the data partition and “onepass ” scan. Furthermore, our extensive experiments with synthetic and real data show that the proposed algorithm is efficient and effective compared with existing outlier detection algorithms, and more suitable for data streams. 1
Algorithms for Calculating Statistical Properties of Moving Points
"... Robust statistics and kinetic data structures are two frequently studied theoretical areas with practical motivations. The former topic is the study of statistical estimators that are robust to data outliers. The latter topic is the study of data structures for calculations on moving point sets. The ..."
Abstract
 Add to MetaCart
Robust statistics and kinetic data structures are two frequently studied theoretical areas with practical motivations. The former topic is the study of statistical estimators that are robust to data outliers. The latter topic is the study of data structures for calculations on moving point sets. The combination of these two areas has not previously been studied. In studying this intersection, we consider these problems in the context of both an established kinetic framework (called KDS) that relies on advance point motion information and calculates properties continuously and a new sensorbased framework that uses discrete point observations. Using the KDS model, we present an approximation algorithm for the kinetic robust kcenter problem, a clustering problem that requires k clusters but allows some outlying points to remain unclustered. For many practical problems that inspired the exploration into robustness, the KDS model is inapplicable due to the point motion restrictions and the advance flight plans required. Working towards a solution to the kinetic robust kcenter problem on a framework that allows unrestricted point motion, we present a new framework for kinetic data that allows calculations on moving points via sensorrecorded observations. This new framework is one
GEOMETRIC ALGORITHMS FOR OBJECTS IN MOTION
"... In this thesis, the theoretical analysis of realworld motivated problems regarding objects in motion is considered. Specifically, four major results are presented addressing the issues of robustness, data collection and compression, realistic theoretical analyses of this compression, and data retri ..."
Abstract
 Add to MetaCart
In this thesis, the theoretical analysis of realworld motivated problems regarding objects in motion is considered. Specifically, four major results are presented addressing the issues of robustness, data collection and compression, realistic theoretical analyses of this compression, and data retrieval. Robust statistics is the study of statistical estimators that are robust to data outliers. The combination of robust statistics and data structures for moving objects has not previously been studied. In studying this intersection, we consider a problem in the context of an established kinetic data structures framework (called KDS) that relies on advance point motion information and calculates properties continuously. Using the KDS model, we present an approximation algorithm for the kinetic robust kcenter problem, a clustering problem that requires k clusters but allows some outlying points to remain unclustered. For many practical problems that inspired the exploration into robustness, the KDS model is inapplicable due to the point motion restrictions and the advanceflight plans required. We present a new framework for kinetic data that allows calculations on moving points via sensorrecorded observations. This new framework
References for Data Stream Algorithms
, 2007
"... Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an in ..."
Abstract
 Add to MetaCart
Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an introduction to (and set of references for) data stream algorithms, and some of the techniques that have been developed over recent years to help mine the data while avoiding drowning in these massive flows of information. 1