Results 1  10
of
78
Communicationefficient distributed monitoring of thresholded counts
 In Proc. of SIGMOD’06
, 2006
"... Monitoring is an issue of primary concern in current and next generation networked systems. For example, the objective of sensor networks is to monitor their surroundings for a variety of different applications like atmospheric conditions, wildlife behavior, and troop movements among others. Simil ..."
Abstract

Cited by 78 (11 self)
 Add to MetaCart
(Show Context)
Monitoring is an issue of primary concern in current and next generation networked systems. For example, the objective of sensor networks is to monitor their surroundings for a variety of different applications like atmospheric conditions, wildlife behavior, and troop movements among others. Similarly, monitoring in data networks is critical not only for accounting and management, but also for detecting anomalies and attacks. Such monitoring applications are inherently continuous and distributed, and must be designed to minimize the communication overhead that they introduce. In this context we introduce and study a fundamental class of problems called “thresholded counts ” where we must return the aggregate frequency count of an event that is continuously monitored by distributed nodes with a userspecified accuracy whenever the actual count exceeds a given threshold value. In this paper we propose to address the problem of thresholded counts by setting local thresholds at each monitoring node and initiating communication only when the locally observed data exceeds these local thresholds. We explore algorithms in two categories: static thresholds and adaptive thresholds. In the static case, we consider thresholds based on a linear combination of two alternate strategies, and show that there exists an optimal blend of the two strategies that results in minimum communication overhead. We further show that this optimal blend can be found using a steepest descent search. In the adaptive case, we propose algorithms that adjust the local thresholds based on the observed distributions of updated information in the distributed monitoring system. We use extensive simulations not only to verify the accuracy of our algorithms and validate our theoretical results, but also to evaluate the performance of the two approaches. We find that both approaches yield significant savings over the naive approach of performing processing at a centralized location. 1.
Algorithms for Distributed Functional Monitoring
, 2008
"... We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi(t), continuously at all times t. The goal is to minimize the number of bits communicated between the players and the coordinator. A simple example is when f is the sum, and the coordinator is required to alert when the sum of a distributed set of values exceeds a given threshold τ. Of interest is the approximate version where the coordinator outputs 1 if f ≥ τ and 0 if f ≤ (1 − ɛ)τ. This defines the (k, f, τ, ɛ) distributed, functional monitoring problem. Functional monitoring problems are fundamental in distributed systems, in particular sensor networks, where we must minimize communication; they also connect to problems in communication complexity, communication theory, and signal processing. Yet few formal bounds are known for functional monitoring. We give upper and lower bounds for the (k, f, τ, ɛ) problem for some of the basic f’s. In particular, we study frequency moments (F0, F1, F2). For F0 and F1, we obtain continuously monitoring algorithms with costs almost the same as their oneshot computation algorithms. However, for F2 the monitoring problem seems much harder. We give a carefully constructed multiround algorithm that uses “sketch summaries ” at multiple levels of detail and solves the (k, F2, τ, ɛ) problem with communication Õ(k2 /ɛ+ ( √ k/ɛ) 3). Since frequency moment estimation is central to other problems, our results have immediate applications to histograms, wavelet computations, and others. Our algorithmic techniques are likely to be useful for other functional monitoring problems as well.
Communicationefficient online detection of networkwide anomalies
 In IEEE Conference on Computer Communications (INFOCOM
, 2007
"... Abstract—There has been growing interest in building largescale distributed monitoring systems for sensor, enterprise, and ISP networks. Recent work has proposed using Principal Component Analysis (PCA) over global traffic matrix statistics to effectively isolate networkwide anomalies. To allow suc ..."
Abstract

Cited by 50 (10 self)
 Add to MetaCart
(Show Context)
Abstract—There has been growing interest in building largescale distributed monitoring systems for sensor, enterprise, and ISP networks. Recent work has proposed using Principal Component Analysis (PCA) over global traffic matrix statistics to effectively isolate networkwide anomalies. To allow such a PCAbased anomaly detection scheme to scale, we propose a novel approximation scheme that dramatically reduces the burden on the production network. Our scheme avoids the expensive step of centralizing all the data by performing intelligent filtering at the distributed monitors. This filtering reduces monitoring bandwidth overheads, but can result in the anomaly detector making incorrect decisions based on a perturbed view of the global data set. We employ stochastic matrix perturbation theory to bound such errors. Our algorithm selects the filtering parameters at local monitors such that the errors made by the detector are guaranteed to lie below a userspecified upper bound. Our algorithm thus allows network operators to explicitly balance the tradeoff between detection accuracy and the amount of data communicated over the network. In addition, our approach enables realtime detection because we exploit continuous monitoring at the distributed monitors. Experiments with traffic data from Abilene backbone network demonstrate that our methods yield significant communication benefits while simultaneously achieving high detection accuracy. I.
Conquering the divide: Continuous clustering of distributed data streams
 In Intl. Conf. on Data Engineering
, 2007
"... Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this pap ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the kcenter clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location. 1
Shape sensitive geometric monitoring
 In Proc. ACM Symposium on Principles of Database Systems
, 2008
"... A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are in ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
(Show Context)
A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are interested in determining, at any given time, whether the value of an arbitrary function on the average of these vectors crosses a predetermined threshold. In this paper, we introduce a new method for monitoring distributed data, which we term shape sensitive geometric monitoring. It is based on a geometric interpretation of the problem, which enables to define local constraints on the data received at the nodes. It is guaranteed that as long as none of these constraints has been violated, the value of the function does not cross the threshold. We generalize previous work on geometric monitoring, and solve two problems which seriously hampered its performance: as opposed to the constraints used so far, which depend only on the current values of the local input vectors, here we incorporate their temporal behavior into the constraints. Also, the new constraints are tailored to the geometric properties of the specific function which is being monitored, while the previous constraints were generic. Experimental results on real world data reveal that using the new geometric constraints reduces communication by up to three orders of magnitude in comparison to existing approaches, and considerably narrows the gap between existing results and a newly defined lower bound on the communication complexity.
Query Recommendations for Interactive Database Exploration
"... Abstract. Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users employ a query interface (typically, a webbased client) to issue a series of SQL queries that aim to analyze t ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
Abstract. Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users employ a query interface (typically, a webbased client) to issue a series of SQL queries that aim to analyze the data and mine it for interesting information. Firsttime users, however, may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. To assist users in this context, we draw inspiration from Web recommender systems and propose the use of personalized query recommendations. The idea is to track the querying behavior of each user, identify which parts of the database may be of interest for the corresponding data analysis task, and recommend queries that retrieve relevant data. We discuss the main challenges in this novel application of recommendation systems, and outline a possible solution based on collaborative filtering. Preliminary experimental results on real user traces demonstrate that our framework can generate effective query recommendations. 1
An energyefficient querying framework in sensor networks for detecting node similarities
 In: ACM Int. Symp. on Modeling, Analysis and Simulation of Wireless and Mobile Systems
, 2006
"... We propose an energyefficient framework, called SAF, for approximate querying and clustering of nodes in a sensor network. SAF uses simple time series forecasting models to predict sensor readings. The idea is to build these local models at each node, transmit them to the root of the network (the ” ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
We propose an energyefficient framework, called SAF, for approximate querying and clustering of nodes in a sensor network. SAF uses simple time series forecasting models to predict sensor readings. The idea is to build these local models at each node, transmit them to the root of the network (the ”sink”), and use them to approximately answer user queries. Our approach dramatically reduces communication relative to previous approaches for querying sensor networks by exploiting properties of these local models, since each sensor communicates with the sink only when its local model varies due to changes in the underlying data distribution. In our experimental results performed on a trace of real data, we observed on average about 150 message transmissions from each sensor over a week (including the learning phase) to correctly predict temperatures to within +/ 0.5 ◦ C. SAF also provides a mechanism to detect data similarities between nodes and organize nodes into clusters at the sink at no additional communication cost. This is again achieved by exploiting properties of our local time series models, and by means of a novel definition of data similarity between nodes that is based not on raw data but on the prediction values. Our clustering algorithm is both very efficient and provably optimal in the number of clusters. Our clusters have several interesting features: first, they can capture similarity between far away nodes that are not geographically adjacent; second, cluster membership to variations in sensors’ local models; third, nodes within a cluster are not required to track the membership of other nodes in the cluster. We present a number of simulationbased experimental results that demonstrate these properties of SAF.
MultiDimensional Online Tracking ∗
"... We propose and study a new class of online problems, which we call online tracking. Suppose an observer, say Alice, observes a multivalued function f: Z + → Z d over time in an online fashion, i.e., she only sees f(t) for t ≤ tnow where tnow is the current time. She would like to keep a tracker, sa ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
We propose and study a new class of online problems, which we call online tracking. Suppose an observer, say Alice, observes a multivalued function f: Z + → Z d over time in an online fashion, i.e., she only sees f(t) for t ≤ tnow where tnow is the current time. She would like to keep a tracker, say Bob, informed of the current value of f at all times. Under this setting, Alice could send new values of f to Bob from time to time, so that the current value of f is always within a distance of ∆ to the last value received by Bob. We give competitive online algorithms whose communication costs are compared with the optimal offline algorithm that knows the entire f in advance. We also consider variations of the problem where Alice is allowed to send “predictions ” to Bob, to further reduce communication for wellbehaved functions. These online tracking problems have a variety of application ranging from sensor monitoring, locationbased services, to publish/subscribe systems. 1
Optimal tracking of distributed heavy hitters and quantiles
 In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
(Show Context)
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φheavy hitters are those elements of A whose frequency in A is at least φA; the φquantile of A is an element x of U such that at most φA  elements of A are smaller than A and at most (1 − φ)A  elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a twoway communication channel to a designated coordinator, whose goal is to track the set of φheavy hitters and the φquantile of A approximately at all times with minimum communication. We give tracking algorithms with worstcase communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φquantiles for all 0 ≤ φ ≤ 1. 1
Optimal sampling from distributed streams
 Proc. ACM Symposium on Principles of Database Systems
, 2009
"... A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The c ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
(Show Context)
A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communicationefficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate. Categories and Subject Descriptors F.2.2 [Analysis of algorithms and problem complexity]: