Results 11  20
of
283
Online filtering, smoothing and probabilistic modeling of streaming data
 in ICDE
, 2008
"... In this paper, we address the problem of extending a relational database system to facilitate efficient realtime application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of modelbased views for this purpose, by allowing users to declaratively specify ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
In this paper, we address the problem of extending a relational database system to facilitate efficient realtime application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of modelbased views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte Carlo algorithms commonly used to implement dynamic probabilistic models, to represent the present and historical states of the model as sets of weighted samples (particles) that are kept uptodate as new readings arrive. We develop novel techniques to convert the queries on the modelbased view directly into queries over particle tables, enabling highly efficient query processing. Finally, we present experimental evaluation of our prototype implementation over sensor data from the Intel Lab dataset that demonstrates the feasibility of online modeling of streaming data using our system and establishes the advantages of such tight integration between dynamic probabilistic models and database systems. 1
Using Probabilistic Models for Data Management in Acquisitional Environments
, 2005
"... Traditional database systems, particularly those focused on capturing and managing data from the real world, are poorly equipped to deal with the noise, loss, and uncertainty in data. We discuss a suite of techniques based on probabilistic models that are designed to allow database to tolerate noise ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
Traditional database systems, particularly those focused on capturing and managing data from the real world, are poorly equipped to deal with the noise, loss, and uncertainty in data. We discuss a suite of techniques based on probabilistic models that are designed to allow database to tolerate noise and loss. These techniques are based on exploiting correlations to predict missing values and identify outliers. Interestingly, correlations also provide a way to give approximate answers to users at a significantly lower cost and enable a range of new types of queries over the correlation structure itself. We illustrate a host of applications for our new techniques and queries, ranging from sensor networks to network monitoring to data stream management. We also present a unified architecture for integrating such models into database systems, focusing in particular on acquisitional systems where the cost of capturing data (e.g., from sensors) is itself a significant part of the query processing cost.
Semantics of ranking queries for probabilistic data and expected ranks
 In Proc. of ICDE’09
, 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract

Cited by 43 (1 self)
 Add to MetaCart
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the topk is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a topk over deterministic data. Specifically, we define a number of fundamental properties, including exactk, containment, uniquerank, valueinvariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attributelevel and tuplelevel uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the topk has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.
A samplingbased approach to optimizing topk queries in sensor networks
 In ICDE
, 2006
"... Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in batterypowered sensors. When approximation is acceptable, a modeldriven approach to query processing is effective in saving energy by a ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in batterypowered sensors. When approximation is acceptable, a modeldriven approach to query processing is effective in saving energy by avoiding contacting nodes whose values can be predicted or are unlikely to be in the result set. However, to optimize queries such as topk, reasoning directly with models of joint probability distributions can be prohibitively expensive. Instead of using models explicitly, we propose to use samples of past sensor readings. Not only are such samples simple to maintain, but they are also computationally efficient to use in query optimization. With these samples, we can formulate the problem of optimizing approximate topk queries under an energy constraint as a linear program. We demonstrate the power and flexibility of our samplingbased approach by developing a series of topk query planning algorithms with linear programming, which are capable of efficiently producing plans with better performance and novel features. We show that our approach is both theoretically sound and practically effective on simulated and realworld datasets. 1
Probabilistic Verifiers: Evaluating Constrained NearestNeighbor Queries over Uncertain Data
"... Abstract — In applications like locationbased services, sensor monitoring and biological databases, the values of the database items are inherently uncertain in nature. An important query for uncertain objects is the Probabilistic NearestNeighbor Query (PNN), which computes the probability of each ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
Abstract — In applications like locationbased services, sensor monitoring and biological databases, the values of the database items are inherently uncertain in nature. An important query for uncertain objects is the Probabilistic NearestNeighbor Query (PNN), which computes the probability of each object for being the nearest neighbor of a query point. Evaluating this query is computationally expensive, since it needs to consider the relationship among uncertain objects, and requires the use of numerical integration or MonteCarlo methods. Sometimes, a query user may not be concerned about the exact probability values. For example, he may only need answers that have sufficiently high confidence. We thus propose the Constrained NearestNeighbor Query (CPNN), which returns the IDs of objects whose probabilities are higher than some threshold, with a given error bound in the answers. The CPNN can be answered efficiently with probabilistic verifiers. These are methods that derive the lower and upper bounds of answer probabilities, so that an object can be quickly decided on whether it should be included in the answer. We have developed three probabilistic verifiers, which can be used on uncertain data with arbitrary probability density functions. Extensive experiments were performed to examine the effectiveness of these approaches. I.
Optimal nonmyopic value of information in graphical models  efficient algorithms and theoretical limits
 In Proc. of IJCAI
, 2005
"... Many realworld decision making tasks require us to choose among several expensive observations. In a sensor network, for example, it is important to select the subset of sensors that is expected to provide the strongest reduction in uncertainty. It has been general practice to use heuristicguided ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Many realworld decision making tasks require us to choose among several expensive observations. In a sensor network, for example, it is important to select the subset of sensors that is expected to provide the strongest reduction in uncertainty. It has been general practice to use heuristicguided procedures for selecting observations. In this paper, we present the first efficient optimal algorithms for selecting observations for a class of graphical models containing Hidden Markov Models (HMMs). We provide results for both selecting the optimal subset of observations, and for obtaining an optimal conditional observation plan. For both problems, we present algorithms for the filtering case, where only observations made in the past are taken into account, and the smoothing case, where all observations are utilized. Furthermore we prove a surprising result: In most graphical models tasks, if one designs an efficient algorithm for chain graphs, such as HMMs, this procedure can be generalized to polytrees. We prove that the value of information problem is NP PPhard even for discrete polytrees. It also follows from our results that even computing conditional entropies, which are widely used to measure value of information, is a #Pcomplete problem on polytrees. Finally, we demonstrate the effectiveness of our approach on several realworld datasets. 1
Modelbased Approximate Querying in Sensor Networks
 VLDB JOURNAL
, 2005
"... Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that “the sensornet is a database” is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings ..."
Abstract

Cited by 36 (0 self)
 Add to MetaCart
Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that “the sensornet is a database” is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a model of that reality is required to complement the readings. In this article, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomialtime heuristic for identifying solutions that perform well in practice. We evaluate our approach on several realworld sensornetwork data sets, taking into account the real measured data and communication quality, demonstrating that our modelbased approach provides a highfidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques.
Exploiting correlated attributes in acquisitional query processing
 In ICDE
, 2005
"... Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high perattribute acquisition cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we o ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high perattribute acquisition cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce lowcost attributes whose observation will allow the query processor to better estimate the selectivity of these expensive predicates. In particular, we show how to build conditional plans that branch into one or more subplans, each with a different ordering for the expensive query predicates, based on the runtime observation of lowcost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate lowcost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomialtime heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several realworld sensornetwork data sets, showing severaltimes performance increases for a variety of queries versus traditional optimization techniques. 1.
Algorithms for Subset Selection in Linear Regression
 STOC'08
, 2008
"... We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under approximation preserving reductions, this proble ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under approximation preserving reductions, this problem is also equivalent to the“sparse approximation”problem of approximating signals concisely. We propose and analyze exact and approximation algorithms for several special cases of practical interest. We give an FPTAS when the covariance matrix has constant bandwidth, and exact algorithms when the associated covariance graph, consisting of edges for pairs of variables with nonzero correlation, forms a tree or has a large (known) independent set. Furthermore, we give an exact algorithm when the variables can be embedded into a line such that the covariance decreases exponentially in the distance, and a constantfactor approximation when the variables have no “conditional suppressor variables”. Much of our reasoning is based on perturbation results for the R 2 multiple correlation measure, frequently used as a measure for “goodnessoffit statistics”. It lies at the core of our FPTAS, and also allows us to extend exact algorithms to approximation algorithms when the matrix “nearly ” falls into one of the above classes. We also use perturbation analysis to prove approximation guarantees for the widely used “Forward Regression ” heuristic when the observation variables are nearly independent.