Results 1  10
of
42
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 404 (24 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Issues in Data Stream Management
, 2003
"... Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require sup ..."
Abstract

Cited by 137 (6 self)
 Add to MetaCart
Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.
Approximate Join Processing Over Data Streams
, 2003
"... We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We first discuss alternate architectural m ..."
Abstract

Cited by 108 (3 self)
 Add to MetaCart
We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We first discuss alternate architectural models for data stream join processing, and we survey suitable measures for the quality of an approximation of a setvalued query result. We then consider the number of generated result tuples as the quality measure, and we give optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data we show the efficacy of our solutions. For applications with demand for exact results we introduce a new Archivemetric which captures the amount of work needed to complete the join in case the streams are archived for later processing.
Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... Given a set of objects P and a query point q, a k nearest neighbor (kNN) query retrieves the k objects in P that lie closest to q. Even though the problem is wellstudied for static datasets, the traditional methods do not extend to highly dynamic environments where multiple continuous queries requ ..."
Abstract

Cited by 83 (16 self)
 Add to MetaCart
(Show Context)
Given a set of objects P and a query point q, a k nearest neighbor (kNN) query retrieves the k objects in P that lie closest to q. Even though the problem is wellstudied for static datasets, the traditional methods do not extend to highly dynamic environments where multiple continuous queries require realtime results, and both objects and queries receive frequent location updates. In this paper we propose conceptual partitioning (CPM), a comprehensive technique for the efficient monitoring of continuous NN queries. CPM achieves low running time by handling location updates only from objects that fall in the vicinity of some query (and ignoring the rest). It can be used with multiple, static or moving queries, and it does not make any assumptions about the object moving patterns. We analyze the performance of CPM and show that it outperforms the current stateoftheart algorithms for all problem settings. Finally, we extend our framework to aggregate NN (ANN) queries, which monitor the data objects that minimize the aggregate distance with respect to a set of query points (e.g., the objects with the minimum sum of distances to all query points). 1.
Tuple Routing Strategies for Distributed Eddies
, 2003
"... Many applications that consist of streams of data are inherently distributed. Since input stream rates and other system parameters such as the amount of available computing resources can fluctuate significantly, a stream query plan must be able to adapt to these changes. Routing tuples between ..."
Abstract

Cited by 59 (2 self)
 Add to MetaCart
Many applications that consist of streams of data are inherently distributed. Since input stream rates and other system parameters such as the amount of available computing resources can fluctuate significantly, a stream query plan must be able to adapt to these changes. Routing tuples between operators of a distributed stream query plan is used in several data stream management systems as an adaptive query optimization technique. The routing policy used can have a significant impact on system performance. In this paper, we use a queuing network to model a distributed stream query plan and define performance metrics for response time and system throughput. We also propose and evaluate several practical routing policies for a distributed stream management system. The performance results of these policies are compared using a discrete event simulator.
Reverse kNN Search in Arbitrary Dimensionality
 IN VLDB
, 2004
"... Given a point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following they are applicable only to 2D data (but not to higher dimensionality), and (iv) t ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
Given a point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following they are applicable only to 2D data (but not to higher dimensionality), and (iv) they retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact processing of RkNN with arbitrary values of k on dynamic multidimensional datasets. Our methods utilize a conventional datapartitioning index on the dataset and do not require any precomputation. In addition to their flexibility, we experimentally verify that the proposed algorithms outperform the existing ones even in their restricted focus.
High Dimensional Reverse Nearest Neighbor Queries
 In CIKM
, 2003
"... Reverse Nearest Neighbor (RNN) queries are of particular interest in a wide range of applications such as decision support systems, profile based marketing, data streaming, document databases, and bioinformatics. The earlier approaches to solve this problem mostly deal with two dimensional data. How ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Reverse Nearest Neighbor (RNN) queries are of particular interest in a wide range of applications such as decision support systems, profile based marketing, data streaming, document databases, and bioinformatics. The earlier approaches to solve this problem mostly deal with two dimensional data. However most of the above applications inherently involve high dimensions and high dimensional RNN problem is still unexplored. In this paper, we propose an approximate solution to answer RNN queries in high dimensions. Our approach kNN and RNN. It works in two phases. In the first phase the kNN of a query point is found and in the next phase they are further analyzed using a novel type of query Boolean Range Query (BRQ). Experimental results show that BRQ is much more e#cient than both NN and range queries, and can be e#ectively used to answer RNN queries. Performance is further improved by running multiple BRQ simultaneously. The proposed approach can also be used to answer other variants of RNN queries such as RNN of order k, bichromatic RNN, and Matching Query which has many applications of its own. Our technique can e#ciently answer NN, RNN, and its variants with approximately same number of I/O as running a NN query.
Group nearest neighbor queries
 In ICDE
, 2004
"... Given two sets of points P and Q, a group nearest neighbor (GNN) query retrieves the point(s) of P with the smallest sum of distances to all points in Q. Consider, for instance, three users at locations q1, q2 and q3 that want to find a meeting point (e.g., a restaurant); the corresponding query ret ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
(Show Context)
Given two sets of points P and Q, a group nearest neighbor (GNN) query retrieves the point(s) of P with the smallest sum of distances to all points in Q. Consider, for instance, three users at locations q1, q2 and q3 that want to find a meeting point (e.g., a restaurant); the corresponding query returns the data point p that minimizes the sum of Euclidean distances pqi  for 1≤i≤3. Assuming that Q fits in memory and P is indexed by an Rtree, we propose several algorithms for finding the group nearest neighbors efficiently. As a second step, we extend our techniques for situations where Q cannot fit in memory, covering both indexed and nonindexed query points. An experimental evaluation identifies the best alternative based on the data and query properties. 1.
Approximate NN Queries on Streams with Guaranteed Error/Performance Bounds
 In VLDB
, 2004
"... In data stream applications, data arrive continuously and can only be scanned once as the query processor has very limited memory (relative to the size of the stream) to work with. ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
In data stream applications, data arrive continuously and can only be scanned once as the query processor has very limited memory (relative to the size of the stream) to work with.
Computing Diameter in the Streaming and SlidingWindow Models
 Algorithmica
, 2002
"... We investigate the diameter problem in the streaming and slidingwindow models. We show that, for a stream of n points or a sliding window of size n, any exact algorithm for diameter requires Ω(n) bits of space. We present a simple ɛapproximation 1algorithm for computing the diameter in the streami ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
(Show Context)
We investigate the diameter problem in the streaming and slidingwindow models. We show that, for a stream of n points or a sliding window of size n, any exact algorithm for diameter requires Ω(n) bits of space. We present a simple ɛapproximation 1algorithm for computing the diameter in the streaming model. Our main result is an ɛapproximation algorithm that maintains the diameter in two dimensions in the sliding windows model using O ( 1 ɛ3/2 log 3 n(log R + log log n + log 1 ɛ)) bits of space, where R is the maximum, over all windows, of the ratio of the diameter to the minimum nonzero distance between any two points in the window. 1 introduction In recent years, massive data sets have become increasingly important in a wide range of applications. In many applications, the input can be viewed as a data stream [12, 7] that the