Results 1  10
of
19
Deterministic Sampling and Range Counting in Geometric Data Streams
 In Proc. 20th ACM Sympos. Comput. Geom
, 2004
"... We present memoryefficient deterministic algorithms for constructing #nets and #approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used t ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
(Show Context)
We present memoryefficient deterministic algorithms for constructing #nets and #approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the ThielSen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inversepolylogarithmic. We also include a lower bound for noniceberg geometric queries.
Deterministic algorithms for sampling count data. Data and Knowledge Engineering, Accepted for publication
, 2007
"... Processing and extracting meaningful knowledge from count data is an important problem in data mining. The volume of data is increasing dramatically as the data is generated by daytoday activities such as market basket data, web clickstream data or network data. Most mining and analysis algorithms ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Processing and extracting meaningful knowledge from count data is an important problem in data mining. The volume of data is increasing dramatically as the data is generated by daytoday activities such as market basket data, web clickstream data or network data. Most mining and analysis algorithms require multiple passes over the data, which requires extreme amounts of time. One solution to save time would be to use samples, since sampling is a good surrogate for the data and the same sample can be used to answer many kinds of queries. In this paper, we propose two deterministic sampling algorithms, BiasedL2 and DRS. Both produce samples vastly superior to the previous deterministic and random algorithms, both in sample quality and accuracy. Our algorithms also improve on the runtime and memory footprint of the existing deterministic algorithms. The new algorithms can be used to sample from a relational database as well as data streams, with the ability to examine each transaction only once, and maintain the sample onthefly in a streaming fashion. We further show how to engineer one of our algorithms (DRS) to adapt and recover from changes to the underlying data distribution, or sample size. We evaluate our algorithms on
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
 ACM Trans. Knowl. Disc. from Data
"... The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, pos ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under or oversampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of VapnikChervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within userspecified parameters. Our technique applies both to absolute and to relative approximations of (topK) FI’s and AR’s. The resulting sample size is linearly dependent on the VCdimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VCdimension of this range space and a proof that it is upper bounded by an easytocompute characteristic quantity of the dataset which we call dindex, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)approximation of the collection of FI’s is O ( 1 ε2 (d + log 1 δ)) (resp. O ( 2+ε ε2(2−ε)θ (d log
MissionCritical Management of Mobile Sensors (or, How to Guide a Flock of Sensors
 In DMSN
, 2004
"... This work addresses the problem of optimizing the deployment of sensors in order to ensure the quality of the readings of the value of interest in a given (critical) geographic region. As usual, we assume that each sensor is capable of reading a particular physical phenomenon (e.g., concentration of ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
This work addresses the problem of optimizing the deployment of sensors in order to ensure the quality of the readings of the value of interest in a given (critical) geographic region. As usual, we assume that each sensor is capable of reading a particular physical phenomenon (e.g., concentration of toxic materials in the air) and transmitting it to a server or a peer. However, the key assumptions considered in this work are: 1. each sensor is capable of moving, where the motion may be remotely controlled; and 2. the spatial range for which the individual sensor’s reading is guaranteed to be of a desired quality is limited. In scenarios like disaster management and homeland security, in case some of the sensors dispersed in a larger geographic area report a value higher than a certain threshold, one may want to ensure a quality of the readings for the affected region. This, in turn, implies that one may want to ensure that there are enough sensors in the affected region and, consequently, guide a subset of the rest of the sensors towards the affected region. In this paper we explore variants of the problem of optimizing the guidance of the mobile sensors towards the affected geographic region and we present algorithms for their solutions. 1
Deterministic Data Reduction in Sensor Networks Third
 IEEE International Conference on Mobile Adhoc and Sensor Systems, Short
, 2006
"... A wide range of mining and analysis problems involve extracting knowledge from count data. Such data typically arises from transactional data sets; here we consider the case where it arises from a highly distributed source such as a sensor network. A general approach that scales well with the data i ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
A wide range of mining and analysis problems involve extracting knowledge from count data. Such data typically arises from transactional data sets; here we consider the case where it arises from a highly distributed source such as a sensor network. A general approach that scales well with the data is sampling, and we have proposed several deterministic streaming algorithms for efficiently reducing such data. Those algorithms perform with significantly better accuracy than random sampling for problems such as frequency estimation, correlation detection, association rules and iceberg cube computations. In this paper, we consider the distributed version of the problem, and specifically the case when the data originates from a sensor network. We engineer a fullydistributed version of our algorithm which builds a deterministic sample along some treelike aggregation structure. We demonstrate that this distributed sample has about the same quality as would have been computed by running our (nondistributed) deterministic algorithm on the underlying data. We compare to other (nondeterministic) sampling algorithms while focusing on issues such as sample quality, network longevity, and energy and communication costs. Index Terms Distributed algorithm, sampling, data reduction, data mining, count dataset, frequency estimation, association
A New Deterministic Data Aggregation Method For Wireless Sensor Networks
, 2007
"... The processing capabilities of wireless sensor nodes enable to aggregate redundant data to limit total data flow over the network. The main property of a good aggregation algorithm is to extract the most representative data by using minimum resources. From this point of view, sampling is a promising ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The processing capabilities of wireless sensor nodes enable to aggregate redundant data to limit total data flow over the network. The main property of a good aggregation algorithm is to extract the most representative data by using minimum resources. From this point of view, sampling is a promising aggregation method, that acts as surrogate for the whole data, and once extracted can be used to answer multiple kinds of queries (such as AVG, MEDIAN, SUM, COUNT, etc.), at no extra cost to the sensor network. Additionally, sampling also preserves correlations between attributes of multidimensional data, which is quite valuable for further data mining. In this paper, we propose a novel, distributed, weighted sampling algorithm to sample sensor network data and compare to an existing random sampling algorithm, which is the only algorithm to work in this kind of setting. We perform popular queries to evaluate our algorithm on a real world data set, which covers climate data in the U.S. for the past 100 years. During testing, we focus on issues such as sample quality, network longevity, energy and communication costs.
FREQUENT PATTERN MINING IN DATA STREAMS
"... Frequent pattern mining is a core data mining operation and has been extensively studied over the last decade. Recently, mining frequent patterns over data streams have attracted a lot of research interests. Compared with other streaming queries, frequent pattern mining poses great challenges due to ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Frequent pattern mining is a core data mining operation and has been extensively studied over the last decade. Recently, mining frequent patterns over data streams have attracted a lot of research interests. Compared with other streaming queries, frequent pattern mining poses great challenges due to high memory and computational costs, and accuracy requirement of the mining results. In this chapter, we overview the stateofart techniques to mine frequent patterns over data streams. We also introduce a new approach for this problem, which makes two major contributions. First, this one pass algorithm for frequent itemset mining has deterministic bounds on the accuracy, and does not require any outofcore summary structure. Second, because the one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. The two pass algorithm is very memory efficient.
StructureAware Sampling: Flexible and Accurate Summarization
"... In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subsetsum queries with unbiased estimators and wellunderstood confidence bounds. Classic samplebased su ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subsetsum queries with unbiased estimators and wellunderstood confidence bounds. Classic samplebased summaries, however, are designed for arbitrary subset queries and are oblivious to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space (multidimensional), makes range queries much more relevant for most analysis of the data. Dedicated summarization algorithms for rangesum queries have also been extensively studied. They can outperform existing sampling schemes in terms of accuracy on range queries per summary size. Their accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They are also less flexible—being targeted for range sum queries alone—and are often quite costly to build and use. In this paper we propose and evaluate variance optimal sampling schemes that are structureaware. These summaries improve over the accuracy of existing structureoblivious sampling schemes on range queries while retaining the benefits of samplebased summaries: flexible summaries, with high accuracy on both range queries and arbitrary subset queries. 1.
Deterministic sampling beyond EASE: reducing multidimensional data
"... A wide range of mining and analysis problems involve extracting knowledge from count data. A general approach that scales well with the data is sampling. One algorithm for efficiently reducing such data is EASE, which was presented in this conference two years ago. Building on this work, we present ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
A wide range of mining and analysis problems involve extracting knowledge from count data. A general approach that scales well with the data is sampling. One algorithm for efficiently reducing such data is EASE, which was presented in this conference two years ago. Building on this work, we present a vastly simplified and more powerful algorithm (BiasedL2) for deterministically sampling streaming count data and iceberg cubes. The algorithm produces samples more than an order of magnitude better than random samples. Coupled with other methods such as Lossy Counting, we drastically improve on the memory footprint and accuracy of EASE. We also present a deterministic variant of reservoir sampling (DRS) which produces samples of similar quality and can help maintain a sample for a data stream, or for a table in a relational database or data cube under insertions or deletions, and has an excellent recovery rate in case the distribution of item changes.