Results 1  10
of
18
Sampling algorithms in a stream operator
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing bett ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and samplingbased aggregations. Also, we show how to implement the operator in Gigascope a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subsetsum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system. 1.
Range counting over multidimensional data streams
 Discrete & Computational Geometry
, 2004
"... \Lambda \Lambda Abstract We consider the problem of approximate range counting over streams of ddimensional points. In the data stream model, the algorithm makes a single scan of the data, which is presented in an arbitrary order, and computes a compact summary (called a sketch). The sketch, whose ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
\Lambda \Lambda Abstract We consider the problem of approximate range counting over streams of ddimensional points. In the data stream model, the algorithm makes a single scan of the data, which is presented in an arbitrary order, and computes a compact summary (called a sketch). The sketch, whose size depends on the approximation parameter ", can be used to count the number of points inside a query range within additive error "n, where n is the size of the stream. We present several results, deterministic and randomized, for both rectangle and halfplane ranges. 1 Introduction Data streams have emerged as an important paradigm for processing data that arrives and needs to be processed continuously. For instance, telecom service providers routinely monitor packet flows through their networks to infer usage patterns and signs of attack, or to optimize their routing tables. Financial markets, banks, web servers, and news organizations also generate rapid and continuous data streams.
Bayesian Statistics
 in WWW', Computing Science and Statistics
, 1989
"... ∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second o ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive multivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the KullbackLeibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum KullbackLiebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Minimum description length (MDL), and Bootstrap information criterion. Likewise, the jackknife method chooses a model of minimum KullbackLeibler distance through bias reduction. This bias, which is inevitable in model
Adaptive spatial partitioning for multidimensional data streams
 In ISAAC
, 2004
"... We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the εapproximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer εapproximate rectangular range queries over a stream of ddimensional points. 1
A SpaceOptimal DataStream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first spaceoptimal datastream algorithm for this problem. As a consequence, we obtain improved datastream approximation algorithms for other extent measures, such as width, robust kernels, as well as εkernels in higher dimensions.
A unified framework for approximating and clustering data. Manuscript available at arXiv.org
, 2011
"... Given a set F of n positive functions over a ground set X, we consider the problem of computing x ∗ that minimizes the expression ∑ f∈F f(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) f ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
Given a set F of n positive functions over a ground set X, we consider the problem of computing x ∗ that minimizes the expression ∑ f∈F f(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) family X of shapes. Here, each point p ∈ P corresponds to a function f such that f(x) is the distance from p to x, and we seek a shape x that minimizes the sum of distances from each point in P. In the kclustering variant, each x ∈ X is a tuple ofk shapes, andf(x) is the distance frompto its closest shape inx. Our main result is a unified framework for constructing coresets and approximate clustering for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of εapproximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of “compressed representation " of the input set F. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). For several function families F for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximations, or coresets that are large, but contained in a lowdimensional space. We demonstrate our unified framework by applying it on projective clustering problems. We obtain new coreset constructions and significantly smaller coresets, over the ones that
Measuring independence of datasets
 CoRR
"... Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. In ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model. Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log napproximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log napproximation for the statistical distance metric. In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (ɛ, δ)approximation of the statistical distance between the joint and product distributions defined by a stream of 1 nm ktuples. Our algorithm requires O log( ɛ δ)) (30+k) k) memory and a single pass over the data stream.
Identifying high cardinality internet hosts
 In Proceedings of IEEE INFOCOM
, 2009
"... Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identificat ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identification. However, due to the tremendous number of hosts in the Internet and high speed links, tracking the exact cardinality of each host is not feasible due to the limited memory and computation resource. Existing approaches on host cardinality counting have primarily focused on hosts of extremely high cardinalities. These methods do not work well with hosts of moderately large cardinalities that are needed for certain host behavior profiling such as detection of p2p hosts or port scanners. In this paper, we propose an online sampling approach for identifying hosts whose cardinality exceeds some moderate prescribed threshold, e.g. 50, or within specific ranges. The main advantage of our approach is that it can filter out the majority of low cardinality hosts while preserving the hosts of interest, and hence minimize the memory resources wasted by tracking irrelevant hosts. Our approach consists of three components: 1) twophase filtering for eliminating low cardinality hosts, 2) thresholded bitmap for counting cardinalities, and 3) bias correction. Through both theoretical analysis and experiments using real Internet traces, we demonstrate that our approach requires much less memory than existing approaches do whereas yields more accurate estimates. I.
Monitoring continuous bandjoin queries over dynamic data
 In Proc. of the 16th Intl. Sympos. Algorithms and Computation
, 2005
"... Abstract. A continuous query is a standing query over a dynamic data set whose query result needs to be constantly updated as new data arrive. We consider the problem of constructing a data structure on a set of continuous bandjoin queries over two data sets R and S, where each bandjoin query asks ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract. A continuous query is a standing query over a dynamic data set whose query result needs to be constantly updated as new data arrive. We consider the problem of constructing a data structure on a set of continuous bandjoin queries over two data sets R and S, where each bandjoin query asks for reporting the set {(r, s) ∈ R×S  a ≤ r −s ≤ b} for some parameters a and b, so that given a data update in R or S, one can quickly identify the subset of continuous queries whose results are affected by the update, and compute changes to these results. We present the first nontrivial data structure for this problem that simultaneously achieves subquadratic space and sublinear query time. This is achieved by first decomposing the original problem into two independent subproblems, and then carefully designing data structures suitable for each case, by exploiting the particular structure in each subproblem. A key step in the above construction is a data structure whose performance increases with the degree of clusteredness of the bandjoins being indexed. We believe that this structure is of independent interest and should have broad impact in practice. We present the details in [1]. 1
Small and Stable Descriptors of Distributions for Geometric Statistical Problems
, 2009
"... This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two exa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two examples of coresets are εsamples and εkernels. An εsample can estimate the density of a point set in any range from a geometric family of ranges (e.g., disks, axisaligned rectangles). An εkernel approximates the width of a point set in all directions. Both coresets have size that depends only on ε, the error parameter, not the size of the original data set. We demonstrate several improvements to these coresets and how they are useful for geometric statistical problems. We reduce the size of εsamples for density queries in axisaligned rectangles to nearly a square root of the size when the queries are with respect to more general families of shapes, such as disks. We also show how to construct εsamples of probability distributions. We show how to maintain “stable” εkernels, that is, if the point set P changes by