Results 1  10
of
63
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 52 (7 self)
 Add to MetaCart
(Show Context)
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Private and continual release of statistics
 In Automata, Languages and Programming, volume 6199 of LNCS
, 2010
"... We ask the question: how can Web sites and data aggregators continually release updated statistics, and meanwhile preserve each individual user’s privacy? Suppose we are given a stream of 0’s and 1’s. We propose a differentially private continual counter that outputs at every time step the approxima ..."
Abstract

Cited by 46 (3 self)
 Add to MetaCart
We ask the question: how can Web sites and data aggregators continually release updated statistics, and meanwhile preserve each individual user’s privacy? Suppose we are given a stream of 0’s and 1’s. We propose a differentially private continual counter that outputs at every time step the approximate number of 1’s seen thus far. Our counter construction has error that is only polylog in the number of time steps. We can extend the basic counter construction to allow Web sites to continually give topk and hot items suggestions while preserving users ’ privacy.
Abbadi, “An integrated efficient solution for computing frequent and topk elements in data streams
 ACM Trans. Database Syst
, 2006
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. Fo ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For general data distributions, our topk algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for topk queries, the analysis ensures that only the topk elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with hardly any loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and topk elements. Although the problems of incremental reporting of frequent and topk elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
Duplicate detection in click streams
 In WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both t ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates. 1
Spaceoptimal Heavy Hitters with Strong Error Bounds
, 2009
"... The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whethe ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of “counterbased algorithms” (including the popular and very spaceefficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining “tail.” This shows that counterbased methods are the most spaceefficient (in fact, spaceoptimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the “sparse recovery ” problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f ∗ to the frequency vector f so that the L1 error ‖f − f ∗ ‖1 is close to the best possible error minf ′ ‖f ′ − f‖1, where f ′ ranges over all vectors with at most k nonzero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
Forward Decay: A Practical Time Decay Model for Streaming Systems
"... Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. polynomial decay) are not, even though they have been identified as useful. We argue that this is because the usual definitions of time decay are “backwards”: the decayed weight of a tuple is based on its age, measured backward from the current time. Since this age is constantly changing, such decay is too complex and unwieldy for scalable implementation. In this paper, we propose a new class of “forward ” decay functions based on measuring forward from a fixed point in time. We show that this model captures the more practical models already known, such as exponential decay and landmark windows, but also includes a wide class of other types of time decay. We provide efficient algorithms to compute a variety of aggregates and draw samples under forward decay, and show that these are easy to implement scalably. Further, we provide empirical evidence that these can be executed in a production data stream management system with little or no overhead compared to the undecayed computations. Our implementation required no extensions to the query language or the DSMS, demonstrating that forward decay represents a practical model of time decay for systems that deal with timebased data. I.
Time decaying aggregates in outoforder streams
, 2007
"... Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically m ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically more important than older ones. This can be formalized via timedecay functions, which assign weights to data based on the age of data. Decay functions such as sliding windows and exponential decay have been studied under the assumption of wellordered arrivals, i.e., data arrives in nondecreasing order of time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays etc.), and correct arrival order is not guaranteed. We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on outoforder streams, where elements do not necessarily arrive in increasing order of timestamps. Existing techniques such as Exponential Histograms and Waves are unable to handle outoforder streams. We give the first deterministic algorithms for approximating these aggregates under popular decay functions such as sliding window and polynomial decay. We study the overhead of allowing outoforder arrivals when compared to wellordered arrivals, both analytically and experimentally. Our experiments confirm that these algorithms can be applied in practice, and compare the relative performance of different approaches for handling outoforder arrivals.
SLEUTH: SinglepubLisher attack dEtection Using correlaTion Hunting ∗
"... Several data management challenges arise in the context of Internet advertising networks, where Internet advertisers pay Internet publishers to display advertisements on their Web sites and drive traffic to the advertisers from surfers ’ clicks. Although advertisers can target appropriate market seg ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
Several data management challenges arise in the context of Internet advertising networks, where Internet advertisers pay Internet publishers to display advertisements on their Web sites and drive traffic to the advertisers from surfers ’ clicks. Although advertisers can target appropriate market segments, the model allows dishonest publishers to defraud the advertisers by simulating fake traffic to their own sites to claim more revenue. This paper addresses the case of publishers launching fraud attacks from numerous machines, which is the most widespread scenario. The difficulty of uncovering these attacks is proportional to the number of machines and resources exploited by the fraudsters. In general, detecting this class of fraud entails solving a new data mining problem, which is finding correlations in multidimensional data. Since the dimensions have large cardinalities, the search space is huge, which has long allowed dishonest publishers to inflate their traffic, and deplete the advertisers ’ advertising budgets. We devise the approximate SLEUTH algorithms to solve the problem efficiently, and uncover singlepublisher frauds. We demonstrate the effectiveness of SLEUTH both analytically and by reporting some of its results on the Fastclick network, where numerous fraudsters were discovered. 1.
Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks
, 2005
"... Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to report ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
(Show Context)
Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to report association rules with tight guarantees on errors, using limited processing per element, and minimal space. The modular design of StreamingRules allows for integration with current stream management systems, since it employs existing techniques for finding frequent elements. The presentation emphasizes the applicability of the algorithm to fraud detection in advertising networks. Such fraud instances have not been successfully detected by current techniques. Our experiments on synthetic data demonstrate scalability and efficiency. On real data, potential fraud was discovered. 1
Finding the Frequent Items in Streams of Data
"... doi:10.1145/1562764.1562789 The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indi ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
doi:10.1145/1562764.1562789 The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on finding the frequent items, and implementations are in use in largescale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties. To further illustrate the different properties of the algorithms, we provide baseline implementations. This allows us to give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware. 1.