Results 1  10
of
47
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Private and continual release of statistics
 In Automata, Languages and Programming, volume 6199 of LNCS
, 2010
"... We ask the question: how can Web sites and data aggregators continually release updated statistics, and meanwhile preserve each individual user’s privacy? Suppose we are given a stream of 0’s and 1’s. We propose a differentially private continual counter that outputs at every time step the approxima ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We ask the question: how can Web sites and data aggregators continually release updated statistics, and meanwhile preserve each individual user’s privacy? Suppose we are given a stream of 0’s and 1’s. We propose a differentially private continual counter that outputs at every time step the approximate number of 1’s seen thus far. Our counter construction has error that is only polylog in the number of time steps. We can extend the basic counter construction to allow Web sites to continually give topk and hot items suggestions while preserving users ’ privacy.
Duplicate detection in click streams
 In WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both t ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates. 1
Time decaying aggregates in outoforder streams
, 2007
"... Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically m ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically more important than older ones. This can be formalized via timedecay functions, which assign weights to data based on the age of data. Decay functions such as sliding windows and exponential decay have been studied under the assumption of wellordered arrivals, i.e., data arrives in nondecreasing order of time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays etc.), and correct arrival order is not guaranteed. We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on outoforder streams, where elements do not necessarily arrive in increasing order of timestamps. Existing techniques such as Exponential Histograms and Waves are unable to handle outoforder streams. We give the first deterministic algorithms for approximating these aggregates under popular decay functions such as sliding window and polynomial decay. We study the overhead of allowing outoforder arrivals when compared to wellordered arrivals, both analytically and experimentally. Our experiments confirm that these algorithms can be applied in practice, and compare the relative performance of different approaches for handling outoforder arrivals.
Spaceoptimal Heavy Hitters with Strong Error Bounds
, 2009
"... The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whethe ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of “counterbased algorithms” (including the popular and very spaceefficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining “tail.” This shows that counterbased methods are the most spaceefficient (in fact, spaceoptimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the “sparse recovery ” problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f ∗ to the frequency vector f so that the L1 error ‖f − f ∗ ‖1 is close to the best possible error minf ′ ‖f ′ − f‖1, where f ′ ranges over all vectors with at most k nonzero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
Fast Data Stream Algorithms using Associative Memories
 SIGMOD
, 2007
"... The primary goal of data stream research is to develop space and time efficient solutions for answering continuous online summarization queries. Research efforts over the last decade have resulted in a number of efficient algorithms with varying degrees of space and time complexities. While these te ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
The primary goal of data stream research is to develop space and time efficient solutions for answering continuous online summarization queries. Research efforts over the last decade have resulted in a number of efficient algorithms with varying degrees of space and time complexities. While these techniques are developed in a standard CPU setting, many of their applications such as clickfraud detection and networktraffic summarization typically execute on special networking architectures called Network Processing Units (NPUs). These NPUs interface with a special type of associative memories, known as Ternary Content Addressable Memories (TCAMs), to provide gigabit rate forwarding at network routers. In this paper, we describe how the integrated architecture of NPU and TCAMs can be exploited towards achieving the goal of developing highspeed stream summarization solutions. We propose two TCAMconscious solutions for the frequent elements problem in data streams and present a comprehensive evaluation of these techniques on a stateoftheart networking platform.
Using Association Rules for Fraud Detection in Web Advertising Networks
 In Proceedings of the 31st VLDB International Conference on Very Large Data Bases
, 2005
"... Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to rep ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to report association rules with tight guarantees on errors, using limited processing per element, and minimal space.
Estimating the Confidence of Conditional Functional Dependencies
"... Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and th ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD) gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.
SLEUTH: SinglepubLisher attack dEtection Using correlaTion Hunting ∗
"... Several data management challenges arise in the context of Internet advertising networks, where Internet advertisers pay Internet publishers to display advertisements on their Web sites and drive traffic to the advertisers from surfers ’ clicks. Although advertisers can target appropriate market seg ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Several data management challenges arise in the context of Internet advertising networks, where Internet advertisers pay Internet publishers to display advertisements on their Web sites and drive traffic to the advertisers from surfers ’ clicks. Although advertisers can target appropriate market segments, the model allows dishonest publishers to defraud the advertisers by simulating fake traffic to their own sites to claim more revenue. This paper addresses the case of publishers launching fraud attacks from numerous machines, which is the most widespread scenario. The difficulty of uncovering these attacks is proportional to the number of machines and resources exploited by the fraudsters. In general, detecting this class of fraud entails solving a new data mining problem, which is finding correlations in multidimensional data. Since the dimensions have large cardinalities, the search space is huge, which has long allowed dishonest publishers to inflate their traffic, and deplete the advertisers ’ advertising budgets. We devise the approximate SLEUTH algorithms to solve the problem efficiently, and uncover singlepublisher frauds. We demonstrate the effectiveness of SLEUTH both analytically and by reporting some of its results on the Fastclick network, where numerous fraudsters were discovered. 1.
Sliding Window Query Processing over Data Streams
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used suc ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, realtime Internet traffic analysis, and online financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes