Results 1 
3 of
3
Flexible approximate counting
 In 15th International Database Engineering & Applications Symposium, IDEAS 2011
, 2011
"... Approximate counting [18] is useful for data stream and database summarization. It can help in many settings that allow only one pass over the data, want low memory usage, and can accept some relative error. Approximate counters use fewer bits; we focus on 8bits but our results are general. These s ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Approximate counting [18] is useful for data stream and database summarization. It can help in many settings that allow only one pass over the data, want low memory usage, and can accept some relative error. Approximate counters use fewer bits; we focus on 8bits but our results are general. These small counters represent a sparse sequence of larger numbers. Counters are incremented probabilistically based on the spacing between the numbers they represent. Our contributions are a customized distribution of counter values and efficient strategies for deciding when to increment them. At runtime, users may independently select the spacing (accuracy) of the approximate counter for small, medium, and large values. We allow the user to select the maximum number to count up to, and our algorithm will select the exponential base of the spacing. These provide additional flexibility over both classic and Csűrös’s [4] floatingpoint approximate counting. These provide additional structure, a useful schema for users, over Kruskal and Greenberg [13]. We describe two new and efficient strategies for incrementing approximate counters: use a deterministic countdown or sample from a geometric distribution. In Csűrös all increments are powers of two, so random bits rather than full random numbers can be used. We also provide the option to use powersoftwo but retain flexibility. We show when each strategy is fastest in our implementation. Categories and Subject Descriptors E.4 [Coding and information theory]: Data compaction
Streaming Pointwise Mutual Information
"... Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), i ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
Data Cube Materialization and Mining over MapReduce
"... Abstract—Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive datasets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such asSUM that are amenable to parallel computation and ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive datasets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such asSUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOPK, however, is nontrivial. In this paper we detail realworld challenges in cube materialization and mining tasks on Webscale datasets. Specifically, we identify an important subset of holistic measures and introduce MRCube, a MapReduce based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our datasets, MRCube successfully and efficiently computes cubes with holistic measures over billiontuple datasets.