Results 1 
5 of
5
A sketch algorithm for estimating twoway and multiway associations
 Computational Linguistics
, 2007
"... We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketchbased algorithm that constructs a contingency table for a sample. One can estimate th ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
(Show Context)
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketchbased algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder’s sketches. 1.
Conditional random sampling: A sketchbased sampling technique for sparse data
 In NIPS
, 2006
"... We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stag ..."
Abstract

Cited by 23 (14 self)
 Add to MetaCart
(Show Context)
We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using realworld data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations. 1
On estimating frequency moments of data streams
 In International Workshop on Randomization and Approximation Techniques in Computer Science
, 2007
"... Abstract. Spaceeconomical estimation of the pth frequency moments, defined as Fp = P n i=1 fip, for p> 0, are of interest in estimating allpairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequenc ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Abstract. Spaceeconomical estimation of the pth frequency moments, defined as Fp = P n i=1 fip, for p> 0, are of interest in estimating allpairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1,..., fn with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of pstable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a pstable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the stable sketches technique and its variants is that they require O ( 1 ɛ 2) innerproducts of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite timeconsuming. We present algorithms for estimating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with wellknown summary structures for data streams, such as the COUNTMIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space 1 ± ɛ factors and requires expected time O(log F1 log 1 δ Õ ( 1 ɛ 2+p) 3 to estimate Fp to within) to process each update. Thus, our technique trades an O ( 1 ɛ p) factor in space for much more efficient processing of stream updates. We also present a standalone iterative estimator for F1. 1