Results 1 -
5 of
5
A sketch algorithm for estimating two-way and multi-way associations
- Computational Linguistics
, 2007
"... We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate th ..."
Abstract
-
Cited by 27 (13 self)
- Add to MetaCart
(Show Context)
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder’s sketches. 1.
Conditional random sampling: A sketch-based sampling technique for sparse data
- In NIPS
, 2006
"... We1 develop Conditional Random Sampling (CRS), a technique particularly suit-able for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stag ..."
Abstract
-
Cited by 23 (14 self)
- Add to MetaCart
(Show Context)
We1 develop Conditional Random Sampling (CRS), a technique particularly suit-able for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations. 1
On estimating frequency moments of data streams
- In International Workshop on Randomization and Approximation Techniques in Computer Science
, 2007
"... Abstract. Space-economical estimation of the pth frequency moments, defined as Fp = P n i=1 |fi|p, for p> 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequenc ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Abstract. Space-economical estimation of the pth frequency moments, defined as Fp = P n i=1 |fi|p, for p> 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1,..., fn with a suitably chosen random vector were pioneered by Alon, Ma-tias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the sta-ble sketches technique and its variants is that they require O ( 1 ɛ 2) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for esti-mating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space 1 ± ɛ factors and requires expected time O(log F1 log 1 δ Õ ( 1 ɛ 2+p) 3 to estimate Fp to within) to process each update. Thus, our tech-nique trades an O ( 1 ɛ p) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1. 1