Results 1 
5 of
5
Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances
, 2013
"... Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been propos ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a file given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n) · polylog n dimensions where D(n) ≪ n is the number of divisors of n. The striking fact is that this is nearoptimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a longstanding open problem on the low distor
2{`2foreach sparse recovery with low risk
 In ICALP (2013
"... In this paper, we consider the “foreach ” sparse recovery problem with failure probability p. The goal of which is to design a distribution over m × N matrices Φ and a decoding algorithm A such that for every x ∈ RN, we have the following error guarantee with probability at least 1 − p ‖x−A(Φx)‖2 6 ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
In this paper, we consider the “foreach ” sparse recovery problem with failure probability p. The goal of which is to design a distribution over m × N matrices Φ and a decoding algorithm A such that for every x ∈ RN, we have the following error guarantee with probability at least 1 − p ‖x−A(Φx)‖2 6 C‖x − xk‖2, where C is a constant (ideally arbitrarily close to 1) and xk is the best ksparse approximation of x. Much of the sparse recovery or compressive sensing literature has focused on the case of either p = 0 or p = Ω(1). We initiate the study of this problem for the entire range of failure probability. Our two main results are as follows: 1. We prove a lower bound on m, the number measurements, of Ω(k log(n/k) + log(1/p)) for 2−Θ(N) 6 p < 1. Cohen, Dahmen, and DeVore [5] prove that this bound is tight. 2. We prove nearly matching upper bounds for sublinear time decoding. Previous such results addressed only p = Ω(1). Our results and techniques lead to the following corollaries: (i) the first ever sublinear time decoding `1/`1 “forall ” sparse recovery system that requires a logγ N extra factor (for some γ < 1) over the
Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing
"... Abstract. In order to overcome the limitations imposed by DNA barcodingwhen multiplexingalarge numberofsamples inthecurrentgeneration of highthroughput sequencing instruments, we have recently proposed a new protocol that leverages advances in combinatorial pooling design (group testing) [9]. We ha ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In order to overcome the limitations imposed by DNA barcodingwhen multiplexingalarge numberofsamples inthecurrentgeneration of highthroughput sequencing instruments, we have recently proposed a new protocol that leverages advances in combinatorial pooling design (group testing) [9]. We have also demonstrated how this new protocol would enable de novo selective sequencing and assembly of large, highlyrepetitive genomes. Here we address the problem of decoding pooled sequenced data obtained from such a protocol. Our algorithm employs a synergistic combination of ideas from compressed sensing and the decoding of errorcorrecting codes. Experimental results on synthetic data for the rice genome and real data for the barley genome show that our novel decoding algorithm enables significantly higher quality assemblies than the previous approach.
BlackBox Trace&Revoke Codes
 ALGORITHMICA
, 2012
"... We address the problem of designing an efficient broadcast encryption scheme which is also capable of tracing traitors. We introduce a code framework to formalize the problem. Then, we give a probabilistic construction of a code which supports both traceability and revocation. Given N users with at ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We address the problem of designing an efficient broadcast encryption scheme which is also capable of tracing traitors. We introduce a code framework to formalize the problem. Then, we give a probabilistic construction of a code which supports both traceability and revocation. Given N users with at most r revoked users and at most t traitors, our code construction gives rise to a Trace&Revoke system with private keys of size O((r +t)log N)(which can also be reduced to constant size based on an additional computational assumption), ciphertexts of size O((r + t)log N), and O(1) decryption time. Our scheme can deal with certain classes of pirate decoders, which we believe are sufficiently powerful to capture practical pirate strategies. In particular, our code construction is based on a combinatorial object called (r, s)disjunct matrix, which is designed to capture both the classic traceability notion of disjunct matrix and the new requirement of revocation capability. We then probabilistically construct (r, s)disjunct matrices which help design efficient BlackBox Trace&Revoke systems. For dealing with “smart” pirates, we introduce a tracing technique called “shadow group testing” that uses (close to) legitimate broadcast signals for tracing. Along the way, we also proved several bounds on the number of queries needed for blackbox tracing under different assumptions about the pirate’s strategies.
Parallel Feature Selection inspired by Group Testing
"... This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with stateoftheart feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a preprocessing step for most other existing methods. 1