DMCA
Efficiently decoding strings from their shingles
Citations
10599 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2009
(Show Context)
Citation Context ...totically optimal. B. Edit distance The problem of determining the minimum number of edits (insertions or deletions) required to transform one string into another has a long history in the literature =-=[6, 11]-=-. Orlitsky [22] shows that the amount of communication Cα̂(x, y) necessary to reconcile two strings x and y (of lengths |x| and |y| respectively) that are known to be at most α̂-edits apart is at most... |
2653 |
The Theory of Error-Correcting Codes
- MacWilliams, Sloane
- 1977
(Show Context)
Citation Context ...ents into an equivalent characteristic polynomial, so that the problem of set reconciliation is reduced to an equivalent problem of rational function interpolation, much like in Reed-Solomon decoding =-=[18]-=-. The resulting algorithm requires one message of roughly bm bits of communication and bm3 computation time to reconcile two sets that differ in m entries. The approach can be improved to expected bm ... |
573 |
On learning the past tenses of English verbs
- Rumelhart, McClelland
- 1986
(Show Context)
Citation Context ...eads [4] and reconstruction of protein sequences from K-peptides [28]. This idea has even shown up in computational linguistics, where it was used to learn transformations on varying-length sequences =-=[27]-=-. In a simple formal statement of the unique string decoding problem, one is given a string s ∈ Σ∗ over the alphabet Σ. The string is considered uniquely decodable if there is no other string s′ ∈ Σ∗ ... |
535 | A.: Fuzzy extractors: How to generate strong keys from biometrics and other noisy data
- Dodis, Ostrovsky, et al.
- 2008
(Show Context)
Citation Context ...two hosts seek to reconcile remote strings that differ in a fixed number of unknown edits, using a minimum amount of communication. A similar problem is faced in cryptography through fuzzy extractors =-=[8]-=-, which can be used to match noisy biometric data to encrypted baseline measurements in a secure fashion. Within a biological context, this problem has common roots with the sequencing of DNA from sho... |
489 | A digital fountain approach to reliable distribution of bulk data
- Byers, Luby, et al.
- 1998
(Show Context)
Citation Context ...ication points can be added to probabilistic check the result). The merged shingle indices, which can be determined independently of the reconciliation, can be encoded with any standard rateless code =-=[3, 17, 29]-=-, and the two rateless streams can be combined by considering them inputs to yet a third rateless encoding. VI. CONCLUSION We have provided a linear-time algorithm for determining whether a given stri... |
284 | Practical loss-resilient codes
- Luby, Mitzenmacher, et al.
- 1997
(Show Context)
Citation Context ...ication points can be added to probabilistic check the result). The merged shingle indices, which can be determined independently of the reconciliation, can be encoded with any standard rateless code =-=[3, 17, 29]-=-, and the two rateless streams can be combined by considering them inputs to yet a third rateless encoding. VI. CONCLUSION We have provided a linear-time algorithm for determining whether a given stri... |
218 | On the Lambert W function
- Corless, Gonnet, et al.
- 1996
(Show Context)
Citation Context ...xpect each node in the de Bruijn graph of length l shingles to have only one outgoing edge (implying unique decodability) if l ≤ n+ 1 + W (− ln(p)p−n) ln p , (5) where W (·) is the Lambert W function =-=[5]-=-. When n goes to infinity, then (5) is O(log(n)), meaning that logarithmically sized shingles should avoid communicationally expensive merges. Thus, when the two strings are composed of random iid bit... |
215 |
Approximate string matching with q-grams and maximal matches
- Ukkonen
- 1992
(Show Context)
Citation Context ...s time complexity O(n|Σ|3) and space complexity Θ(|Σ|3). A. Approach Two principal approaches have been put forth for deciding unique string decodability. The first is due to Pevzner [25] and Ukkonen =-=[32]-=-, who characterized the type of strings that have the same collection of shingles. This approach can be used to generate a simple unique decodability tester whose naive worst-case running time on stri... |
131 | Efficient algorithms for sorting and synchronization
- Tridgell
- 1999
(Show Context)
Citation Context ...α log |y| (log |y|+ log log |y|+ log(1/ǫ) + logα) bits of communication. Other approaches include those of Evfimievski [10] for small edit distances, Suel [30] based on deltacompression, and Tridgell =-=[31]-=- which presents the computationally efficient (but potentially communicationally inefficient) rsync protocol. C. Reconciliation Another natural approach to the α-edits problem involves the utilization... |
75 | Set reconciliation with nearly optimal communication complexity
- Minsky, Trachtengerg, et al.
- 2003
(Show Context)
Citation Context ...e data with minimum communication. a) Set reconciliation: The problem of set reconciliation seeks to reconcile two remote sets SA and SB of b-bit integers using minimum communication. The approach in =-=[20]-=- involves translating the set elements into an equivalent characteristic polynomial, so that the problem of set reconciliation is reduced to an equivalent problem of rational function interpolation, m... |
45 | Interactive communication of balanced distributions and of correlated files.
- Orlitsky
- 1993
(Show Context)
Citation Context ...efficiently reconstructing a string from a given encoding is fundamental to a broad range of settings. In the information theory world, this is related to the α-edits or string reconciliation problem =-=[4, 22]-=-, wherein two hosts seek to reconcile remote strings that differ in a fixed number of unknown edits, using a minimum amount of communication. A similar problem is faced in cryptography through fuzzy e... |
43 | Fragment assembly with short reads
- Chaisson, Pevzner, et al.
- 2004
(Show Context)
Citation Context ...efficiently reconstructing a string from a given encoding is fundamental to a broad range of settings. In the information theory world, this is related to the α-edits or string reconciliation problem =-=[4, 22]-=-, wherein two hosts seek to reconcile remote strings that differ in a fixed number of unknown edits, using a minimum amount of communication. A similar problem is faced in cryptography through fuzzy e... |
31 | The probability of unique solution of sequencing by hybridization
- Dyer, Frieze, et al.
- 1994
(Show Context)
Citation Context ...an expect a unique decoding for substrings of identically distributed, independent random bits as long as the substrings are roughly logarithmic in the size of the overall decoded string. The work in =-=[9]-=- also provides evidence of a high probability of unique decoding for logarithmically sized substrings, and includes generalizations to non-binary and even non-uniformly random characters for the strin... |
29 |
Raptor codes,” Information Theory
- Shokrollahi
- 2006
(Show Context)
Citation Context ...ication points can be added to probabilistic check the result). The merged shingle indices, which can be determined independently of the reconciliation, can be encoded with any standard rateless code =-=[3, 17, 29]-=-, and the two rateless streams can be combined by considering them inputs to yet a third rateless encoding. VI. CONCLUSION We have provided a linear-time algorithm for determining whether a given stri... |
27 | DNA physical mapping and alternating Eulerian cycles in colored graphs.
- Pevzner
- 1995
(Show Context)
Citation Context ...algorithm [14] has time complexity O(n|Σ|3) and space complexity Θ(|Σ|3). A. Approach Two principal approaches have been put forth for deciding unique string decodability. The first is due to Pevzner =-=[25]-=- and Ukkonen [32], who characterized the type of strings that have the same collection of shingles. This approach can be used to generate a simple unique decodability tester whose naive worst-case run... |
24 | Scalable set reconciliation
- MINSKY, TRACHTENBERG
- 2002
(Show Context)
Citation Context ...bits of communication and bm3 computation time to reconcile two sets that differ in m entries. The approach can be improved to expected bm communication and computation through the use of interaction =-=[19]-=- and generalized to multisets and to arbitrary error-correcting codes [12]. b) String reconciliation: A string σ can be transformed into a multiset S through shingling, or collecting all contiguous su... |
22 | Sequencing-by-hybridization at the information-theory bound: an optimal algorithm
- Preparata, Upfal
(Show Context)
Citation Context ...ncludes generalizations to non-binary and even non-uniformly random characters for the strings. This is extended in [2] to characterize the number of decodings for a given collection of shingles, and =-=[26]-=- considers decoding from regularly gapped collections of substrings in a DNA sequencing framework. Finally, [21] considers an information-theoretic capacity of the sequencing problem, and presents a g... |
17 | A probabilistic algorithm for updating files over a communication link
- Evfimievski
- 1998
(Show Context)
Citation Context ... that does not need to know the number of edits in advance and requires at most 2α log |y| (log |y|+ log log |y|+ log(1/ǫ) + logα) bits of communication. Other approaches include those of Evfimievski =-=[10]-=- for small edit distances, Suel [30] based on deltacompression, and Tridgell [31] which presents the computationally efficient (but potentially communicationally inefficient) rsync protocol. C. Reconc... |
17 | Data verification and reconciliation with generalized error-correction codes
- Karpovsky, Levitin, et al.
(Show Context)
Citation Context ...differ in m entries. The approach can be improved to expected bm communication and computation through the use of interaction [19] and generalized to multisets and to arbitrary error-correcting codes =-=[12]-=-. b) String reconciliation: A string σ can be transformed into a multiset S through shingling, or collecting all contiguous substrings of a given length, including repetitions. For example, shingling ... |
16 | Euler circuits and DNA sequencing by hybridization - Arratia, Bollobás, et al. |
14 |
Süleyman Cenk Sahinalp, and Uzi Vishkin. Communication complexity of document exchange
- Cormode, Paterson
- 2000
(Show Context)
Citation Context ...(y)⌉ ≤ log (( |y|+ α̂ α̂ )) + 3 log(α̂), although he leaves an efficient one-way protocol as an open question. The literature includes a variety of proposed protocols for this problem. Cormode et al. =-=[7]-=- propose a hash-based approach that requires a known bound α̂ on edits between x and y (assuming, without loss of generality, that y is the longer string) and communicates at most 4α log( 2|y| α ) log... |
13 | Bandwidth efficient string reconciliation using puzzles
- Agarwal, Chauhan, et al.
(Show Context)
Citation Context ...s language, and instead we have exhibited an equivalent NFA with Θ(|Σ|3) states. There has also been work on the probability of a collection of shingles having a unique reconstruction. The authors in =-=[1]-=- show that one can expect a unique decoding for substrings of identically distributed, independent random bits as long as the substrings are roughly logarithmic in the size of the overall decoded stri... |
11 |
Practical algorithms for interactive communication
- Orlitsky, Viswanathan
- 2001
(Show Context)
Citation Context ...is the longer string) and communicates at most 4α log( 2|y| α ) log(2α̂) +O ( α log |y| log log(|y|) ln 11−ǫ ) (1) bits to reconcile the strings with probability of failure ǫ. Orlitsky and Viswanthan =-=[24]-=- propose a interactive protocol that does not need to know the number of edits in advance and requires at most 2α log |y| (log |y|+ log log |y|+ log(1/ǫ) + logα) bits of communication. Other approache... |
8 |
Dimitre Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks
- Suel, Noel
- 2004
(Show Context)
Citation Context ...er of edits in advance and requires at most 2α log |y| (log |y|+ log log |y|+ log(1/ǫ) + logα) bits of communication. Other approaches include those of Evfimievski [10] for small edit distances, Suel =-=[30]-=- based on deltacompression, and Tridgell [31] which presents the computationally efficient (but potentially communicationally inefficient) rsync protocol. C. Reconciliation Another natural approach to... |
6 | Uniquely decodable n-gram embeddings
- Kontorovich
- 2004
(Show Context)
Citation Context ...tring character. • Line 9. The key observation here is that the graph is necessarily sparse, since any node with more than two parents or children necessarily renders the graph not uniquely decodable =-=[15]-=-. As such, the graph can be stored as an adjacency list so that this line represents a constant time operation for each string character. • Lines 10-19. We maintain a stack onto which vertices are pus... |
6 | Decomposition and reconstruction of protein sequences: The problem of uniqueness and factorizable language
- Shi, Xie, et al.
(Show Context)
Citation Context ...ne measurements in a secure fashion. Within a biological context, this problem has common roots with the sequencing of DNA from short reads [4] and reconstruction of protein sequences from K-peptides =-=[28]-=-. This idea has even shown up in computational linguistics, where it was used to learn transformations on varying-length sequences [27]. In a simple formal statement of the unique string decoding prob... |
4 |
Finite automata for testing composition-based reconstructibility of sequences
- Li, Xie
(Show Context)
Citation Context ...bservation that the set of uniquely decodable strings form a regular language [15]. With this observation, it is possible to produce a deterministic finite state machine on exp(Ω(|Σ| log |Σ|)) states =-=[16]-=- and a non-deterministic one on O(|Σ|3) states [14]. The DFA is prohibitively expensive to construct explicitly, while the NFA may be simulated in time O(n|Σ|3) and space Θ(|Σ|3). In this work, we pre... |
4 |
Tse D: Information theory of DNA sequencing
- SA, Bresler
(Show Context)
Citation Context ... in [2] to characterize the number of decodings for a given collection of shingles, and [26] considers decoding from regularly gapped collections of substrings in a DNA sequencing framework. Finally, =-=[21]-=- considers an information-theoretic capacity of the sequencing problem, and presents a greedy algorithm for reconstruction that is asymptotically optimal. B. Edit distance The problem of determining t... |
3 | String reconciliation with unknown edit distance
- Kontorovich, Trachtenberg
(Show Context)
Citation Context ... same connected component if and only if [ia, ja]∩ [ib, jb] 6= ∅. This check is a constant-time operation per character. V. STRING RECONCILIATION We next present the string reconciliation protocol in =-=[13]-=- as a specific example where our online unique decodability algorithm is applicable. This specific protocol is a refinement of a shingling approach in [1], and is further based on a transformation to ... |