Results 1  10
of
61
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 296 (26 self)
 Add to MetaCart
(Show Context)
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
Secure multiparty computation of approximations
, 2001
"... Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by different parties and may be extremely large. Furthermore, for some applications, the ..."
Abstract

Cited by 107 (26 self)
 Add to MetaCart
Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by different parties and may be extremely large. Furthermore, for some applications, the parties want to compute a function of their inputs securely, without revealing more information than necessary. In this work we study the question of simultaneously addressing the above efficiency and security concerns via what we call secure approximations. We start by extending standard definitions of secure (exact) computation to the setting of secure approximations. Our definitions guarantee that no additional information is revealed by the approximation beyond what follows from the output of the function being approximated. We then study the complexity of specific secure approximation problems. In particular, we obtain a sublinearcommunication protocol for securely approximating the Hamming distance and a polynomialtime protocol for securely approximating the permanent and related #Phard problems. 1
Set Reconciliation with Nearly Optimal Communication Complexity
 in International Symposium on Information Theory
, 2000
"... We consider the problem of efficiently reconciling two similar sets held by different hosts while minimizing the communication complexity. This type of problem arises naturally from gossip protocols used for the distribution of information. We describe an approach to set reconciliation based on the ..."
Abstract

Cited by 77 (16 self)
 Add to MetaCart
We consider the problem of efficiently reconciling two similar sets held by different hosts while minimizing the communication complexity. This type of problem arises naturally from gossip protocols used for the distribution of information. We describe an approach to set reconciliation based on the encoding of sets as polynomials. The resulting protocols exhibit tractable computational complexity and nearly optimal communication complexity. Also, these protocols can be adapted to work over a broadcast channel, allowing many clients to reconcile with one host based on a single broadcast, even if each client is missing a different subset.
The String Edit Distance Matching Problems with Moves
, 2006
"... The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smalles ..."
Abstract

Cited by 72 (3 self)
 Add to MetaCart
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log ∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit
Algorithmic clustering of music based on string compression
 COMPUTER MUSIC JOURNAL
, 2004
"... All musical pieces are similar, but some are more similar than others. Apart from serving as an infinite source of discussion (‘‘Haydn is just like Mozart—No, he’s not!’’), such similarities are also crucial for the design of efficient music information retrieval systems. The amount of digitized mus ..."
Abstract

Cited by 67 (19 self)
 Add to MetaCart
All musical pieces are similar, but some are more similar than others. Apart from serving as an infinite source of discussion (‘‘Haydn is just like Mozart—No, he’s not!’’), such similarities are also crucial for the design of efficient music information retrieval systems. The amount of digitized music available on the Internet has grown dramatically in recent years, both in the public domain and on commercial sites; Napster and its clones are prime examples. Web sites offering musical content in some form like MP3, MIDI, or other, need a way to organize their wealth of material; they need to somehow classify their files according to musical genres and subgenres, putting similar pieces together. The purpose of such organization is to enable users to navigate to pieces of music they already know and like, but also to give them advice and recommendations (‘‘If you like this, you might also like...’’). Currently, such organization is mostly done manually by humans, or based on patterns in the purchasing behaviors of customers. However, some recent research has been examining the possibilities of automating music classification. A human expert, comparing different pieces of music with the goal of clustering similar works together, will generally look for certain specific similarities. Previous attempts to automate this process do the same. Generally speaking, they take a file containing a piece of music and extract from it various specific numerical features, related to pitch, rhythm, harmony, etc. One can extract such features using, for instance, Fourier transforms (Tzanetakis and Cook 2002) or wavelet transforms
Approximate Nearest Neighbors and Sequence Comparison With Block Operations
 IN STOC
, 2000
"... We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any online query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences ..."
Abstract

Cited by 48 (6 self)
 Add to MetaCart
We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any online query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S; T ) = d(T; S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications. We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences with p...
A novel method for multiple alignment of sequences with repeated and shuffled elements
, 2004
"... ..."
A sublinear algorithm for weakly approximating edit distance
 In Proc. STOC 2003
, 2003
"... We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ n max { α 2,2α−1} for any fixed α < 1. Our algorithm thus provides a tradeoff between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(n α/2)onthequerycomplexity of every algorithm that distinguishes pairs of strings with edit distance at most n α from those with edit distance at least n/6.
Approximating edit distance efficiently
 In Proc. FOCS 2004
, 2004
"... Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion e ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion embeddings into normed spaces, approximate nearestneighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than ℓ, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the k vs. (kn) 2/3 gap problem, using a constant size sketch. A more involved algorithm solves the stronger k vs. ℓ gap problem, where ℓ can be as small as O(k 2)—still with a constant sketch—but works only for strings that are mildly “nonrepetitive”. Finally, we develop an n 3/7approximation quasilinear time algorithm for edit distance, improving the previous best factor of n 3/4 [5]; if the input strings are assumed to be nonrepetitive, then the approximation factor can be strengthened to n 1/3. 1.
Lower bounds for embedding edit distance into normed spaces
 In Proc. SODA 2003
, 2003
"... MIT S. Raskhodnikova MIT 1 Introduction The edit distance (also called Levenshtein metric) between two strings is the minimum number of operations (insertions, deletions and character substitutions) needed to transform one string into another. This distance is of key importance in computational biol ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
MIT S. Raskhodnikova MIT 1 Introduction The edit distance (also called Levenshtein metric) between two strings is the minimum number of operations (insertions, deletions and character substitutions) needed to transform one string into another. This distance is of key importance in computational biology, as well as text processing and other areas. Algorithms for problems involving this metric have been extensively investigated. In particular, the quadratictime dynamic programming algorithm for computing the edit distance between two strings is one of the most investigated and used algorithms in computational biology. Recently, a new approach to problems involving edit distance has been proposed. Its basic component is construction of a mapping f (called an embedding), which maps any string s into a vector f (s) 2!