Results 1  10
of
24
The Similarity Metric
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2003
"... A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it min ..."
Abstract

Cited by 192 (29 self)
 Add to MetaCart
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.
An Efficient Index Structure for String Databases
 In VLDB
, 2001
"... We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we ..."
Abstract

Cited by 70 (8 self)
 Add to MetaCart
We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 5095%), thus reducing both the disk I/O cost and the CPU cost significantly. 1
The String Edit Distance Matching Problems with Moves
, 2006
"... The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smalles ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log ∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit
LowDistortion Embeddings of Finite Metric Spaces
 in Handbook of Discrete and Computational Geometry
, 2004
"... INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their diss ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their dissimilarity (computed, say, by comparing their DNA). It is dicult to see any structure in a large table of numbers, and so we would like to represent a given metric space in a more comprehensible way. For example, it would be very nice if we could assign to each x 2 X a point f(x) in the plane in such a way that D(x; y) equals the Euclidean distance of f(x) and f(y). Such a representation would allow us to see the structure of the metric space: tight clusters, isolated points, and so on. Another advantage would be that the metric would now be represented by only 2n real numbers, the coordinates of the n points in the plane, instead of numbers as before. Moreover, many quantities concern
Nonembeddability theorems via Fourier analysis
"... Various new nonembeddability results (mainly into L1) are proved via Fourier analysis. In particular, it is shown that the Edit Distance on {0, 1}d has L1 distortion (log d) 12o(1). We also give new lower bounds on the L1 distortion of flat tori, quotients of the discrete hypercube under group ac ..."
Abstract

Cited by 43 (9 self)
 Add to MetaCart
Various new nonembeddability results (mainly into L1) are proved via Fourier analysis. In particular, it is shown that the Edit Distance on {0, 1}d has L1 distortion (log d) 12o(1). We also give new lower bounds on the L1 distortion of flat tori, quotients of the discrete hypercube under group actions, and the transportation cost (Earthmover) metric.
CLUSEQ: Efficient and Effective Sequence Clustering
 In ICDE
, 2003
"... Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize (the significant portion of) the CPD in a concise way. A novel algorithm is devised to efficiently discover clusters with high quality and is able to automatically adjust the number of clusters to its optimal range via a unique combination of successive new cluster generation and cluster consolidation. The performance of CLUSEQ has been demonstrated via extensive experiments on several real and synthetic sequence databases. 1
Low distortion embeddings for edit distance
 In Proceedings of the Symposium on Theory of Computing
, 2005
"... We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, neare ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds. 1
The computational hardness of estimating edit distance
 In Proceedings of the Symposium on Foundations of Computer Science
, 2007
"... We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a tradeoff between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation α ≥ Ω(log d / log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance since it captures constantsize sketches as well as embeddings into spaces like L1 and squaredL2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Σ = {0, 1}, but also for strings that are permutations (aka the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are neartight in at least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way. 1
Comparing sequences with segment rearrangements
 In Proceedings of Foundations of Software Technology and Theoretical Computer Science
, 2003
"... Abstract. Computational genomics involves comparing sequences based on "similarity " for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Mostsequencing effor ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Abstract. Computational genomics involves comparing sequences based on "similarity " for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Mostsequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segmentsfound in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem isclosely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these otherproblems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor \Omega (log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors. 1 Introduction Similarity comparison between biomolecular sequences play an important rolein computational genomics due to the premise that sequence similarity usually
Indexing DNA sequences using qgrams
 In Proceedings of the 10th International Conference on Database Systems for Advanced Applications
, 2005
"... Abstract. We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on qgrams to facilitate similarity search in a DNA database an ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Abstract. We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on qgrams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and ctrees – are proposed based on the qgrams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity. 1