## Linear-Time Computation of Similarity Measures for Sequential Data (2008)

### Cached

### Download Links

Citations: | 23 - 17 self |

### BibTeX

@MISC{Rieck08linear-timecomputation,

author = {Konrad Rieck and Pavel Laskov},

title = {Linear-Time Computation of Similarity Measures for Sequential Data},

year = {2008}

}

### OpenURL

### Abstract

Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... an inner product in high-dimensional feature spaces, the importance of an c○2008 Konrad Rieck and Pavel Laskov.sRIECK AND LASKOV abstraction from data representation has been quickly realized (e.g., =-=Vapnik, 1995-=-). Consequently kernel-based methods have been proposed for non-vectorial domains, such as analysis of images (e.g., Schölkopf et al., 1998a; Chapelle et al., 1999), sequences (e.g., Jaakkola et al., ... |

3836 | J.D.: Introduction to automata theory, languages, and computation - Hopcroft, Motwani, et al. |

2030 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ...ndled. Thus, a powerful abstraction between algorithms and data representations can be established. The most prominent example of such abstraction is kernel-based learning (e.g., Müller et al., 2001; =-=Schölkopf and Smola, 2002-=-) in which pairwise relationships between objects are expressed by a Mercer kernel, an inner product in a reproducing kernel Hilbert space. Following the seminal work of Boser et al. (1992), various l... |

1461 |
Identification of common molecular subsequences
- Smith, Waterman
- 1981
(Show Context)
Citation Context ...1983). Applications in bioinformatics motivated extensions and adaptions of this concept, for example defining sequence similarity in terms of local and global alignments (Needleman and Wunsch, 1970; =-=Smith and Waterman, 1981-=-). However, similarity measures based on the Hamming distance are restricted to sequences of equal length and measures derived from the Levenshtein distance (e.g., Liao and Noble, 2003; Vert et al., 2... |

1410 |
A general method applicable to the search for similarities in the amino acid sequence of two proteins
- Needleman, Wunsch
- 1970
(Show Context)
Citation Context ...other (Sankoff and Kruskal, 1983). Applications in bioinformatics motivated extensions and adaptions of this concept, for example defining sequence similarity in terms of local and global alignments (=-=Needleman and Wunsch, 1970-=-; Smith and Waterman, 1981). However, similarity measures based on the Hamming distance are restricted to sequences of equal length and measures derived from the Levenshtein distance (e.g., Liao and N... |

1323 | Color indexing
- Swain, Ballard
- 1991
(Show Context)
Citation Context ...2 list well-known kernel and distance functions (see Vapnik, 1995; Schölkopf and Smola, 2002; Webb, 2002) in terms of L. The histogram intersection kernel in Table 1 derives from computer vision (see =-=Swain and Ballard, 1991-=-; Odone et al., 2005) and the Jensen-Shannon divergence in Table 2 is defined using H(x, y) = x log 2x 2y + y log x+y x+y . A further and rather exotic class of vectorial similarity measures are simil... |

1291 | A training algorithm for optimal margin classifiers - Boser, Guyon, et al. |

1201 |
Binary codes capable of correcting deletions, insertions, and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...sequence lengths—at the expense of narrowing the scope of similarity measures that can be handled. For example, we do not consider super-linear comparison algorithms such as the Levenshtein distance (=-=Levenshtein, 1966-=-) and the all-subsequences kernel (Lodhi et al., 2002). The basis of our framework is embedding of sequences in a high-dimensional feature space using a formal language, a classical tool of computer s... |

1048 | Nonlinear component analysis as a kernel eigenvalue problem
- Schölkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...r kernel, an inner product in a reproducing kernel Hilbert space. Following the seminal work of Boser et al. (1992), various learning methods have been re-formulated in terms of kernels, such as PCA (=-=Schölkopf et al., 1998-=-b), ridge regression (Cherkassky et al., 1999), ICA (Harmeling et al., 2003) and many others. Although the initial motivation for the “kernel trick” was to allow efficient computation of an inner prod... |

899 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...link to only one child node and as explicit otherwise. By iteratively removing implicit nodes and appending their labels to edges of explicit parent nodes one obtains a compact trie (cf. Knuth, 1973; =-=Gusfield, 1997-=-). Edges are labeled by subsequences encoded using indices i and j pointing to x[i.. j]. The major benefit of compact tries is reduced space complexity, which decreases from O(k|x|) to O(|x|) independ... |

828 |
A Vector Space Model for Automatic Indexing
- Salton, Wong, et al.
- 1975
(Show Context)
Citation Context .../logn) for sequences of length n (Masek and Patterson, 1980). A different approach to sequence comparison originated in the field of information retrieval with the vector space or bag-of-words model (=-=Salton et al., 1975-=-; Salton, 1979). Textual documents are embedded into a vector space spanned by weighted frequencies of contained words. The similarity of two documents is assessed by an inner-product between the corr... |

787 |
Kernel Methods for Pattern Analysis
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...e. Several data structures have been previously considered for specific similarity measures, such as hash tables (Damashek, 1995), sorted arrays (Sonnenburg et al., 2007), tries (Leslie et al., 2002; =-=Shawe-Taylor and Cristianini, 2004-=-; Rieck et al., 2006), suffix trees using matching statistics (Vishwanathan and Smola, 2004), suffix trees using recursive matching (Rieck et al., 2007) and suffix arrays (Teo and Vishwanathan, 2006).... |

710 | Cluster Analysis for Applications - Anderberg - 1973 |

644 | Suffix arrays: a new method for on-line string searches - Manber, Myers - 1990 |

548 |
A space–economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...es, even if a sequence x contains O(|x| 2 ) words and the embedding language corresponds to L = A ∗ . There are well-known algorithms for linear-time construction of suffix trees (e.g., Weiner, 1973; =-=McCreight, 1976-=-; Ukkonen, 1995), so that a GST for two sequences x and y can be constructed in O(|x|+|y|) using the concatenation z = x$1y$2 . As a GST contains at most 2|z| nodes, the worstcase run-time of any trav... |

426 |
Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...ilarity measures, even if a sequence x contains O(|x| 2 ) words and the embedding language corresponds to L = A ∗ . There are well-known algorithms for linear-time construction of suffix trees (e.g., =-=Weiner, 1973-=-; McCreight, 1976; Ukkonen, 1995), so that a GST for two sequences x and y can be constructed in O(|x|+|y|) using the concatenation z = x$1y$2 . As a GST contains at most 2|z| nodes, the worstcase run... |

389 |
Learning to Classify Text using Support Vector Machines
- Joachims
- 2002
(Show Context)
Citation Context ...ashek, 1995). The idea of determining similarity of sequences by an inner-product was revived in kernel-based learning in the form of bag-of-words kernels (e.g., Joachims, 1998; Drucker et al., 1999; =-=Joachims, 2002-=-) and various string kernels (e.g., Zien et al., 2000; Leslie et al., 2002; Vishwanathan and Smola, 2004). Moreover, research in bioinformatics and text processing advanced the capabilities of string ... |

385 | Watkins,C.,Text classification using string kernel - Lodhi, Cristianini, et al. - 2001 |

380 | Error detecting and error correcting codes - Hamming, RW - 1950 |

375 | An Introduction to Kernel-Based Learning Algorithms
- Muller, Mika, et al.
- 2001
(Show Context)
Citation Context ...f data that can be handled. Thus, a powerful abstraction between algorithms and data representations can be established. The most prominent example of such abstraction is kernel-based learning (e.g., =-=Müller et al., 2001-=-; Schölkopf and Smola, 2002) in which pairwise relationships between objects are expressed by a Mercer kernel, an inner product in a reproducing kernel Hilbert space. Following the seminal work of Bos... |

334 |
Numerical Taxonomy
- Sneath, Sokal
(Show Context)
Citation Context ...d the Jensen-Shannon divergence in Table 2 is defined using H(x, y) = x log 2x 2y + y log x+y x+y . A further and rather exotic class of vectorial similarity measures are similarity coefficients (see =-=Sokal and Sneath, 1963-=-; Anderberg, 1973). These coefficients have been designed for comparison of binary vectors and often express non-metric properties. They are constructed using three summation variables a,b and c, whic... |

328 | On-line construction of suffix trees - Ukkonen - 1995 |

284 | J.M.: N-gram-based text categorization
- Cavnar, Trenkle
- 1994
(Show Context)
Citation Context ...oduct between the corresponding vectors. This concept was extended to k-grams—k consecutive characters or words—in the domain of natural language processing and computer linguistic (e.g., Suen, 1979; =-=Cavnar and Trenkle, 1994-=-; Damashek, 1995). The idea of determining similarity of sequences by an inner-product was revived in kernel-based learning in the form of bag-of-words kernels (e.g., Joachims, 1998; Drucker et al., 1... |

280 |
Statistical Pattern Recognition
- Webb
- 1999
(Show Context)
Citation Context ... at hand, we can now express common vectorial similarity measures in the domain of sequences. Table 1 and 2 list well-known kernel and distance functions (see Vapnik, 1995; Schölkopf and Smola, 2002; =-=Webb, 2002-=-) in terms of L. The histogram intersection kernel in Table 1 derives from computer vision (see Swain and Ballard, 1991; Odone et al., 2005) and the Jensen-Shannon divergence in Table 2 is defined usi... |

259 |
Trie memory
- Fredkin
- 1960
(Show Context)
Citation Context ...for two sequences x and y holds |x|log 2 |y| < |x|+|y|. 4.2 Tries Data structure. A trie is a tree structure for storage and retrieval of sequences. The edges of a trie are labeled with symbols of A (=-=Fredkin, 1960-=-; Knuth, 1973). A path from the root to a marked node x represents a stored sequence, hereafter denoted by ¯x. A trie node x contains a vector of size |A| linking to child nodes, a binary flag to indi... |

254 | Convolution kernels for natural language
- Collins, Duffy
- 2002
(Show Context)
Citation Context ...torial domains, such as analysis of images (e.g., Schölkopf et al., 1998a; Chapelle et al., 1999), sequences (e.g., Jaakkola et al., 2000; Watkins, 2000; Zien et al., 2000) and structured data (e.g., =-=Collins and Duffy, 2002-=-; Gärtner et al., 2004). Although kernel-based learning has gained significant attention in recent years, a Mercer kernel is only one of many possibilities for defining pairwise relationships between ... |

240 | Support vector machines for spam categorization
- Drucker, Wu, et al.
- 1999
(Show Context)
Citation Context ...and Trenkle, 1994; Damashek, 1995). The idea of determining similarity of sequences by an inner-product was revived in kernel-based learning in the form of bag-of-words kernels (e.g., Joachims, 1998; =-=Drucker et al., 1999-=-; Joachims, 2002) and various string kernels (e.g., Zien et al., 2000; Leslie et al., 2002; Vishwanathan and Smola, 2004). Moreover, research in bioinformatics and text processing advanced the capabil... |

195 | Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory - Mchugh |

193 | A discriminative framework for detecting remote protein homologies - Jaakkola, Diekhans - 2000 |

167 |
A faster algorithm computing string edit distances
- Masek, Paterson
- 1980
(Show Context)
Citation Context ...utational complexity: No linear-time algorithm for determining the shortest trace of operations is currently known. One of the fastest exact algorithms runs in O(n 2 /logn) for sequences of length n (=-=Masek and Patterson, 1980-=-). A different approach to sequence comparison originated in the field of information retrieval with the vector space or bag-of-words model (Salton et al., 1975; Salton, 1979). Textual documents are e... |

156 | Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578 - Lewis - 2004 |

144 | The 1999 DARPA Off-Line Intrusion Detection Evaluation - Lippmann, Fried, et al. - 1999 |

134 | Mismatch string kernels for discriminative protein classification, Bioinformatics - Leslie, Eskin, et al. |

132 |
The art of computer programming. Volume 3
- Knuth
- 1973
(Show Context)
Citation Context ... in each field indicates the number of occurrences. Algorithm. Comparison of two sorted arrays X and Y is carried out by looping over the fields of both arrays in the manner of merging sorted arrays (=-=Knuth, 1973-=-). During each iteration the inner function m is computed over contained words and aggregated using the operator ⊕. The corresponding comparison procedure in pseudo-code is given in Algorithm 1. Herei... |

122 | Dynamic alignment kernels - Watkins - 1999 |

116 |
Gauging similarity with n-grams: Language-independent categorizat ion of text
- Damashek
- 1995
(Show Context)
Citation Context ...s one to investigate different data structures to obtain optimal efficiency in practice. Several data structures have been previously considered for specific similarity measures, such as hash tables (=-=Damashek, 1995-=-), sorted arrays (Sonnenburg et al., 2007), tries (Leslie et al., 2002; Shawe-Taylor and Cristianini, 2004; Rieck et al., 2006), suffix trees using matching statistics (Vishwanathan and Smola, 2004), ... |

98 | Prior knowledge in support vector kernels
- Schölkopf, Simard, et al.
- 1998
(Show Context)
Citation Context ...r kernel, an inner product in a reproducing kernel Hilbert space. Following the seminal work of Boser et al. (1992), various learning methods have been re-formulated in terms of kernels, such as PCA (=-=Schölkopf et al., 1998-=-b), ridge regression (Cherkassky et al., 1999), ICA (Harmeling et al., 2003) and many others. Although the initial motivation for the “kernel trick” was to allow efficient computation of an inner prod... |

83 | SVMs for histogram-based image classification
- Chapelle, Haffner, et al.
- 1999
(Show Context)
Citation Context ...presentation has been quickly realized (e.g., Vapnik, 1995). Consequently kernel-based methods have been proposed for non-vectorial domains, such as analysis of images (e.g., Schölkopf et al., 1998a; =-=Chapelle et al., 1999-=-), sequences (e.g., Jaakkola et al., 2000; Watkins, 2000; Zien et al., 2000) and structured data (e.g., Collins and Duffy, 2002; Gärtner et al., 2004). Although kernel-based learning has gained signif... |

82 | A (2004). “Fast Kernels for String and Tree Matching
- Vishwanathan, Smola
(Show Context)
Citation Context ... as hash tables (Damashek, 1995), sorted arrays (Sonnenburg et al., 2007), tries (Leslie et al., 2002; Shawe-Taylor and Cristianini, 2004; Rieck et al., 2006), suffix trees using matching statistics (=-=Vishwanathan and Smola, 2004-=-), suffix trees using recursive matching (Rieck et al., 2007) and suffix arrays (Teo and Vishwanathan, 2006). All of these data structures allow one to develop linear-time algorithms for computation o... |

79 | Linear-time longestcommon-prefix computation in suffix arrays and its applications - Kasai, Lee, et al. |

71 | Classification with nonmetric distances: Image retrieval and class representation
- Jacobs, Weinshall, et al.
- 2000
(Show Context)
Citation Context ...ne of many possibilities for defining pairwise relationships between objects. Numerous applications exist for which relationships are defined as metric or non-metric distances (e.g., Anderberg, 1973; =-=Jacobs et al., 2000-=-; von Luxburg and Bousquet, 2004), similarity or dissimilarity measures (e.g., Graepel et al., 1999; Roth et al., 2003; Laub and Müller, 2004; Laub et al., 2006) or non-positive kernel functions (e.g.... |

61 | A new discriminative kernel from probabilistic models - Tsuda, Kawanabe, et al. |

59 | Engineering a lightweight suffix array construction algorithm - Manzini, Ferragina |

58 | Feature space interpretation of SVMs with indefinite kernels - Haasdonk |

50 | Kernels and distances for structured data - Gartner, Flach |

47 | Classification on pairwise proximity data
- Graepel, Herbrich, et al.
- 1999
(Show Context)
Citation Context ...s exist for which relationships are defined as metric or non-metric distances (e.g., Anderberg, 1973; Jacobs et al., 2000; von Luxburg and Bousquet, 2004), similarity or dissimilarity measures (e.g., =-=Graepel et al., 1999-=-; Roth et al., 2003; Laub and Müller, 2004; Laub et al., 2006) or non-positive kernel functions (e.g., Ong et al., 2004; Haasdonk, 2005). It is therefore imperative to address pairwise comparison of o... |

42 | Fast string kernels using inexact matching for protein sequences - Leslie, Kuang |

42 | Learning with non-positive kernels
- Ong, Mary, et al.
- 2004
(Show Context)
Citation Context ...von Luxburg and Bousquet, 2004), similarity or dissimilarity measures (e.g., Graepel et al., 1999; Roth et al., 2003; Laub and Müller, 2004; Laub et al., 2006) or non-positive kernel functions (e.g., =-=Ong et al., 2004-=-; Haasdonk, 2005). It is therefore imperative to address pairwise comparison of objects in a most general setup. The aim of this contribution is to develop a general framework for pairwise comparison ... |

42 | Optimal cluster preserving embedding of nonmetric proximity data - Roth, Laub, et al. |

41 | Building kernels from binary strings for image matching - Odone, Barla, et al. - 2005 |