Results 11  20
of
113
LinearTime Computation of Similarity Measures for Sequential Data
, 2008
"... Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comp ..."
Abstract

Cited by 38 (24 self)
 Add to MetaCart
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, kgrams or all contiguous subsequences. As realizations of the framework we provide lineartime algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and nonmetric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.
Fast lightweight suffix array construction and checking
 14th Annual Symposium on Combinatorial Pattern Matching
, 2003
"... We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p
Theoretical and practical improvements on the RMQproblem, with applications to LCA and LCE
 PROC. CPM. VOLUME 4009 OF LNCS
, 2006
"... The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
(Show Context)
The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of any dynamic data structure. It consumes less than half of the space that is needed by the method by Berkman and Vishkin. We use our new algorithm for RMQ to improve on LCAcomputation for binary trees, and further give a constanttime LCEalgorithm solely based on arrays. Both LCA and LCE have important applications, e.g., in computational biology. Experimental studies show that our new method is almost twice as fast in practice as previous approaches, and asymptotically slower variants of the constanttime algorithms perform even better for today’s common problem sizes.
Faster EntropyBounded Compressed Suffix Trees
, 2009
"... Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix t ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
Picky: oligo microarray design for large genomes
 Bioinformatics
, 2004
"... *To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many exi ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
(Show Context)
*To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many existing tools operate in a batch mode that does not assure best designs. Results: PICKY is an efficient oligo microarray design tool for large genomes. PICKY integrates novel computer science techniques and the best known nearestneighbor parameters to quickly identify sequence similarities and estimate their hybridization properties. Oligos designed by PICKY are computationally optimized to guarantee the best specificity, sensitivity and uniformity under the given design constrains. PICKY can be used to design arrays for whole genomes, or for only a subset of genes. The latter can still be screened against a whole genome to attain the same quality as a whole genome array, thereby permitting low budget, pathwayspecific experiments to be conducted with large genomes. PICKY is the fastest oligo array design tool currently available to the public, requiring only a few hours to process large gene sets from rice, maize or human. Availability: PICKY is independent of any external software to execute, is designed for nonprogrammers to easily operate through a graphical user interface, and is made available for all major computing platforms (e.g., Mac, Windows and Linux) at
HiTEC: accurate error correction in highthroughput sequencing data
, 2010
"... Motivation: Highthroughput sequencing technologies produce very large amounts of data and sequencing errors constitute one of the major problems in analyzing such data. Current algorithms for correcting these errors are not very accurate and do not automatically adapt to the given data. Results: We ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Motivation: Highthroughput sequencing technologies produce very large amounts of data and sequencing errors constitute one of the major problems in analyzing such data. Current algorithms for correcting these errors are not very accurate and do not automatically adapt to the given data. Results: We present HiTEC, an algorithm which provides a highly accurate, robust, and fully automated method to correct reads produced by highthroughput sequencing methods. Our approach provides significantly higher accuracy than previous methods. It is time and space efficient and works very well for all read lengths, genome sizes, and coverage levels. Availability: The source code of HiTEC is freely available at www.csd.uwo.ca/˜ilie/HiTEC/
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Permuted longestcommonprefix array
 In Proc. 20th CPM, LNCS 5577
, 2009
"... Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithm ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values. 1
Practical methods for constructing suffix trees
, 2005
"... Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluati ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very timeconsuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not