Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
 in Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing
, 2000
Abstract.
Abstract

Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
LempelZiv parsing and sublinearsize index structures for string matching (Extended Abstract)
 Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
Abstract
Abstract

String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing of the text and has size linear in N, the size of the LempelZiv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases
 In ICDE
, 2000
Abstract
Abstract

We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our indexing technique uses a diskbased suffix tree as an index structure and employs' lowerbound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and thus accelerate the query processing, we convert sequences of continuous values to sequences of discrete values via a categorization method and store only a subset of suffixes whose first values are different from their preceding values. The experimental results' reveal that our proposed technique can be a few orders' of magnitude faster than sequential scanning.
Finding Optimal Pairs of Cooperative and Competing Patterns with Bounded Distance
 In Proc. 7th International Conference on Discovery Science (DS’04
, 2004
Abstract
Abstract

We consider the problem of discovering the optimal pair of substring patterns with bounded distance #, from a given set S of strings.
Lineartime offline text compression by longestfirst substitution
 in Proc. 10th International Symp. on String Processing and Information Retrieval (SPIRE’03
, 2003
Abstract.
Abstract

Abstract. Given a text, grammarbased compression is to construct a grammar that generates the text. There are many kinds of text compression techniques of this type. Each compression scheme is categorized as being either offline or online, according to how a text is processed. One representative tactics for offline compression is to substitute the longest repeated factors of a text with a production rule. In this paper, we present an algorithm that compresses a text basing on this longestfirst principle, in linear time. The algorithm employs a suitable index structure for a text, and involves technically efficient operations on the structure. 1
M.: Sparse Directed Acyclic Word Graphs
 in Proc. 13th International Symp. on String Processing and Information Retrieval (SPIRE’06), Lecture Notes in Computer Science
, 2006
Abstract.
Abstract

Abstract. The suffix tree of string w is a text indexing structure that represents all suffixes of w. A sparse suffix tree of w represents only a subset of suffixes of w. An application to sparse suffix trees is composite pattern discovery from biological sequences. In this paper, we introduce a new data structure named sparse directed acyclic word graphs (SDAWGs), which are a sparse text indexing version of directed acyclic word graphs (DAWGs) of Blumer et al. We show that the size of SDAWGs is linear in the length of w, and present an online lineartime construction algorithm for SDAWGs. 1
METHODOLOGY ARTICLE Open Access
