Results 1  10
of
13
Suffix arrays on words
 In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, volume 4580 of LNCS
, 2007
"... Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at wordboundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a classnote solution to this problem that achieves such optimal time and space bound ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at wordboundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a classnote solution to this problem that achieves such optimal time and space bounds. Wordbased versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other wordindexes, and thus it foresees applications in wordbased approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that wordbased suffix arrays may beconstructed twice as fast as their fulltext counterparts, and with a working space as low as 20%. The space reduction of the final wordbased suffix array impacts also in their query time (i.e. less random access binarysearch steps!), being faster by a factor of up to 3. 1
Lineartime offline text compression by longestfirst substitution
 in Proc. 10th International Symp. on String Processing and Information Retrieval (SPIRE’03
, 2003
"... Abstract. Given a text, grammarbased compression is to construct a grammar that generates the text. There are many kinds of text compression techniques of this type. Each compression scheme is categorized as being either offline or online, according to how a text is processed. One representative ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract. Given a text, grammarbased compression is to construct a grammar that generates the text. There are many kinds of text compression techniques of this type. Each compression scheme is categorized as being either offline or online, according to how a text is processed. One representative tactics for offline compression is to substitute the longest repeated factors of a text with a production rule. In this paper, we present an algorithm that compresses a text basing on this longestfirst principle, in linear time. The algorithm employs a suitable index structure for a text, and involves technically efficient operations on the structure. 1
M.: Sparse compact directed acyclic word graphs
 In: Stringology
, 2006
"... Abstract. The suffix tree of string w represents all suffixes of w, and thus it supports full indexing of w for exact pattern matching. On the other hand, a sparse suffix tree of w represents only a subset of the suffixes of w, and therefore it supports sparse indexing of w. There has been a wide ra ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. The suffix tree of string w represents all suffixes of w, and thus it supports full indexing of w for exact pattern matching. On the other hand, a sparse suffix tree of w represents only a subset of the suffixes of w, and therefore it supports sparse indexing of w. There has been a wide range of applications of sparse suffix trees, e.g., natural language processing and biological sequence analysis. Word suffix trees are a variant of sparse suffix trees that are defined for strings that contain a special word delimiter #. Namely, the word suffix tree of string w = w1w2 · · · wk, consisting of k words each ending with #, represents only the k suffixes of w of the form wi · · · wk. Recently, we presented an algorithm which builds word suffix trees in O(n) time with O(k) space, where n is the length of w. In addition, we proposed sparse directed acyclic word graphs (SDAWGs) and an online algorithm for constructing them, working in O(n) time and space. As a further achievement of this research direction, this paper introduces yet a new text indexing structure named sparse compact directed acyclic word graphs (SCDAWGs). We show that the size of SCDAWGs is smaller than that of word suffix trees and SDAWGs, and present an SCDAWG construction algorithm that works in O(n) time with O(k) space and in an online manner. 1
Bidirectional construction of suffix trees
 Nordic Journal of Computing
, 2002
"... Abstract. String matching is critical in information retrieval since in many cases information is stored and manipulated as strings. Constructing and utilizing suitable data structures for text strings, we can solve the string matching problem efficiently. Such structures are called index structures ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. String matching is critical in information retrieval since in many cases information is stored and manipulated as strings. Constructing and utilizing suitable data structures for text strings, we can solve the string matching problem efficiently. Such structures are called index structures. The suffix tree is certainly the most widelyknown and extensivelystudied structure of this kind. In this paper, we present a lineartime algorithm for bidirectional construction of suffix trees. 1
OnLine Construction of Symmetric Compact Directed Acyclic Word Graphs
 In Proc. of 8th International Symposium on String Processing and Information Retrieval (SPIRE’01
, 2001
"... The Compact Directed Acyclic Word Graph (CDAWG) is a space efficient data structure that supports indices of a string. The Symmetric Directed Acyclic Word Graph (SCDAWG) for a string w is a dual structure that supports indices of both w and the reverse of w simultaneously. Blumer et al. gave the fir ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The Compact Directed Acyclic Word Graph (CDAWG) is a space efficient data structure that supports indices of a string. The Symmetric Directed Acyclic Word Graph (SCDAWG) for a string w is a dual structure that supports indices of both w and the reverse of w simultaneously. Blumer et al. gave the first algorithm to construct an SCDAWG from a given string, that works in an offline manner. In this paper, we show an online algorithm that constructs an SCDAWG from a given string directly.
Spaceeconomical construction of index structures for all suffixes of a string
 Proc. 27th International Symposium on Mathematical Foundations of Computer Science (MFCS’02), Lecture Notes in Computer Science
, 2002
"... The minimum allsuffixes directed acyclic word graph (MASDAWG) of a string w has ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The minimum allsuffixes directed acyclic word graph (MASDAWG) of a string w has
Efficient Computation of Substring Equivalence Classes with Suffix Arrays
"... Abstract. This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since th ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches. 1
On the Suffix Automaton with mismatches ⋆
"... Abstract. In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w up to k errors in every windows of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some wellknown words, like prefixes of Fibonacci and ThueMorse words, and we make a conjecture on the size of the suffix automaton with mismatches.
The minimum dawg for all suffixes of a string and its applications
 In Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPM’02), volume 2373 of Lecture Notes in Computer Science
, 2002
"... Abstract. For a string w over an alphabet Σ, we consider a composite data structure called the allsuffixes directed acyclic word graph (ASDAWG). ASDAWG(w) has w  + 1 initial nodes, and the dag induced by all reachable nodes from the kth initial node conforms with DAWG(w[k:]), where w[k:] denotes ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. For a string w over an alphabet Σ, we consider a composite data structure called the allsuffixes directed acyclic word graph (ASDAWG). ASDAWG(w) has w  + 1 initial nodes, and the dag induced by all reachable nodes from the kth initial node conforms with DAWG(w[k:]), where w[k:] denotes the kth suffix of w. We prove that the size of the minimum ASDAWG(w) (MASDAWG(w)) is Θ(w) for Σ  = 1, and is Θ(w  2) for Σ  ≥ 2. Moreover, we introduce an online algorithm which directly constructs MASDAWG(w) for given w, whose running time is linear with respect to its size. We also demonstrate some application problems, beginningsensitive pattern matching, regionsensitive pattern matching, and VLDCpattern matching, for which ASDAWGs are useful. 1
General Suffix Automaton Construction Algorithm and Space Bounds
"... Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automa ..."
Abstract
 Add to MetaCart
Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a prefixtree representing the strings in U. This bound significantly improves over 2‖U‖−1, the bound given by Blumer et al. (1987), where ‖U ‖ is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a lineartime algorithm for constructing the suffix automaton S or factor automaton F of U in time O(S). Our algorithm applies in fact to any input suffixunique automaton and strictly generalizes the standard online construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in stringmatching. Our analysis suggests that the use of factor automata of automata can be practical for largescale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.