Results 1 - 10
of
20
A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory
"... ..."
Indexing Compressed Text
- Proceedings of the 4th South American Workshop on String Processing
, 1997
"... We present a technique to build an index based on suffix arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
We present a technique to build an index based on suffix arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of the compressed strings to generate the suffix array without decompressing. As the compressed text is under 30% of the size of the original text we are able to build the suffix array twice as fast on the compressed text. The compressed text plus index is 55-60% of the size of the original text plus index and search times are reduced to approximately half the time. We also present analytical and experimental results for different variations of the word-oriented compression paradigm.
A Lempel-Ziv text index on secondary storage
- IN PROC. CPM, LNCS 4580
, 2007
"... Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u lo ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
(Show Context)
Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size including the text, which means 39%–65 % the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.
Improving Suffix Array Locality for Fast Pattern Matching on Disk
, 2008
"... The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that queryi ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
LEDA-SM: Extending LEDA to secondary memory
- IN PROC. WORKSHOP ON ALGORITHM ENGINEERING
, 1999
"... During the last years, many software libraries for in-core computation have been developed. Most internal memory algorithms perform very badly when used in an external memory setting. We introduce LEDA-SM that extends the LEDA-library [22] towards secondary memory computation. LEDA-SM uses I/O-effi ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
During the last years, many software libraries for in-core computation have been developed. Most internal memory algorithms perform very badly when used in an external memory setting. We introduce LEDA-SM that extends the LEDA-library [22] towards secondary memory computation. LEDA-SM uses I/O-efficient algorithms and data structures that do not suffer from the so called I/O bottleneck. LEDA is used for in-core computation. We explain the design of LEDA-SM and report on performance results.
A Compressed Text Index on Secondary Memory
"... Abstract. We introduce a practical disk-based compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compr ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We introduce a practical disk-based compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compressed indexes have been slower than their classical counterparts on secondary memory. We analyze our index and show experimentally that it is extremely competitive on compressible texts. 1 Introduction and Related Work Compressed full-text self-indexing [22] is a recent trend that builds on the discovery that traditional text indexes like suffix trees and suffix arrays can be compacted to take space proportional to the compressed text size, and moreover be able to reproduce any text context. Therefore self-indexes replace the text,
Binary Searching with Non-uniform Costs and Its Application to Text Retrieval
- ALGORITHMICA
, 1998
"... We study the problem of minimizing the expected cost of binary searching for data where the access cost is not fixed and depends on the last accessed element, such as data stored in magnetic or optical disk. We present an optimal algorithm for this problem that finds the optimal search strategy in O ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
We study the problem of minimizing the expected cost of binary searching for data where the access cost is not fixed and depends on the last accessed element, such as data stored in magnetic or optical disk. We present an optimal algorithm for this problem that finds the optimal search strategy in O(n³) time, which is the same time complexity of the simpler classical problem of fixed costs. Next, we present two practical linear expected time algorithms, under the assumption that the access cost of an element is independent of its physical position. Both practical algorithms are online, that is, they find the next element to access as the search proceeds. The first one is an approximate algorithm which minimizes the access cost disregarding the goodness of the problem partitioning. The second one is a heuristic algorithm, whose quality depends on its ability to estimate the final search cost, and therefore it can be tuned by recording statistics of previous runs. We present an appli...
Optimized Binary Search and Text Retrieval
, 1995
"... We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs, such as disk data. Indirect binary search means that sorted access to the data is obtained through an array of pointers to the raw data. One immediate application of this algor ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs, such as disk data. Indirect binary search means that sorted access to the data is obtained through an array of pointers to the raw data. One immediate application of this algorithm is to improve the retrieval performance of disk databases that are indexed using the suffix array model (also called pat array). We consider the cost model of magnetic and optical disks and the anticipated knowledge of the expected size of the subproblem produced by reading each disk track. This information is used to devise a modified binary searching algorithm to decrease overall retrieval costs. Both an optimal and a practical algorithm are presented, together with analytical and experimental results. For 100 megabytes of text the practical algorithm costs 60% of the standard binary search cost for the magnetic disk and 65% for the optical disk.
Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays
"... Abstract—The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level su ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure – the condensed BWT – and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text on a computer with 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around one-third the size of previous twolevel mechanisms; and the memory footprint of as little as 1 % of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX. Index Terms—String search, pattern matching, suffix array, Burrows-Wheeler transform, succinct data structure, disk-based algorithm, experimental evaluation. I.
Syntactic Similarity of Web Documents
, 2003
"... This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses sh ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original document and some candidates, two methods find documents that have some similarity relationship with the original document. Experimental results were obtained by using a plagiarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic average of the absolute differences between the expected and obtained similarity, the algorithm that uses shingles obtained a performance of ##### and the algorithm that uses Patricia tree a performance of #####.