Results 1  10
of
77
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 263 (94 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Efficient token based clone detection with flexible tokenization
 In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESECFSE
, 2007
"... Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous wellknown tools.
Compressed permuterm index
 In Proceedings 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
"... The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dic ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
(Show Context)
The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index which allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the BurrowsWheeler Transform defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [17]. We will complement our theoretical study with a rich set of experiments which show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip2. This improves known approaches based on FrontCoding [19] by more than 50 % in absolute space occupancy, still guaranteeing comparable query time.
Improved dynamic rankselect entropybound structures
 in Proc. of the Latin American Theoretical Informatics (LATIN
"... Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (considering all the operations). One by Mäkinen and Navarro achieves compressed space (i.e., nH0 + o(n log σ) bits) and O(log nlog σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other log σ log log n solution, by Lee and Park, achieves O(log n(1 +)) amortized time and uncompressed space, i.e. nlog σ +O(n)+o(nlog σ) bits. In this paper we show that the best of both worlds can be achieved. We log σ combine the solutions to obtain nH0+o(nlog σ) bits of space and O(log n(1+)) worstcase time log log n for all the operations. Apart from the best current solution, we obtain some byproducts that might be
Permuted longestcommonprefix array
 In Proc. 20th CPM, LNCS 5577
, 2009
"... Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithm ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values. 1
Rank/Select on Dynamic Compressed Sequences and Applications
, 2008
"... Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via inserti ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (when considering all the operations). One solution, by Mäkinen and Navarro, achieves compressed space (i.e., nH0 +o(n log σ) bits) and O(log n log σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other solution, by Lee and log σ Park, achieves O(log n(1 + log log n)) amortized time and uncompressed space, i.e. n log2 σ +O(n)+o(n log σ) bits. In this paper we show that the best of both worlds can be achieved. We combine the solutions to obtain nH0 + o(n log σ) bits of space log σ log log n and O(log n(1 +)) worstcase time for all the operations. Apart from the best current solution to the problem, we obtain several byproducts of independent interest applicable to partial sums, text indexes, suffix arrays, the BurrowsWheeler transform, and others.
Faster Lightweight Suffix Array Construction
"... The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward e ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward efficient construction of suffix arrays with algorithms striving not only to be fast, but also “lightweight” (in the sense that they use small working memory). In this paper we describe a new lightweight suffix array construction algorithm. By exploiting several interesting properties of suffixes in combination with cache concious programming we acheive excellent runtimes. Extensive experiments show our approach to be faster that all other known algorithms for the task.
Improving Suffix Array Locality for Fast Pattern Matching on Disk
, 2008
"... The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that queryi ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix arraylike blocked data structure allows queries to be answered as much as three times faster than the best alternative diskbased suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
Lempel–Ziv Factorization Using Less Time & Space
, 1661
"... Abstract. For 30 years the Lempel–Ziv factorization LZx of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZx was based on Θ ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Abstract. For 30 years the Lempel–Ziv factorization LZx of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZx was based on Θ(n)time (or, depending on the measure used, O(n log n)time) processing of the suffix tree STx of x. Recently Abouelhoda et al. proposed an efficient Lempel–Ziv factorization algorithm based on an “enhanced ” suffix array – that is, a suffix array SAx together with supporting data structures, principally an “interval tree”. In this paper we introduce a collection of fast spaceefficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed; one family out of this collection achieves true Θ(n)time alphabetindependent processing in the worst case by avoiding tree structures altogether. Mathematics Subject Classification (2000). 68W05. Keywords. Lempel–Ziv factorization, suffix array, suffix tree, LZ factorization.
Efficient inmemory topk document retrieval
, 2012
"... For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many eff ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiencybound search tasks can now easily be supported entirely inmemory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for inmemory bagofwords ranked document retrieval using a selfindex derived from the FMIndex, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic tradeoffs for performing efficient queries entirely inmemory. We compare our approach with two classic approaches to bagofwords queries using inverted indexes, termatatime (TAAT) and documentatatime (DAAT) query processing. We show that our framework is competitive with stateoftheart indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.