## Space-Efficient Framework for Top-k String Retrieval Problems

Citations: | 25 - 3 self |

### BibTeX

@MISC{Hon_space-efficientframework,

author = {Wing-kai Hon and Rahul Shah and Jeffrey Scott Vitter},

title = {Space-Efficient Framework for Top-k String Retrieval Problems },

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant” is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a thresholdbased approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.

### Citations

2374 | The PageRank Citation Ranking: Bring Order to the Web
- Page, Brin, et al.
- 1999
(Show Context)
Citation Context ...n can be arbitrary and can capture many practical measures, such as the frequency of P in d, the distance between two closest occurrences of P in d (known as proximity), or simply the static PageRank =-=[27]-=- of d which is independent of pattern P . The score function may also be some combinations of these measures. Muthukrishnan’s formulation tends to capture the notions of frequency and proximity by int... |

851 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Mo®at, et al.
- 1999
(Show Context)
Citation Context ...dakane [31] showed how to solve the document listing problem using succinct data structures which take space very close to that of the compressed text. He also showed how to compute the TF-IDF scores =-=[36]-=- of each document with such data structures. However, one limitation of Sadakane’s approach is that it needs to first retrieve all the documents where the pattern (or patterns) occurs, and then find t... |

667 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...rch which is almost about half a century old. Some of the earliest algorithms like [19] achieved optimal linear time performance. In the data structural sense, suffix trees [23, 35] and suffix arrays =-=[21]-=- are the most popular linear space data structures with optimal (or near-optimal) query performance. Although thought to be linear, the practical space requirement of suffix tree turned out to be abou... |

658 |
Fast pattern matching in strings
- Knuth, Morris, et al.
- 1977
(Show Context)
Citation Context ... to find, given a text of size n and a pattern P of length p, all the locations in the text where this pattern matches. Earlier work has focussed on developing linear-time algorithms for this problem =-=[19]-=-. When the text is given beforehand, and the pattern queries come online, one might want to build a data structure on the text such that pattern matching queries can be answered in O(p + occ) time, wh... |

591 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...red unfavorably to inverted indexes. Recently, Grossi and Vitter [17] and Ferragina and Manzini [11] gave compressed variants of text searching data structures, based on the Burrows-Wheeler Transform =-=[6]-=-. These data structures not only compared well with inverted indexes in their space utilization but also provided query functionality for arbitrary patterns. Since then, designing succinct or compress... |

577 | Optimal Aggregation Algorithms for Middleware
- Fagin, Lotem, et al.
- 2001
(Show Context)
Citation Context ...can be used as alternative tools (in place of RMQ structures) in our framework also. Top-k query processing has been an extensive field of research in the information retrieval and database community =-=[9, 18]-=-. Many theoretical results have also appeared in the context of aggregating ranks from various ranked lists [1, 32]. 2.1. Generalized Suffix Tree 2. PRELIMINARIES Given a set of D strings {d1, d2, . .... |

565 |
A space-economical suffix tree construction algorithm
- Mccreight
- 1976
(Show Context)
Citation Context ...eries come online, one might want to build a data structure on the text such that pattern matching queries can be answered in O(p + occ) time, where occ denotes the number of occurrences. Suffix tree =-=[23, 35]-=- is the most popular data structure which achieves this goal. Most string databases consist of a collection of multiple text documents (or strings) rather than just one single text. In this case, the ... |

439 | Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...eries come online, one might want to build a data structure on the text such that pattern matching queries can be answered in O(p + occ) time, where occ denotes the number of occurrences. Suffix tree =-=[23, 35]-=- is the most popular data structure which achieves this goal. Most string databases consist of a collection of multiple text documents (or strings) rather than just one single text. In this case, the ... |

192 | Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
- Grossi, Vitter
- 2000
(Show Context)
Citation Context ...ut 15–50 times that of the text and and for suffix arrays this was almost about 5–20 times the text. Due to this limitation, they compared unfavorably to inverted indexes. Recently, Grossi and Vitter =-=[17]-=- and Ferragina and Manzini [11] gave compressed variants of text searching data structures, based on the Burrows-Wheeler Transform [6]. These data structures not only compared well with inverted index... |

188 | The LCA problem revisited
- Bender, Farach-Colton
- 2000
(Show Context)
Citation Context ...solving range minimum/maximum query (RMQ) using succinct variant of cartesian tree (See [31]). Although solving RMQ is as old as Chazelle’s original paper on range searching [8], many simplifications =-=[3]-=- and improvements have been made, culminating in Fischer et al’s 2n + o(n) bits of space data structure [13, 14]. Even our results shall extensively use RMQ as a tool to obtain top-k in a given set of... |

181 |
Scaling and related techniques for geometry problems
- Gabow, Bentley, et al.
- 1984
(Show Context)
Citation Context ... term of O(k) (as against our O(k log k)) for reporting k items in sorted order, their data structure necessarily takes super-linear space. The other data structures having an O(k) additive term like =-=[16, 24]-=- do not output the top-k items in sorted order. Although these data structures do not directly address the notion of frequency or proximity (when rank score is dependent on set of items rather than a ... |

178 | Compressed full-text indexes - Navarro, Mäkinen |

164 | Aggregating inconsistent information: Ranking and clustering
- Ailon, Charikar, et al.
- 2005
(Show Context)
Citation Context ...n an extensive field of research in the information retrieval and database community [9, 18]. Many theoretical results have also appeared in the context of aggregating ranks from various ranked lists =-=[1, 32]-=-. 2.1. Generalized Suffix Tree 2. PRELIMINARIES Given a set of D strings {d1, d2, . . . , dD} of total length n, the generalized suffix tree (GST) is a compact trie storing all suffixes of all the D s... |

133 | A functional approach to data structures and its use in multidimensional searching
- Chazelle
- 1988
(Show Context)
Citation Context ...cument listing problem by solving range minimum/maximum query (RMQ) using succinct variant of cartesian tree (See [31]). Although solving RMQ is as old as Chazelle’s original paper on range searching =-=[8]-=-, many simplifications [3] and improvements have been made, culminating in Fischer et al’s 2n + o(n) bits of space data structure [13, 14]. Even our results shall extensively use RMQ as a tool to obta... |

122 |
Indexing compressed texts
- Ferragina, Manzini
(Show Context)
Citation Context ... and and for suffix arrays this was almost about 5–20 times the text. Due to this limitation, they compared unfavorably to inverted indexes. Recently, Grossi and Vitter [17] and Ferragina and Manzini =-=[11]-=- gave compressed variants of text searching data structures, based on the Burrows-Wheeler Transform [6]. These data structures not only compared well with inverted indexes in their space utilization b... |

112 | Compressed representations of sequences and full-text indexes
- Ferragina, Manzini, et al.
- 2007
(Show Context)
Citation Context ... SA[L], SA[L + 1], ..., SA[R]. There are various versions of CSA in the literature which provide different performance tradeoffs. Throughout this paper, we shall assume the version by Ferragina et al =-=[12]-=- and assume |Σ| = O(polylog n), such that SA[i] and SA −1 [j] can be reported in O(log 1+ɛ n) time, while the exact range [L, R] for P can be computed in O(p+log 1+ɛ n) time for any ɛ > 0. Note that t... |

73 | Efficient algorithms for document retrieval problems, in
- Muthukrishnan
(Show Context)
Citation Context ...ttern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan =-=[25]-=-. He considered two metrics for relevance: frequency and proximity. He took a thresholdbased approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem ... |

54 | Structuring labeled trees for optimal succinctness, and beyond
- Ferragina, Luccio, et al.
- 2005
(Show Context)
Citation Context ... query functionality for arbitrary patterns. Since then, designing succinct or compressed data structures for text problems has been a thriving field of research with many improvements and extensions =-=[2, 10, 26, 29, 30]-=-. Puglisi et al [28] indeed showed that compressed text indexes provide faster searching than inverted indexes. However, the authors also showed that if the number of occurrences ‖ Here, |CSA| denotes... |

37 |
A fast merging algorithm
- Brown, Tarjan
- 1979
(Show Context)
Citation Context ...pair with mindist(v1) and mindist(v2) and obtain mindist(v) for v. This merging step can be done in O (|L1| log(|L2|/|L1|)) time (assuming |L1| ≤ |L2|) using Brown and Tarjan’s fast merging algorithm =-=[5]-=-. The total time can be shown to be O(n log n) for processing the entire document (See a similar analysis in [33]). This thus gives us an O(n log n) algorithm for calculating mindist scores over the G... |

34 |
Optimizing scoring functions and indexes for proximity search in type-annotated corpora
- Chakrabarti, Puniyani, et al.
- 2006
(Show Context)
Citation Context ...d gave data structures taking O(n log n) words of space for them. His work quickly motivated a flurry of new results with some seeking to improve them and some utilizing them in specific applications =-=[7]-=-. Many particular algorithms in bio-informatics have been based on the frequency metric [15, 20]. One particular line of research was to obtain compressed/succinct data structures for document listing... |

33 | Space-efficient algorithms for document retrieval, in
- Välimäki, Mäkinen
(Show Context)
Citation Context ...the documents first. Nevertheless, Sadakane did show some very useful tools and techniques for deriving succinct data structures for these problems. Similar work was also done by Välimäki and Mäkinen =-=[34]-=- where they derived alternative succinct data structures for the document listing problem. In all these papers, deriving succinct data structures for the more meaningful (in the IR sense) K-mine and K... |

18 | Evaluating rank joins with optimal cost - Schnaitter, Polyzotis - 2008 |

12 |
Inverted files versus suffix arrays for locating patterns in primary memory,” in Proc. Symp. String Processing and Information Retrieval, 2006, pp. 122–133. Simon Gog completed a PhD in Computer Science at Ulm University in 2011 in the area of practical c
- Puglisi, Smyth, et al.
- 1999
(Show Context)
Citation Context ...patterns. Since then, designing succinct or compressed data structures for text problems has been a thriving field of research with many improvements and extensions [2, 10, 26, 29, 30]. Puglisi et al =-=[28]-=- indeed showed that compressed text indexes provide faster searching than inverted indexes. However, the authors also showed that if the number of occurrences ‖ Here, |CSA| denotes space (in bits) of ... |

10 | Augmenting suffix trees, with applications
- Matias, Muthukrishnan, et al.
- 1998
(Show Context)
Citation Context .... The formal study of document retrieval problems is motivated by this fact that occurrences of a pattern may be too many but the number of documents carrying the pattern might be fewer. Matias et al =-=[22]-=- gave the first solution for the document listing problem which answers this in O(p log D + ndoc) time. Muthukrishnan [25] improved the result to optimal O(p + ndoc). He also initiated the formal fram... |

9 | Rank-Sensitive Data Structures
- Bialynicka-Birula, Grossi
- 2005
(Show Context)
Citation Context ...p-k in a given set of ranges. The study of reporting top-k matching items in the given range in sorted order can be traced back to McCreight’s priority search trees [24]. Bialynicka-Birula and Grossi =-=[4]-=- gave a general framework to add rank information to items being outputted (from any range reporting data structure) and report top-k items in sorted order. We wish to note here that, although they ac... |

9 |
Undiscretized dynamic programming: Faster algorithms for facility location and related problems on trees
- Shah, Farach-Colton
- 2002
(Show Context)
Citation Context ...|L2|/|L1|)) time (assuming |L1| ≤ |L2|) using Brown and Tarjan’s fast merging algorithm [5]. The total time can be shown to be O(n log n) for processing the entire document (See a similar analysis in =-=[33]-=-). This thus gives us an O(n log n) algorithm for calculating mindist scores over the GST. 4. SUCCINCT STRUCTURES In this section, we describe succinct structure for the problem of top-k string retrie... |

5 | Space Efficient String Mining under Frequency Constraints
- Fischer, Mäkinen, et al.
- 2008
(Show Context)
Citation Context ...d a flurry of new results with some seeking to improve them and some utilizing them in specific applications [7]. Many particular algorithms in bio-informatics have been based on the frequency metric =-=[15, 20]-=-. One particular line of research was to obtain compressed/succinct data structures for document listing problem by solving range minimum/maximum query (RMQ) using succinct variant of cartesian tree (... |

4 | Storage and retrieval of individual genomes
- Mäkinen, Navarro, et al.
- 2009
(Show Context)
Citation Context ...d a flurry of new results with some seeking to improve them and some utilizing them in specific applications [7]. Many particular algorithms in bio-informatics have been based on the frequency metric =-=[15, 20]-=-. One particular line of research was to obtain compressed/succinct data structures for document listing problem by solving range minimum/maximum query (RMQ) using succinct variant of cartesian tree (... |

3 |
A New Succinct Representation of RMQ-Information and
- Fischer, Heun
- 2007
(Show Context)
Citation Context ... O(n)-word data structure was proposed by [25], and the space was subsequently improved through a series of papers to |CSA| + 2n + o(n) + D log(n/D) bits with query answered in O(p + ndoc log n) time =-=[13, 31, 34]-=-. We remove the additional 2n bits required (with slight increase in query time) to achieve a better space bound of |CSA| + o(n) + D log(n/D) bits. 1.2. Related Work Pattern matching is a field of res... |

3 | Practical Entropy-Bounded Schemes for O(1)-Range Minimum Queries
- Fischer, Heun, et al.
- 2008
(Show Context)
Citation Context ...ving RMQ is as old as Chazelle’s original paper on range searching [8], many simplifications [3] and improvements have been made, culminating in Fischer et al’s 2n + o(n) bits of space data structure =-=[13, 14]-=-. Even our results shall extensively use RMQ as a tool to obtain top-k in a given set of ranges. The study of reporting top-k matching items in the given range in sorted order can be traced back to Mc... |

2 |
A Survey of Top-K
- Ilyas, Beskales, et al.
- 2008
(Show Context)
Citation Context ...can be used as alternative tools (in place of RMQ structures) in our framework also. Top-k query processing has been an extensive field of research in the information retrieval and database community =-=[9, 18]-=-. Many theoretical results have also appeared in the context of aggregating ranks from various ranked lists [1, 32]. 2.1. Generalized Suffix Tree 2. PRELIMINARIES Given a set of D strings {d1, d2, . .... |

1 |
Suffix Trees with Full Functionality,” Theory of Computing Systems
- “Compressed
- 2007
(Show Context)
Citation Context ... query functionality for arbitrary patterns. Since then, designing succinct or compressed data structures for text problems has been a thriving field of research with many improvements and extensions =-=[2, 10, 26, 29, 30]-=-. Puglisi et al [28] indeed showed that compressed text indexes provide faster searching than inverted indexes. However, the authors also showed that if the number of occurrences ‖ Here, |CSA| denotes... |

1 |
Data Structures for Flexible Text Retrieval Systems
- “Succinct
- 2007
(Show Context)
Citation Context ...s to another popular heuristic in information retrieval called proximity. He gave O(n log n)-word data structures for these problems which can answer the queries in optimal O(p + ndoc) time. Sadakane =-=[31]-=- showed how to solve the document listing problem using succinct data structures which take space very close to that of the compressed text. He also showed how to compute the TF-IDF scores [36] of eac... |