Results 1 -
2 of
2
Indexing Straight-Line Programs ∗
, 2008
"... Straight-line programs offer powerful text compression by representing a text T[1, u] in terms of a context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar repre ..."
Abstract
- Add to MetaCart
Straight-line programs offer powerful text compression by representing a text T[1, u] in terms of a context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). Its size is of the same order of that of a plain SLP representation, and it can be of independent interest for other grammar-based problems. We also give some byproducts on representing binary relations. 1 Introduction and Related Work Grammar-based compression is a well-known technique since at least the seventies [53, 50, 3, 28, 47], and still a very active area of research stimulated by the recent interest in XML compression [33, 22, 37]. The main idea is to replace a given text T[1,u] by a context-free grammar (CFG) from which T can be derived. In fact, two different approaches fall under the same name [28]. In the
Optimizing XML Compression
"... Abstract. The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yields also one of its main ..."
Abstract
- Add to MetaCart
Abstract. The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yields also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm. 1

