Results 1  10
of
10
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 188 (17 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Compression, indexing, and retrieval for massive string data
 COMBINATORIAL PATTERN MATCHING. LNCS
, 2010
"... The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as t ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the wellknown technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/Oefficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
A note on the BurrowsWheeler transformation
 Theor. Comput. Sci
"... We relate the BurrowsWheeler transformation with a result in combinatorics on words known as the GesselReutenauer transformation. 1 ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We relate the BurrowsWheeler transformation with a result in combinatorics on words known as the GesselReutenauer transformation. 1
An algorithmic framework for compression and text indexing
"... We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In particular, we provide a tight analysis of the BurrowsWheeler transform (bwt) establishing a bound of nHh + M(T,Σ,h) bits, where M(T,Σ,h) denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order h appearing in T. Using the same framework, we also obtain an implementation of the compressed suffix array (csa) which achieves nHh + M(T,Σ,h) + O(nlg lg n/lg Σ  n) bits of space while still retaining competitive fulltext indexing functionality. The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since M(T,Σ,h) ≤ min{g ′ h lg(n/g ′ h + 1),H ∗ hn + lg n + g′′ h}, where g ′ h = O(Σh+1) and g ′′ h = O(Σ  h+1 lg Σ  h+1) do not depend on the text length n, while H ∗ h ≥ Hh is the modified hthorder empirical entropy of T. Moreover, we show a strong relationship between a compressed fulltext index and the succinct dictionary problem. We also examine the importance of lowerorder terms, as these can dwarf any savings achieved by highorder entropy. We report further results and tradeoffs on highorder entropycompressed text indexes in the paper. 1
Golatowski, “Encoding and Compression for Devices Profile for Web Services
 5th International Workshop on Service Oriented Architectures in Converging Networked Environments (SOCNE2010
, 2010
"... Abstract—Most solutions for Wireless Sensor Networks (WSN) come equipped with their own architectural concepts which raise the problem of possible incompatibility of computer networks and the WSN. Often gateway concepts are used to overcome this problem. But this is not the best solution on the long ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract—Most solutions for Wireless Sensor Networks (WSN) come equipped with their own architectural concepts which raise the problem of possible incompatibility of computer networks and the WSN. Often gateway concepts are used to overcome this problem. But this is not the best solution on the long term. Other research fields and industrial domains are heading for universal cross domain architecture concepts based on internet technologies that are more mature and better understood. The IETF 6LoWPAN working group provides the groundings for standardized communication using existing network protocols like IPv6 also in low power radio networks. A big challenge when deploying further application layer network protocols on top of 6LoWPAN is the message size of existing mostly XML based protocols which does not meet the resource requirements of deeply embedded devices without further research efforts. This paper presents different data compression techniques for the Devices Profile of Web Services (DPWS) to be applied in 6LoWPAN networks. Therefore, we analyze a realistic scenario. We determined 18 message types in the scenario and compressed and encoded all messages by using existing schemes and tools. For the first time, we also investigate on the Efficient XML Interchange (EXI) format for DPWS.
Nearly tight bounds on the encoding length of the BurrowsWheeler transform
 In Proc. Work. on Analytical Algorithmics and Combinatorics, January 2008. Indexing, and Retrieval for Massive String Data 13
"... In this paper, we present a nearly tight analysis of the encoding length of the BurrowsWheeler Transform (bwt) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hthorder empirical entropy Hh of th ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In this paper, we present a nearly tight analysis of the encoding length of the BurrowsWheeler Transform (bwt) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hthorder empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the bwt that constructs an infinite (nontrivial) class of texts that are among the hardest to compress using the bwt. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described. In designing our bwt encoding and its lower bound, we also address the tsubset problem; here, the goal is to store a subset of t items drawn from a universe [1..n] using just lg () n t +O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required. 1
PARALLELIZATION OF WEIGHTED SEQUENCE COMPARISION BY USING EBWT
"... In this paper, we describe the design of highperformance extended burrow wheeler transform based weighted sequence comparison algorithm for many core GPUs taking advantages of the full programmability offered by compute unified device architecture (CUDA) and its standard library thrust. Our extende ..."
Abstract
 Add to MetaCart
In this paper, we describe the design of highperformance extended burrow wheeler transform based weighted sequence comparison algorithm for many core GPUs taking advantages of the full programmability offered by compute unified device architecture (CUDA) and its standard library thrust. Our extended burrow wheeler transform based weighted sequence comparison algorithm with thrust library implementation on CUDA is the fastest implementation of weighted sequence comparison algorithm than the our previous implementation of extended burrow wheeler transform based weighted sequence algorithm without using thrust library, and it is on average 56.3X times faster. Moreover, our present time implementation is also competitive with CPU implementations, being up to 2.9X times faster than comparable routine on 2.99 GHz Intel Pentium (R) 4 CPU with 3 GB RAM.
Improving HTML Compression
, 2008
"... The verbosity of the Hypertext Markup Language (HTML) remains one of its main weaknesses. This problem can be solved with the aid of HTML specialized compression algorithms. In this work, we describe a lossless HTML transform which, combined with generally used compression algorithms, allows to atta ..."
Abstract
 Add to MetaCart
The verbosity of the Hypertext Markup Language (HTML) remains one of its main weaknesses. This problem can be solved with the aid of HTML specialized compression algorithms. In this work, we describe a lossless HTML transform which, combined with generally used compression algorithms, allows to attain high compression ratios. Its core is a fully reversible transform featuring substitution of words in an HTML document using a static English dictionary or a semistatic dictionary of the most frequent words in the document, effective encoding of dictionary indexes, numbers, and specific patterns. The experimental results show that the proposed transform improves the HTML compression efficiency of general purpose compressors on average by 15 % in case of gzip achieving comparable processing speed. Moreover, we show that the compression ratio of gzip can be improved by up to 28 % for the price of higher memory requirements and much slower processing. Povzetek: Opisan je izvirni algoritem za izboljšavo zgoščevanja HTML. 1
Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.??, NO.??, XX 20XX 1 1 2 3 4 5 6 7 8
"... Abstract—Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Smallscale and local repetitive structures are better understood than large and ..."
Abstract
 Add to MetaCart
Abstract—Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Smallscale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a spaceefficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19–50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time and spaceefficient manner. Our technique uses the BurrowsWheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our spaceefficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as generalpurpose opensource software for public use.
Article 13.2.1 Clustering Words and Interval Exchanges
"... We characterize words which cluster under the BurrowsWheeler transform as those words w such that ww occurs in a trajectory of an interval exchange transformation, and build examples of clustering words. 1 1 ..."
Abstract
 Add to MetaCart
We characterize words which cluster under the BurrowsWheeler transform as those words w such that ww occurs in a trajectory of an interval exchange transformation, and build examples of clustering words. 1 1