Results 1  10
of
13
Searching BWT compressed text with the BoyerMoore algorithm and binary search
 Proceedings, IEEE Data Compression Conference, 2002
, 2002
"... Abstract: This paper explores two techniques for online exact pattern matching in files that have been compressed using the BurrowsWheeler transform. We investigate two approaches. The first is an application of the BoyerMoore algorithm (Boyer & Moore 1977) to a transformed string. The second ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
Abstract: This paper explores two techniques for online exact pattern matching in files that have been compressed using the BurrowsWheeler transform. We investigate two approaches. The first is an application of the BoyerMoore algorithm (Boyer & Moore 1977) to a transformed string. The second approach is based on the observation that the transform effectively contains a sorted list of all substrings of the original text, which can be exploited for very rapid searching using a variant of binary search. Both methods are faster than a decompressandsearch approach for small numbers of queries, and binary search is much faster even for large numbers of queries. 1
The SCP and compressed domain analysis of biological sequences
 Proc., IEEE Bioinformatics Conference
, 2003
"... We introduce the SCP the sorted common prefix, and study some of its properties. Based on the internal representations used by a class of new compression schemes, we show how the SCP table can be constructed using an O ( u + Σκmax) number of comparisons on average, and O ( u Σ) worst case, where u ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We introduce the SCP the sorted common prefix, and study some of its properties. Based on the internal representations used by a class of new compression schemes, we show how the SCP table can be constructed using an O ( u + Σκmax) number of comparisons on average, and O ( u Σ) worst case, where u is the size of the sequence, Σ is the number of symbols, and κ max is the maximum SCP value. We describe how two applications of the SCP in biological sequence analysis. In particular, using the SCP, and the compressed representation of the sequence, we present an algorithm for finding all the η occ canonical tandem arrays in the sequence in O ( u + ηocc + Σκmax) time on average, and O ( η occ + u Σ) worst case. Preliminary results on the statistics of the SCP for some DNA and protein sequences are included. 1.
Pattern Matching in LZW Compressed Files
 IEEE Transactions on Computers
, 2005
"... ..."
(Show Context)
LZW Based Compressed Pattern Matching
"... Compressed pattern matching is an emerging research area that addresses the following problem: given a file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal (or no) decompression. In this paper, we report our work on compressed pattern matching in ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Compressed pattern matching is an emerging research area that addresses the following problem: given a file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal (or no) decompression. In this paper, we report our work on compressed pattern matching in LZW compressed files. The reported work is based on Amir’s wellknown “almostoptimal ” algorithm [1] but has been improved to search not only the first occurrence of the pattern but also all other occurrences. The improvements also include the multipattern matching and a faster implementation for socalled “simple pattern”, which is defined as “a pattern with no symbol appearing more than once”. Extensive experiments have been conducted to test the search performance and to compare with not only the “decompressthensearch” approach but also the best available compressed pattern matching algorithms, particularly the BWTbased algorithms [2, 3]. The results showed that our method is competitive among the best algorithms.
Approximate pattern match using the BurrowsWheeler transform
 Proceedings of Data Compression Conference
, 2003
"... Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a lexicographic ordering of the input text as part of its inverse transformation process. Based on this observation, pattern matching is performed by text prefiltering, using a fast qgram intersection of segments from the pattern P and the text T. Algorithms are proposed that solve the kmismatch problem in O(min{m(m − k)Σk log uΣ ,mu log uΣ}) time worst case, and the kapproximate matching problem in O(Σ  log Σ+ m2 k log uΣ  + αk) time on average (α ≤ u), where u = T  is the size of the text, m = P  is the size of the pattern, and Σ is the symbol alphabet. Each algorithm requires an O(u) auxiliary arrays, which are constructed in O(u) time and space. 1
COMPRESSED PATTERN MATCHING FOR TEXT AND IMAGES
, 2005
"... The amount of information that we are dealing with today is being generated at an everincreasing rate. On one hand, data compression is needed to efficiently store, organize the data and transport the data over the limitedbandwidth network. On the other hand, efficient information retrieval is need ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The amount of information that we are dealing with today is being generated at an everincreasing rate. On one hand, data compression is needed to efficiently store, organize the data and transport the data over the limitedbandwidth network. On the other hand, efficient information retrieval is needed to speedily find the relevant information from this huge mass of data using available resources. The compressed pattern matching problem can be stated as: given the compressed format of a text or an image and a pattern string or a pattern image, report the occurrence(s) of the pattern in the text or image with minimal (or no) decompression. The main advantages of compressed pattern matching versus the naïve decompressthensearch approach are: First, reduced storage cost. Since there is no need to decompress the data or there is only minimal decompression required, the disk space and the memory cost is reduced. Second, less search time. Since the size of the compressed data is smaller than that of the original data, a searching performed on the compressed data will result in a shorter search time. The challenge of efficient compressed pattern matching can be met from two inseparable
Compressed pattern matching for predictive lossless image encoding
 Proceeding of Distributed Multimedia Systems
, 2003
"... Pattern matching in compressed image domain is a new topic in computer science. Many works have been reported for pattern matching for compressed text and for lossy compressed image. However, searching of images in lossless compressed domain is almost a blank area and needs to be explored. Lossless ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Pattern matching in compressed image domain is a new topic in computer science. Many works have been reported for pattern matching for compressed text and for lossy compressed image. However, searching of images in lossless compressed domain is almost a blank area and needs to be explored. Lossless image compression is widely used in areas such as medical images, satellite images, geometric images and many other areas that need to losslessly maintain the data of the images. Being able to searching in the compressed domain will save disk space and searching time and bring up considerable economic savings in these areas. In our work, we have studied the possibility of compressed pattern matching for the most three popular lossless image compression schemes: lossless JPEG, CALIC and JPEGLS. Our study indicates that these algorithms can be searchaware by minor modification. We also present a modified JPEGLS algorithm and the corresponding searching algorithm. Experimental results show that our method, comparing with the “decompressthensearching ” method, has nearly 30% improvement in searching time for most natural images. The modified JPEGLS algorithm also has shorter encoding and decoding time, with an improvement of about 1215 % and 812%, respectively, for most natural images. The tradeoff is the decrease of compression of about 2 %8%. To our best knowledge, this is the first report on JPEGLS compressed matching algorithm and this is the first “competitive” compressed pattern matching algorithm for lossless image compression.
doi:10.1093/comjnl/bxu009 Circular Pattern Discovery
, 2013
"... Given a text or database T, the circular pattern discovery (CPD) problem is to identify ‘interesting’ circular patterns in T. Here, no specific input pattern is provided, and what is interesting is typically defined in terms of constraints in the search. We propose two algorithms for the CPD problem ..."
Abstract
 Add to MetaCart
(Show Context)
Given a text or database T, the circular pattern discovery (CPD) problem is to identify ‘interesting’ circular patterns in T. Here, no specific input pattern is provided, and what is interesting is typically defined in terms of constraints in the search. We propose two algorithms for the CPD problem. The first algorithm uses suffix trees and suffix links to solve the exact CPD problem in O(m22N) time, where m2 is the maximum length of the circular patterns and N is the total length of the sequence database. The second algorithm uses suffix arrays to solve the more challenging approximate CPD (ACPD) problem inO(km22N2)worst case, andO(km22N)on average, wherek is the maximum allowed error(s). By exploiting the nature of the ACPD problem, the complexity is reduced to O(m22N2) time in the worst case, and O(m22N) on average.
Advance Access publication on 2 March 2014 doi:10.1093/comjnl/bxu009 Circular Pattern Discovery
, 2013
"... Given a text or database T, the circular pattern discovery (CPD) problem is to identify ‘interesting’ circular patterns in T. Here, no specific input pattern is provided, and what is interesting is typically defined in terms of constraints in the search. We propose two algorithms for the CPD problem ..."
Abstract
 Add to MetaCart
(Show Context)
Given a text or database T, the circular pattern discovery (CPD) problem is to identify ‘interesting’ circular patterns in T. Here, no specific input pattern is provided, and what is interesting is typically defined in terms of constraints in the search. We propose two algorithms for the CPD problem. The first algorithm uses suffix trees and suffix links to solve the exact CPD problem in O(m22N) time, where m2 is the maximum length of the circular patterns and N is the total length of the sequence database. The second algorithm uses suffix arrays to solve the more challenging approximate CPD (ACPD) problem inO(km22N2)worst case, andO(km22N)on average, wherek is the maximum allowed error(s). By exploiting the nature of the ACPD problem, the complexity is reduced to O(m22N2) time in the worst case, and O(m22N) on average.
Approximate Pattern Matching Over the BurrowsWheeler Transformed Text
, 2002
"... The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T , with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BurrowWheeler transformed (BWT) text which is ..."
Abstract
 Add to MetaCart
The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T , with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BurrowWheeler transformed (BWT) text which is a critical step for a fully compressed pattern matching algorithm on a BWT based compression algorithm.