Results 11 - 20
of
22
Models of Bitmap Generation: A Systematic Approach to Bitmap Compression
- Inf. Proc. & Management, v28
, 1992
"... : In large IR systems, information about word occurrence may be stored in form of a bit matrix, with rows corresponding to different words and columns to documents. Such a matrix is generally very large and very sparse. New methods for compressing such matrices are presented, which exploit possible ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
: In large IR systems, information about word occurrence may be stored in form of a bit matrix, with rows corresponding to different words and columns to documents. Such a matrix is generally very large and very sparse. New methods for compressing such matrices are presented, which exploit possible correlations between rows and between columns. The methods are based on partitioning the matrix into small blocks and predicting the 1-bit distribution within a block by means of various bit generation models. Each block is then encoded using Huffman or arithmetic coding. The methods also use a new way of enumerating subsets of fixed size from a given superset. Preliminary experimental results indicate improvements over previous methods. 1. Introduction The common approach to processing complex boolean queries in large full-text document retrieval systems is to use inverted files: a concordance is accessed via a dictionary, and includes for each different word of the text, the ordered list ...
Using Fibonacci Compression Codes as Alternatives to Dense Codes
"... Abstract Recent publications advocate the use of various variable length codes forwhich each codeword consists of an integral number of bytes in compression applications using large alphabets. This paper shows that another tradeoffwith similar properties can be obtained by Fibonacci codes. These are ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract Recent publications advocate the use of various variable length codes forwhich each codeword consists of an integral number of bytes in compression applications using large alphabets. This paper shows that another tradeoffwith similar properties can be obtained by Fibonacci codes. These are fixed codeword sets, using binary representations of integers based on Fibonaccinumbers of order m> = 2. Fibonacci codes have been used before, and thispaper extends previous work presenting several novel features. In particular,
Comparative Study between Various Algorithms of Data Compression Techniques
"... The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for "data compression", that is, the ability to reduce the amount of storage or Internet bandwidth required to handle this data. This p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for "data compression", that is, the ability to reduce the amount of storage or Internet bandwidth required to handle this data. This paper provides a survey of data compression techniques. The focus is on the most prominent data compression schemes, particularly popular.DOC,.TXT,.BMP,.TIF,.GIF, and.JPG files. By using different compression algorithms, we get some results and regarding to these results we suggest the efficient algorithm to be used with a certain type of file to be compressed taking into consideration both the compression ratio and compressed file size.
DNA compression challenge revisited
- IN PROC. CPM-2005 COMBINATORIAL PATTERN MATCHING, JEJU ISLAND, KOREA
, 2005
"... Standard compression algorithms are not able to compress DNA sequences. Recently, new algorithms have been introduced specifically for this purpose, often using detection of long approximate repeats. In this paper, we present another algorithm, DNAPack, based on dynamic programming. In comparison wi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Standard compression algorithms are not able to compress DNA sequences. Recently, new algorithms have been introduced specifically for this purpose, often using detection of long approximate repeats. In this paper, we present another algorithm, DNAPack, based on dynamic programming. In comparison with former existing programs, it compresses DNA slighly better, while the cost of dynamic programming is almost neglectible.
The Responsa Storage and Retrieval System - Whither?
, 1996
"... p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order ..."
Abstract
- Add to MetaCart
p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order to end up with a database of manageable size; the prepositions are left to the synthesis phase. See [AFCS1972] for full details. I also set up a "Committee for the Mechanization in Jewish Law Research" whose first members were, I think, Dr. Choueka, Mr. Asa Kasher, later professor of Philosophy at Tel Aviv University, Mr. Joseph Dueck, a young lawyer and research assistant at the IRJL, who served as their representative, and assistants, to formulate procedures for preediting and postediting texts to be inputted, and various algorithms needed for the work. (Many other persons, such as Mr. Reuven Mirkin of the Academy of the Hebrew Language, and research students, joined later.) I also felt ...
Bioinformatics
, 2003
"... Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We e ..."
Abstract
- Add to MetaCart
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.
The Burrows-Wheeler compression algorithm is even better than what you have thought
, 2005
"... The best compression algorithm today for English text is based on the Burrows-Wheeler transform. This algorithm (whose common implementation is bzip2) consists of the following three essential steps: 1) Obtain the Burrows-Wheeler transform of the text, 2) Convert the transform into a sequence of int ..."
Abstract
- Add to MetaCart
The best compression algorithm today for English text is based on the Burrows-Wheeler transform. This algorithm (whose common implementation is bzip2) consists of the following three essential steps: 1) Obtain the Burrows-Wheeler transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using arithmetic code or any order-0 encoding (possibly with run length encoding). In this paper we achieve a strong bound on the worst-case compression ratio of this algorithm, that is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, for any input string s, and µ> 1, the length of the compressed string is bounded by µ · |s|Hk(s) + log(ζ(µ)) · |s | + gk where Hk is the k-th order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 µ + 1 2 µ +... is the standard zeta function. In fact we prove a stronger result: That this bound without the additive term gk holds when we replace Hk(s) by the sum of the logarithms of the integers obtain by the move-to-front encoding of the transform. This refined bound is tight and close to the actual compression achieved in practice. To obtain this result we prove a tight result on the compressibility of integer sequences, which is of independent interest. 1
Modular Data Compression to Optimally Locate Regular Segments in Sequences. Application to DNA Sequence Analysis
"... A new location method for regular segments in sequences is presented. It uses the Minimum Description Length (MDL) criterion. If a lossless compressor achieves size reduction by exploiting a regularity, our algorithm TurboOptLift locates very quickly the segments where the regularity is probably pre ..."
Abstract
- Add to MetaCart
A new location method for regular segments in sequences is presented. It uses the Minimum Description Length (MDL) criterion. If a lossless compressor achieves size reduction by exploiting a regularity, our algorithm TurboOptLift locates very quickly the segments where the regularity is probably present and those where it is not. The location is optimal from a MDL viewpoint. We apply the method to the problem of locating approximate tandem repeats in DNA sequences.
Theoretical Bioinformatic (815)
"... Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shu ing,... Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanisms can ..."
Abstract
- Add to MetaCart
Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shu ing,... Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanisms can be gained from a systematic investigation of repeats. The problem of nding such repeats is viewed as an NP-complete problem of the optimal compression of a sequence thanks to the encoding of its exact repeats. The repeats chosen for compression must not overlap each other as do the repeats which result from molecular duplications. We present a new heuristic algorithm, Search Repeats, where the selection of exact repeats is guided by two biologically sound criteria: their length and the absence of overlap between those repeats. Search Repeats detects approximate repeats, as clusters of exact sub-repeats, and points out large insertions/deletions in them. Search Repeats takes only 3 seconds of CPU time for the genome of Haemophilus in uenzae on a Sun Ultrasparc workstation. 1

