Results 1 - 10
of
18
Bytecode Compression via Profiled Grammar Rewriting
, 2001
"... This paper describes the design and implementation of a method for producing compact, bytecoded instruction sets and interpreters for them. It accepts a grammar for programs written using a simple bytecoded stack-based instruction set, as well as a training set of sample programs. The system transfo ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This paper describes the design and implementation of a method for producing compact, bytecoded instruction sets and interpreters for them. It accepts a grammar for programs written using a simple bytecoded stack-based instruction set, as well as a training set of sample programs. The system transforms the grammar, creating an expanded grammar that represents the same language as the original grammar, but permits a shorter derivation of the sample programs and others like them. A program's derivation under the expanded grammar forms the compressed bytecode representation of the program. The interpreter for this bytecode is automatically generated from the original bytecode interpreter and the expanded grammar. Programs expressed using compressed bytecode can be substantially smaller than their original bytecode representation and even their machine code representation. For example, compression cuts the bytecode for lee from 199KB to 58KB but increases the size of the interpreter by just over 11KB.
Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches
, 2004
"... With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and u ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing large data blocks and files. Cache lines, however, are typically short (32-256 bytes), and a per-line dictionary places a significant overhead that limits the compressibility and increases decompression latency of such algorithms. For such short lines, significance-based compression is an appealing alternative. We propose and evaluate a simple significance-based compression scheme that has a low compression and decompression overhead. This scheme, Frequent Pattern Compression (FPC) compresses individual cache lines on a word-by-word basis by storing common word patterns in a compressed format accompanied with an appropriate prefix. For a 64-byte cache line, compression can be completed in three cycles and decompression in five cycles, assuming 12 FO4 gate delays per cycle. We propose a compressed cache design in which data is stored in a compressed form in the L2 caches, but are uncompressed in the L1 caches. L2 cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. This simple scheme provides comparable compression ratios to more complex schemes that have higher cache hit latencies. 1
A Robust Main-Memory Compression Scheme
- In Proceedings of the 32nd Annual International Symposium on Computer Architecture
, 2005
"... Lossless data compression techniques can potentially free up more than 50 % of the memory resources. However, previously proposed schemes suffer from high access costs. The proposed main-memory compression scheme practically eliminates performance losses of previous schemes by exploiting a simple an ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Lossless data compression techniques can potentially free up more than 50 % of the memory resources. However, previously proposed schemes suffer from high access costs. The proposed main-memory compression scheme practically eliminates performance losses of previous schemes by exploiting a simple and yet effective compression scheme, a highly-efficient structure for locating a compressed block in memory, and a hierarchical memory layout that allows compressibility of blocks to vary with a low fragmentation overhead. We have evaluated an embodiment of the proposed scheme in detail using 14 integer and floating point applications from the SPEC2000 suite along with two server applications and we show that the scheme robustly frees up 30 % of the memory resources, on average, with a negligible impact on the performance of only
An Instruction for Direct Interpretation of LZ77-compressed Programs
"... A new instruction adapts LZ77 compression for use inside running programs. The instruction economically references and reuses code fragments that are too small to package as conventional subroutines. The compressed code is interpreted directly, with neither prior nor on-the-fly decompression. Hardwa ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A new instruction adapts LZ77 compression for use inside running programs. The instruction economically references and reuses code fragments that are too small to package as conventional subroutines. The compressed code is interpreted directly, with neither prior nor on-the-fly decompression. Hardware implementations seem plausible and could benefit both memoryconstrained and more conventional systems. The method is extremely simple. It has been added to a pre-existing, bytecoded instruction set, and it added only ten lines of C to the bytecode interpreter. It typically cuts code size by a third; that is, typical compression ratios are roughly 0.67x. More ambitious compressors are available, but they are more complex, which retards adoption. The current method offers a useful trade-off to these more complex systems.
Using Compression to Improve Chip Multiprocessor Performance
, 2006
"... Chip multiprocessors (CMPs) combine multiple processors on a single die, typically with private level-one caches and a shared level-two cache. However, the increasing number of processors cores on a single chip increases the demand on two critical resources: the shared L2 cache capacity and the off- ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Chip multiprocessors (CMPs) combine multiple processors on a single die, typically with private level-one caches and a shared level-two cache. However, the increasing number of processors cores on a single chip increases the demand on two critical resources: the shared L2 cache capacity and the off-chip pin band-width. Demand on these critical resources is further exacerbated by latency-hiding techniques such as hardware prefetching. In this dissertation, we explore using compression to effectively increase cache and pin bandwidth resources and ultimately CMP performance. We identify two distinct and complementary designs where compression can help improve CMP perfor-mance: Cache Compression and Link Compression. Cache compression stores compressed lines in the cache, potentially increasing the effective cache size, reducing off-chip misses and improving perfor-mance. On the downside, decompression overhead can slow down cache hit latencies, possibly degrading performance. Link (i.e., off-chip interconnect) compression compresses communication messages before sending to or receiving from off-chip system components, thereby increasing the effective off-chip pin bandwidth, reducing contention and improving performance for bandwidth-limited configurations. While compression can have a positive impact on CMP performance, practical implementations of compression
Design and Performance of Compressed Interconnects for High Performance Servers
"... As microprocessors scale rapidly in frequency, the design of fast and efficient interconnects becomes extremely important for low latency data access and high performance. Furthermore, in a multiprocessor configuration, the width of the shared interconnect can pose a significant hurdle in terms of d ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
As microprocessors scale rapidly in frequency, the design of fast and efficient interconnects becomes extremely important for low latency data access and high performance. Furthermore, in a multiprocessor configuration, the width of the shared interconnect can pose a significant hurdle in terms of design complexity, cost, and achievable interconnect frequency. In this paper, we evaluate a technique for reducing the interconnect width by exploiting the spatial and temporal locality in communication transfers (addresses & data). The width reduction implies a number of other advantages including higher operating frequency, reduced pin-count, lower chip & board cost, etc. We evaluate the effectiveness of the proposed scheme by performing trace-driven simulations for two well-known commercial server workloads (SPECweb99 and TPC-C). We also study the sensitivity of the compression hit ratio with respect to the number of bits compressed, size of the encoding/decoding table used and the replacement policy. The results indicate that the proposed technique has a potential to reduce address bus width in most cases and data bus widths in some cases while maintaining equal or better performance than in the uncompressed case.
Compressibility Characteristics of Address/Data Transfers in Commercial Workloads
, 2002
"... In this paper, we evaluate the compressibility of address and data transfers in commercial servers. Our proposed compression scheme is geared towards improving the efficiency of the transfer medium (busses, links etc) and increasing the performance of the system. We evaluate the potential of the bas ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper, we evaluate the compressibility of address and data transfers in commercial servers. Our proposed compression scheme is geared towards improving the efficiency of the transfer medium (busses, links etc) and increasing the performance of the system. We evaluate the potential of the basic compression techniques for two commercial workloads -- SPECweb99 [21] and TPCC [22] -- based on trace-driven simulations. Based on the obtained results, we show that simple compression schemes show significant promise for reducing address bus width and moderate benefits for data bus width reduction. We also show the sensitivity of these performance benefits to the number of bits compressed and the size of the encoding/decoding table used. Additionally, we propose enhancements to the compression schemes based on (1) recognizing and utilizing data-type specific knowledge and (2) improving the replacement policy of the encoding /decoding table. The performance benefits of bus compression schemes with these enhancements are also presented and analyzed.
A Dynamically Partitionable Compressed Cache
- In Proceedings of the Singapore-MIT Alliance Symposium
, 2003
"... Abstract — The effective size of an L2 cache can be increased by using a dictionary-based compression scheme. Naive application of this idea performs poorly since the data values in a cache greatly vary in their “compressibility. ” The novelty of this paper is a scheme that dynamically partitions th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — The effective size of an L2 cache can be increased by using a dictionary-based compression scheme. Naive application of this idea performs poorly since the data values in a cache greatly vary in their “compressibility. ” The novelty of this paper is a scheme that dynamically partitions the cache into sections of different compressibilities. While compression is often researched in the context of a large stream, in this work it is applied repeatedly on smaller cache-line sized blocks so as to preserve the random access requirement of a cache. When a cache-line is brought into the L2 cache or the cache-line is to be modified, the line is compressed using a dynamic, LZW dictionary. Depending on the compression, it is placed into the relevant partition. The partitioning is dynamic in that the ratio of space allocated to compressed and uncompressed varies depending on the actual performance, Certain SPEC-2000 benchmarks using a compressed L2 cache show an 80reduction in L2 miss-rate when compared to using an uncompressed L2 cache of the same area, taking into account all area overhead associated with the compression circuitry. For other SPEC-2000 benchmarks, the compressed cache performs as well as a traditional cache that is 4.3 times as large as the compressed cache in terms of hit rate, The adaptivity ensures that, in terms of miss rates, the compressed cache never performs worse than a traditional cache. I.
Zero-content augmented caches
- In ICS ’09: Proceedings of the 23rd annual International Conference on Supercomputing
, 2009
"... It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as a waste of resources. In this paper, we propose the Zero-Content Augmented cache, the ZCA cache. A ZCA cache consists of a conventional cache augmented with a specialized cache for memorizing null blocks, the Zero-Content cache or ZC cache. In the ZC cache, the data block is represented by its address tag and a validity bit. Moreover, as null blocks generally exhibit high spatial locality, several null blocks can be associated with a single address tag in the ZC cache. For instance, a ZC cache mapping 32MB of zero 64-byte lines uses less than 80KB of storage. Decompression of a null block is very simple, therefore read access time on the ZCA cache is in the same range as the one of a conventional cache. On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead. In particular, the write-back traffic on null blocks is limited. For applications with a low null block rate, no performance loss is observed.
Selective Main Memory Compression by Identifying Program Phase Changes
, 2004
"... During a program's runtime, the stack and data segments of the main memory often contain much redundancy, which makes them good candidates for compression. Compression and decompression however require either extra hardware or substantial processing resources. This paper presents a new approach in w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
During a program's runtime, the stack and data segments of the main memory often contain much redundancy, which makes them good candidates for compression. Compression and decompression however require either extra hardware or substantial processing resources. This paper presents a new approach in which a mostly software solution is suggested but without the processing power penalty that usually accompanies such a solution. This is achieved by not compressing all the memory all the time.

