Results 11 - 20
of
108
A Comparison of Compiler Tiling Algorithms
, 1999
"... Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with exis ..."
Abstract
-
Cited by 50 (8 self)
- Add to MetaCart
Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100 % in some cases over a range of matrix sizes. Comparing the efficacy of different tiling algorithms, we discover rectangular tiles are slightly more efficient than square tiles. Overall, tiling improves performance from 0-250%. Copying tiles at run time proves to be quite effective.
Monet. An Impressionist Sketch of an Advanced Database System
- In Proc. IEEE BIWIT workshop
, 1994
"... Monet is a customizable database system developed at CWI and University of Amsterdam, intended to be used as the database backend for widely varying application domains. It is designed to get maximum database performance out of today's workstations and multiprocessor systems. It has already achieved ..."
Abstract
-
Cited by 50 (13 self)
- Add to MetaCart
Monet is a customizable database system developed at CWI and University of Amsterdam, intended to be used as the database backend for widely varying application domains. It is designed to get maximum database performance out of today's workstations and multiprocessor systems. It has already achieved considerable success in supporting a Data Mining application [12, 13], and work is well under way in a project where it is used in a high-end GIS application. Monet is a type- and algebra-extensible database system and employs shared memory parallelism. In this paper, we give the goals and motivation of Monet, and outline its architectural features, including its use of the Decomposed Storage Model (DSM), emphasis on bulk operations, use of main virtual-memory and server customization. As a case example, we discuss some issues on how to build a GIS on top of Monet; amongst others how Monet can handle the very large data volumes involved. Parts of this work are supported by SION grant no. ...
Fusion of Loops for Parallelism and Locality
- IEEE Transactions on Parallel and Distributed Systems
, 1995
"... Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. We present new, systematic techniques which: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) allow parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We evaluate our techniques on a 56-processor KSR2 multiprocessor, and show improvements of up to 20% for representative loop nest sequences. The results also indicate a performance tradeoff as more processors are used, suggesting careful evaluation of the profitability of fusion. 1 Introduction The performance of data-parallel applications on cachecoherent shared-memory multiprocessors is significantly affected by data locality and by the cost ...
AccMon: Automatically Detecting Memory-related Bugs via Program Counter-based Invariants
- In 37th International Symposium on Microarchitecture (MICRO
, 2004
"... This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically acce ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically accessed by only a few instructions. Therefore, by capturing the invariant of the set of PCs that normally access a given variable, we can detect accesses by outlier instructions, which are often caused by memory corruption, buffer overflow, stack smashing or other memory-related bugs. Since this method is statistics-based, it can detect bugs that do not violate any programming rules and that, therefore, are likely to be missed by many existing tools. The second contribution is a novel architectural extension called the Check Look-aside Buffer (CLB). The CLB uses a Bloom filter to reduce monitoring overheads in the recentlyproposed iWatcher architectural framework for software debugging. The CLB significantly reduces the overhead of PC-based invariant debugging.
Examination of a Memory Access Classification Scheme for Pointer-Intensive and Numeric Programs
, 1996
"... In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in pr ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in programs that we introduced earlier [HM94]. In this paper we take a close look at two codes from our suite, an English parser called Link-Gram, and the circuit simulation program spice2g6, and present a detailed analysis of them in the context of our model. Focusing on just two programs allows us to display a wider range of data, and discuss relevant code fragments extracted from their source distributions. Results from this study provide a deeper understanding of our memory access classification scheme, and suggest additional optimizations for future data prefetch mechanisms. Keywords: CPU architecture, data cache, memory access pattern classification, instruction profiling, memory latency t...
Design issues and tradeoffs for write buffers
- In Proceedings of the Third IEEE Symposium on High Performance Computer Architecture
, 1997
"... Processors with write-through caches typically require a write buffer to hide the write latency to the next level of memory hierarchy and to reduce write trajgic. A write buffer can cause processor stalls when it isfull, when it contends with a cache miss for access to the next level of the hierarch ..."
Abstract
-
Cited by 39 (3 self)
- Add to MetaCart
Processors with write-through caches typically require a write buffer to hide the write latency to the next level of memory hierarchy and to reduce write trajgic. A write buffer can cause processor stalls when it isfull, when it contends with a cache miss for access to the next level of the hierarchy, and when it contains thefreshest copy of data needed by a load. This paper uses instructionlevel simulation of SPEC92 benchmarks to investigate how different write buffer depths, retirement policies, and load-hazard policies affect these three types of write-buffer stalls. Deeper buflers with adequate headroom, lazier retirement policies, and the ability to read data directly from the write buffer combine to substantially reduce write-buffer-induced stalls. 1
Tuning Memory Performance in Sequential and Parallel Programs
- IEEE Computer
, 1995
"... Recent architecture and technology trends have led to a significant and increasing gap be- tween processor and main memory speeds. Caches hide these latencies to some extent, but when cache misses are frequent, memory stalls can significantly degrade program execution time. This paper describes M ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
Recent architecture and technology trends have led to a significant and increasing gap be- tween processor and main memory speeds. Caches hide these latencies to some extent, but when cache misses are frequent, memory stalls can significantly degrade program execution time. This paper describes MemSpy, a performance monitoring system designed to help identify and fix program memory bottlenecks. The natural interrelationship between memory bottlenecks and program data structures motivates MemSpy's introduction of data oriented statistics for mem- ory performance information. Furthermore, MemSpy's detailed statistics on the causes of cache misses are crucial for determining sources of memory bottlenecks.
Memory Behavior of the SPEC2000 Benchmark Suite
, 2000
"... The SPEC CPU benchmarks are frequently used in computer architecture research. The newly released SPEC'2000 benchmarks consist of fourteen floating point and twelve integer applications. In this paper we present measurements of number of cache misses for all the applications for a variety of cache ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
The SPEC CPU benchmarks are frequently used in computer architecture research. The newly released SPEC'2000 benchmarks consist of fourteen floating point and twelve integer applications. In this paper we present measurements of number of cache misses for all the applications for a variety of cache configurations. Prior studies have shown that SPEC benchmarks do not put much stress on the memory system. Our simulation results demonstrate that SPEC'2000 places only modest pressure on the first level caches confirming the results of similar experiments. 1 Introduction SPEC CPU benchmarks have long been used to gauge the performance of uniprocessor systems as well as microarchitectural enhancements. The newly released SPEC'2000 benchmark suite replaced the previous release, SPEC'95. Many studies [1, 3, 4]showed that only a few applications place more than modest stress on the memory system. The purpose of this study is to examine the memory behavior of the SPEC'2000 benchmark suite an...
Tuning Strassen's Matrix Multiplication for Memory Efficiency
- IN PROCEEDINGS OF SC98 (CD-ROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this alg ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non-standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...

