Results 11 - 20
of
42
Synthesis of Hardware Models in C with Pointers and Complex Data Structures
- IEEE TRANSACTIONS ON VLSI SYSTEMS
, 2001
"... One of the greatest challenges in a C/C++-based design methodology is efficiently mapping C/C++ models into hardware. Many networking and multimedia applications implemented in hardware or mixed hardware/software systems now use complex data structures stored in multiple memories, so many C/C++ feat ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
One of the greatest challenges in a C/C++-based design methodology is efficiently mapping C/C++ models into hardware. Many networking and multimedia applications implemented in hardware or mixed hardware/software systems now use complex data structures stored in multiple memories, so many C/C++ features that were originally designed for software applications are now making their way into hardware. Such features include dynamic memory allocation and pointers for managing data. We present a solution for efficiently mapping arbitrary C code with pointers and malloc/free into hardware. Our solution, which fits current memory management methodologies, instantiates an application-specific hardware memory allocator coupled with a memory architecture. Our work also supports the resolution of pointers without restriction on the data structures. We present an implementation based on the SUIF framework along with case studies such as the realization of a video filter and an ATM segmentation engine.
Energy-Oriented Compiler Optimizations for Partitioned Memory Architectures
, 2000
"... Due to low power requirements of many embedded/portable devices such as mobile phones and laptop computers and dramatic increases in clock frequencies of general-purpose processors, lowpower software technology is becoming increasingly important in system design. Many applications from image and vid ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Due to low power requirements of many embedded/portable devices such as mobile phones and laptop computers and dramatic increases in clock frequencies of general-purpose processors, lowpower software technology is becoming increasingly important in system design. Many applications from image and video processing as well as from dense linear algebra are array-dominated and data-intensive, thereby spending a major portion of their execution time and energy in the memory subsystem. This paper presents a compiler-based optimization framework that targets reducing the energy consumption in a partitioned off-chip memory architecture that contains multiple memory banks by organizing the order of computations and the layout of data. The optimizations considered in this work take advantage of low-power operating modes and the partitioned (multi-bank) structure of the off-chip memory. Our preliminary experiments show that the proposed framework improves memory energy by up to 86% over a scheme that keeps all the memory banks in the active (fully-operational) operating mode all the time, and up to 70% over a scheme that utilizes low-power operating modes without doing any loop and data optimizations. 1
Locality Optimizations for Multi-Level Caches
- in Proceedings of SC99: High-Performance Networking and Computing
, 1999
"... Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformation ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance...
Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes
, 1995
"... Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advancedmemory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-size ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advancedmemory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharin...
Compiler-Assisted Memory Exclusion for Fast Checkpointing
- IEEE TECHNICAL COMMITTEE ON OPERATING SYSTEMS AND APPLICATION ENVIRONMENTS
, 1995
"... Memory exclusion is a powerful tool for optimizing the performance of checkpointing, however it has not been automated completely with low enough overhead. In this paper we present compiler-assisted memory exclusion (CAME), a technique that uses static program analysis to optimize the performance o ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Memory exclusion is a powerful tool for optimizing the performance of checkpointing, however it has not been automated completely with low enough overhead. In this paper we present compiler-assisted memory exclusion (CAME), a technique that uses static program analysis to optimize the performance of checkpointing. With the assistance of user-placed directives, the compiler can perform data flow analyses for dead and read-only regions of memory that can be omitted from checkpoints. The result can be a significant reduction in the size of checkpoints, thereby reducing the overhead of checkpointing.
C to Asynchronous Dataflow Circuits: An End-to-End Toolflow
, 2004
"... We present a complete toolflow that translates ANSI-C programs into asynchronous circuits. The toolflow is built around a compiler that converts C into a functional dataflow intermediate representation, exposing instruction-level, pipeline and memory parallelism. ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
We present a complete toolflow that translates ANSI-C programs into asynchronous circuits. The toolflow is built around a compiler that converts C into a functional dataflow intermediate representation, exposing instruction-level, pipeline and memory parallelism.
Initial Results on the Performance and Cost of Vector Microprocessors
, 1997
"... Increasingly wider superscalar processors are experiencing diminishing performance returns while requiring larger portions of die area dedicated to control rather than datapath. As an alternative to using these processors to exploit parallelism effectively, we are investigating the viability of usin ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Increasingly wider superscalar processors are experiencing diminishing performance returns while requiring larger portions of die area dedicated to control rather than datapath. As an alternative to using these processors to exploit parallelism effectively, we are investigating the viability of using single-chip vector microprocessors. This paper presents some initial results of our investigation where we compare the performance and cost of vector microprocessors to that of aggressive, out-of-order superscalar microprocessors. On the performance side, we show that vector processors are able to execute a highly parallel, integer-based application 1.5-7.3 times faster than superscalar processors can by exploiting parallelism more effectively. This ability stems from the use of vector instructions. Vector instructions exploit parallelism across loop iterations by implicitly re-scheduling operations and temporally localizing the parallelism. Vector instructions also reduce instruction bandw...
Locality Optimizations For Adaptive Irregular Scientific Codes
, 2000
"... Irregular scientific codes experience poor cache performance due to their memory access patterns. We examine several data and computation locality transformations including GPART, a new technique based on hierarchical clustering. GPART constructs quality partitions quickly by clustering multiple n ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Irregular scientific codes experience poor cache performance due to their memory access patterns. We examine several data and computation locality transformations including GPART, a new technique based on hierarchical clustering. GPART constructs quality partitions quickly by clustering multiple neighboring nodes in a few passes, with priority on nodes with high degree. Overhead is kept low by considering only edges between partitions. We develop compiler analyses and transformations in SUIF to automatically apply locality transformations, and propose user annotations to locate coordinate information needed by geometric partitioning algorithms. We experimentally evaluate locality optimizations for both static and adaptive codes, where connection patterns dynamically change at intervals during program execution. We derive a simple cost model to guide locality optimizations when access patterns change. Experiments on several irregular scientific codes show locality optimization t...
Automatic Topology-Based Identification Of Instruction-Set Extensions
- IN PROCEEDINGS OF THE DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION
, 2002
"... ..."
Improving the Compiler/Software DSM Interface: Preliminary Results
- in Proceedings of the First SUIF Compiler Workshop
, 1996
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results show simply combining the parallelizer and software DSM yields very poor performance. The compiler/software DSM interface can be improved based on relatively little compiler input by: 1) combining synchronization and parallelism information communication on parallel task invocation, 2) employing customized routines for evaluating reduction operations, and 3) selecting a hybrid update protocol to presend data by flushing updates at barriers. These optimizations yield decent speedups for program kernels, but are not sufficient for entire programs. Based on our experimental results, we point out areas where additional compiler analysis and software DSM improvements are necessary to achieve goo...

