Results 1 - 10
of
49
SPARK: A high-level synthesis framework for applying parallelizing compiler transformations
- In International Conference on VLSI Design
, 2003
"... This paper presents a modular and extensible high-level synthesis research system, called SPARK, that takes a behavioral description in ANSI-C as input and produces synthesizable register-transfer level VHDL. SPARK uses parallelizing compiler technology developed previously to enhance instruction-le ..."
Abstract
-
Cited by 66 (7 self)
- Add to MetaCart
This paper presents a modular and extensible high-level synthesis research system, called SPARK, that takes a behavioral description in ANSI-C as input and produces synthesizable register-transfer level VHDL. SPARK uses parallelizing compiler technology developed previously to enhance instruction-level parallelism and re-instruments it for high-level synthesis by incorporating ideas of mutual exclusivity of operations, resource sharing and hardware cost models. In this paper, we present the design flow through the SPARK system, a set of transformations that include speculative code motions and dynamic transformations and show how these transformations and other optimizing synthesis and compiler techniques are employed by a scheduling heuristic. Experiments are performed on two moderately complex industrial applications, namely, MPEG-1 and the GIMP image processing tool. The results show that the various code transformations lead to up to 70 % improvements in performance without any increase in the overall area and critical path of the final synthesized design. 1
A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers
, 1994
"... Compilers have focussed on the exploitation of one of functional or data parallelism in the past. The PARADIGM compiler project at the University of Illinois is among the #rst to incorporate techniques for simultaneous exploitation of both. The work in this paper describes the techniques used in the ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Compilers have focussed on the exploitation of one of functional or data parallelism in the past. The PARADIGM compiler project at the University of Illinois is among the #rst to incorporate techniques for simultaneous exploitation of both. The work in this paper describes the techniques used in the PARADIGM compiler and analyzes the optimality of these techniques. It is the #rst of its kind to use realistic cost models and includes data transfer costs which all previous researchers have neglected. Preliminary results on the CM-5 show the e#cacy of our methods and the signi#cant advantages of using functional and data parallelism together for execution of real applications. 1. INTRODUCTION Distributed memory multicomputers such as the Intel Paragon, the IBM SP-1 and the Thinking Machines CM-5 o#er signi#cant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately,to extract all that computational power from these machines, users have to write e#...
Double Standards: Bringing Task Parallelism to HPF Via the Message Passing Interface
- IN PROCEEDINGS OF SUPERCOMPUTING '96
, 1996
"... High Performance Fortran (HPF) does not allow efficient expression of mixed task/dataparallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how a coordination library implementing the Message Passing Interface (MPI) can be used to represent these c ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
High Performance Fortran (HPF) does not allow efficient expression of mixed task/dataparallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how a coordination library implementing the Message Passing Interface (MPI) can be used to represent these common parallel program structures. This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from twodimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, incre...
SpecSyn: An Environment Supporting the Specify-Explore-Refine Paradigm for Hardware/Software System Design
- IEEE Transactions on VLSI Systems
, 1998
"... System-level design issues are gaining increasing attention, as behavioral synthesis tools and methodologies mature. We present the SpecSyn system-level design environment, which supports the new specify-explore-refine (SER) design paradigm. This three-step approach to design includes precise specif ..."
Abstract
-
Cited by 26 (13 self)
- Add to MetaCart
System-level design issues are gaining increasing attention, as behavioral synthesis tools and methodologies mature. We present the SpecSyn system-level design environment, which supports the new specify-explore-refine (SER) design paradigm. This three-step approach to design includes precise specification of system functionality, rapid exploration of numerous systemlevel design options, and refinement of the specification into one reflecting the chosen option. A system-level design option consists of an allocation of system components, such as standard and custom processors, memories, and buses, and a partitioning of functionality among those components. After refinement, the functionality assigned to each component can then be synthesized to hardware or compiled to software. We describe the issues and approaches for each part of the SpecSyn environment. The new paradigm and environment are expected to lead to a more than ten times reduction in design time, and our experiments support...
Fastforward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue
- In PPoPP ’08: Proceedings of the The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2008
"... A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward’s effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
Kernel-Level Scheduling for the Nano-Threads Programming Model
, 1998
"... Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as w ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as well as execution time become important performance metrics. In this paper we present efficient kernel scheduling policies and propose a new kernel-user interface aiming at supporting efficient parallel execution in diverse workload environments. Our approach relies on support for user level threads which are used to exploit parallelism within applications, and a two-level scheduling policy which coordinates the number of resources allocated by the kernel with the number of threads generated by each application. We compare our scheduling policies with the native gang scheduling policy of the IRIX 6.4 operating system on a Silicon Graphics Origin2000. Our experimental results show substantial ...
An Evaluation of Optimized Threaded Code Generation
- In Proc. of the Conf. on Parallel Architectures and Compilation Techniques
, 1994
"... : Multithreaded architectures hold many promises: the exploitation of intra-thread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of t ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
: Multithreaded architectures hold many promises: the exploitation of intra-thread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism. Top-down code generation, where threads are created directly from the compiler's intermediate form, is effective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the thread size. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we present such bottom-up optimizations and evaluate their effectiveness in terms of overall performance and specific thread characteristics such as size, length, instruction level parallelism, numbe...
Speculation Techniques for High Level Synthesis of Control Intensive Designs
- In Design Automation Conference
, 2001
"... The quality of synthesis results for most high level synthesis approaches is strongly affected by the choice of control flow (through conditions and loops) in the input description. In this paper, we explore the effectiveness of various types of code motions, suchasmoving operations across condition ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
The quality of synthesis results for most high level synthesis approaches is strongly affected by the choice of control flow (through conditions and loops) in the input description. In this paper, we explore the effectiveness of various types of code motions, suchasmoving operations across conditionals, out of conditionals (speculation) and into conditionals (reverse speculation), and how they can be effectively directed by heuristics so as to lead to improved synthesis results in terms of fewer execution cycles and fewer number of states in the finite state machine controller. We also study the effects of the code motions on the area and latency of the final synthesized netlist. Based on speculative code motions, we presentanovel way to perform early condition execution that leads to significant improvements in highly control-intensive designs. Overall, reductions of up to 38% in execution cycles are obtained with all the code motions enabled.

