Results 1  10
of
150
Exploiting coarsegrained task, data, and pipeline parallelism in stream programs
 In 14th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2006
"... As multicore architectures enter the mainstream, there is a pressing demand for highlevel programming models that can effectively map to them. Stream programming offers an attractive way to expose coarsegrained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represen ..."
Abstract

Cited by 103 (6 self)
 Add to MetaCart
(Show Context)
As multicore architectures enter the mainstream, there is a pressing demand for highlevel programming models that can effectively map to them. Stream programming offers an attractive way to expose coarsegrained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represented by independent filters that communicate over explicit data channels. In this paper, we demonstrate an endtoend stream compiler that attains robust multicore performance in the face of varying application characteristics. As benchmarks exhibit different amounts of task, data, and pipeline parallelism, we exploit all types of parallelism in a unified manner in order to achieve this generality. Our compiler, which maps from the StreamIt language to the 16core Raw architecture, attains a 11.2x mean speedup over a singlecore baseline, and a 1.84x speedup over our previous work. Categories and Subject Descriptors D.3.2 [Programming Languages]:
Parameterized Dataflow Modeling for DSP Systems
 IEEE Transactions on Signal Processing
, 2001
"... Dataflow has proven to be an attractive computation model for programming digital signal processing (DSP) applications. A restricted version of dataflow, termed synchronous dataflow (SDF), that offers strong compiletime predictability properties, but has limited expressive power, has been studied e ..."
Abstract

Cited by 96 (37 self)
 Add to MetaCart
(Show Context)
Dataflow has proven to be an attractive computation model for programming digital signal processing (DSP) applications. A restricted version of dataflow, termed synchronous dataflow (SDF), that offers strong compiletime predictability properties, but has limited expressive power, has been studied extensively in the DSP context. Many extensions to synchronous dataflow have been proposed to increase its expressivity while maintaining its compiletime predictability properties as much as possible. We propose a parameterized dataflow framework that can be applied as a metamodeling technique to significantly improve the expressive power of any dataflow model that possesses a welldefined concept of a graph iteration. Indeed, the parameterized dataflow framework is compatible with many of the existing dataflow models for DSP including SDF, cyclostatic dataflow, scalable synchronous dataflow, and Boolean dataflow.In this paper, we develop precise, formal semantics for parameterized synchr...
Static Scheduling for Synthesis of DSP Algorithms on Various Models
 Journal of VLSI Signal Processing
, 1995
"... Abstract. Given a behavioral description of a DSP algorithm represented by a dataflow graph, we show how to obtain a rateoptimal static schedule with the minimum unfolding factor under two models, integral grid model and fractional grid model, and two kinds of implementations for each model, pipel ..."
Abstract

Cited by 63 (35 self)
 Add to MetaCart
Abstract. Given a behavioral description of a DSP algorithm represented by a dataflow graph, we show how to obtain a rateoptimal static schedule with the minimum unfolding factor under two models, integral grid model and fractional grid model, and two kinds of implementations for each model, pipelined implementation and nonpipelined implementation. We present a simple and unified approach to deal with the four possible combinations. A unified polynomialtime scheduling algorithm is presented, which works on the original dataflow graphs without really unfolding. The values of the minimum rateoptimal unfolding factors and the general properties for all the four combinations are proved. 1
Orchestrating the execution of stream programs on multicore platforms
 In Proc. of the SIGPLAN ’08 Conference on Programming Language Design and Implementation
, 2008
"... While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing dom ..."
Abstract

Cited by 59 (9 self)
 Add to MetaCart
(Show Context)
While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16core Cell platform compared to a single core.
Scheduling dataflow graphs via retiming and unfolding
 IEEE Trans. on Parallel and Distributed Systems
, 1997
"... Abstract—Loop scheduling is an important problem in parallel processing. The retiming technique reorganizes an iteration; the unfolding technique schedules several iterations together. We combine these two techniques to obtain a static schedule with a reduced average computation time per iteration. ..."
Abstract

Cited by 58 (25 self)
 Add to MetaCart
Abstract—Loop scheduling is an important problem in parallel processing. The retiming technique reorganizes an iteration; the unfolding technique schedules several iterations together. We combine these two techniques to obtain a static schedule with a reduced average computation time per iteration. We first prove that the order of retiming and unfolding is immaterial for scheduling a dataflow graph (DFG). From this nice property, we present a polynomialtime algorithm on the original DFG, before unfolding, to find the minimumrate static schedule for a given unfolding factor. For the case of a unittime DFG, efficient checking and retiming algorithms are presented.
Multidimensional Synchronous Dataflow
 IEEE Transactions on Signal Processing
, 2002
"... Signal flow graphs with dataflow semantics have been used in signal processing system simulation, algorithm development, and realtime system design. Dataflow semantics implicitly expose function parallelism by imposing only a partial ordering constraint on the execution of functions. One particular ..."
Abstract

Cited by 51 (4 self)
 Add to MetaCart
(Show Context)
Signal flow graphs with dataflow semantics have been used in signal processing system simulation, algorithm development, and realtime system design. Dataflow semantics implicitly expose function parallelism by imposing only a partial ordering constraint on the execution of functions. One particular form of dataflow called synchronous dataflow (SDF) has been quite popular in programming environments for digital signal processing (DSP) since it has strong formal properties and is ideally suited for expressing multirate DSP algorithms. However, SDF and other dataflow models use firstin firstout (FIFO) queues on the communication channels and are thus ideally suited only for onedimensional (1D) signal processing algorithms. While multidimensional systems can also be expressed by collapsing arrays into 1D streams, such modeling is often awkward and can obscure potential data parallelism that might be present. SDF can be generalized...
Scheduling And Behavioral Transformations For Parallel Systems
, 1993
"... In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually the most timecritical parts of an application, the parallelism embedded in the repetitive pattern of an iterative algorithm needs to be explored. This thesis studies techniques and algorithms to expose the parallelism in an iterative algorithm so that the designer can find an implementation achieving a desired execution rate. In particular, the objective is to find an efficient schedule to be executed iteratively. A form of dataflow graphs is used to model the iterative part of an application, e.g. a digital signal filter or the while/for loop of a program. Nodes in the graph represent operations to be performed and edges represent both intraiteration and interiteration precedence relat...
Minimizing Memory Requirements in RateOptimal Schedules
, 1994
"... In this paper we address the problem of minimizing buffer storage requirement in constructing rateoptimal compiletime schedules for multirate dataflow graphs. We demonstrate that this problem, called the Minimum Buffer RateOptimal (MBRO) scheduling problem, can be formulated as a unified linear ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
In this paper we address the problem of minimizing buffer storage requirement in constructing rateoptimal compiletime schedules for multirate dataflow graphs. We demonstrate that this problem, called the Minimum Buffer RateOptimal (MBRO) scheduling problem, can be formulated as a unified linear programming problem. A novel feature of our method is that it tries to minimize the memory requirement while simultaneously maximizing the computation rate. We have constructed an experimental testbed which implements our scheduling algorithm as well as (i) the widely used periodic admissible parallel schedules proposed by Lee and Messerschmitt [12], (ii) the optimal scheduling buffer allocation (OSBA) algorithm of Ning and Gao [15], and (iii) the multirate software pipelining (MRSP) algorithm [7]. The experimental results have demonstrated a significant improvement in buffer requirements for the MBRO schedules compared to the schedules generated by the other three methods. Compared to bloc...
Cyclic Scheduling on Parallel Processors: An Overview
, 1994
"... A recent research effort has been devoted to cyclic scheduling problems that arise in the design of compilers for parallel architectures as well as in manufacturing systems. This paper is focused on the extensions of the basic cyclic scheduling problem (BCS), that seems to be one of the most suitabl ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
A recent research effort has been devoted to cyclic scheduling problems that arise in the design of compilers for parallel architectures as well as in manufacturing systems. This paper is focused on the extensions of the basic cyclic scheduling problem (BCS), that seems to be one of the most suitable model for parallel processing applications. The properties of the earliest schedule of BCS are recalled and their most recent extensions are presented. Several generalizations of BCS that include resource constraints are then discussed. In particular, structural results and algorithms for periodic versions of jobshop and mmachines problems are reported. 4.1 Introduction Up to now, cyclic scheduling problems have been studied from several points of view depending on the target application. A few theoretical studies have been recently devoted to these problems, in which basic results are often proved independently using different formalisms. We hope that this paper, without pretending to...
A transformationbased method for loop folding
 IEEE TRANSACTIONS ON COMPUTERAIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
, 1994
"... We propose a transformationbased scheduling algorithm for the problem given a loop construct, a target initiation interval and a set of resource constraints, schedule the loop in a pipelined fashion such that the iteration time of executing an iteration of the loop is minimized. The iteration tim ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
We propose a transformationbased scheduling algorithm for the problem given a loop construct, a target initiation interval and a set of resource constraints, schedule the loop in a pipelined fashion such that the iteration time of executing an iteration of the loop is minimized. The iteration time is an important quality measure of a data path design because it affects both storage and control costs. Our algorithm first performs an As Soon As Possible Pipelined (ASAPp) scheduling regardless the resource constraint. It then resolves resource constraint violations by rescheduling some operations. The software system implementing the proposed algorithm, called Theda.Fold, can deal with behavioral loop descriptions that contain chained, multicycle and/or structural pipelined operations as well as those having data dependencies across iteration boundaries. Experiment on a number of benchmarks is reported.