Results 1  10
of
10
HighLevel Synthesis of Nonprogrammable Hardware Accelerators
 JOURNAL OF VLSI SIGNAL PROCESSING
, 2000
"... The PICON system automatically synthesizes embedded nonprogrammable accelerators to be used as coprocessors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of cu ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
The PICON system automatically synthesizes embedded nonprogrammable accelerators to be used as coprocessors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (verylong instruction word) processors, their controller, local memory, and interfaces. The system also modifies the user's application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICON designs are slightly more costly than handdesigned accelerators with the same performance.
Generalized multipartitioning for multidimensional arrays
 In Proceedings of the International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL
, 2002
"... Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multidimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. Thi ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multidimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing multipartitionings for ddimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that exploits generalized multipartitioning and show that the compiler’s generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance. 1.
Constructing and Exploiting Linear Schedules with Prescribed Parallelism
, 2002
"... this paper appeared in the proceedings of the 14th International Parallel and Distributed Processing Symposium (IEEE Computer Society, 2000, pp. 815821) under the title Aconstructive solution to the juggling problem in processor array synthesis ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
this paper appeared in the proceedings of the 14th International Parallel and Distributed Processing Symposium (IEEE Computer Society, 2000, pp. 815821) under the title Aconstructive solution to the juggling problem in processor array synthesis
On Efficient Parallelization of LineSweep Computations
 In 9th Workshop on Compilers for Parallel Computers
, 2001
"... Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors so that linesweep computations can be performed e#ciently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same nu ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors so that linesweep computations can be performed e#ciently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same number of tiles to compute at each step in the sweep. This property results in full, balanced parallelism. A secondary benefit of multipartitionings is that they induce only coarsegrain communication. Previously, computing a ddimensional multipartitioning required that p 1 d1 be integral, where p is the number of processors. Here, we describe an algorithm to compute a ddimensional multipartitioning of an array of # dimensions for an arbitrary number of processors, for any d, 2 # d # #. When using a multipartitioning to parallelize a line sweep computation, the best partitioning is the one that exploits all of the processors and has the smallest communication volume. To compute the best multipartitioning of a #dimensional array, we describe a cost model for selecting d, the dimensionality of the best partitioning, and the number of cuts along each partitioned dimension. In practice, our technique will choose a 3dimensional multipartitioning for a 3dimensional linesweep computation, except when p is a prime; previously, a 3dimensional multipartitioning could be applied only when # p is integral. We describe an implementation of multipartitioning in the Rice dHPF compiler and performance results obtained to parallelize a line sweep computation on a range of di#erent numbers of processors. # This work performed while a visiting scholar at Rice University. 1
Generalized multipartitioning
 In Second Annual Los Alamos Computer Science Institute (LACSI) Sy mposisum
, 2001
"... Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors. With multipartitioning, computations that require solving onedimensional recurrences along each dimension of a multidimensional array can be parallelized effectively. Previous techniques for ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors. With multipartitioning, computations that require solving onedimensional recurrences along each dimension of a multidimensional array can be parallelized effectively. Previous techniques for multipartitioning yield efficient parallelizations over threedimensional domains only when the number of processors is a perfect square. This paper considers the general problem of computing optimal multipartitionings for ddimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning for this general case, which enables multipartitioning to be used for performing efficient parallelizations of linesweep computations under arbitrary conditions. Finally, we describe a prototype implementation of generalized multipartitioning in the Rice dHPF compiler and performance results obtained when using it to parallelize a line sweep computation for different numbers of processors. 1
Latin HyperRectangles for Efficient Parallelization of LineSweep Computations
"... Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors so that linesweep computations can be performed eciently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same nu ..."
Abstract
 Add to MetaCart
Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors so that linesweep computations can be performed eciently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same number of tiles to compute at each step in the sweep, in other words, it describes a latin hyperrectangle, natural extension of the notion of latin squares. This property results in full, balanced parallelism. A secondary benet of multipartitionings is that they induce only coarsegrain communication. All of the multipartitionings described in the literature to date assign only one tile per processor per hyperplane of a multipartitioning (latin hypercube). While this class of multipartitionings is optimal for two dimensions, in three dimensions it requires the number of processors to be a perfect square. This paper considers the general problem of computing optimal multipartitionings for multidimensional data volumes on an arbitrary number of processors. We describe an algorithm to compute a ddimensional multipartitioning of a multidimensional array for an arbitrary number of processors. When using a multipartitioning to parallelize a line sweep computation, the best partitioning is the one that exploits all of the processors and has the smallest communication volume. To compute the best multipartitioning of a multidimensional array, we describe a cost model for selecting d, the dimensionality of the best partitioning, and the number of cuts along each partitioned dimension. In practice, our technique will choose a 3dimensional multipartitioning for a 3dimensional linesweep computation, except when p is a prime; previously, a 3dimensional multipartitioning could be a...
Internal Accession Date Only
, 2000
"... systolic array synthesis, affine scheduling We describe a new, practical, constructive method for solving the wellknown conflictfree scheduling systolic for the locally sequential, globally parallel (LSGP) case of processor array synthesis. Previous solutions have an important practical disadvanta ..."
Abstract
 Add to MetaCart
(Show Context)
systolic array synthesis, affine scheduling We describe a new, practical, constructive method for solving the wellknown conflictfree scheduling systolic for the locally sequential, globally parallel (LSGP) case of processor array synthesis. Previous solutions have an important practical disadvantage. Here we provide a closed form solution that enables the enumeration of all conflictfree schedules. The second part of the paper discusses reduction of the cost of hardware whose function is to control the flow of data, enable or disable functional units, and generate memory addresses. We present a new technique for controlling the complexity of these housekeeping functions in a processor array.
Hardware Synthesis for Systems of Recurrence Equations with Multi Dimensional Schedule
"... Abstract — This paper introduces methods for extending the classical systolic synthesis methodology to multidimensional time. Multidimensional scheduling enables complex algorithms that do not admit linear schedules to be parallelized, but it requires the use of memories in the architecture. The s ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — This paper introduces methods for extending the classical systolic synthesis methodology to multidimensional time. Multidimensional scheduling enables complex algorithms that do not admit linear schedules to be parallelized, but it requires the use of memories in the architecture. The synthesis of such an architecture requires the definition of an allocation function that maps the calculations on the processors, and memory functions that define where the data are stored during execution. As our approach targets custom VLSI architectures, we constrain the synthesis method to produce parallel architectures that that satisfy the computer owns rule, i.e., each processor computes the data which are stored in its local memory. We explain how to combine the allocation and memory functions in order to meet the computer owns rule, and we present an original mechanism for controlling the operation of the architecture. We detail the different steps needed to generate a HDL description of the architecture, and we illustrate our method on the matrix multiplication algorithm. We describe a structural VHDL program that has been derived and synthesized for a FPGA platform using these design principles. Our results show that the complexity added in each processor by the memories and the control is moderate and justifies in practice the use of such architectures.
Abstract
"... This paper introduces basic principles for extending the classical systolic synthesis methodology to multidimensional time. Multidimensional scheduling enables complex algorithms that do not admit linear schedules to be parallelized, but it also implies the use of memories in the architecture. The ..."
Abstract
 Add to MetaCart
(Show Context)
This paper introduces basic principles for extending the classical systolic synthesis methodology to multidimensional time. Multidimensional scheduling enables complex algorithms that do not admit linear schedules to be parallelized, but it also implies the use of memories in the architecture. The paper explains how to obtain compatible allocation and memory functions for vlsi (or simdlike code) generation. It also presents an original mechanism for controlling a vlsi architecture which has a multidimensional schedule. A structural vhdl code has been derived and synthesized (for implementation on fpga platform) using these systematic design principles. These results are preliminary steps to the possibility of a systematic hardware synthesis for multidimensional time.
On EÆcient Parallelization of LineSweep Computations
, 2001
"... Multipartitioning is a strategy for partitioning multidimensional arrays on a collection of processors. With multipartitioning, computations that require solving 1D recurrences along each dimension of a multidimensional array can be parallelized eectively. Previous techniques for multipartitioning ..."
Abstract
 Add to MetaCart
(Show Context)
Multipartitioning is a strategy for partitioning multidimensional arrays on a collection of processors. With multipartitioning, computations that require solving 1D recurrences along each dimension of a multidimensional array can be parallelized eectively. Previous techniques for multipartitioning yield eÆcient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing optimal multipartitionings for ddimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning for this general case, which enables eÆcient parallelizations of linesweep computations under arbitrary conditions. Finally, we describe a prototype implementation of generalized multipartitioning in the Rice dHPF compiler and performance results obtained when using it to parallelize a line sweep computation for dierent numbers of processors.