Results 1  10
of
59
Generation of Efficient Nested Loops from Polyhedra
 International Journal of Parallel Programming
, 2000
"... Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target spacetime domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a signi ..."
Abstract

Cited by 72 (3 self)
 Add to MetaCart
Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target spacetime domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a significant impact on the quality of the final code. It involves making a tradeoff between code size and control code simplification/optimization. Previous methods of doing code generation are based on loop splitting, however they have nonoptimal behavior when working on parameterized programs. We present a general parameterized method for code generation based on dual representation of polyhedra. Our algorithm uses a simple recursion on the dimensions of the domains, and enables fine control over the tradeoff between code size and control overhead.
Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures
 IN PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON HARDWARE/SOFTWARE CODESIGN (CODES
, 2000
"... This paper presents the Compaan tool that automatically transforms a nested loop program written in Matlab into a processnetwork specification. The process ..."
Abstract

Cited by 50 (11 self)
 Add to MetaCart
This paper presents the Compaan tool that automatically transforms a nested loop program written in Matlab into a processnetwork specification. The process
Parameterized Polyhedra and their Vertices
 International Journal of Parallel Programming
, 1995
"... Algorithms specified for parametrically sized problems are more general purpose and more reusable than algorithms for fixed sized problems. For this reason, there is a need for representing and symbolically analyzing linearly parameterized algorithms. An important class of parallel algorithms can be ..."
Abstract

Cited by 42 (11 self)
 Add to MetaCart
Algorithms specified for parametrically sized problems are more general purpose and more reusable than algorithms for fixed sized problems. For this reason, there is a need for representing and symbolically analyzing linearly parameterized algorithms. An important class of parallel algorithms can be described as systems of parameterized affine recurrence equations (PARE). In this representation, linearly parameterized polyhedra are used to describe the domains of variables. This paper describes an algorithm which computes the set of parameterized vertices of a polyhedron, given its representation as a system of parameterized inequalities. This provides an important tool for the symbolic analysis of the parameterized domains used to define variables and computation domains in PARE's. A library of operations on parameterized polyhedra based on the Polyhedral Library has been written in C and is freely distributed. 1 Introduction In order to improve the performance of scientific programs...
Generating Cache Hints for Improved Program Efficiency
 JOURNAL OF SYSTEMS ARCHITECTURE
, 2004
"... One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedu ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compiletime method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric. Two
Analytical Computation of Ehrhart Polynomials: Enabling more Compiler Analyses and Optimizations
 In CASES
, 2004
"... Many optimization techniques, including several targeted specifically at embedded systems, depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many optimization techniques, including several targeted specifically at embedded systems, depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that this parametric count can be represented by a set of Ehrhart polynomials. Previously, interpolation was used to obtain these polynomials, but this technique has several disadvantages. Its worstcase computation time for a single Ehrhart polynomial is exponential in the input size, even for fixed dimensions. The worstcase size of such an Ehrhart polynomial (measured in bits needed to represent the polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution.
PolyLib: A Library for Manipulating Parameterized Polyhedra
, 1999
"... This document is a continuation of the technical report [Wil93], describing release 1.1 of the PolyLib. Version 1.1 manipulates non parameterized unions of polyhedra through the following operations: intersection, difference, union, convex hull, simplify, image and preimage, plus some input and outp ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
This document is a continuation of the technical report [Wil93], describing release 1.1 of the PolyLib. Version 1.1 manipulates non parameterized unions of polyhedra through the following operations: intersection, difference, union, convex hull, simplify, image and preimage, plus some input and output functions. The polyhedra are computed in their dual implicit and Minkowski representations, in homogeneous spaces. Each polyhedron is represented by two matrices: a matrix of lines and rays, and a matrix of equalities and inequalities. The first column of these matrices distinguishes lines from rays and equalities from inequalities respectively
Precise Data Locality Optimization of Nested Loops
 J. SUPERCOMPUT
, 2002
"... A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.
Counting integer points in parametric polytopes using Barvinok’s rational functions
 Algorithmica
, 2007
"... Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasipolynomials each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasipolynomials, but this technique has several disadvantages. Its worstcase computation time for a single quasipolynomial is exponential in the input size, even for fixed dimensions. The worstcase size of such a quasipolynomial (measured in bits needed to represent the quasipolynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasipolynomials analytically. It extends an existing method, based on Barvinok’s decomposition,
Handling Memory Cache Policy with Integer Points Countings
, 1997
"... . We propose an automatic method allowing the computation of useful information on the distribution of data in the memory cache. The number of distinct memory locations and cache lines touched by a loop, and the number of processors resulting from a distribution (CYCLIC, BLOCK, REPLICATE) from a ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
. We propose an automatic method allowing the computation of useful information on the distribution of data in the memory cache. The number of distinct memory locations and cache lines touched by a loop, and the number of processors resulting from a distribution (CYCLIC, BLOCK, REPLICATE) from array elements are computed. It is shown that these problems are relevant to the same general mathematical problem, which is the counting of the exact number of integer points resulting from linear and nonlinear mappings of polytopes. 1 Introduction The construction of realistic and efficient parallel programs requires the extraction and the use of much information, concerning as well the potential parallelism and the needed resources of the algorithm, as the availability of resources on the target architecture. But these are generally not trivial to determine. Nowadays processors are designed with a memory hierarchy organized into several levels, each of which is smaller, faster, and ...
Design Space Exploration for Massively Parallel Processor Arrays
 Parallel Computing Technologies, 6th International Conference, PaCT 2001, Proceedings
, 2001
"... In this paper, we describe an approach for the optimization of dedicated coprocessors that are implemented either in hardware (ASIC) or configware (FPGA). Such massively parallel coprocessors are typically part of a heterogeneous hardware/softwaresystem. Each coprocessor is a massive parallel ..."
Abstract

Cited by 16 (11 self)
 Add to MetaCart
In this paper, we describe an approach for the optimization of dedicated coprocessors that are implemented either in hardware (ASIC) or configware (FPGA). Such massively parallel coprocessors are typically part of a heterogeneous hardware/softwaresystem. Each coprocessor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational intensive task into hardware, existing approaches either try to optimize for performance or for cost with the other objective being a secondary goal.