Results 1  10
of
33
Exact Memory Size Estimation for Array Computations without Loop Unrolling
 In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99
, 1999
"... This paper presents a new algorithm for exact estimation of the minimum memory size required by programs dealing with array computations. Memory size is an important factor a ecting area and power cost of memory units. For programs dealing mostly with array computations, memory cost is a dominant fa ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
This paper presents a new algorithm for exact estimation of the minimum memory size required by programs dealing with array computations. Memory size is an important factor a ecting area and power cost of memory units. For programs dealing mostly with array computations, memory cost is a dominant factor in the overall system cost. Thus, exact estimation of memory size required by a program is necessary to provide quantitative information for making highlevel design decisions. Based on formulated live variables analysis, our algorithm transforms the minimum memory size estimation into an equivalent problem: integer point counting for intersection/union of mappings of parameterized polytopes. Then, a heuristics was proposed to solve the counting problem. Experimental results show that the algorithm achieves the exactness traditionally associated with totallyunrolling loops while exploiting the reduced computation complexity by preserving original loop structure. 1
Fast and Extensive SystemLevel Memory Exploration for ATM
 10th International Symposium on System Synthesis (Isss97
, 1997
"... In this paper, our memory architecture exploration methodology and CAD techniques for network protocol applications are presented. Prototype tools have been implemented, and applied on part of an industrial ATM application to show how our novel approach can be used to easily and thoroughly explore t ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
In this paper, our memory architecture exploration methodology and CAD techniques for network protocol applications are presented. Prototype tools have been implemented, and applied on part of an industrial ATM application to show how our novel approach can be used to easily and thoroughly explore the memory organization search space at the systemlevel. An extended, novel method for signal to memory assignment is proposed which takes into account memory access conflict constraints. The number of conflicts is first optimized by our flowgraph balancing technique. Significant power and area savings were obtained by performing the exploration thoroughly at each of the degrees of freedom in the global search space.
Flow Graph Balancing for Minimizing the Required Memory Bandwidth
 In ISSS, La Jolla, CA
, 1996
"... In this paper we present the problem of flow graph balancing for minimizing the required memory bandwidth. Our goal is to minimize the required memory bandwidth within the given cycle budget by adding ordering constraints to the flow graph. This allows the subsequent memory allocation and assignment ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
In this paper we present the problem of flow graph balancing for minimizing the required memory bandwidth. Our goal is to minimize the required memory bandwidth within the given cycle budget by adding ordering constraints to the flow graph. This allows the subsequent memory allocation and assignment tasks to come up with a cheaper memory architecture with less memories and memory ports. The effect of flow graph balancing is shown on an example. We show that it is important to take into account which data is being accessed in parallel, instead of only considering the number of simultaneous memory accesses. This leads to the optimization of a conflict graph. 1.
Data dependency size estimation for use in memory optimization
 IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems
, 2003
"... Abstract — A novel storage requirement estimation methodology is presented for use in the early system design phases when the data transfer ordering is only partly fixed. At that stage, none of the existing estimation tools are adequate, as they either assume a fully specified execution order or ign ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Abstract — A novel storage requirement estimation methodology is presented for use in the early system design phases when the data transfer ordering is only partly fixed. At that stage, none of the existing estimation tools are adequate, as they either assume a fully specified execution order or ignore it completely. This paper presents an algorithm for automated estimation of strict upper and lower bounds on the individual data dependency sizes in high level application code given a partially fixed execution ordering. In the overall estimation technique, this is followed by a detection of the maximally combined size of simultaneously alive dependencies, resulting in the overall storage requirement of the application. Using representative application demonstrators, we show how our techniques can effectively guide the designer to achieve a transformed specification with low storage requirement.
Reducing Memory Requirements of Nested Loops for Embedded Systems
 DAC 2001
, 2001
"... Most embedded systems have limited amount of memory. In contrast, the memory requirements of code (in particular loops) running on embedded systems is significant. This paper addresses the problem of estimating the amount of memory needed for transfers of data in embedded systems. The problem of est ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Most embedded systems have limited amount of memory. In contrast, the memory requirements of code (in particular loops) running on embedded systems is significant. This paper addresses the problem of estimating the amount of memory needed for transfers of data in embedded systems. The problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of the entire set of nested loops is analyzed. A quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for nonuniformly generated references are derived. In addition to presenting an algorithm that computes the total memory required, we discuss the effect of transformations on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. A detailed analysis on the effect of unimodular transformations on data locality including the calculation of the maximum window size is discussed. The term maximum window size is introduced and quantitative expressions are derived to compute the window size. The smaller the value of the maximum window size, the higher the amount of data locality in the loop.
Counting integer points in parametric polytopes using Barvinok’s rational functions
 Algorithmica
, 2007
"... Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasipolynomials each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasipolynomials, but this technique has several disadvantages. Its worstcase computation time for a single quasipolynomial is exponential in the input size, even for fixed dimensions. The worstcase size of such a quasipolynomial (measured in bits needed to represent the quasipolynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasipolynomials analytically. It extends an existing method, based on Barvinok’s decomposition,
Synthesis of ApplicationSpecific Memory Structures
, 1995
"... The benefits of a behavioral synthesis design methodology, including higher designer productivity and shorter timetomarket, are the results of allowing the designer to use more abstract and familiar specifications. One of the most common and familiar abstractions used by hardware and software desi ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
The benefits of a behavioral synthesis design methodology, including higher designer productivity and shorter timetomarket, are the results of allowing the designer to use more abstract and familiar specifications. One of the most common and familiar abstractions used by hardware and software designers is the array, which allows the specification of a set of values that have a unique index associated with each element. A behavioral synthesis tool that accepts specifications containing arrays must map the storage implied by the arrays to memory in an implementation of that specification. In many dataintensive applications, these memory design decisions have a larger impact on the cost, performance, and power of the implementation than any other design decision made by a behavioral synthesis tool. Most behavioral synthesis tools, however, fail to separate the concepts of array specification and memory implementation, which severely restricts the span of designs that can be explored gi...
Experiences with Enumeration of Integer Projections of Parametric Polytopes
 in Compiler Construction: 14th Int. Conf
, 2005
"... Many compiler optimization techniques depend on the ability to calculate the number of integer values that satisfy a given set of linear constraints. This count (the enumerator of a parametric polytope) is a function of the symbolic parameters that may appear in the constraints. In an extended probl ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Many compiler optimization techniques depend on the ability to calculate the number of integer values that satisfy a given set of linear constraints. This count (the enumerator of a parametric polytope) is a function of the symbolic parameters that may appear in the constraints. In an extended problem (the "integer projection" of a parametric polytope), some of the variables that appear in the constraints may be existentially quantified and then the enumerated set corresponds to the projection of the integer points in a parametric polytope. This paper shows how to...
Automatic Onchip Memory Minimization for Data Reuse
 In FCCM ’07 : Proceedings of the 15th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines
, 2007
"... FPGAbased computing engines have become a promising option for the implementation of computationally intensive applications due to high flexibility and parallelism. However, one of the main obstacles to overcome when trying to accelerate an application on an FPGA is the bottleneck in offchip commu ..."
Abstract

Cited by 11 (10 self)
 Add to MetaCart
FPGAbased computing engines have become a promising option for the implementation of computationally intensive applications due to high flexibility and parallelism. However, one of the main obstacles to overcome when trying to accelerate an application on an FPGA is the bottleneck in offchip communication, typically to large memories. Often it is known at compiletime that the same data item is accessed many times, and as a result can be loaded once from large offchip RAM onto scarce onchip RAM, alleviating this bottleneck. This paper addresses how to automatically derive an address mapping that reduces the size of the required onchip memory for a given memory access pattern. Experimental results demonstrate that, in practice, our approach reduces onchip storage requirements to the minimum, corresponding to a reduction in onchip memory size of up to 40× (average 10×) for some benchmarks compared to a naïve approach. At the same time, no clock period penalty or increase in control logic area compared to this approach is observed for these benchmarks. 1.