Results 1 - 10
of
28
Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints
, 2003
"... Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried ..."
Abstract
-
Cited by 95 (23 self)
- Add to MetaCart
Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried out in the customised extensions. A few algorithms exist but are severely limited in the type of operation clusters they can choose and hence reduce significantly the effectiveness of specialisation. In this paper we introduce a more general algorithm which selects maximal-speedup convex subgraphs of the application dataflow graph under fundamental microarchitectural constraints, and which improves significantly on the state of the art.
A General Compiler Framework for Speculative Multithreading
, 2002
"... Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing non-numeric programs, which tend to use irregular data structures with pointers and have complex flows of control. Proper thread selection is crucial to obtaining good speedup in an SpMT system. This paper pres ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing non-numeric programs, which tend to use irregular data structures with pointers and have complex flows of control. Proper thread selection is crucial to obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general, and support a wide variety of threads, such as speculative threads, non-speculative threads, loop-centric threads, and out-of-order thread spawning. To do efficient partitioning, the compiler uses profiling, intra-procedural pointer analysis, data dependence information and control dependence information. Our compiler framework is implemented on the SUIF-MachSUIF platform, and is able to partition large programs, such as the SPEC benchmarks. A simulation-based evaluation of the generated threads shows that an average speedup of 3 can be obtained with 6 processing elements for non-numeric programs. This speedup reduces to 2 if we use only loop-based threads.
Exact and approximate algorithms for the extension of embedded processor instruction sets
- IEEE Trans. on CAD of Integrated Circuits and Systems
"... Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective opt ..."
Abstract
-
Cited by 30 (14 self)
- Add to MetaCart
Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective option is that of extending the existing processors and toolchains. Extensibility is indeed a feature now offered in real designs, e.g., by processors such as Tensilica Xtensa [T. R. Halfhill, Microprocess
A model-based framework: an approach for profit-driven optimization
- In Third Annual IEEE/ACM Interational Conference on Code Generation and Optimization
, 2005
"... Although optimizations have been applied for a number of years to improve the performance of software, problems that have been long-standing remain, which include knowing what optimizations to apply and how to apply them. To systematically tackle these problems, we need to understand the properties ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
Although optimizations have been applied for a number of years to improve the performance of software, problems that have been long-standing remain, which include knowing what optimizations to apply and how to apply them. To systematically tackle these problems, we need to understand the properties of optimizations. In our current research, we are investigating the profitability property, which is useful for determining the benefit of applying an optimization. Due to the high cost of applying optimizations and then experimentally evaluating their profitability, we use an analytic model framework for predicting the profitability of optimizations. In this paper, we target scalar optimizations, and in particular, describe framework instances for Partial Redundancy Elimination (PRE) and Loop Invariant Code Motion (LICM). We implemented the framework for both optimizations and compare profitdriven PRE and LICM with a heuristic-driven approach. Our experiments demonstrate that a model-based approach is effective and efficient in that it can accurately predict the profitability of optimizations with low overhead. By predicting the profitability using models, we can selectively apply optimizations. The model-based approach does not require tuning of parameters used in heuristic approaches and works well across different code contexts and optimizations. 1.
A Unified Theory of Timing Budget Management
- In IEEE/ACM International Conference on Computer-Aided Design
, 2004
"... This paper presents a theoretical framework that optimally solves many open problems in time budgeting. Our approach unifies a large class of existing timemanagement paradigms. Examples include time budgeting for maximizing total weighted delay relaxation, minimizing the maximum relaxation and min-s ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
This paper presents a theoretical framework that optimally solves many open problems in time budgeting. Our approach unifies a large class of existing timemanagement paradigms. Examples include time budgeting for maximizing total weighted delay relaxation, minimizing the maximum relaxation and min-skew time budget distribution. We show that many of the time management problems can be transformed into a min-cost flow instance that can be optimally and efficiently solved through well-known combinatorial techniques. Experiments include mapping of several designs, which are implemented using parameterized CoreGen IP cores, on Xilinx FPGA devices. Different time budgeting policies have been applied during the mapping stage. Our time management techniques always improved the area requirement of the implemented testbenches compared to a widely-used path-based method. We also compared the maximum budgeting and fairness in delay budget assignments. Our experimental results show that an average improvement of 19 % in area can be achieved when fairness and maximum budgeting policies are combined, compared to pure maximum budgeting. 1.
Input data reuse in compiling window operations onto reconfigurable hardware
- Proc. ACM Symp. On Languages, Compilers and Tools for Embedded Systems (LCTES
, 2004
"... Balancing computation with I/O has been considered as a critical factor of the overall performance for embedded systems in general and reconfigurable computing systems in particular. Data I/O often dominates the overall computation performance for window operation, which are frequently used in image ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
Balancing computation with I/O has been considered as a critical factor of the overall performance for embedded systems in general and reconfigurable computing systems in particular. Data I/O often dominates the overall computation performance for window operation, which are frequently used in image processing, image compression, pattern recognition and digital signal processing. This problem is more acute in reconfigurable systems since the compiler must generate the data path and the sequence of operations. The challenge is to intelligently exploit data reuse on the reconfigurable fabric (FPGA) to minimize the required memory or I/O bandwidth while maximizing parallelism. In this paper, we present a compile-time approach to reuse data in window-based codes. The compiler, called ROCCC, first analyzes and optimizes the window operation in C. It then computes the size of the hardware buffer and defines three sets of data values for each window: the window set, the managed set and the killed set. This compile-time analysis simplifies the HDL code generation and improves the resulting hardware performance. We also discuss in-place window operations.
Architecture and Synthesis for On-Chip Multicycle Communication
- IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems
, 2004
"... For multigigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication. The RDR micr ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
For multigigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication. The RDR microarchitecture divides the entire chip into an array of islands so that all local computation and communication within an island can be performed in a single clock cycle. Each island contains a cluster of computational elements, local registers, and a local controller. On top of the RDR microarchitecture, novel layout-driven architectural synthesis algorithms have been developed for multicycle communication, including scheduling-driven placement, placement-driven simultaneous scheduling with rebinding, and distributed control generation, etc. The experimentation on a number of real-life examples demonstrates promising results. For data flow intensive examples, we obtain a 44% improvement on average in terms of the clock period and a 37% improvement on average in terms of the final latency, over the traditional flow. For designs with control flow, our approach achieves a 28% clock-period reduction and a 23% latency reduction on average.
Architectural synthesis integrated with global placement for multi-cycle communication
- Proc. of International Conference on Computer Aided Design
, 2003
"... Multiple clock cycles are needed to cross the global interconnects for multi-gigahertz designs in nanometer technologies. For synchronous design, this requires the consideration of multi-cycle on-chip communication at the high level. In this paper, we present a new architectural synthesis system int ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Multiple clock cycles are needed to cross the global interconnects for multi-gigahertz designs in nanometer technologies. For synchronous design, this requires the consideration of multi-cycle on-chip communication at the high level. In this paper, we present a new architectural synthesis system integrated with global placement, named MCAS (Multi-Cycle Architectural Synthesis), on top of the recently-proposed Regular Distributed Register (RDR) micro-architecture [3]. The RDR architecture provides a regular synthesis platform for supporting multi-cycle communication. Novel architectural synthesis algorithms that integrate high-level synthesis with global placement have been developed in MCAS, including scheduling-driven placement and distributed controller generation, etc. Experimental results show that our methodology can achieve a clock period improvement of 31 % and a total latency improvement of 24 % on average compared to the conventional architectural synthesis flow. 1.
Data Communication Estimation and Reduction for Reconfigurable Systems
- IN PROC. 40TH DESIGN AUTOMATION CONF
, 2003
"... Widespread adoption of reconfigurable devices requires system level synthesis techniques to take an application written in a high level language and map it to the reconfigurable device. This paper describes methods for synthesizing the internal representation of a compiler into a hardware descriptio ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Widespread adoption of reconfigurable devices requires system level synthesis techniques to take an application written in a high level language and map it to the reconfigurable device. This paper describes methods for synthesizing the internal representation of a compiler into a hardware description language in order to program reconfigurable hardware devices. We demonstrate the usefulness of static single assignment (SSA) in reducing the amount of data communication in the hardware. However, the placement of -nodes by current SSA algorithms is not optimal in terms of minimizing data communication. We propose a new algorithm which optimally places -nodes, further decreasing area and communication latency. Our algorithm reduces the data communication (measured as total edge weight in a control data flow graph) by as much as 20% for some applications as compared to the best-known SSA algorithm - the pruned algorithm. We also describe future modifications to our model that should increase the effectiveness of our methods.
A Compiler Intermediate Representation for Reconfigurable Fabrics
- International Conference on Field Programmable Logic and Applications
, 2006
"... An intermediate representation (IR) is a central structure around which tools such as compilers and synthesis tools are built. In this paper we propose such an IR specifically designed for reconfigurable fabrics: CIRRF (Compiler ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
An intermediate representation (IR) is a central structure around which tools such as compilers and synthesis tools are built. In this paper we propose such an IR specifically designed for reconfigurable fabrics: CIRRF (Compiler

