Results 1 - 10
of
25
Compiler optimization-space exploration
- In Proceedings of the international symposium on Code generation and optimization
, 2003
"... To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to guide optimizations by predicting their effects a priori. Unfortunately, the unpredictability of optimization interaction and the irregularity of today’s wide-issue machines severely limit the accuracy of these heuristics. As a result, compiler writers may temper high variance optimizations with overly conservative heuristics or may exclude these optimizations entirely. While this process results in a compiler capable of generating good average code quality across the target benchmark set, it is at the cost of missed optimization opportunities in individual code segments. To replace predictive heuristics, researchers have proposed compilers which explore many optimization options, selecting the best one a posteriori. Unfortunately, these existing iterative compilation techniques are not practical for reasons of compile time and applicability. In this paper, we present the Optimization-Space Exploration (OSE) compiler organization, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers. Instead of replacing predictive heuristics, OSE uses the compiler writer’s knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment. Compile time is limited by evaluating only these alternatives for hot code segments using a general compiletime performance estimator. An OSE-enhanced version of Intel’s highly-tuned, aggressively optimizing production compiler for IA-64 yields a significant performance improvement, more than 20 % in some cases, on Itanium for SPEC codes. 1.
Convergent Scheduling
- In Proceedings of the 35th Annual International Symposium on Microarchitecture
, 2002
"... Convergent scheduling is a general framework for cluster assignment and instruction scheduling on spatial architectures. A convergent scheduler is composed of independent passes, each implementing a heuristic that addresses a particular problem or constraint. The passes share a simple, common interf ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Convergent scheduling is a general framework for cluster assignment and instruction scheduling on spatial architectures. A convergent scheduler is composed of independent passes, each implementing a heuristic that addresses a particular problem or constraint. The passes share a simple, common interface that provides spatial and temporal preference for each instruction. Preferences are not absolute; instead, the interface allows a pass to express the confidence of its preferences, as well as preferences for multiple space and time slots. A pass operates by modifying these preferences. By applying a series of passes that address all the relevant constraints, the convergent scheduler can produce a schedule that satisfies all the important constraints. Because all passes are independent and need to understand only one interface to interact with each other, convergent scheduling simplifies the problem of handling multiple constraints and codeveloping different heuristics. We have applied convergent scheduling to two spatial architectures: the Raw processor and a clustered VLIW machine. It is able to successfully handle traditional constraints such as parallelism, load balancing, and communication minimization, as well as constraints due to preplaced instructions, which are instructions with predetermined cluster assignment. Convergent scheduling is able to obtain an average performance improvement of 21% over the existing space-time scheduler of the Raw processor, and an improvement of 14% over state-of-the-art assignment and scheduling techniques on a clustered VLIW architecture.
High-Quality Operation Binding for Clustered VLIW Datapaths
- In Design Automation Conference
, 2001
"... Clustering is an effective method to increase the available parallelism in VLIW datapaths without incurring severe penalties associated with large number of register file ports. Efficient utilization of a clustered datapath requires careful binding of operations to clusters. The paper proposes a bin ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Clustering is an effective method to increase the available parallelism in VLIW datapaths without incurring severe penalties associated with large number of register file ports. Efficient utilization of a clustered datapath requires careful binding of operations to clusters. The paper proposes a binding algorithm that effectively explores tradeoffs between in-cluster operation serialization and delays associated with data transfers between clusters. Extensive experimental evidence is provided showing that the algorithm generates high quality solutions for basic blocks, with up to 25% improvement over a state-of-the-art advanced binding algorithm.
Integrated Temporal and Spatial Scheduling for Extended Operand Clustered VLIW Processors
- In Proc. of Conf. on computing frontiers
, 2004
"... Centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected clusters each with a separate register file and a few ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected clusters each with a separate register file and a few functional units. Among the many inter-cluster communication models proposed, the extended operand model extends some of operand fields of instruction with a cluster specifier and allows an instruction to read some of the operands from other clusters without any extra cost. Scheduling for clustered processors involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). A scheduler is responsible for resolving the conflicting requirements of aggressively exploiting the parallelism offered by hardware and limiting the communication among clusters to available slots. This paper proposes an integrated spatial and temporal scheduling algorithm for extended operand clustered VLIW processors and evaluates its effectiveness in improving the run time performance of the code without code size penalty. 1.
Design Space Exploration for Real-Time Embedded Stream
"... We present a framework for rapidly exploring the design space for stream processors in real-time embedded systems. Stream processors are high performance digital signal processors with clusters of arithmetic units. There is a trade-off between the number of arithmetic units in a cluster of a stream ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present a framework for rapidly exploring the design space for stream processors in real-time embedded systems. Stream processors are high performance digital signal processors with clusters of arithmetic units. There is a trade-off between the number of arithmetic units in a cluster of a stream processor, the number of clusters and the clock frequency as each solution meets real-time at a different power consumption. We have developed a design exploration tool that explores this trade-off and provides candidate configurations for low power and an estimate of their real-time performance. Our design methodology relates the instruction level parallelism, subword parallelism and data parallelism to the organization of the functional units and also provides insights into the functional unit utilization of the processor. A sensitivity analysis to the technology and the processor model enables the designer to check the robustness of the design exploration results.
Algorithms for Compiler-Assisted Design Space Exploration of Clustered VLIW ASIP Datapaths
, 2001
"... Clustered Very Large Instruction Word Application-Specific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible pe ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Clustered Very Large Instruction Word Application-Specific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible performance/cost tradeoffs. In this dissertation we propose and validate an algorithm to support early design space exploration (DSE) over classes of datapaths, in the context of a specific target application, and carry out an empirical study for a set of representative benchmarks. We argue that at an early DSE phase one should use design space parameters that have a first-order impact on two key physical figures of merit: clock rate f and power dissipation P. We found these parameters to be: maximum cluster capacity (number of functional units in a cluster) NF, number of clusters NC, and the interconnect capacity NB. The experimental validation of our DSE algorithm shows that a thorough exploration of the complex design space can be performed very efficiently in this parameterized design space. Moreover, our case studies suggest that penalties of clustered versus nonclustered datapaths are often minimal and that clustering indeed unlocks a variety of valuable design alternatives. Our exploration methodology is enabled by an efficient algorithm for binding op-erations in a dataflow graph to the clusters of a datapath, so as to minimize latency and the number of data transfers. The algorithm utilizes effective cost and ranking functions that enable the exploration of complex tradeoffs between: (1) operation serialization, due to cluster overload; and (2) penalties incurred by data transfers, due to scattering opera-tions with data dependencies over different clusters. The core binding algorithm has shown robustness over a large set of datapaths and application kernels, and demonstrated up to 29% improvement in schedule latency, as compared to a state of the art advanced binding algorithm.
Data-parallel Digital Signal Processors: Algorithm mapping, Architecture scaling and Workload adaptation
, 2004
"... ..."
Instruction buffering exploration for low energy VLIW with instruction clusters
- IN PROC. IEEE ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASPDAC’04
, 2004
"... For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. In particular, software controlled clustered loop buffers are energy efficient. However current compilers for VLIW do not fully exploit the potentials offered b ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. In particular, software controlled clustered loop buffers are energy efficient. However current compilers for VLIW do not fully exploit the potentials offered by such a clustered organization This paper presents an algorithm to explore what is the optimal loop buffer configuration and the optimal way to use this configuration for an application or a set of applications. Results for the MediaBench application suite show an additional 18 % reduction (on average) in energy in the instruction memory hierarchy as compared to traditional nonclustered approaches to the loop buffer without compromising performance.
Compilation Techniques for Energy-, Code-Size-, and Run-Time-Efficient Embedded Software
, 2001
"... This paper is motivated by two essential characteristics of embedded systems: the increasing amount of software that is used for implementing embedded systems and the need for implementing embedded systems efficiently. As a consequence, embedded software has to be efficient. In the following, we wil ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper is motivated by two essential characteristics of embedded systems: the increasing amount of software that is used for implementing embedded systems and the need for implementing embedded systems efficiently. As a consequence, embedded software has to be efficient. In the following, we will present techniques for generating efficient machine code for architectures which are typically found in embedded systems. We will demonstrate, using examples, how compilers for embedded processors can exploit features that are found in embedded processors.
ORC2DSP: Compiler Infrastructure Supports for VLIW DSP
- Processors. Proceedings of 2005 IEEE International Symposium on VLSI Design, Automation, and Test
, 2005
"... Abstract — In this paper, we describe our experiences in deploying ORC infrastructures for a novel 32-bit VLIW DSP processor (known as PAC core), which equips with new architectural features, such as distributed and ‘ping-pong ’ register files. We also present methods in retargeting ORC compilers fo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract — In this paper, we describe our experiences in deploying ORC infrastructures for a novel 32-bit VLIW DSP processor (known as PAC core), which equips with new architectural features, such as distributed and ‘ping-pong ’ register files. We also present methods in retargeting ORC compilers for PAC VLIW DSP processors. In addition, mechanisms are proposed to incorporate register allocation policies in the compiler framework for distributed register files in PAC architectures. In the early design stage, several iterations of tuning are needed between architecture and software designs. Our work gives an early estimation of architecture performance so that refinements of architectures are possible with the software feedbacks. I.

