Results 1 - 10
of
16
Profile-Driven Instruction Level Parallel Scheduling with Application to Super Blocks
- IN PROCEEDINGS OF THE 29TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1996
"... Code scheduling to exploit instruction level parallelism (ILP) is a critical problem in compiler optimization research, in light of the increased use of long-instruction-word machines. Unfortunately, optimum scheduling is computationally intractable, and one must resort to carefully crafted heuristi ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Code scheduling to exploit instruction level parallelism (ILP) is a critical problem in compiler optimization research, in light of the increased use of long-instruction-word machines. Unfortunately, optimum scheduling is computationally intractable, and one must resort to carefully crafted heuristics in practice. If the scope of application of a scheduling heuristic is limited to basic blocks, considerable performance loss may be incurred at block boundaries. To overcome this obstacle, basic blocks can be coalesced across branches to form larger regions such as super blocks. In the literature, these regions are typically scheduled using algorithms that are either oblivious to profile information (under the assumption that the process of forming the region has fully utilized the profile information), or use the profile information as an addendum to classical scheduling techniques. We believe that even for the simple case of linear code regions such as super blocks, additional performanc...
Tuning Compiler Optimizations for Simultaneous Multithreading
- in International Symposium on Microarchitecture
, 1997
"... Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost, inter-processor c ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost, inter-processor communication. This paper reexamines several compiler optimizations in the context of simultaneous multithreading (SMT), a processor architecture that issues instructions from multiple threads to the functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding latencies. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, a...
Partial dead code elimination using slicing transformations
- In Proceedings of the ACM SIGPLAN '97 Conference on Programming Language Design and Implementation
, 1997
"... ..."
Achieving High Levels of Instruction-Level Parallelism With Reduced Hardware Complexity
, 1997
"... instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing (ILP) has established itself as the only viable approach for achieving the goal of providing continuously increasing performance without having to fundamentally re-write the application. ILP processors differ in their strategies for deciding exactly when, and on which functional unit, an operation should be executed. The alternatives lie somewhere on a spectrum depending on the extent to which these decisions are made by the compiler rather than by the hardware and on the manner in which information regarding parallelism is communicated by the compiler to the hardware via the program. HPL PlayDoh is a research architecture that has been defined to support research in ILP, with a bias towards VLIW processing. The overall objective of this research effort is to develop a suite of architectural features and compiler techniques that will enable a secondgeneration of VLIW processors to achieve high levels of ILP, across both scientific and non-scientific computations, but with hardware that is simple compared to out-of-order superscalar processors. The basic approach is to provide the program (compiler) more control over capabilities that, in superscalar processors, are typically microarchitectural (i.e., controlled by the hardware) by raising them to the architectural level.
Balance Scheduling: Weighting Branch Tradeoffs in Superblocks
- PROC. 32 ND ANN. INT’L SYMP. MICROARCHITECTURE (MICRO32
, 1999
"... Since there is generally insufficient instruction level parallelism within a single basic block, higher performance is achieved by speculatively scheduling operations in superblocks. This is difficult in general because each branch competes for the processor's limited resources. Previous work manage ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Since there is generally insufficient instruction level parallelism within a single basic block, higher performance is achieved by speculatively scheduling operations in superblocks. This is difficult in general because each branch competes for the processor's limited resources. Previous work manages the performance tradeoffs that exist between branches only indirectly. We show here that dependence and resource constraints can be used to gather explicit knowledge about scheduling tradeoffs between branches. The first contribution of this paper is a set of new, tighter lower bounds on the execution times of superblocks that specifically accounts for the dependence and resource conflicts between pairs of branches. The second contribution of this paper is a novel superblock scheduling heuristic that finds high performance schedules by determining the operations that each branch needs to be scheduled early and selecting branches with compatible needs that favor beneficial branch tradeoffs....
Tartan: Evaluating spatial computation for whole program execution
- In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2006
"... Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose architecture which integrates a reconfigurable fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an application into an instruction stream for the core and a configuration for the RF. We use a detailed simulator to capture both timing and energy numbers for all parts of the system. Our results indicate that a hierarchical RF architecture, designed around a scalable interconnect, is instrumental in harnessing the benefits of spatial computation. The interconnect uses static configuration and routing at the lower levels and a packet-switched, dynamically-routed network at the top level. Tartan is most energyefficient when almost all of the application is mapped to the RF, indicating the need for the RF to support most general-purpose programming constructs. Our initial investigation reveals that such a system can provide, on average, an order of magnitude improvement in energy-delay compared to an aggressive superscalar core on single-threaded workloads.
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management
, 2003
"... Leakage energy optimization for caches has been the target of much recent effort. In this work, we focus on instruction caches and tailor two techniques that exploit the two major factors that shape the instruction access behavior, namely, hotspot execution and sequentiality. First, we adopt a hotsp ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Leakage energy optimization for caches has been the target of much recent effort. In this work, we focus on instruction caches and tailor two techniques that exploit the two major factors that shape the instruction access behavior, namely, hotspot execution and sequentiality. First, we adopt a hotspot detection mechanism by profiling the branch behavior at runtime and utilize this to implement a HotSpot based Leakage Management (HSLM) mechanism. Second, we exploit code sequentiality in implementing a Just-In-Time Activation (JITA) that transitions cache lines to active mode just before they are accessed. We utilize the recently proposed drowsy cache that dynamically scales voltages for leakage reduction and implement various schemes that use different combinations of HSLM and JITA. Our experimental evaluation using the SPEC2000 benchmark suite shows that instruction cache leakage energy consumption can be reduced by 63%, 49% and 29%, on the average, as compared to an unoptimized cache, a recently proposed hardware optimized cache, and a cache optimized using compiler, respectively. Further, we observe that these energy savings can be obtained without a significant impact on performance.
An integrated approach to accelerate data and predicate computations in hyperblocks
- In Proceedings of the 33rd International Symposium on Microarchitecture
, 2000
"... To exploit increased instruction−level parallelism available in modern processors, we describe the formation and optimization of tracenets, an integrated approach to reducing the length of the critical path in data and predicated computation. By tightly integrating selective path expansion and path ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
To exploit increased instruction−level parallelism available in modern processors, we describe the formation and optimization of tracenets, an integrated approach to reducing the length of the critical path in data and predicated computation. By tightly integrating selective path expansion and path optimization within hyperblocks, our algorithm is able to produce highly optimized code without exploring the exponentially large number of paths included in a hyperblock. Our approach extracts more of the implicit predicate correlations in hyperblocks and uses a precise model of predicate correlations to aggressively accelerate data and predicate computations. Experimental results indicate that tracenets can significantly reduce the number of
Dataflow: A Complement to Superscalar
- In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS
, 2005
"... There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow ma ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow machine, even with unlimited resources, does not always outperform a superscalar processor on general-purpose codes, under the assumption that both machines take the same time to execute basic operations. We compare a program-specific dataflow machine with unlimited parallelism to a superscalar processor running the same program. While the dataflow machines provide very good performance on most data-parallel programs, we show that the dataflow machine cannot always take advantage of the available parallelism. Using the dynamic critical path we investigate the mechanisms used by superscalar processors to provide a performance advantage and their impact on a dataflow model.
An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors
, 1999
"... . Growing demand for high performance in embedded systems is creating new opportunities for Instruction-Level Parallelism (ILP) techniques that are traditionally used in high performance systems. Predicated execution, an important ILP technique, can be used to improve branch handling, reduce fre ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
. Growing demand for high performance in embedded systems is creating new opportunities for Instruction-Level Parallelism (ILP) techniques that are traditionally used in high performance systems. Predicated execution, an important ILP technique, can be used to improve branch handling, reduce frequently mispredicted branches, and expose multiple execution paths to hardware resources. However, there is a major tradeoff in the design of the instruction set, the addition of a predicate operand for all instructions. We propose a new architecture framework for introducing predicated execution to embedded designs. Experimental results show a 10% performance improvement and a code reduction of 25% over a traditionally predicated architecture. 1 Introduction Growing demand for high performance in embedded computing systems is creating new opportunities for Instruction-Level Parallelism (ILP) techniques that are traditionally used in high performance systems. In several ways, the nee...

