Results 1 -
8 of
8
Compiling for EDGE architectures
- In International Symposium on Code Generation and Optimization
, 2006
"... Explicit Data Graph Execution (EDGE) architectures offer the possibility of high instruction-level parallelism with energy efficiency. In EDGE architectures, the compiler breaks a program into a sequence of structured blocks that the hardware executes atomically. The instructions within each block c ..."
Abstract
-
Cited by 33 (23 self)
- Add to MetaCart
Explicit Data Graph Execution (EDGE) architectures offer the possibility of high instruction-level parallelism with energy efficiency. In EDGE architectures, the compiler breaks a program into a sequence of structured blocks that the hardware executes atomically. The instructions within each block communicate directly, instead of communicating through shared registers. The TRIPS EDGE architecture imposes several restrictions on its blocks to simplify the microarchitecture: each TRIPS block has at most 128 instructions, issues at most 32 loads and/or stores, and executes at most 32 register bank reads and 32 writes. To detect block completion, each TRIPS block must produce a constant number of outputs (stores and register writes) and a branch decision. The goal of the TRIPS compiler is to produce TRIPS blocks full of useful instructions while enforcing these constraints. This paper describes a set of compiler algorithms that meet these sometimes conflicting goals, including an algorithm that assigns load and store identifiers to maximize the number of loads and stores within a block. We demonstrate the correctness of these algorithms in simulation on SPEC2000, EEMBC, and microbenchmarks extracted from SPEC2000 and others. We measure speedup in cycles over an Alpha 21264 on microbenchmarks. 1.
A Systematic Approach to Delivering INSTRUCTION-LEVEL PARALLELISM IN EPIC SYSTEMS
, 2005
"... Computer systems designed under the explicitly parallel instruction computing (EPIC) paradigm rely extensively on compiler technology to deliver the instruction-level parallelism (ILP) required for them to achieve high levels of performance. While manifold techniques have been proposed in the litera ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Computer systems designed under the explicitly parallel instruction computing (EPIC) paradigm rely extensively on compiler technology to deliver the instruction-level parallelism (ILP) required for them to achieve high levels of performance. While manifold techniques have been proposed in the literature for delivering such parallelism, this dissertation is unique in integrating and applying a comprehensive suite of techniques, embodied in the IMPACT Research Compiler, to a concrete system, comprised of the SPEC CINT2000 benchmarks and the Intel Itanium 2 platform. These techniques include advanced pointer analysis, aggressive cross-file procedure inlining, targeted region formation, profile-guided optimizations, and, most importantly, aggressive and pervasive use of predication and control speculation. The collective effect of these techniques is evaluated with real-system measurements, showing them to achieve a 1.20 average (up to 1.59) speedup relative to classically optimized code and a 1.70 average (up to 2.51) speedup relative to code compiled with the Gnu GCC compiler. Achieving these results in the real-machine environment required advances in region formation heuristics, optimization, and speculation methods. Modern
Global Instruction Scheduling for Multi-Threaded Architectures
, 2008
"... Recently, the microprocessor industry has moved toward multi-core or chip multipro-cessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-thre ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recently, the microprocessor industry has moved toward multi-core or chip multipro-cessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract thread-level parallelism (TLP) from sequential applications. Unfortunately, automatic par-allelization has only been successful in the restricted domains of scientific and data-parallel applications, which usually have regular array-based memory accesses and little control flow. In order to support parallelization of general-purpose applications, computer archi-tects have proposed CMPs with light-weight, fine-grained (scalar) communication mech-anisms. Despite such support, most existing multi-threading compilation techniques have
Region-Based Partial Dead Code Elimination on Predicated Code
"... Abstract. This paper presents the design, implementation and experimental evaluation of a practical region-based partial dead code elimination (PDE) algorithm on predicated code in an existing compiler framework. Our algorithm processes PDE candidates using a worklist and reasons about their partial ..."
Abstract
- Add to MetaCart
Abstract. This paper presents the design, implementation and experimental evaluation of a practical region-based partial dead code elimination (PDE) algorithm on predicated code in an existing compiler framework. Our algorithm processes PDE candidates using a worklist and reasons about their partial deadness using predicate partition graphs. It operates uniformly on hyperblocks and regions comprising basic blocks and hyperblocks. The result of applying our algorithm to an SEME region is optimal: partially dead code cannot be removed without changing the branching structure of the program or potentially introducing new predicate defining instructions. We present statistic evidence about the PDE opportunities in the 12 SPECint2000 benchmarks. In addition to exhibit small compilation overheads, our algorithm achieves moderate performance improvements in 8 out of the 12 benchmarks on an Itanium machine. Our performance results and statistics show the usefulness of our algorithm as a pass applied before instruction scheduling. 1
Explicit Data Graph Compilation Committee:
"... This research would not have been possible without the support of many people. First, I would like to thank my advisor Doug Burger for his advice and friendship throughout my time in graduate school. I would like to thank my remaining committee members for their service and mentorship–Steve Keckler, ..."
Abstract
- Add to MetaCart
This research would not have been possible without the support of many people. First, I would like to thank my advisor Doug Burger for his advice and friendship throughout my time in graduate school. I would like to thank my remaining committee members for their service and mentorship–Steve Keckler, Calvin Lin, Kathryn McKinley, and Lizy John. A special thanks to Calvin Lin for advising me during my first few years in graduate school, and to Steve Keckler and Kathryn McKinley for their close collaboration on the TRIPS project over all these years. I thank the entire TRIPS team for their collaboration–Robert McDonald,
RECOMMENDED FOR ACCEPTANCE
"... In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. Since many of today’s applications are single-threaded and since it is likely that many of tomorrow’s applications will have far fewer threads than there will be processor cores, automat ..."
Abstract
- Add to MetaCart
In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. Since many of today’s applications are single-threaded and since it is likely that many of tomorrow’s applications will have far fewer threads than there will be processor cores, automatic thread extraction is an essential tool for effectively leveraging today’s multi-core and tomorrow’s many-core processors. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency and variability in intra-thread performance. This dissertation extends the pipelined parallelism paradigm with speculation. Using speculation, dependences that manifest infrequently or are easily predictable can be safely ignored by the compiler allowing it to carve more, and better balanced, thread-based pipeline stages from a single thread of execution. Prior speculative threading proposals

