Results 1 - 10
of
160
A Chip-Multiprocessor Architecture with Speculative Multithreading
- IEEE Transactions on Computers
, 1999
"... Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-perform ..."
Abstract
-
Cited by 112 (13 self)
- Add to MetaCart
Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-performance microprocessors. The issue rate of these microprocessors has continued to increase over the past few years, with today's high-performance superscalar processors such as the Compaq Alpha 21264 [4], IBM PowerPC [16], Intel Pentium-Pro [3] or MIPS R10000 [19] able to issue up to four instructions per cycle.
Architectural Support for Scalable Speculative Parallelization
- in SharedMemory Systems”, in Proc. of the 27th Int. Symp. on Computer Architecture, 2000
"... Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have a ..."
Abstract
-
Cited by 92 (17 self)
- Add to MetaCart
Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have attempted this technique in the context of scalable shared-memory systems. In this paper, we present and evaluate a new hardware scheme for scalable speculative parallelization. This design needs relatively simple hardware and is efficiently integrated into a cache-coherent NUMA system. We have designed the scheme in a hierarchical manner that largely abstracts away the internals of the node. We effectively utilize a speculative CMP as the building block for our scheme. Simulations show that the architecture proposed delivers good speedups at a modest hardware cost. For a set of important nonanalyzable scientific loops, we report average speedups of 4.2 for 16 processors. We show that support for per-word speculative state is required by our applications, or else the performance suffers greatly. 1
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
- In Supercomputing
, 1999
"... Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PI ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the mem...
Instruction Generation for Hybrid Reconfigurable Systems
- ACM Transactions on Design Automation of Electronic Systems
, 2001
"... Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via manual clustering of necessary operations into more compact operations, essentially a form of template construction. Whereas our template generation and matching algorithms are automated, the definition of clusters in Cathedral was a manual operation, mainly clustering loop and function bodies. Their results demonstrated an expected reduction of critical path length as well as interconnect as a result of clustering.
Energy-Efficient Design of Battery-Powered Embedded Systems
, 1999
"... Energy-efficient design of battery-powered systems demands optimizations in both hardware and software. We present a modular approach for enhancing instruction level simulators with cycle-accurate simulation of energy dissipation in embedded systems. Our methodology has tightly coupled component mod ..."
Abstract
-
Cited by 51 (7 self)
- Add to MetaCart
Energy-efficient design of battery-powered systems demands optimizations in both hardware and software. We present a modular approach for enhancing instruction level simulators with cycle-accurate simulation of energy dissipation in embedded systems. Our methodology has tightly coupled component models thus making our approach more accurate. Performance and energy computed by our simulator are within a 5% tolerance of hardware measurements on the SmartBadge [2]. We show how the simulation methodology can be used for hardware design exploration aimed at enhancing the SmartBadge with realtime MPEG video feature. In addition, we present a profiler that relates energy consumption to the source code. Using the profiler we can quickly and easily redesign the MP3 audio decoder software to run in real time on the SmartBadge with low energy consumption. Performance increase of 92% and energy consumption decrease of 77% over the original executable specification have been achieved. Keywords--- low-power-design, system-level, performancetradeoffs, power-consumption-model I.
Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
, 1997
"... Thread-level speculation (TLS) makes it possible to parallelize general purpose C programs. This paper proposes software and hardware mechanisms that support speculative thread-level execution on a single-chip multiprocessor. A detailed analysis of programs using the TLS execution model shows a boun ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
Thread-level speculation (TLS) makes it possible to parallelize general purpose C programs. This paper proposes software and hardware mechanisms that support speculative thread-level execution on a single-chip multiprocessor. A detailed analysis of programs using the TLS execution model shows a bound on the performance of a TLS machine that is promising. In particular, TLS makes it feasible to find speculative do across parallelism in outer loops that can greatly improve the performance of general-purpose applications. Exploiting speculative thread-level parallelism on a multiprocessor requires the compiler to determine where to speculate, and to generate SPMD (single program multiple data) code.We have developed a fully automatic compiler system that uses profile information to determine the best loops to execute speculatively, and to generate the synchronization code that improves the performance of speculative execution. The hardware mechanisms required to support speculation are simple extensions to the cache hierarchy of a single chip multiprocessor. We show that with our proposed mechanisms, thread-level speculation provides significant performance benefits.
Maps: A Compiler-Managed Memory System for Raw Machines
- IN PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck, " with machin ..."
Abstract
-
Cited by 41 (9 self)
- Add to MetaCart
This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck, " with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to...
Parallelizing applications into silicon
- IEEE Symposium on Field-Programmable Custom Computing Machines
, 1999
"... The next decade of computing will be dominated by embedded systems, information appliances and application-speci c computers. In order to build these systems, designers will need high-level compilation and CAD tools that generate architectures that e ectively meet the needs of each application. In t ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
The next decade of computing will be dominated by embedded systems, information appliances and application-speci c computers. In order to build these systems, designers will need high-level compilation and CAD tools that generate architectures that e ectively meet the needs of each application. In this paper we present a novel compilation system that allows sequential programs, written in C or FOR-TRAN, to be compiled directly into custom silicon or recon gurable architectures. This capability is also interesting because trends in computer architecture are moving towards more recon gurable hardware-like substrates, suchasFPGA based systems. Our system works by successfully combining two resource-e cient computing disciplines: Small Memories and Virtual Wires. For a given application, the compiler rst analyzes the memory access patterns of pointers and arrays in the program and constructs a partitioned memory system made up of many small memories. The computation is implemented by active computing elements that are spatially distributed within the memory array. A space-time scheduler assigns instructions to the computing elements in a way that maximizes locality and minimizes physical communication distance. It also generates an e cient static schedule for the interconnect. Finally, specialized hardware for the resulting schedule of memory accesses, wires, and computation is generated as a multi-process state machine in synthesizable Verilog. With this system, implementedasasetofSUIFcompiler passes, we havesuccessfully compiled programs into hardware and achieve specialization performance enhancements by up to an order of magnitude versus a single generalpurpose processor. We also achieve additional parallelization speedups similar to those obtainable using a tightlyinterconnected multiprocessor. 1

