Results 1 - 10
of
14
RAMP: Research Accelerator for Multiple Processors
- In Proceedings of Hot Chips 18
, 2006
"... Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ..."
Abstract
-
Cited by 54 (12 self)
- Add to MetaCart
Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.
Ramp: Research accelerator for multiple processors—a community vision for a shared experimental parallel hw/sw platform.
, 2006
"... ..."
A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
"... Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
(Show Context)
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose the PROTOFLEX simulation architecture, which uses FPGAs to accelerate simulation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, PROTOFLEX reduces complexity by virtualizing the execution of many logical processors onto a consolidated set of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance. To achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the PROTOFLEX simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 39x speedup (and as high as 49x) over comparable software simulation across a suite of applications, including OLTP on a commercial database server.
ProtoFlex: Towards Scalable, FullSystem Multiprocessor Simulations Using FPGAs
- ACM Transactions on Reconfigurable Technology and Systems
, 2009
"... Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrum ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the PROTOFLEX simulation architecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, PROTOFLEX virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the PROTOFLEX simulation architecture, which is an FPGAbased,
RPM: A rapid prototyping engine for multiprocessor systems
- IEEE Computer
, 1995
"... In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many architectural variations are possible. Fair comparisons among systems are difficult because of the lack of a common hardware platform to implement the different architectures. RPM (Rapid Prototyping engine for Multiprocessors) is a hardware emulator for the rapid prototyping of various multiprocessor architectures. In RPM, the hardware of the target machine is emulated by reprogrammable controllers implemented with Field-Programmable Gate Arrays (FPGAs). The processors, memories and interconnect are off-theshelf and their relative speeds can be modified to emulate various component technologies. Every emulation is an actual incarnation of the target machine and therefore software written for the target machine can be easily ported on it with little modification and without instrumentation of the code. In this paper, we describe the architecture of RPM, its performance and the prototyping methodology. We also compare our approach with simulation and breadboard prototyping. Keywords: Field-Programmable Gate Arrays (FPGAs), message-passing multicomputers, shared-memory multiprocessors, design verification, performance evaluation, simulation.
RACE: A Reconfigurable and Adaptive Computing Environment
, 1997
"... The Reconfigurable and Adaptive Computing Environment, or RACE, is a reconfigurable computer that has been developed in the Design Automation Laboratory at the University of Cincinnati. RACE was developed to facilitate any type of reconfigurable computing. Reconfigurable computing can be thought of ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The Reconfigurable and Adaptive Computing Environment, or RACE, is a reconfigurable computer that has been developed in the Design Automation Laboratory at the University of Cincinnati. RACE was developed to facilitate any type of reconfigurable computing. Reconfigurable computing can be thought of as having the ability to repeatedly perform applications on a reconfigurable hardware system. Such reconfigurable hardware has been made possible by the advent of FPGAs, or Field-Programmable Gate Arrays. The RACE system has five Xilinx XC4013 FPGAs, one of which acts as a controller, which provide approximately 52,000 logic gates for computing. Furthermore, each FPGA has 128KB of local data memory and 64KB of local configuration memory, which is used to store FPGA configurations. RACE was designed to make reconfigurable computing easy to use so a library of software functions have been developed to control the reconfigurable hardware without detailed hardware knowledge of RACE. Likewise, im...
Rapid Hardware Prototyping On RPM-2: Methodology And Experience
, 1998
"... Field-Programmable Gate Arrays is an emerging technology which promises easy hardware reconfigurability by software at low cost. Entire systems can be built in which some parts are programmable. Such systems implement various architectures. Each architecture prototype is a detailed hardware implemen ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Field-Programmable Gate Arrays is an emerging technology which promises easy hardware reconfigurability by software at low cost. Entire systems can be built in which some parts are programmable. Such systems implement various architectures. Each architecture prototype is a detailed hardware implementation of the architecture -- including I/O-- on which complex software systems can be ported. We have built a multiprocessor emulator called RPM --Rapid Prototyping engine for Multiprocessor systems. The second version of the hardware called RPM-2 is up and running. In this paper, we present the design and the performance of our first emulator, a cache-coherent non uniform memory access multiprocessor (CC-NUMA). 1. Introduction There are currently many competing ideas to implement multiprocessor systems and some of them have been prototyped in hardware. However, hardware prototypes take too long to build and are very expensive. By the time a hardware prototype really works it is often obs...
Multiprocessor Emulation With Rpm: Early Experience
, 1995
"... Field-Programmable Gate Arrays is an emerging technology which promises easy hardware reconfigurability by software at low cost. Entire systems can be built in which some parts are easily programmable. Such systems are flexible hardware platforms or emulators, which are then tailored to implement va ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Field-Programmable Gate Arrays is an emerging technology which promises easy hardware reconfigurability by software at low cost. Entire systems can be built in which some parts are easily programmable. Such systems are flexible hardware platforms or emulators, which are then tailored to implement various architectures. The performance of these architectures can be compared on the same hardware substrate. Besides having a large speedup advantage over software simulation, the emulator is a detailed hardware implementation of the architecture --including I/O-- on which complex software systems can be run without code instrumentation and it is a more convincing proof of concept. On the other hand it is much more cost-effective than a full-fledged prototype. We have built a multiprocessor emulator called RPM --Rapid Prototyping engine for Multiprocessor systems. RPM can emulate various configurations of shared-memory and message-passing systems. The bandwidth and latency of various componen...
Summarizing Multiprocessor . . . Microarchitecture-Independent Snapshots
, 2006
"... Computer architects rely heavily on software simulation to evaluate, refine, and validate new designs before they are implemented. However, simulation time continues to increase as computers become more complex and multicore designs become more common. This thesis investigates software structures an ..."
Abstract
- Add to MetaCart
Computer architects rely heavily on software simulation to evaluate, refine, and validate new designs before they are implemented. However, simulation time continues to increase as computers become more complex and multicore designs become more common. This thesis investigates software structures and algorithms for quickly simulating modern cache-coherent multiprocessors by amortizing the time spent to simulate the memory system and branch predictors. The Memory Timestamp Record (MTR) summarizes the directory and cache state of a multiprocessor system in a compact data structure. A single MTR snapshot is versatile enough to reconstruct the microarchitectural state resulting from various coherence protocols and cache organizations. The MTR may be quickly updated by each simulated processor during a fast-forwarding phase and optionally stored off-line for reuse. To fill large branch prediction tables, we introduce Branch Predictor-based Compression (BPC) which compactly stores a branch trace so that it may be used to fill in any branch predictor structure. An entire BPC trace requires less space than single discrete predictor
ABSTRACT A Practical FPGA-based Framework for Novel CMP Research
"... Chip-multiprocessors are quickly gaining momentum in all segments of computing. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development. To address this challenge, it is necessary to co-develop new CMP architecture with novel prog ..."
Abstract
- Add to MetaCart
(Show Context)
Chip-multiprocessors are quickly gaining momentum in all segments of computing. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development. To address this challenge, it is necessary to co-develop new CMP architecture with novel programming models. Currently, architecture research relies on software simulators which are too slow to facilitate interesting experiments with CMP software without using small datasets or significantly reducing the level of detail in the simulated models. An alternative to simulation is to exploit the rich capabilities of modern FPGAs to create FPGA-based platforms for novel CMP research. This paper presents ATLAS, the first prototype for CMPs with hardware support for Transactional Memory (TM), a technology aiming to simplify parallel programming. ATLAS uses the BEE2 multi-FPGA board to provide a system with 8 PowerPC cores that run at 100MHz and runs Linux. ATLAS provides significant benefits for CMP research such as 100x performance improvement over a software simulator and good visibility that helps with software tuning and architectural improvements. In addition to presenting and evaluating ATLAS, we share our observations about building a FPGA-based framework for CMP research. Specifically, we address issues such as overall performance, challenges of mapping ASIC-style CMP RTL on to FPGAs, software support, the selection criteria for the base processor, and the challenges of using pre-designed IP libraries.