Results 1 - 10
of
48
Slipstream processors: improving both performance and fault tolerance
- In Proceedings of the ninth international conference on Architectural
"... Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computat ..."
Abstract
-
Cited by 145 (6 self)
- Add to MetaCart
Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with the full program on a chip multiprocessor or simultaneous multithreaded processor, with two key advantages: 1) Improved single-program performance. The shorter program speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes, at the same time validating the speculative, shorter program. The two programs combined run faster than the original program alone. Detailed simulations of an example implementation show an average improvement of 7 % for the SPEC95 integer benchmarks. 2) Fault tolerance. The shorter program is a subset of the full program and this partial-redundancy is transparently leveraged for detecting and recovering from transient hardware faults. 1.
Continual flow pipelines
- In International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multip ..."
Abstract
-
Cited by 83 (0 self)
- Add to MetaCart
Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new nonblocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.
A New Direction for Computer Architecture Research
- IEEE Computer
, 1998
"... Abstract In this paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
Abstract In this paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functions provided today by a portable computer, a cellular phone, a digital camera and a video game. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions, and area efficient, scalable designs. We examine the architectures that were recently proposed for billion transistor microprocessors. While they are very promising for the stationary desktop and server workloads, we discover that most of them are unable to meet the challenges of the new environment and provide the necessary enhancements for multimedia applications running on portable devices. We conclude with Vector IRAM, an initial example of a microprocessor architecture and implementation that matches the new environment.
Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
- In Supercomputing
, 1999
"... Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PI ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the mem...
Mobile Multimedia Systems
- In Proc. of PROGRESS workshop 2000
, 2000
"... system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior written permission of the author. ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior written permission of the author.
LLVA: A Low-level Virtual Instruction Set Architecture
- IN MICRO-36
, 2003
"... A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates tran ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates translation and optimization. In fact, there has been little research on specific designs for a virtual ISA for processors. This paper proposes a novel virtual ISA (LLVA) and a translation strategy for implementing it on arbitrary hardware. The instruction set is typed, uses an infinite virtual register set in Static Single Assignment form, and provides explicit control-flow and dataflow information, and yet uses low-level operations closely matched to traditional hardware. It includes novel mechanisms to allow more flexible optimization of native code, including a flexible exception model and minor constraints on self-modifying code. We propose a translation strategy that enables offline translation and transparent offline caching of native code and profile information, while remaining completely OS-independent. It also supports optimizations directly on the representation at install-time, runtime, and offline between executions. We show experimentally that the virtual ISA is compact, it is closely matched to ordinary hardware instruction sets, and permits very fast code generation, yet has enough high-level information to permit sophisticated program analyses and optimizations.
Exploiting Instruction and Data Level Parallelism in Future High Performance Processors
- IEEE Micro
, 1997
"... Introduction Historically, there have been two different approaches to high performance computing: instructionlevel parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential instruction stream and extracting indep ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Introduction Historically, there have been two different approaches to high performance computing: instructionlevel parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential instruction stream and extracting independent instructions that can be sent to several execution units in parallel. The DLP paradigm, on the other hand, uses vectorization techniques to specify with a single instruction (a vector instruction) a large number of operations to be performed on independent data. A few of these vector instructions running concurrently can provide a large operation parallelism for many consecutive cycles. Figure 1 graphically presents the different microarchitecture generations that have appeared to date in the DLP world. The first DLP machines appeared shortly after the introduction of pipelining in the ILP world. The prototype example of the fir
Exploiting the Cache Capacity of a Single-Chip Multi-Core Processor with Execution Migration
- In HPCA 10
, 2004
"... We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall onchip cache capacity. We introduce the affinity algorithm, a method for dis ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall onchip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called "splittability", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.
Toward A Cost-Effective DSM Organization That Exploits Processor-Memory Integration
, 2000
"... Dramatic increases in the number of transistors that can be integrated on a VLSI chip will soon allow commodity microprocessors to include both processor and a sizable fraction of main memory on chip. Distributed Shared-Memory (DSM) multiprocessors typically use the latest off-the-shelf microprocess ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Dramatic increases in the number of transistors that can be integrated on a VLSI chip will soon allow commodity microprocessors to include both processor and a sizable fraction of main memory on chip. Distributed Shared-Memory (DSM) multiprocessors typically use the latest off-the-shelf microprocessors and thus will be affected by the upcoming processor-memory integration. In this paper, we explore how a cache-coherent DSM machine built around Processor-In-Memory (PIM) chips might be cost-effectively organized. To take advantage of the close coupling between processor and memory, we propose tagging the memory and organizing it as a cache. Furthermore, commercial considerations dictate the use of off-the-shelf hardware largely designed for uniprocessors. Consequently, we keep the directory control off-chip. To keep the multiprocessor cheap and simple, and to allow for recon gurability, directory control is performed by chips that are identical to the ones used as compute nodes. As a result, ...
Distributed Vector Architecture: Beyond a Single Vector-IRAM
- In First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember
, 1997
"... Abstract—The integration of memory on the same die as the processor (IRAM) has the potential to offer unprecedented bandwidth that can be exploited efficiently by vector processors. However, real-world scientific vector applications with their very large memory requirements and their poor locality, ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Abstract—The integration of memory on the same die as the processor (IRAM) has the potential to offer unprecedented bandwidth that can be exploited efficiently by vector processors. However, real-world scientific vector applications with their very large memory requirements and their poor locality, would easily overflow any single IRAM device. In this environment, traditional approaches such as caching or paging generate considerable traffic, diminishing the performance advantage of processor-memory integration. To exploit the full potential of IRAM in the realm of large-scale scientific computing, we propose a DIstributed Vector Architecture (DIVA), that uses multiple vector-capable IRAM nodes in a distributed shared-memory configuration. The advantages of our approach are twofold: (i) we speed up

