Results 1 - 10
of
49
Reconfigurable Architectures and General-Purpose Computing in the MOS VLSI Era
, 1996
"... The hallmark of general-purpose computing has been the capability to run almost any, large, functionally diverse computational task on a single hardware system. General-purpose computing platforms, to date, have largely been been built around one or more moderately coarse-grained, fixed processors. ..."
Abstract
-
Cited by 97 (6 self)
- Add to MetaCart
The hallmark of general-purpose computing has been the capability to run almost any, large, functionally diverse computational task on a single hardware system. General-purpose computing platforms, to date, have largely been been built around one or more moderately coarse-grained, fixed processors. As available silicon density increases, it is worthwhile to consider other computing structures for providing flexible computation. In this paper, we look specifically at both conventional processors and reconfigurable logic to understand their relative merits in general-purpose computing scenarios. A simple analysis of delivered functional capacity suggests that conventional processors are best suited for tasks which require a diverse set of operations which are well matched to the processor's ALU primitives and datapaths. Reconfigurable logic, on the other hand, can deliver higher capacity on a broader range of functionality and datapaths when the function required is highly repetitive. 1 ...
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
- in Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation
, 2000
"... Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line asse ..."
Abstract
-
Cited by 69 (8 self)
- Add to MetaCart
Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Some researchers have proposed using vector compilers as a means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism(SLP), a novel way of viewing parallelism in multimedia applications. We believe SLP is fundamentally different from the loop-level parallelism exploited by traditional vector processing, and therefore warrants a different method for extracting it. We have developed a simple and robust compiler technique for detecting SLP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. Experiments on scientific and multimedia benchmarks have yielded average performance improvements of 84%, and range as high as 253%.
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Simple Vector Microprocessors for Multimedia Applications
- In Proceedings of the 31st Annual International Symposium on MicroArchitecutre
, 1998
"... In anticipation of the emergenceof multimedia applications as an important workload, microprocessor companies have augmented their instruction-set architectures with short vector extensions, thus adding basic vector hardware to state-ofthe -art superscalar processors. Although a vector architecture ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
In anticipation of the emergenceof multimedia applications as an important workload, microprocessor companies have augmented their instruction-set architectures with short vector extensions, thus adding basic vector hardware to state-ofthe -art superscalar processors. Although a vector architecture may be a good match for multimedia applications, there is growing evidence that the control logic for increasingly complex superscalar processors is difficult to implement. Rather than combining a complex superscalar core with short wide vector hardware, we propose using a much simpler processordesign that is similar to traditional vector computers with long vectors and simple control logic for instruction issue. Such a design would use the bulk of its transistors and die area for datapath and registers, and thus lessen the time required to design, implement, and verify control. In this paper, we present data that quantifies this trading of control transistors for datapath and register tr...
A Media-Enhanced Vector Architecture for Embedded Memory Systems
, 1999
"... Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representat...
Tarantula: A Vector Extension to the Alpha Architecture
- In Proceedings of The 29th International Symposium on Computer Architecture
, 2002
"... ..."
Overcoming the Limitations of Conventional Vector Processors
- In ISCA-30
, 2003
"... Despite their superior performance for multimedia applications, vector processors have three limitations that hinder their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instr ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Despite their superior performance for multimedia applications, vector processors have three limitations that hinder their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instructions are difficult to implement. Third, vector processors require an expensive on-chip memory system that supports high bandwidth at low access latency.
Exploring the VLSI Scalability of Stream Processors
- In International Conference on High Performance Computer Architecture (HPCA-2003
, 2003
"... Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI te ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALUs in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALUs per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALUs per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3x of kernel speedup and 8.0x of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.
Exploiting Instruction and Data Level Parallelism in Future High Performance Processors
- IEEE Micro
, 1997
"... Introduction Historically, there have been two different approaches to high performance computing: instructionlevel parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential instruction stream and extracting indep ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Introduction Historically, there have been two different approaches to high performance computing: instructionlevel parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential instruction stream and extracting independent instructions that can be sent to several execution units in parallel. The DLP paradigm, on the other hand, uses vectorization techniques to specify with a single instruction (a vector instruction) a large number of operations to be performed on independent data. A few of these vector instructions running concurrently can provide a large operation parallelism for many consecutive cycles. Figure 1 graphically presents the different microarchitecture generations that have appeared to date in the DLP world. The first DLP machines appeared shortly after the introduction of pipelining in the ILP world. The prototype example of the fir
Dynamic Cache Partitioning via Columnization
, 2000
"... This paper introduces column caching, a flexible mechanism that allows software to dynamically customize cache behavior through fine-grain control of its placement policy. For a set-associative cache, specific data can be restricted to a subset of the usual target cache set during replacement. Throu ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper introduces column caching, a flexible mechanism that allows software to dynamically customize cache behavior through fine-grain control of its placement policy. For a set-associative cache, specific data can be restricted to a subset of the usual target cache set during replacement. Through this simple enhancement, column caching enables the cache to be partitioned. When done properly, this improves cache utilization through both constructive and destructive interference, leading to overall better system performance. Column caching provides a basic mechanism which can emulate many different hard-wired specializations, such as dedicated SRAM and separate temporal and spatial cache, and further offers the advantage of dynamical repartitioning under software control. This paper introduces column caching, describes possible implementations and presents a few example usages including preliminary performance numbers.

