• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Scalable Vector Media-processors for Embedded Systems. (2002)

by C Kozyrakis
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 40
Next 10 →

Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks

by Christoforos Kozyrakis, David Patterson - In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture , 2002
"... Multimedia processing on embedded devices requires an architecture that leads to high performance, low power consumption, reduced design complexity, and small code size. In this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector architecture to superscalar and VLIW proce ..."
Abstract - Cited by 52 (2 self) - Add to MetaCart
Multimedia processing on embedded devices requires an architecture that leads to high performance, low power consumption, reduced design complexity, and small code size. In this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector architecture to superscalar and VLIW processors for embedded multimedia applications. The comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip that integrates a vector processor with DRAM main memory.

The Vector-Thread Architecture

by Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović - In 31st International Symposium on Computer Architecture , 2004
"... The vector-thread (VT) architectural paradigm unifies the vector and multithreaded compute models. The VT abstraction provides the programmer with a control processor and a vector of virtual processors (VPs). The control processor can use vector-fetch commands to broadcast instructions to all the VP ..."
Abstract - Cited by 52 (7 self) - Add to MetaCart
The vector-thread (VT) architectural paradigm unifies the vector and multithreaded compute models. The VT abstraction provides the programmer with a control processor and a vector of virtual processors (VPs). The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality, and a VT machine exploits these to improve performance and efficiency. We present SCALE, an instantiation of the VT architecture designed for low-power and high-performance embedded systems. We evaluate the SCALE prototype design using detailed simulation of a broad range of embedded applications and show that its performance is competitive with larger and more complex processors. 1.
(Show Context)

Citation Context

... a 512 KB L2 cache, and a 64-bit memory port running at 133 MHz. The OPT results for the processor use its Altivec SIMD unit which has a 128-bit datapath and four execution units. The VIRAM processor =-=[4]-=- is a research vector processor with four 64-bit lanes. VIRAM runs at 200 MHz and includes 13 MB of embedded DRAM supporting up to 256 bits each of load and store data, and four independent addresses ...

Overcoming the Limitations of Conventional Vector Processors

by Christos Kozyrakis, David Patterson - In Proc. of the 30th Annual Intl. Symp. on Comp. Architecture , 2003
"... Despite their superior performance for multimedia ap-plications, vector processors have three limitations that hin-der their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector inst ..."
Abstract - Cited by 40 (1 self) - Add to MetaCart
Despite their superior performance for multimedia ap-plications, vector processors have three limitations that hin-der their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instructions are difficult to implement. Third, vector pro-cessors require an expensive on-chip memory system that supports high bandwidth at low access latency. This paper introduces CODE, a scalable vector microar-chitecture that addresses these three shortcomings. It is de-signed around a clustered vector register file and uses a separate network for operand transfers across functional units. With extensive use of decoupling, it can hide the latency of communication across functional units and pro-vides 26 % performance improvement over a centralized or-ganization. CODE scales efficiently to 8 functional units without requiring wide instruction issue capabilities. A re-naming table makes the clustered register file transparent at the instruction set level. Renaming also enables precise exceptions for vector instructions at a performance loss of less than 5%. Finally, decoupling allows CODE to toler-ate large increases in memory latency at sub-linear per-formance degradation without using on-chip caches. Thus, CODE can use economical, off-chip, memory systems. 1
(Show Context)

Citation Context

...M ISA for multimedia processing. However, CODE is equally applicable to any other modern vector ISA, such as the Cray X1 [1] or the Alpha Tarantula [7]. A complete description of CODE is available in =-=[18]-=-. The VIRAM ISA is a vector load-store extension to the MIPS architecture. It defines a 8-KByte vector register file that stores 32 general-purpose registers with 32 64-bit, 64 32-bit, or 128 16-bit e...

The Stream Virtual Machine

by Francois Labonte, Christos Kozyrakis - In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques , 2004
"... Stream programming is currently being pushed as a way to expose concurrency and separate communication from computation. Since there are many stream languages and potential stream execution engines, this paper proposes an abstract machine model that captures the essential characteristics of stream a ..."
Abstract - Cited by 29 (0 self) - Add to MetaCart
Stream programming is currently being pushed as a way to expose concurrency and separate communication from computation. Since there are many stream languages and potential stream execution engines, this paper proposes an abstract machine model that captures the essential characteristics of stream architectures, the Stream Virtual Machine (SVM). The goal of the SVM is to improve interoperability, allow developpment of common compilation tools and reason about stream program performance. The SVM contains control processors, slave kernel processors, and slave DMA units. Is is presented along with the compilation process that takes a stream program down to the SVM and finally down to machine binary. To extract the parameters for our SVM model, we use micro-kernels to characterize two graphics processors and a stream engine, Imagine. The results are encouraging; the model estimates the performance of the target machines with high accuracy. 1.
(Show Context)

Citation Context

...ted through defined stream operators. Data-parallel architectures have made a come-back recently due to the diminishing returns of ILP gains in a single thread. Vector processors whether for embedded =-=[9]-=- or scientific purposes share a lot of common benefits with dedicated stream architectures for media [8] or scientific [5] applications. Some other architectures like Raw [20] exploitsstreams by mappi...

Exploring the VLSI Scalability of Stream Processors

by Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens, Brian Towles - In International Conference on High Performance Computer Architecture (HPCA-2003 , 2003
"... Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI te ..."
Abstract - Cited by 26 (9 self) - Add to MetaCart
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALUs in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALUs per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALUs per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3x of kernel speedup and 8.0x of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.

ALP: Efficient Support for All Levels of Parallelism for Complex Media Applications

by Ruchira Sasanka, Man-lap Li, Sarita V. Adve, Yen-kuang Chen, Eric Debes - ACM Trans. Archit. Code Optim , 2005
"... The real-time execution of contemporary complex media applications requires energy-efficient processing capabilities beyond those of current superscalar processors. We observe that the complexity of contemporary media applications requires support for multiple forms of parallelism, including ILP, TL ..."
Abstract - Cited by 17 (2 self) - Add to MetaCart
The real-time execution of contemporary complex media applications requires energy-efficient processing capabilities beyond those of current superscalar processors. We observe that the complexity of contemporary media applications requires support for multiple forms of parallelism, including ILP, TLP, and various forms of DLP, such as subword SIMD, short vectors, and streams. Based on our observations, we propose an architecture, called ALP, that efficiently integrates all of these forms of parallelism with evolutionary changes to the programming model and hardware. The novel part of ALP is a DLP technique called SIMD vectors and streams (SVectors/SStreams), which is integrated within a conventional superscalar-based CMP/SMT architecture with subword SIMD. This technique lies between subword SIMD and vectors, providing significant benefits over the former at a lower cost than the latter. Our evaluations show that each form of parallelism supported by ALP is important. Specifically, SVectors/SStreams are effective, compared to a system with the other enhancements in ALP. They give speedups of 1.1 to 3.4X and energy-delay product improvements of 1.1 to 5.1X for applications with DLP.
(Show Context)

Citation Context

...P). Much of the recent effort therefore has been on architectures that target such DLP in various ways (e.g., sub-word 3SIMD ISA extensions in most processors, VIRAM’s and CODE’s vector architecture =-=[16, 15]-=-, Imagine’s streaming architecture [4], Scale’s vector-threading architecture [17], Raw’s tiled architecture with stream support [31]). While the quantitative evaluations of these architectures are ve...

Stream Register Files with Indexed Access

by Nuwan Jayasena , Mattan Erez, Jung Ho Ahn, William J. Dally - IN TENTH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-2004 , 2004
"... Many current programmable architectures designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with less regular data access patterns perform sub-optimally on such architectures. This paper presents a reg ..."
Abstract - Cited by 15 (4 self) - Add to MetaCart
Many current programmable architectures designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with less regular data access patterns perform sub-optimally on such architectures. This paper presents a register file for streams (SRF) that allows arbitrary, indexed accesses. Compared to sequential SRF access, indexed access captures more temporal locality, reduces data replication in the SRF, and provides efficient support for certain types of complex access patterns. Our simulations show that indexed SRF access provides speedups of 1.03x to 4.1x and memory bandwidth reductions of up to 95% over sequential SRF access for a set of benchmarks representative of data-parallel applications with irregular accesses. Indexed SRF access also provides greater speedups than caches for a number of application classes despite significantly lower hardware costs. The area overhead of our indexed SRF implementation is 11%-22% over a sequentially accessed SRF, which corresponds to a modest 1.5%-3% increase in the total die area of a typical stream processor.

VEGAS: Soft Vector Processor with Scratchpad Memory

by Christopher H. Chou, Zhiduo Liu, Aaron Severance, Saurabh Sant, Alex D. Brant, Guy Lemieux
"... This paper presents VEGAS, a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file. The scratchpad memory is a more efficient storage medium than a vector register file, allowing up to 9 × more data elements to ..."
Abstract - Cited by 14 (4 self) - Add to MetaCart
This paper presents VEGAS, a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file. The scratchpad memory is a more efficient storage medium than a vector register file, allowing up to 9 × more data elements to fit into on-chip memory. In addition, the use of fracturable ALUs in VEGAS allow efficient processing of bytes, halfwords and words in the same processor instance, providing up to 4 × the operations compared to existing fixedwidth soft vector ALUs. Benchmarks show the new VE-GAS architecture is 10 × to 208 × faster than Nios II and has 1.7 × to 3.1 × better area-delay product than previous vector work, achieving much higher throughput per unit area. To put this performance in perspective, VEGAS is faster than a leading-edge Intel processor at integer matrix multiply. To ease programming effort and provide full debug support, VEGAS uses a C macro API that outputs vector instructions as standard NIOS II/f custom instructions. Categories and Subject Descriptors C.1.2 [Multiple Data Stream Architectures (Multiprocessors)]: Array and vector processors; C.3 [Specialpurpose
(Show Context)

Citation Context

...ing closure. Unfortunately, traditional soft processors are too slow for most processing-intensive tasks. However, vector processing is known to accelerate data-parallel tasks. The VIRAM architecture =-=[9]-=- demonstrated that embedded tasks such as the EEMBC benchmark suite [1] can be accelerated with vectors. Embedded vector architectures SODA [12] and Ardbeg [18] were developed for low-power wireless a...

Vector-thread architecture and implementation

by Ronny Meir Krashinsky , 2007
"... ..."
Abstract - Cited by 13 (1 self) - Add to MetaCart
Abstract not found

A performance analysis of pim, stream processing, and tiled processing on memory-intensive signal processing kernels

by Jinwoo Suh, Eun-gyu Kim, Stephen P. Crago, Lakshmi Srinivasan, Matthew C. French - Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA-03), volume 31 of Computer Architecture News , 2003
"... Trends in microprocessors of increasing die size and clock speed and decreasing feature sizes have fueled rapidly increasing performance. However, the limited improvements in DRAM latency and bandwidth and diminishing returns of increasing superscalar ILP and cache sizes have led to the proposal of ..."
Abstract - Cited by 13 (0 self) - Add to MetaCart
Trends in microprocessors of increasing die size and clock speed and decreasing feature sizes have fueled rapidly increasing performance. However, the limited improvements in DRAM latency and bandwidth and diminishing returns of increasing superscalar ILP and cache sizes have led to the proposal of new microprocessor architectures that implement processorin-memory, stream processing, and tiled processing. Each architecture is typically evaluated separately and compared to a baseline architecture. In this paper, we evaluate the performance of processors that implement these architectures on a common set of signal processing kernels. The implementation results are compared with the measured performance of a conventional system based on the PowerPC with Altivec. The results show that these new processors show significant improvements over conventional systems and that each architecture has its own strengths and weaknesses. 1.
(Show Context)

Citation Context

... increase the bandwidth between the processor and memory. PIM technology also has the potential to decrease other important system parameters such as power consumption, cost, and area. The VIRAM chip =-=[5]-=- is a PIM research prototype being developed at the University of California at Berkeley. A simplified architecture of the chip is shown in Figure 1. The VIRAM contains two vector-processing units in ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University