Results 1 - 10
of
157
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance
, 1999
"... In general-purpose microprocessors, recent trends have pushed towards 64-bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64-bit dynamic range these CPUs provide. In fact, another recent instruction set trend ..."
Abstract
-
Cited by 124 (7 self)
- Add to MetaCart
In general-purpose microprocessors, recent trends have pushed towards 64-bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64-bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced "multimedia" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these "narrow-bitwidth" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimizati...
Register Organization for Media Processing
- HPCA6
, 2000
"... Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10 11 operations per second. As the number of arithmetic uni ..."
Abstract
-
Cited by 94 (10 self)
- Add to MetaCart
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10 11 operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay, and power of the arithmetic units. In this paper we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level parallel, and memory hierarchy axes, and by optimizing the hierarchical register organization to operate on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file ar...
Bitwidth Analysis with Application to Silicon Compilation
, 2000
"... This paper introduces Bitwise, a compiler that minimizes the bitwidth --- the number of bits used to representeach operand --- for both integers and pointers in a program. By propagating static information both forward and backward in the program dataflowgraph,Bitwise frees the programmer from decla ..."
Abstract
-
Cited by 80 (0 self)
- Add to MetaCart
This paper introduces Bitwise, a compiler that minimizes the bitwidth --- the number of bits used to representeach operand --- for both integers and pointers in a program. By propagating static information both forward and backward in the program dataflowgraph,Bitwise frees the programmer from declaring bitwidth invariants in cases where the compiler can determine bitwidths automatically. We find a rich opportunity for bitwidth reduction in modern multimedia and streaming application workloads. For new architectures that support sub-word quantities, we expect that our bitwidth reductions will savepower and increase processor performance. This paper
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
- in Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation
, 2000
"... Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line asse ..."
Abstract
-
Cited by 69 (8 self)
- Add to MetaCart
Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Some researchers have proposed using vector compilers as a means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism(SLP), a novel way of viewing parallelism in multimedia applications. We believe SLP is fundamentally different from the loop-level parallelism exploited by traditional vector processing, and therefore warrants a different method for extracting it. We have developed a simple and robust compiler technique for detecting SLP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. Experiments on scientific and multimedia benchmarks have yielded average performance improvements of 84%, and range as high as 253%.
Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions
, 1999
"... This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and ide ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3X to 4.2X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1X to 4.2X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2X performance benefits with the larger caches. Software prefetching provides 1.4X to 2.5X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound. 1
Altivec extension to powerpc accelerates media processing
- IEEE Micro
"... There is a clear trend in personal computing toward multimedia-rich applications. These applications will incorporate a wide variety of multimedia technologies, including audio and video compression, 2D image processing, 3D graphics, speech and handwriting recognition, media mining, and narrow-/broa ..."
Abstract
-
Cited by 59 (2 self)
- Add to MetaCart
There is a clear trend in personal computing toward multimedia-rich applications. These applications will incorporate a wide variety of multimedia technologies, including audio and video compression, 2D image processing, 3D graphics, speech and handwriting recognition, media mining, and narrow-/broadband signal processing for communication. In response to this demand, major microprocessor vendors have announced architectural extensions to their general-purpose processors in an effort to improve their multimedia performance. Intel extended IA-32 with MMX1 and SSE (alias KNI), 2 Sun enhanced Sparc with VIS, 3
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Efficient Conditional Operations for Data-parallel Architectures
- In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture
, 2000
"... Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallel architectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams. 1. Introduction Many applications contain abundant data-parallelism, particularly emerging media applications such as graphics and video, image, and signal p...
A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications
, 1998
"... Recently, computer architectures that combine a reconfigurable (or retargetable) coprocessor with a general-purpose microprocessor have been proposed. These architectures are designed to exploit large amounts of fine grain parallelism in applications. In this paper, we study the performance of the r ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
Recently, computer architectures that combine a reconfigurable (or retargetable) coprocessor with a general-purpose microprocessor have been proposed. These architectures are designed to exploit large amounts of fine grain parallelism in applications. In this paper, we study the performance of the reconfigurable coprocessors on multimedia applications. We compare a Field Programmable Gate Array (FPGA) based reconfigurable coprocessor with the array processor called REMARC (Reconfigurable Multimedia Array Coprocessor). REMARC uses a 16-bit simple processor that is much larger than a Configurable Logic Block (CLB) of an FPGA. We have developed a simulator, a programming environment, and multimedia application programs to evaluate the performance of the two coprocessor architectures. The simulation results show that REMARC achieves speedups ranging from a factor of 2.3 to 7.3 on these applications. The FPGA coprocessor achieves similar performance improvements. However, the FPGA coprocess...
Multimedia Extensions for General-Purpose Processors
- Proc. IEEE Workshop on Signal Processing Systems
, 1997
"... This paper gives an overview of the multimedia instructions that have been added to the instruction set architectures of general-purpose microprocessors to accelerate media processing. Examples are MAX, MMX and VIS, the multimedia extensions for PA-RISC, ix86, and SPARC processor architectures. We d ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
This paper gives an overview of the multimedia instructions that have been added to the instruction set architectures of general-purpose microprocessors to accelerate media processing. Examples are MAX, MMX and VIS, the multimedia extensions for PA-RISC, ix86, and SPARC processor architectures. We describe subword parallelism, a low overhead form of SIMD parallelism, and the classes of instructions needed to support subword parallel computations efficiently. Features described include arithmetic operations with saturation, averaging, multiply alternatives, data rearrangement primitives like Permute and Mix, formatting instructions, conditional execution, and complex instructions. 1. INTRODUCTION The general-purpose information processing workload is changing to include an increasing amount of media processing. Media processing is the processing of digital multimedia information, such as images, video, 2-dimensional and 3dimensional graphics, animations, audio and text. The definition...

