Results 1 - 10
of
11
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
- in Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation
, 2000
"... Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line asse ..."
Abstract
-
Cited by 69 (8 self)
- Add to MetaCart
Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general-purpose microprocessors. This added functionality comes primarily in the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Some researchers have proposed using vector compilers as a means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism(SLP), a novel way of viewing parallelism in multimedia applications. We believe SLP is fundamentally different from the loop-level parallelism exploited by traditional vector processing, and therefore warrants a different method for extracting it. We have developed a simple and robust compiler technique for detecting SLP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. Experiments on scientific and multimedia benchmarks have yielded average performance improvements of 84%, and range as high as 253%.
Simple Vector Microprocessors for Multimedia Applications
- In Proceedings of the 31st Annual International Symposium on MicroArchitecutre
, 1998
"... In anticipation of the emergenceof multimedia applications as an important workload, microprocessor companies have augmented their instruction-set architectures with short vector extensions, thus adding basic vector hardware to state-ofthe -art superscalar processors. Although a vector architecture ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
In anticipation of the emergenceof multimedia applications as an important workload, microprocessor companies have augmented their instruction-set architectures with short vector extensions, thus adding basic vector hardware to state-ofthe -art superscalar processors. Although a vector architecture may be a good match for multimedia applications, there is growing evidence that the control logic for increasingly complex superscalar processors is difficult to implement. Rather than combining a complex superscalar core with short wide vector hardware, we propose using a much simpler processordesign that is similar to traditional vector computers with long vectors and simple control logic for instruction issue. Such a design would use the bulk of its transistors and die area for datapath and registers, and thus lessen the time required to design, implement, and verify control. In this paper, we present data that quantifies this trading of control transistors for datapath and register tr...
Adding a Vector Unit to a Superscalar Processor
, 1999
"... The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors. The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floating point datapath. A key po ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors. The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floating point datapath. A key point in our proposal is the design of a high performance cache interface that delivers high bandwidth to the vector unit at a low cost and low latency. We propose a double-banked cache with alignment circuitry to serve vector accesses and we study two cache hierarchies: one feeds the vector unit from the L1; the other from the L2. Our results show that large IPC values (higher than 10 in some cases) can be achieved. Moreover the scalability of our architecture simply requires addition of functional units, without requiring more issue bandwidth. As a consequence, the proposed vector unit achieves high performance for numerical and multimedia codes with minimal impact on the cycle time of the p...
MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications
- IN PROC. OF THE IEEE/ACM SC99 CONF. ON SUPERCOMPUTING
, 1999
"... MOM is a novel matrix-oriented ISA paradigm for multimedia applications, based on fusing conventional vector ISAs with SIMD ISAs such as MMX. This paper justifies why MOM is a suitable alternative for the multimedia domain due to its efficiency handling the small matrix structures typically found in ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
MOM is a novel matrix-oriented ISA paradigm for multimedia applications, based on fusing conventional vector ISAs with SIMD ISAs such as MMX. This paper justifies why MOM is a suitable alternative for the multimedia domain due to its efficiency handling the small matrix structures typically found in most multimedia kernels. MOM leverages a performance boost between 1.3x and 4x over more conventional multimedia extensions (such as MMX and MDMX), which already achieve performance benefits ranging from 1.3x to 15x over conventional Alpha code. Moreover, MOM exhibit a high relative performance for low-issue rates and a high tolerance to memory latency. Both advantages present MOM as an attractive alternative for the embedded domain.
An Evaluation of Different DLP Alternatives for the Embedded Media Domain
, 1999
"... The importance of media processing has produced a revolution in the design of embedded processors. In order to face the high computational and technological demands of near future media applications, new embedded processors are including features that were commonly restricted to the general purpose ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
The importance of media processing has produced a revolution in the design of embedded processors. In order to face the high computational and technological demands of near future media applications, new embedded processors are including features that were commonly restricted to the general purpose and the supercomputing domains. In this paper we have evaluated the performance of various DLP (Data Level Parallelism) oriented embedded architectures and analyzed quantitative data in order to determine the highlights and disadvantages of each approach. Additionally we have analyzed the differences between the explicit parallel versions of code (often based on the standard algorithms) and the high-tuned, non-vectorizable versions usually found in real multimedia programs. We will show that sub-word SIMD architectures (like MMX) are a very costeffective solution, and that, while long vector architectures provide few improvements at a very high cost, a smart combination between vector and SI...
Vector Microprocessors for Desktop Computing
, 1999
"... Desktop workloads are expected to shift over the next few years to become increasingly mediacentric. These multimedia applications require much larger computational demands than current desktop processors can provide. In this paper, we describe four major requirements that we believe any effective ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Desktop workloads are expected to shift over the next few years to become increasingly mediacentric. These multimedia applications require much larger computational demands than current desktop processors can provide. In this paper, we describe four major requirements that we believe any effective desktop processor should address: it should meet the performance requirements of desktop workloads, it should exploit advances in VLSI fabrication technology to provide this performance, it should provide scalable performance for different processor generations with binary compatibility, and it should have mature compiler technology. We explain how vector microprocessors meet three of these requirements, but there is a perception that the performance for non-vectorizable codes would be unacceptably low. The first half of this paper argues that current desktop workloads such as productivity applications and the SPEC95 integer benchmarks are either highly interactive or contain little ...
Exposing DataLevel Parallelism in Sequential Image Processing Algorithms
- In Proc. of the 9th Working Conference on Reverse Engineering (WCRE ’02
, 2002
"... As new computer architectures are developed to exploit large-scale data-level parallelism, techniques are needed to retarget legacy sequential code to these platforms. Sequential programming languages force programmers to include sequential artifacts in their code, particularly with respect to how t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As new computer architectures are developed to exploit large-scale data-level parallelism, techniques are needed to retarget legacy sequential code to these platforms. Sequential programming languages force programmers to include sequential artifacts in their code, particularly with respect to how the source code expresses data references (generally assuming a linear address space). In contrast, data-parallel programs apply many operations in parallel to elements in twodimensional data sets, and a given data parallel operation can access other spatially local elements along either dimension. Of key importance in exposing data parallelism is determining these two-dimensional data dependencies among elements of a matrix. This paper
A Comparison Between Processor Architectures for Multimedia Applications
- In Proc. 15th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC
, 2004
"... Abstract — The efficient processing of MultiMedia Applications (MMAs) is currently one of the main bottlenecks in the media processing field. Many architectures have been proposed for processing MMAs such as VLIW, superscalar (general-purpose processor enhanced with a multimedia extension such as MM ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract — The efficient processing of MultiMedia Applications (MMAs) is currently one of the main bottlenecks in the media processing field. Many architectures have been proposed for processing MMAs such as VLIW, superscalar (general-purpose processor enhanced with a multimedia extension such as MMX), vector architectures, SIMD architectures, and reconfigurable computing devices. The question then arises: which architecture can exploit the characteristic features of MMAs the most? In this paper, first, we explain the characteristics of MMAs, after that we discuss the different architectures that have been proposed for processing MMAs. Subsequently, they are compared based on their ability to exploit the characteristics of MMAs. Superscalar processors with dynamic out-of-order scheduling provide higher performance than VLIW processors and than superscalar processors with in-order scheduling. Because superscalar architectures include complicated control logic for out-of-order execution, and because VLIW processors have to decode every instruction slot in parallel and need a register file with multiple read and write ports, they are more complex than single-issue vector architectures.
XMT-M: A Scalable Decentralized Processor
, 1999
"... A defining challenge for research in computer science and engineering has been the ongoing quest for reducing the completion time of a single computation task. Even outside the parallel processing communities, there is little doubt that the key to further progress in this quest is to do parallel pro ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A defining challenge for research in computer science and engineering has been the ongoing quest for reducing the completion time of a single computation task. Even outside the parallel processing communities, there is little doubt that the key to further progress in this quest is to do parallel processing of some kind. A recently proposed parallel processing framework that spans the entire spectrum from (parallel) algorithms to architecture to implementation is the explicit multi-threading (XMT) framework. This framework provides: (i) simple and natural parallel algorithms for essentially every general-purpose application, including notoriously difficult irregular integer applications, and (ii) a multi-threaded programming model for these algorithms which allows an "independence-of-order" semantics: every thread can proceed at its own speed, independent of other concurrent threads. To the extent possible, the XMT framework uses established ideas in parallel processing. This paper pre...
Width-Sensitive Scheduling for Resource-Constrained VLIW Processors
- Processors”, in Workshop on Feedback Directed and Dynamic Optimizations held in conjunction with 33 rd International Symposium on Microarchitecture
, 2000
"... As the width of processor instruction words increases, so do the opportunities for optimizations that exploit the widths of operands in instructions. This paper presents a feedback-directed technique, called widthsensitive scheduling, that packs operations on a functional unit, thereby enabling shar ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As the width of processor instruction words increases, so do the opportunities for optimizations that exploit the widths of operands in instructions. This paper presents a feedback-directed technique, called widthsensitive scheduling, that packs operations on a functional unit, thereby enabling sharing of a functional unit among multiple operations. We target this technique as a static optimization to VLIW processors that are are resource-constrained, and use profile data to guide the optimization. We first discuss the significant factors in optimization using operand widths. We then describe and evaluate various approaches to optimizing operand widths on a realistic VLIW model. We find that there is sufficient potential, upto 13% speedup, for performance improvement using our technique. Packing of homogeneous operations on a functional unit is unable to exploit most of this available potential. An approach to pack heterogeneous operations on the same functional unit with minimal hardw...

