DMCA
Adding a Vector Unit to a Superscalar Processor (1999)
Citations: | 22 - 11 self |
Citations
1844 | The SimpleScalar Tool Set, Version 2.0
- Burger, Austin
(Show Context)
Citation Context ... for this study: a purely scalar one and a version with vector instructions. For the scalar version, each program was compiled using GCC v2.6.3 and then simulated using the SimpleScalar Tool Set v3.0 =-=[16]-=-. For the scalar programs, we always simulate 1500 million graduated instructions. For the vector enhanced version, the programs were compiled on a Convex C3400 with vectorization turned on. Some impo... |
966 | Mediabench: a tool for evaluating and synthesizing multimedia and communications systems.
- Lee
- 1997
(Show Context)
Citation Context ... performance of our cache hierarchy, discussing the relative merits of the CA and CB cache organizations. 4.1 Benchmarks and Simulation Tools We have studied the Perfect Club, SPECfp92 and Mediabench =-=[15]-=- suites, and results shown here correspond to a set of selected benchmarks that conform a representative sample of the different behaviors found in these numerical and multimedia applications. Our wor... |
467 | Complexity-Effective Superscalar Processors. In
- Palacharla
- 1997
(Show Context)
Citation Context ... achieve large amounts of ILP is an area of very active research. There is a growing consensus that it can not be done by simply trying to fetch, decode and issue more and more instructions per cycle =-=[1]-=-. First of all, an aggressive fetch and decode engine must be designed, which is far from being trivial due to branches as well as instruction cache bandwidth issues [2, 3]. Second, an aggressive issu... |
311 | Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching.
- Rotenberg, Bennet, et al.
- 1996
(Show Context)
Citation Context ...nd more instructions per cycle [1]. First of all, an aggressive fetch and decode engine must be designed, which is far from being trivial due to branches as well as instruction cache bandwidth issues =-=[2, 3]-=-. Second, an aggressive issue engine with a large instruction window is required to be able to feed a large number of functional units. The instruction window lookup time increases quadratically with ... |
259 |
MMX Technology Extension to the Intel Architecture”,
- Peleg, Weiser
- 1996
(Show Context)
Citation Context ...y in microprocessor design targeted at the exploitation of sub-word parallelism. Most major computer vendors have recently included multimedia specific instructions in their architectures such as MMX =-=[4]-=-, VIS [5] or Altivec [6]. Except for the Altivec case, all other extensions only offer sub-word parallelism. That is, a 64-bit register can be broken into independent entities of 8, 16 or 32 bits that... |
133 | Optimization of instruction fetch mechanisms for high issue rates.
- Conte, Menezes, et al.
- 1995
(Show Context)
Citation Context ...nd more instructions per cycle [1]. First of all, an aggressive fetch and decode engine must be designed, which is far from being trivial due to branches as well as instruction cache bandwidth issues =-=[2, 3]-=-. Second, an aggressive issue engine with a large instruction window is required to be able to feed a large number of functional units. The instruction window lookup time increases quadratically with ... |
70 |
Direct rambus technology: The new main memory standard
- Crisp
- 1997
(Show Context)
Citation Context ... bus and simply load a maximum of one word per cycle. Finally, the L2 data cache, which is assumed on-chip, is a conventional cache that will be connected by a bidirectional bus to an external RAMBUS =-=[14]-=- controller. Since we would like our model to be extensible to multiprocessing, our simulators faithfully model all the coherency traffic required to maintain the inclusion property 2. There are sever... |
59 | Multithreaded vector architectures.
- Espasa, Valero
- 1997
(Show Context)
Citation Context ...esigns such as the Alpha lineage, the vector unit must be adapted to out-of-order execution and register renaming. In this paper we leverage from our previous studies on out-oforder vector processors =-=[9]-=- to fully integrate the vector unit in an out-of-order superscalar processor. Furthermore, this paper is mainly devoted to designing a feasible cache hierarchy that fits both the scalar engine and the... |
46 | Design issues and tradeoffs for write buffers
- Skadron, Clark
- 1997
(Show Context)
Citation Context ...herefore, we will assume that the required time to perform a store operation is two cycles. The write buffer The write buffer for our vector cache has been designed following the results presented in =-=[13]-=-. We have implemented a coalescing write buffer with a width equal to the cache line size, since it has to be able to insert both scalar and vector accesses. The retirement order is FIFO, as usual, ex... |
36 |
Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum ‘98 (http://www.digital.com/alphaoem/microprocessorforum.htm),
- Bannon
- 1998
(Show Context)
Citation Context ...nd a small and fast on-chip L1 cache, backed up by a large (1-4MB) off-chip L2 cache. However, advances in logic integration will allow next generation superscalar processors (such as the Alpha 21364 =-=[10]-=-) to include both caches on-chip. As the number of instructions executed in parallel increases, data caches with higher bandwidth will be required. To obtain high bandwidth from a cache, several requi... |
35 | Simple Vector Microprocessors for Multimedia Applications
- Lee, Stoodley
- 1998
(Show Context)
Citation Context ... vector units to also include sub-word parallel instructions, such as the ones provided by Altivec. Second, we must also differentiate this paper from the work recently presented by C. Lee in [7] and =-=[8]-=-. Lee advocates using simple vector processors in future desktop systems. Although this may sound similar to our proposal, indeed the differences are profound. Lee argues that by using vector units, o... |
27 | On High-Bandwidth Data Cache Design for Multi-Issue Processors
- Rivers, Tyson, et al.
- 1997
(Show Context)
Citation Context ...the 21164). However, research shows that for large number of ports, say 4 to 16, this is not feasible and alternative designs using multiple banks or hybrids of multi-bank and multi-port must be used =-=[11, 12]-=-. Thus, no obvious solution seems to be available for scaling current superscalar processors up to issuing four or more memory accesses per cycle. Even though accesses tend to exhibit high spatial loc... |
24 | Cache performance in vector supercomputers
- Kontothanassis, Sugumar, et al.
- 1994
(Show Context)
Citation Context ...l and temporal locality, that this locality is well exploited by the large vector registers and that using a cache only gets in the way of memory accesses that end up accessing main memory regardless =-=[19]-=-. To answer this question, we have measured the percentage of 64-bit words that are filtered by the cache hierarchy, that is, the fraction of words requested by the processor that are serviced by eith... |
21 | Initial Results on the Performance and Cost of Vector Microprocessors",
- Lee, DeVries
(Show Context)
Citation Context ...vent our vector units to also include sub-word parallel instructions, such as the ones provided by Altivec. Second, we must also differentiate this paper from the work recently presented by C. Lee in =-=[7]-=- and [8]. Lee advocates using simple vector processors in future desktop systems. Although this may sound similar to our proposal, indeed the differences are profound. Lee argues that by using vector ... |
16 | Data caches for superscalar processors
- Juan, Navarro, et al.
- 1997
(Show Context)
Citation Context ...the 21164). However, research shows that for large number of ports, say 4 to 16, this is not feasible and alternative designs using multiple banks or hybrids of multi-bank and multi-port must be used =-=[11, 12]-=-. Thus, no obvious solution seems to be available for scaling current superscalar processors up to issuing four or more memory accesses per cycle. Even though accesses tend to exhibit high spatial loc... |
3 | A Case for Merging the ILP and DLP Paradigms
- Quintana, Espasa, et al.
- 1998
(Show Context)
Citation Context ...gnificant problem. As it is widely known, to execute the same code, a vector machine uses much fewer instructions than a scalar machine (because each vector instruction specifies multiple operations) =-=[18]-=-. Therefore, using raw IPC as a performance measure would be meaningless. The solution is as follows. First, each program is run to completion in pure scalar mode on a R10000 processor and, using the ... |
1 |
Digital Compresion for Multimedia
- Gibson, Berger, et al.
- 1998
(Show Context)
Citation Context ...er to get them to vectorize. These changes range from simple loop interchange techniques (as in epic) to a major rewrite of the idct algorithm (in mpeg and epic) following the standard specifications =-=[17]-=-. The resulting vector binaries are then processed using our own tracing and simulation environment, described in [9]. Comparing a scalar and a vector program in terms of IPC poses a significant probl... |