Results 1 - 10
of
31
StreamIt: A Language for Streaming Applications
- In International Conference on Compiler Construction
, 2001
"... We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important. ..."
Abstract
-
Cited by 236 (24 self)
- Add to MetaCart
We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration
, 1999
"... Future computing workloads will emphasize an architecture 's ability to perform relatively simple calculations on massive quantities of mixed-width data. This paper describes a novel reconfigurable fabric architecture, PipeRench, optimized to accelerate these types of computations. PipeRench enables ..."
Abstract
-
Cited by 109 (11 self)
- Add to MetaCart
Future computing workloads will emphasize an architecture 's ability to perform relatively simple calculations on massive quantities of mixed-width data. This paper describes a novel reconfigurable fabric architecture, PipeRench, optimized to accelerate these types of computations. PipeRench enables fast, robust compilers, supports forward compatibility, and virtualizes configurations, thus removing the fixed size constraint present in other fabrics. For the first time we explore how the bit-width of processing elements affects performance and show how the PipeRench architecture has been optimized to balance the needs of the compiler against the realities of silicon. Finally, we demonstrate extreme performance speedup on certain computingkernels (up to 190x versus a modernRISC processor), and analyze how this acceleration translates to application speedup. 1.
Impulse: Building a Smarter Memory Controller
- IN PROCEEDINGS OF THE FIFTH ANNUAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accesse ..."
Abstract
-
Cited by 74 (17 self)
- Add to MetaCart
Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
- In 14th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2006
"... As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represen ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represented by independent filters that communicate over explicit data channels. In this paper, we demonstrate an end-to-end stream compiler that attains robust multicore performance in the face of varying application characteristics. As benchmarks exhibit different amounts of task, data, and pipeline parallelism, we exploit all types of parallelism in a unified manner in order to achieve this generality. Our compiler, which maps from the StreamIt language to the 16-core Raw architecture, attains a 11.2x mean speedup over a single-core baseline, and a 1.84x speedup over our previous work. Categories and Subject Descriptors D.3.2 [Programming Languages]:
Task Selection for a Multiscalar Processor
- In Proceedings of the 31st annual international symposium on Microarchitecture
, 1998
"... The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implication ..."
Abstract
-
Cited by 58 (6 self)
- Add to MetaCart
The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, inter-task control flow, and inter-task data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP win...
Min-Cut Program Decomposition for Thread-Level Speculation
, 2004
"... With billion-transistor chips on the horizon, single-chip multiprocessors (CMPs) are likely to become commodity components. Speculative CMPs use hardware to enforce dependence, allowing the compiler to improve performance by speculating on ambiguous dependences without absolute guarantees of indepen ..."
Abstract
-
Cited by 50 (2 self)
- Add to MetaCart
With billion-transistor chips on the horizon, single-chip multiprocessors (CMPs) are likely to become commodity components. Speculative CMPs use hardware to enforce dependence, allowing the compiler to improve performance by speculating on ambiguous dependences without absolute guarantees of independence. The compiler is responsible for decomposing a sequential program into speculatively parallel threads, while considering multiple performance overheads related to data dependence, load imbalance, and thread prediction. Although the decomposition problem lends itself to a min-cut-based approach, the overheads depend on the thread size, requiring the edge weights to be changed as the algorithm progresses. The changing weights make our approach di#erent from graph-theoretic solutions to the general problem of task scheduling. One recent work uses a set of heuristics, each targeting a specific overhead in isolation, and gives precedence to thread prediction, without comparing the performance of the threads resulting from each heuristic. By contrast, our method uses a sequence of balanced min-cuts that give equal consideration to all the overheads, and adjusts the edge weights after every cut. This method achieves an (geometric) average speedup of 74% for floating-point programs and 23% for integer programs on a four-processor chip, improving on the 52% and 13% achieved by the previous heuristics.
Compiling for the Multiscalar Architecture
, 1998
"... High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level para ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance. The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory depe...
FlexRAM: Toward an Advanced Intelligent Memory System
- In International Conference on Computer Design
, 1999
"... Major advances in Merged Logic DRAM (MLD) technology coupled with the popularization of memory-intensive applications provide fertile ground for architectures based on Intelligent Memory (IRAM) or rrocessors-in-Memory (riM). The contribution of this paper is to explore one way to use the current sta ..."
Abstract
-
Cited by 44 (10 self)
- Add to MetaCart
Major advances in Merged Logic DRAM (MLD) technology coupled with the popularization of memory-intensive applications provide fertile ground for architectures based on Intelligent Memory (IRAM) or rrocessors-in-Memory (riM). The contribution of this paper is to explore one way to use the current state-of-the-art MLD technology for general-purpose computers. To satisfy requirements of general purpose and low programming cost, we place the rim chips in the memory system and let them default to plain DRAM if the application is not enabled for intelligent memory. Since wide usability is crucial, we identify and analyze a range of real applications for riM. Based on the requirements of these applications and current technological constraints, we design a rim chip and a riM-based memory system. We call the chip FlexRAM. We describe FlexRAM's design and floorplan, and the resulting memory system. Evaluation of the system through simulations shows that 4 FlexRAM chips often allow a workstation to run 25-40 times faster.
The Impulse Memory Controller
- IEEE TRANSACTIONS ON COMPUTERS
, 2001
"... Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cach ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs, since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and show how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20% to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.
Efficient On-Chip Global Interconnects
- in Symposium on VLSI Circuits
, 2003
"... We present circuits for a high-efficiency low-swing interconnect scheme suitable for the Smart Memories reconfigurable architecture. By using a separate supply, global clocking, and differential signaling, we reduce design complexity; and by using overdrive circuits, equalization techniques, and sen ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We present circuits for a high-efficiency low-swing interconnect scheme suitable for the Smart Memories reconfigurable architecture. By using a separate supply, global clocking, and differential signaling, we reduce design complexity; and by using overdrive circuits, equalization techniques, and senseamplifiers we retain high performance. A testchip built in a 1.8V 0.18-#m technology consumed # 1pJ/bit for a 10mm bus at 1GHz, a power savings over full-swing signaling of up to 10x, and demonstrated amplifier input offset voltages of under 100mV.

