Results 1 - 10
of
139
The Landscape of Parallel Computing Research: A View from Berkeley
- TECHNICAL REPORT, UC BERKELEY
, 2006
"... All rights reserved. ..."
Active Disks: Programming Model, Algorithms and Evaluation
, 1998
"... Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing power and memory into a disk drive and allow applicat ..."
Abstract
-
Cited by 159 (9 self)
- Add to MetaCart
Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing power and memory into a disk drive and allow application-specific code to be downloaded and executed on the data that is being read from (written to) disk. The key idea is to o#oad bulk of the processing to the disk-resident processors and to use the host processor primarily for coordination, scheduling and combination of results from individual disks. To program Active Disks, we propose a stream-based programming model which allows disklets to be executed efficiently and safely. Simulation results for a suite of six algorithms from three application domains (commercial data warehouses, image processing and satellite data processing) indicate that for these algorithms, Active Disks outperform conventional-disk architectures.
Active Pages: A Computation Model for Intelligent Memory
- IN INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can ope ..."
Abstract
-
Cited by 91 (7 self)
- Add to MetaCart
Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic. Results from the SimpleScalar simulator [BA97] demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies.
Cell multiprocessor communication network: Built for speed
- IEEE Micro
, 2006
"... Over the past decade, high-performance computing has ridden the wave of commodity computing, building clusterbased parallel computers that leverage the tremendous growth in processor performance fueled by the commercial world. As this pace slows, processor designers face complex problems in their ef ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
Over the past decade, high-performance computing has ridden the wave of commodity computing, building clusterbased parallel computers that leverage the tremendous growth in processor performance fueled by the commercial world. As this pace slows, processor designers face complex problems in their efforts to increase gate density, reduce power consumption, and design efficient memory hierarchies. Processor developers are looking for solutions that can keep up with the scientific and industrial communities’ insatiable demand for computing capability and that also have a sustainable market outside science and industry. A major trend in computer architecture is integrating system components onto the processor chip. This trend is driving the development of processors that can perform functions typically associated with entire systems. Building modular processors with multiple cores is far more cost-effective than building Published by the IEEE Computer Society monolithic processors, which are prohibitively expensive to develop, have high power consumption, and give limited return on investment. Multicore system-on-chip (SoC) processors integrate several identical, independent processing units on the same die, together with network interfaces, acceleration units, and other specialized units. Researchers have explored several design avenues in both academia and industry. Examples include MIT’s Raw multiprocessor, the University of Texas’s Trips multiprocessor,
Compiler-Controlled Memory
- IN PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1998
"... Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler’s ability to improve memory behavior is its im-perfect knowledge about the ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler’s ability to improve memory behavior is its im-perfect knowledge about the run-time behavior of the program. The compiler cannot completely predict run-time access patterns. There is an exception to this rule. During the reg-ister allocation phase, the compiler often must insert substantial amount,s of spill code; that is, instructions that move values from registers to memory and back again. Because the compiler itself inserts these memory instructions, it has more knowledge about them than other memory operations in the program. Spill-code operations are disjoint from the memory manipulations required by the semantics of the program being compiled, and, indeed, the two can interfere in the cache. This paper proposes a hardware solution to the problem of increased spill costs-a small compiler-con-trolled memory (CCM) to hold spilled values. This small random-access memory can (and should) be placed in a distinct address space from the main memory hierar-chy. The compiler can target spill instructions to use the CCM, moving most compiler-inserted memory traf-fic out of the pathway to main memory and eliminating any impact that those spill instructions would have on the state of the main memory hierarchy. Such mem-ories already exist on some DSP microprocessors. Our techniques can be applied directly on those chips. This paper presents two compiler-based methods to exploit such a memory, along with experimental results showing that speedups from using CCM may be sizable. It shows that using the register allocation’s coloring paradigm to assign spilled values to memory can greatly reduce the amount of memory required by a program.
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
, 1997
"... The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consisten ..."
Abstract
-
Cited by 42 (12 self)
- Add to MetaCart
The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consiste...
A case for intelligent RAM: IRAM
- IEEE Micro
, 1997
"... Abstract: Two trends call into question the current practice of microprocessors and DRAMs being fabricated as different chips on different fab lines: 1) the gap between processor and DRAM speed is growing at 50% per year; and 2) the size and organization of memory on a single DRAM chip is becoming a ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
Abstract: Two trends call into question the current practice of microprocessors and DRAMs being fabricated as different chips on different fab lines: 1) the gap between processor and DRAM speed is growing at 50% per year; and 2) the size and organization of memory on a single DRAM chip is becoming awkward to use in a system, yet size is growing at 60 % per year. Intelligent RAM, or IRAM, merges processing and memory into a single chip to lower memory latency, increase memory bandwidth, and improve energy efficiency as well as to allow more flexible selection of memory size and organization. In addition, IRAM promises savings in power and board area. This paper reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy efficiency of three IRAM designs. 1. Introduction and Why
Design and Implementation of Power-Aware Virtual Memory
, 2003
"... Despite constant improvements in fabrication technology, hardware components are consuming more power than ever. With the everincreasing demand for higher performance in highly-integrated systems, and as battery technology falls further behind, managing energy is becoming critically important to var ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
Despite constant improvements in fabrication technology, hardware components are consuming more power than ever. With the everincreasing demand for higher performance in highly-integrated systems, and as battery technology falls further behind, managing energy is becoming critically important to various embedded and mobile systems. In this paper, we propose and implement power-aware virtual memory to reduce the energy consumed by the memory in response to workloads becoming increasingly data-centric. We can use the power management features in current memory technology to put individual memory devices into low power modes dynamically under software control to reduce the power dissipation. However, it is imperative that any techniques employed weigh memory energy savings against any potential energy increases in other system components due to performance degradation of the memory. Using a novel power-aware virtual memory implementation, we show a significant reduction in memory power dissipation, from 4.1 W to 0.5--2.7 W, when using Rambus memory and running various real-world applications in a working Linux system. Applying more advanced techniques, we can reduce this further to 0.2--1.7 W, depending on the actual workload, with negligible effects on performance. We also show this work is applicable to other memory architectures, and is orthogonal to previouslyproposed hardware-controlled power-management techniques, so it can be applied simultaneously to further enhance energy conservation in a variety of platforms.
3d-stacked memory architectures for multi-core processors
- In International Symposium on Computer Architecture
"... Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75 × speedup over previously proposed 3D-DRAM approaches on our memoryintensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8 % performance improvement over our 3D-stacked memory architecture. 1.
Dual-core execution: building a highly scalable single-thread instruction window
, 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.

