Results 1 - 10
of
22
Optimizing replication, communication, and capacity allocation in cmps
- INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2005
"... Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the l ..."
Abstract
-
Cited by 69 (0 self)
- Add to MetaCart
Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.
Memory-System Design Considerations For Dynamically-Scheduled Microprocessors
, 1997
"... Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs that balance hardware complexity and compiler technology against performance targets. This dissertation presents a first thorough look at some of the issues introduced by this hardware complexity. The focus of the investigation of these issues is the register file and the other components of the data memory system. These components are: the lockup-free data cache, the stream buffers, and the interface to the lower levels of the memory system. The investigation is based on software models. These models incorporate the features of a dynamically-scheduled processor that affect the design of the data-memory components. The models represent a balance between accuracy and generality, and ar...
Vector Microprocessors
- In Hot Chips VII
, 1998
"... Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector superc ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector supercomputer implementations. This thesis expands the body of vector research by examining designs appropriate for single-chip full-custom vector microprocessor implementations targeting a much broader range of applications. I present the design, implementation, and evaluation of T0 (Torrent-0): the first single-chip vector microprocessor. T0 is a compact but highly parallel processor that can sustain over 24 operations per cycle while issuing only a single 32-bit instruction per cycle. T0 demonstrates that vector architectures are well suited to full-custom VLSI implementation and that they perform well on many multimedia and human-machine interface tasks. The remainder of the thesis contains ...
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping
, 2001
"... Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation. ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation.
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
, 1997
"... The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consisten ..."
Abstract
-
Cited by 43 (13 self)
- Add to MetaCart
The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consiste...
Efficient Sorting Using Registers and Caches
- in Proceedings of the 4th Workshop on Algorithm Engineering (WAE 2000
, 2000
"... Modern computer systems have increasingly complex memory systems.Common machine models for algorithm analysis do not reflect many of the features... ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Modern computer systems have increasingly complex memory systems.Common machine models for algorithm analysis do not reflect many of the features...
Dynamic scheduling in RISC architectures
- IEE Proceedings Computers and Digital Techniques
, 1996
"... Multithreaded processors support a number of execution contexts and switch contexts rapidly in order to tolerate highly latent events such as external memory references. Existing multithreaded architectures are implicitly based on the assumption that latency tolerance requires massive parallelism, w ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Multithreaded processors support a number of execution contexts and switch contexts rapidly in order to tolerate highly latent events such as external memory references. Existing multithreaded architectures are implicitly based on the assumption that latency tolerance requires massive parallelism, which must be found from diverse contexts. We have carried out a quantitative analysis of the e ciency of multithreaded execution as a function of the number of threads for two important classes of memory systems: conventional o-chip memory and symmetric networks. The results of these analyses show that there are fundamental reasons for the e ciency to grow very rapidly with the number of threads. This in turn implies that the original goal of latency tolerance can be achieved with only a limited number of threads that can typically be drawn from the same referential context and do not therefore require the heavyweight hardware solutions of conventional multithreading. A novel dynamically scheduled RISC architecture is presented based on this new understanding of the problem. 1
Informing Memory Operations: Memory Performance Feedback Mechanisms and Their Applications
- ACM Transactions on Computer Systems
, 1998
"... this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. This article de ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. This article describes two different implementations of informing memory operations. One is based on a cache-outcome condition code, and the other is based on low-overhead traps. We find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and we look at cache coherence with fine-grained access control as a Mark Horowitz is supported by ARPA Contract DABT63-94-C-0054. Margaret Martonosi is supported in part by a National Science Foundation Career Award (CCR-9502516). Todd C. Mowry is partially supported by a Faculty Development Award from IBM. Michael D. Smith is supported by a National Science Foundation Young Investigator award (CCR-9457779).
Efficient sorting using registers and caches
- WAE, WORKSHOP ON ALGORITHM ENGINEERING , LECTURE NOTES IN COMPUTER SCIENCE
, 2000
"... Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines. A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-merge, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-merge is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.

