Results 1 -
9 of
9
Simultaneous Multithreading: Maximizing On-Chip Parallelism
, 1995
"... This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide s ..."
Abstract
-
Cited by 623 (46 self)
- Add to MetaCart
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
A new memory monitoring scheme for memory-aware scheduling and partitioning
, 2002
"... We propose a low overhead, on-line memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme t ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
We propose a low overhead, on-line memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to
Models of Parallel Computation: A Survey and Synthesis
- INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES
, 1995
"... In the realm of sequential computing the random access machine has successufully provided an underlying model of computation that promoted consistency and coordination among algorithm developers, computer architects and language experts. In the realm of parallel computing, however, there has been no ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
In the realm of sequential computing the random access machine has successufully provided an underlying model of computation that promoted consistency and coordination among algorithm developers, computer architects and language experts. In the realm of parallel computing, however, there has been no similar success. The need for such a unifying parallel model or set of models is heightened by the greater demand for performance and the greater diversity among machines. Yet the modeling of parallel computing still seems to be mired in controversy and chaos. This paper is an excerpt from a study which presents broad range of models of parallel computation and the different roles they serve in algorithm, language and machine design. The objective is to better understand which model characteristics are important to each design community in order to elucidate the requirements of a unifying paradigm. As an impetus for discussion, we conclude by suggesting a model of parallel computation which...
Dynamic Partitioning of Shared Cache Memory
- JOURNAL OF SUPERCOMPUTING
, 2002
"... This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.
Since memory reference characteristics of processes/threads can change over time, our method collects the cache ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.
Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.
The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.
Effects of memory performance on parallel job scheduling
- In JSSPP ’01: Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
, 2001
"... ..."
VLSI Datapath Choices: Cell-Based Versus Full-Custom
, 1998
"... Traditionally, VLSI architects and designers have acknowledged the area, performance, and effort tradeoffs between cell-based and full-custom implementations of the same datapath function. However, few attempts have been made to characterize these tradeoffs in the context of contemporary fabrication ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Traditionally, VLSI architects and designers have acknowledged the area, performance, and effort tradeoffs between cell-based and full-custom implementations of the same datapath function. However, few attempts have been made to characterize these tradeoffs in the context of contemporary fabrication processes and area place and route tools. More importantly, few attempts have been made to determine how to enable cell-based implementations to approach the density and speed of full-custom designs. This work quantifies the limits of cell-based datapath implementations based on results derived from a detailed analysis of the density and performance tradeoffs in the implementation of two full-custom datapaths, the Integer Register-Read Datapath (IRRDP) and the 64-bit adder/subtracter (ADDSUB), employed in the multi-ALU Processor (MAP) chip. A cell-based implementation of the IRRDP is 1.64x larger than the full-custom original. The critical timing path for the cell-based implementation is 11...
Optimistic Active Messages: Structuring Systems for High-Performance Communication
- In Sixth SIGOPS Eu Workshop: Matching Operating Systems to Application Needs
, 1994
"... Recent networks and network interfaces promise remarkable communication performance with very little overhead, but current software structures impose substantial overhead that prevents applications from achieving the benefits of these new architectures. We propose a new software structure that elimi ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Recent networks and network interfaces promise remarkable communication performance with very little overhead, but current software structures impose substantial overhead that prevents applications from achieving the benefits of these new architectures. We propose a new software structure that eliminates much of the overhead while preserving the ease of programming of current systems. Our architecture relies on the compiler to bridge the gap between high-level application programs and low-level communication primitives. The compiler incorporates application code into message handlers using a new runtime mechanism called optimistic active messages. This work was supported in part by the Advanced Research Projects Agency under contracts N00014-91J -1698 and DABT63-93-C-008, by the National Science Foundation under contract MIP-9012773, by an NSF Presidential Young Investigator Award, by IBM, AT&T, and Digital Equipment Corp., by Project Scout under ARPA contract MDA972-92-J-1032, by a...
Architecture, 1996. [Zilic95] Z. Zilic, G. Lemieux, K. Loveless, S. Brown, Z. Vranesic, "Designing for Canadian Workshop on Field-Programmable Devices
"... University, Stanford, CA, November 1991. [Stumm93] M. Stumm, Z. Vranesic, R. White, R. Unrau, K. Farkas, "Experiences with the Hextor Multiprocessor," CSRI Technical Report CSRI-276, Computer Systems Research Institute, University of Toronto, Toronto, 1993. Extended version of paper with same title ..."
Abstract
- Add to MetaCart
University, Stanford, CA, November 1991. [Stumm93] M. Stumm, Z. Vranesic, R. White, R. Unrau, K. Farkas, "Experiences with the Hextor Multiprocessor," CSRI Technical Report CSRI-276, Computer Systems Research Institute, University of Toronto, Toronto, 1993. Extended version of paper with same title in Proc. Intl. Parallel Processing Symposium Parallel Systems Fair, 1993, pp. 9-16. [Sun95] Sun, SuperSPARC II Data Sheet, Sun, 1995. [Torrellas95] J. Torrellas, C. Xia, R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," to appear in IEEE Transactions on Computers, 1995. [Torrie95] E. Torrie, C-W. Tseng, M. Martonosi, M.W. Hall, "Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes," International Conference on Parallel Architectures and Compilation Techniques, June 1995. [Varley93] D.A. Varley, "Practical Experience of the Limitations of gprof," Software --- Pract

