Results 1 -
9 of
9
Shared memory consistency models: A tutorial
- IEEE Computer
, 1996
"... Parallel systems that support the shared memory abstraction are becoming widely accepted in many areas of computing. Writing correct and efficient programs for such systems requires a formal specification of memory semantics, called a memory consistency model. The most intuitive model—sequential con ..."
Abstract
-
Cited by 297 (8 self)
- Add to MetaCart
Parallel systems that support the shared memory abstraction are becoming widely accepted in many areas of computing. Writing correct and efficient programs for such systems requires a formal specification of memory semantics, called a memory consistency model. The most intuitive model—sequential consistency—greatly restricts the use of many performance optimizations commonly used by uniprocessor hardware and compiler designers, thereby reducing the benefit of using a multiprocessor. To alleviate this problem, many current multiprocessors support more relaxed consistency models. Unfortunately, the models supported by various systems differ from each other in subtle yet important ways. Furthermore, precisely defining the semantics of each model often leads to complex specifications that are difficult to understand for typical users and builders of computer systems. The purpose of this tutorial paper is to describe issues related to memory consistency models in a way that would be understandable to most computer professionals. We focus on consistency models proposed for hardware-based shared-memory systems. Many of these models are originally specified with an emphasis on the system optimizations they allow. We retain the system-centric emphasis, but use uniform and simple terminology to describe the different models. We also briefly discuss an alternate programmer-centric view that describes the models in terms of program behavior rather than specific system optimizations. 1
Complexity/Performance Tradeoffs with Non-Blocking Loads
, 1994
"... Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. We describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and inst ..."
Abstract
-
Cited by 86 (9 self)
- Add to MetaCart
Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. We describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in nonblocking systems. This Research Report is a preprint of a paper to appear at the 21st Annual International Symposium on Computer Architecture. d i g i t a l Western Research Laboratory 250 University Avenue Palo Alto, ...
How Useful Are Non-blocking Loads, Stream Buffers, and Speculative Execution in Multiple Issue Processors?
, 1994
"... We investigate the relative performance impact of non-blocking loads, stream buffers, and speculative execution both used individually and in conjunction with each other. We have simulated the SPEC92 benchmarks on a statically scheduled quad-issue processor model, running code from the Multiflow com ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
We investigate the relative performance impact of non-blocking loads, stream buffers, and speculative execution both used individually and in conjunction with each other. We have simulated the SPEC92 benchmarks on a statically scheduled quad-issue processor model, running code from the Multiflow compiler. Non-blocking loads and stream buffers both provide a significant performance advantage, and their combination performs significantly better than either alone. For example, with a 64-byte, 2-way set associative cache with 32 cycle fetch latency, non-blocking loads reduce the run-time by 21% while stream-buffers reduce it by 26%, and the combined use of the two yields a 47% reduction. The addition of speculative execution further improves the performance of the systems that we have simulated, with or without non-blocking loads and stream buffers, by an additional 20% to 40%. We expect that the use of all three of these techniques will be important in future generations of microprocessor...
Experience with a Wireless World Wide Web Client
, 1994
"... research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There are two other research laboratories located in Pal ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There are two other research laboratories located in Palo Alto, the Network Systems
Recursive Layout Generation
- WRL Research Report 95/2
, 1995
"... research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There are two other research laboratories located in Pal ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There are two other research laboratories located in Palo Alto, the Network Systems
A 300MHz 115W 32b Bipolar ECL Microprocessor
, 1993
"... A full-custom single-chip bipolar ECL RISC microprocessor was implemented in a 1.0m single-poly bipolar technology. This research prototype contains a CPU and on-chip 2KB instruction and 2KB data caches. Worst-case power dissipation with a nominal-5.2V supply is 115W. The chip has been designed for ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
A full-custom single-chip bipolar ECL RISC microprocessor was implemented in a 1.0m single-poly bipolar technology. This research prototype contains a CPU and on-chip 2KB instruction and 2KB data caches. Worst-case power dissipation with a nominal-5.2V supply is 115W. The chip has been designed for a worst-case clock frequency of 275MHz at a nominal supply. The chip verifies a new style of CAD tools developed during the design process, advanced packaging techniques for high-power microprocessors, and VLSI ECL circuit techniques. This Research Report is a reprint of a paper appearing in the November 1993 issue of the IEEE Journal of Solid-State Circuits. d i g i t a l Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA ii Table of Contents 1. Introduction 1 2. Chip Overview 2 3. Bipolar Process Technology 5 4. Circuit Technology 8 4.1. Noise Margins 9 4.2. Clock Distribution 11 4.3. RAM Cell 12 4.4. Biases 12 4.5. Testing 12 5. CAD 13 5.1. Design Capt...
Drip: A Schematic Drawing Interpreter
- WRL Research Report 95/1
, 1995
"... This paper presents a design capture system in which schematics are translated into a procedural netlist specification language. The circuit designer draws schematics with a standard structured graphics editor that knows nothing about netlists or schematics. The translator program analyzes the struc ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper presents a design capture system in which schematics are translated into a procedural netlist specification language. The circuit designer draws schematics with a standard structured graphics editor that knows nothing about netlists or schematics. The translator program analyzes the structured graphics output file and translates it into a procedural netlist specification. d i g i t a l Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA ii Table of Contents 1. Introduction 1 2. Basics 2 2.1. Simple Example 2 2.2. Structured Graphics 3 3. Generating Procedures 4 3.1. Frames and Evaluation 4 3.2. 2D Ordering 5 4. Drawing Interpretation 7 4.1. Icons 8 5. Analysis of Non-Evaluation Objects 9 5.1. Binding Text to Objects 9 5.2. Wires 10 5.3. Wire Subscripting 11 6. Error Reporting 11 7. Experiences 12 Acknowledgements 12 References 12 iii iv List of Figures Figure 1: Code Generated for "CELL: orN" 2 Figure 2: 2D ordering of objects 5 Figur...
Performance implications of multiple pointer sizes
- IN: USENIX WINTER
, 1995
"... ... This paper analyzes several programs and pro-gramming techniques to understand the performance implications of different pointer sizes. Many (but not all) programs show small but definite performance consequences, primarily due to cache and paging effects. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
... This paper analyzes several programs and pro-gramming techniques to understand the performance implications of different pointer sizes. Many (but not all) programs show small but definite performance consequences, primarily due to cache and paging effects.
Circuit and Process Directions for Low-Voltage Swing Submicron BiCMOS
, 1994
"... Low-swing (<600mV) submicron BiCMOS circuits have many advantages over fullswing BiCMOS, CMOS, or small-swing bipolar circuits. We show that the optimal speed fan-in for low-swing BiCMOS logic circuits is generally in the range of 7 to 20, depending on the process characteristics and gate topology. ..."
Abstract
- Add to MetaCart
Low-swing (<600mV) submicron BiCMOS circuits have many advantages over fullswing BiCMOS, CMOS, or small-swing bipolar circuits. We show that the optimal speed fan-in for low-swing BiCMOS logic circuits is generally in the range of 7 to 20, depending on the process characteristics and gate topology. This high fan-in means that the bipolar device parasitic capacitances primarily determine the circuit speed and speedpower products, instead of f as in the case of low fan-in mux/demux communication T circuits. SiGe HBT BiCMOS circuits are attractive for logic circuits not primarily for their higher f , but rather for their increased maximum device currents for a given T parasitic capacitance and for their smaller V , which can lower chip power dissipation. be Finally, for small-swing BiCMOS circuits to be competitive with CMOS they must also be built from the same lithography as CMOS circuits, have local interconnect for interdevice intra-gate wiring, and be built with a full-custom d...

