Results 1 - 10
of
47
Potential benefits of delta encoding and data compression for HTTP (Corrected version)
, 1997
"... ..."
The Multicluster Architecture: Reducing Cycle Time Through Partitioning
, 1997
"... The multicluster architecture that we introduce offers a decentralized, dynamically-scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural regis ..."
Abstract
-
Cited by 155 (0 self)
- Add to MetaCart
The multicluster architecture that we introduce offers a decentralized, dynamically-scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using tracedriven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered, the multicluste...
Scalable kernel performance for Internet servers under realistic loads
, 1998
"... UNIX Internet servers with an event-driven architecture often perform poorly under real workloads, even if they perform well under laboratory benchmarking conditions. We investigated the poor performance of event-driven servers. We found that the delays typical in wide-area networks cause busy serve ..."
Abstract
-
Cited by 86 (9 self)
- Add to MetaCart
UNIX Internet servers with an event-driven architecture often perform poorly under real workloads, even if they perform well under laboratory benchmarking conditions. We investigated the poor performance of event-driven servers. We found that the delays typical in wide-area networks cause busy servers to manage a large number of simultaneous connections. We also observed that the select system call implementation in most UNIX kernels scales poorly with the number of connections being managed by a process. The UNIX algorithm for allocating file descriptors also scales poorly. These algorithmic problems lead directly to the poor performance of event-driven servers. We implemented scalable versions of the select system call and the descriptor allocation algorithm. This led to an improvement of up to 58% in Web proxy and Web server throughput, and dramatically improved the scalability of the system.
Predictor-directed stream buffers
- In 33rd International Symposium on Microarchitecture
, 2000
"... An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ’ ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in ..."
Abstract
-
Cited by 58 (12 self)
- Add to MetaCart
An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ’ ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. In this paper we propose Predictor-Directed Stream Buffers (PSB), a scheme in which the stream buffer follows an address prediction stream instead of a fixed stride. In addition, we examine using confidence techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show for pointer-based applications that PSB provides a 30 % speedup on average over no prefetching, and provides an average 10 % speedup over using previously proposed stride-based stream buffers for pointer-intensive applications. 1
Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques
- IEEE Transactions on Computers
, 1999
"... Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.
Exploiting Dead Value Information
- In 30th International Symposium on Microarchitecture
, 1997
"... We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead registers and dynamically eliminate unnecess ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead registers and dynamically eliminate unnecessary save and restore instructions from the execution stream at procedure calls and context switches. Our results indicate that dynamic saves and restore instances can be reduced by 46 % for procedure calls and by 51 % for context switches. In addition, save/restore elimination for procedure calls can improve overall performance by up to 5%. DVI also allows the processor manage physical registers to efficiently, reducing the size requirements of the physical register file. When the system clock rate is proportional to the register file cycle time, this optimization can improve performance. All of these optimizations can be supported with only a few new instructions and minimal additional hardware structures. 1
Dynamic Hammock Predication for Non-predicated Instruction Set Architectures
, 1998
"... Conventional speculative architectures use branch prediction to evaluate the most likely execution path during program execution. However, certain branches are difficult to predict. One solution to this problem is to evaluate both paths following such a conditional branch. Predicated execution can b ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Conventional speculative architectures use branch prediction to evaluate the most likely execution path during program execution. However, certain branches are difficult to predict. One solution to this problem is to evaluate both paths following such a conditional branch. Predicated execution can be used to implement this form of multi-path execution. Predicated architectures fetch and issue instructions that have associated predicates. These predicates indicate if the instruction should commit its result. Predicating a branch reduces the number of branches executed, eliminating the chance of branch misprediction at the cost of executing additional instructions. In this paper, we propose a restricted form of multi-path execution called Dynamic Predication for architectures with little or no support for predicated instructions in their instruction set. Dynamic predication dynamically predicates instruction sequences in the form of a branch hammock, concurrently executing both paths of...
The Impulse Memory Controller
- IEEE TRANSACTIONS ON COMPUTERS
, 2001
"... Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cach ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs, since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and show how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20% to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.
Access order and effective bandwidth for streams on a direct rambus memory
- Proc. HPCA-5
, 1999
"... Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recent ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76 % of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25. 1.
Dual-core execution: building a highly scalable single-thread instruction window
, 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.

