Results 1 -
6 of
6
An analysis of database workload performance on simultaneous multithreaded processors
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, databa ..."
Abstract
-
Cited by 119 (13 self)
- Add to MetaCart
Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable “critical ” working set. Second, we show that the additional data cache conflicts caused by simultaneousmultithreaded instruction scheduling can be nearly eliminated by the proper choice of software-directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better interthread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT’s latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. 1
A performance comparison of contemporary DRAM architectures
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organiz ..."
Abstract
-
Cited by 92 (9 self)
- Add to MetaCart
In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM
Compressibility Characteristics of Address/Data Transfers in Commercial Workloads
, 2002
"... In this paper, we evaluate the compressibility of address and data transfers in commercial servers. Our proposed compression scheme is geared towards improving the efficiency of the transfer medium (busses, links etc) and increasing the performance of the system. We evaluate the potential of the bas ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper, we evaluate the compressibility of address and data transfers in commercial servers. Our proposed compression scheme is geared towards improving the efficiency of the transfer medium (busses, links etc) and increasing the performance of the system. We evaluate the potential of the basic compression techniques for two commercial workloads -- SPECweb99 [21] and TPCC [22] -- based on trace-driven simulations. Based on the obtained results, we show that simple compression schemes show significant promise for reducing address bus width and moderate benefits for data bus width reduction. We also show the sensitivity of these performance benefits to the number of bits compressed and the size of the encoding/decoding table used. Additionally, we propose enhancements to the compression schemes based on (1) recognizing and utilizing data-type specific knowledge and (2) improving the replacement policy of the encoding /decoding table. The performance benefits of bus compression schemes with these enhancements are also presented and analyzed.
Comparing the Memory System Performance of DSS Workloads on the HP
, 2002
"... In this paper, we present an in-depth analysis of the memory system performance of the DSS commercial workloads on two state-of-the-art multiprocessors: the SGI Origin 2000 and the HP V-Class. Our results show that a single query process takes almost the same amount of cycles in both machines. Howev ..."
Abstract
- Add to MetaCart
In this paper, we present an in-depth analysis of the memory system performance of the DSS commercial workloads on two state-of-the-art multiprocessors: the SGI Origin 2000 and the HP V-Class. Our results show that a single query process takes almost the same amount of cycles in both machines. However, when multiple query processes run simultaneously on the system, the execution time tends to increase more in SGI Origin 2000 than in HP V-Class due to the more expensive communication overhead in SGI Origin 2000. We also show how the rate at which number of data cache misses, context switches and the overall execution time increases when more query processes run simultaneously. 1
Exploring Performance Limits to Future Instruction-Level-Parallel Processors
, 1998
"... In this paper, we examine the relative importance of memory latency, memory bandwidth, and branch predictability on the performance of future processors. We develop and validate a sampling-based simulation methodology that allows us to simulate a large number of design points. Our methodology ens ..."
Abstract
- Add to MetaCart
In this paper, we examine the relative importance of memory latency, memory bandwidth, and branch predictability on the performance of future processors. We develop and validate a sampling-based simulation methodology that allows us to simulate a large number of design points. Our methodology ensures that the entire execution profile of the application is captured while limiting the errors induced by sampling to less than 2%. We extend our simulation results by fitting the data to analytic expressions of filters. Using the insight gained from these expressions, our simulation data, and known technological trends, we develop an understanding of the factors that will limit the performance of future-generation processors. From our examination, we conclude the following. The amount of instruction-level parallelism exploited by an application changes the relative importance of performance bottlenecks. In systems with less capacity to exploit instruction-level parallelism, memory l...
Shih-lien Lu
"... To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references ’ software semantics such as stack-heap bifurcation of the memory space, and user-kernel ..."
Abstract
- Add to MetaCart
To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references ’ software semantics such as stack-heap bifurcation of the memory space, and user-kernel ring levels. This constitutes a waste of energy since e.g., a user-mode instruction fetch will never hit a cache block that contains kernel code. Similarly, a stack access will not hit a cacheline that contains heap data. We propose to exploit software semantics in cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we utilize virtual memory region properties to optimize the data cache and ring level information to optimize the instruction cache. Our design does not impact performance, and incurs very small hardware cost. Simulations results using SPEC CPU and SPECjapps indicate that the proposed designs help to reduce cache block fetches from DL1 and IL1 by 27% and 57 % respectively, resulting in average savings of 15 % of DL1 power and more than 30 % of IL1 power compared to an aggressively clock-gated baseline.

