Results 1 - 10
of
91
DBMSs on a modern processor: Where does time go
, 1999
"... Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studie ..."
Abstract
-
Cited by 166 (23 self)
- Add to MetaCart
Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studies report that they do not improve database system performance to the same extent as scientific workloads. Recent work on database systems focusing on minimizing memory latencies, such as cache-conscious algorithms for sorting and data placement, is one step toward addressing this problem. However, to best design high performance DBMSs, we must carefully evaluate and understand the processor and memory behavior of commercial DBMSs on today’s hardware platforms. In this paper we answer the question "Where does time go when a database system executes on a modern computer platform? " We examine four commercial DBMSs running on an Intel Xeon and NT 4.0. We introduce a framework for analyzing query execution time on a DBMS running on a server with a modern processor and memory architecture. To focus on processor and memory interactions and exclude effects from the I/O subsystem, we use a memory resident database. Using simple queues, we find that database developers should (a) optimize data placement for the second level of data cache, and not the first, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction). 1
A Case for Intelligent Disks (IDISKs)
, 1998
"... Decision support systems (DSS) and data warehousing workloads comprise an increasing fraction of the database market today. I/O capacity and associated processing requirements for DSS workloads are increasing at a rapid rate, doubling roughly every nine to twelve months [38]. In response to this inc ..."
Abstract
-
Cited by 106 (4 self)
- Add to MetaCart
Decision support systems (DSS) and data warehousing workloads comprise an increasing fraction of the database market today. I/O capacity and associated processing requirements for DSS workloads are increasing at a rapid rate, doubling roughly every nine to twelve months [38]. In response to this increasing storage and computational demand, we present a computer architecture for decision support database servers that utilizes "intelligent" disks (IDISKs). IDISKs utilize low-cost embedded general-purpose processing, main memory, and high-speed serial communication links on each disk. IDISKs are connected to each other via these serial links and high-speed crossbar switches, overcoming the I/O bus bottleneck of conventional systems. By off-loading computation from expensive desktop processors, IDISK systems may improve cost-performance. More importantly, the IDISK architecture allows the processing of the system to scale with increasing storage demand.
A performance comparison of contemporary DRAM architectures
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organiz ..."
Abstract
-
Cited by 92 (9 self)
- Add to MetaCart
In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM
Weaving Relations for Cache Performance
, 2001
"... Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on m ..."
Abstract
-
Cited by 83 (14 self)
- Add to MetaCart
Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM's stall time due to data cache accesses, (b) range selection queries and updates on memoryresident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster.
Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors
, 1998
"... Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering wo ..."
Abstract
-
Cited by 81 (4 self)
- Add to MetaCart
Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications. This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 t...
Exploring the Design Space of Future CMPs
, 2001
"... In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the pe ..."
Abstract
-
Cited by 78 (12 self)
- Add to MetaCart
In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible.
Using Cohort Scheduling to Enhance Server Performance
, 2002
"... A server application is commonly organized as a collection of concurrent threads, each of which executes the code necessary to process a request. This software architecture, which causes frequent control transfers between unrelated pieces of code, decreases instruction and data locality, and consequ ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
A server application is commonly organized as a collection of concurrent threads, each of which executes the code necessary to process a request. This software architecture, which causes frequent control transfers between unrelated pieces of code, decreases instruction and data locality, and consequently reduces the effec- tiveness of hardware mechanisms such as caches, TLBs, and branch predictors. Numerous measurements demonstrate this effect in server applications, which often utilize only a fraction of a modern processor's computational throughput.
A New Direction for Computer Architecture Research
- IEEE Computer
, 1998
"... Abstract In this paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
Abstract In this paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functions provided today by a portable computer, a cellular phone, a digital camera and a video game. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions, and area efficient, scalable designs. We examine the architectures that were recently proposed for billion transistor microprocessors. While they are very promising for the stationary desktop and server workloads, we discover that most of them are unable to meet the challenges of the new environment and provide the necessary enhancements for multimedia applications running on portable devices. We conclude with Vector IRAM, an initial example of a microprocessor architecture and implementation that matches the new environment.
An architectural evaluation of Java TPC-W
- In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture
, 2001
"... The use of the Java programming language for implementing server-side application logic is increasing in popularity, yet there is very little known about the architectural requirements of this emerging commercial workload. We present a detailed characterization of the Transaction Processing Council’ ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
The use of the Java programming language for implementing server-side application logic is increasing in popularity, yet there is very little known about the architectural requirements of this emerging commercial workload. We present a detailed characterization of the Transaction Processing Council’s TPC-W web benchmark, implemented in Java. The TPC-W benchmark is designed to exercise the web server and transaction processing system of a typical e-commerce web site. We have implemented TPC-W as a collection of Java servlets, and present an architectural study detailing the memory system and branch predictor behavior of the workload. We also evaluate the effectiveness of a coarse-grained multithreaded processor at increasing system throughput using TPC-W and other commercial workloads. We measure system throughput
Characterizing the Memory Behavior of Java Workloads: A Structured View and Opportunities for Optimizations
, 2000
"... This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We be ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We begin by characterizing the inherent memory behavior of the benchmarks, such as information on the breakup of heap accesses among different categories and on the hotness of references to fields and methods. We then provide detailed information about misses in the data TLB and caches, including the distribution of misses over different kinds of accesses and over different methods. In the process, we make interesting discoveries about TLB behavior and limitations of data prefetching schemes discussed in the literature in dealing with pointer-intensive Java codes. Throughout this paper, we develop a set of recommendations to computer architects and compiler writers on how to optimize computer systems and system software to run Java programs more efficiently. This paper also makes the first attempt to compare the characteristics of SPECjvm98 to those of a server-oriented benchmark, pBOB, and explain why the current set of SPECjvm98 benchmarks may not be adequate for a comprehensive and objective evaluation of JVMs and just-in-time (JIT) compilers. We discover that the fraction of accesses to array elements is quite significant, demonstrate that the number of "hot spots" in the benchmarks is small, and show that field reordering cannot yield significant performance gains. We also show that even a fairly large L2 data cache is not effective for many Java benchmarks. We observe that instructions used to prefetch data into the L2 data cache are often squashed because of high TLB miss ...

