Results 1 - 10
of
29
Piranha: A scalable architecture based on single-chip multiprocessing
- SIGARCH Comput. Archit. News
, 2000
"... The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instructionlevel parallelism. Meanwhile, such designs are especially ill suited for important commercial application ..."
Abstract
-
Cited by 174 (5 self)
- Add to MetaCart
The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instructionlevel parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors
Trace-Driven Memory Simulation: A Survey
- ACM Computing Surveys
, 2004
"... This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show t ..."
Abstract
-
Cited by 133 (0 self)
- Add to MetaCart
This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered. In a concluding section, we examine fundamental limitations to trace-driven simulation, and survey some recent developments in memory simulation that may overcome these bottlenecks
An analysis of database workload performance on simultaneous multithreaded processors
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, databa ..."
Abstract
-
Cited by 118 (13 self)
- Add to MetaCart
Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable “critical ” working set. Second, we show that the additional data cache conflicts caused by simultaneousmultithreaded instruction scheduling can be nearly eliminated by the proper choice of software-directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better interthread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT’s latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. 1
A performance comparison of contemporary DRAM architectures
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organiz ..."
Abstract
-
Cited by 92 (9 self)
- Add to MetaCart
In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM
Vector Microprocessors
- In Hot Chips VII
, 1998
"... Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector superc ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector supercomputer implementations. This thesis expands the body of vector research by examining designs appropriate for single-chip full-custom vector microprocessor implementations targeting a much broader range of applications. I present the design, implementation, and evaluation of T0 (Torrent-0): the first single-chip vector microprocessor. T0 is a compact but highly parallel processor that can sustain over 24 operations per cycle while issuing only a single 32-bit instruction per cycle. T0 demonstrates that vector architectures are well suited to full-custom VLSI implementation and that they perform well on many multimedia and human-machine interface tasks. The remainder of the thesis contains ...
Alphasort: a RISC machine sort
- In Proceedings of 1994 ACM SIGMOD Conference
, 1994
"... Abstract A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using Alpha AXP processors, commodi ~ memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This beats the best p ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
Abstract A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using Alpha AXP processors, commodi ~ memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This beats the best published record on a 32-cpu 32-disk Hypercube by 8:1. On another benchmark, AlphaSort sorted more than a gigabyte in a minute. AlphaSort is a cache-sensitive memory-intensive sort algorithm. It uses file striping to get high disk bandwidth. It uses QuickSort to generate runs and uses replacementselection to merge the runs. It uses shared memoq multiprocessors to break the sort into subsort chores. Because startup times are becoming a sign~icant part of the total time, we propose two new benchmarks:
Characterizing the Memory Behavior of Java Workloads: A Structured View and Opportunities for Optimizations
, 2000
"... This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We be ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We begin by characterizing the inherent memory behavior of the benchmarks, such as information on the breakup of heap accesses among different categories and on the hotness of references to fields and methods. We then provide detailed information about misses in the data TLB and caches, including the distribution of misses over different kinds of accesses and over different methods. In the process, we make interesting discoveries about TLB behavior and limitations of data prefetching schemes discussed in the literature in dealing with pointer-intensive Java codes. Throughout this paper, we develop a set of recommendations to computer architects and compiler writers on how to optimize computer systems and system software to run Java programs more efficiently. This paper also makes the first attempt to compare the characteristics of SPECjvm98 to those of a server-oriented benchmark, pBOB, and explain why the current set of SPECjvm98 benchmarks may not be adequate for a comprehensive and objective evaluation of JVMs and just-in-time (JIT) compilers. We discover that the fraction of accesses to array elements is quite significant, demonstrate that the number of "hot spots" in the benchmarks is small, and show that field reordering cannot yield significant performance gains. We also show that even a fairly large L2 data cache is not effective for many Java benchmarks. We observe that instructions used to prefetch data into the L2 data cache are often squashed because of high TLB miss ...
The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors
"... Although cache-coherent shared-memory multiprocessors are sometimes used to run commercial workloads, little work has been done to characterize how well they support such applications. In particular, we do not have many insights on the demands of commercial workloads on the memory subsystem of su ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
Although cache-coherent shared-memory multiprocessors are sometimes used to run commercial workloads, little work has been done to characterize how well they support such applications. In particular, we do not have many insights on the demands of commercial workloads on the memory subsystem of such machines. In this paper, we analyze the memory access patterns of several queries that are representative of Decision Support Systems (DSS) databases. Our analysis shows that queries differ largely depending on how they access the database data, namely via indices or by sequentially scanning the records. The former queries, which we call Index-Queries, suffer most of their misses on the index data structure and on lock-related metadata structures that we identify. The latter queries, which we call Sequential-Queries, suffer most of their misses on the database records as they are scanned. An analysis of the data locality of the queries shows that, both Index-Queries and Sequential-...
Scaling and Characterizing Database Workloads: Bridging the Gap between Research and Practice
- in Proceedings of the 36th International Symposium on Microarchitecture
, 2003
"... On-Line Transaction Processing (OLTP) workloads are crucial benchmarks for the design and analysis of server processors. Typical cached configurations used by researchers to simulate OLTP workloads are orders of magnitude smaller than the fully scaled configurations used by OEM vendors to achieve wo ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
On-Line Transaction Processing (OLTP) workloads are crucial benchmarks for the design and analysis of server processors. Typical cached configurations used by researchers to simulate OLTP workloads are orders of magnitude smaller than the fully scaled configurations used by OEM vendors to achieve world-record transaction processing throughput. The objective of this study is to discover the underlying relationships that characterize OLTP performance over a wide range of configurations. To this end, we have derived the "iron law" of database performance. Using our iron law, we show that both the average instructions executed per transaction (IPX) and the average cycles per instruction (CPI) are critical to the transaction-throughput performance. We use an extensive, empirical examination of an Oracle based commercial LTP workload on an Intel Xeon multiprocessor system to characterize the scaling behavior of both the IPX and the CPI. We demonstrate that across a wide range of configurations the IPX and CPI behavior follows predictable trends, which can be accurately characterized by simple linear or piece-wise linear approximations. Based on our data, we propose a method for selecting a minimal, representative workload configuration from which behaviors of much larger LTP configurations can be accurately extrapolated.
Evaluation of existing architectures in IRAM systems
- In First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember
, 1997
"... Computer memory systems are increasingly a bottleneck limiting application performance. IRAM architectures, which integrate a CPU with DRAM main memory on a single chip, promise to remove this limitation by providing tremendous main memory bandwidth and significant reductions in memory latency. To d ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Computer memory systems are increasingly a bottleneck limiting application performance. IRAM architectures, which integrate a CPU with DRAM main memory on a single chip, promise to remove this limitation by providing tremendous main memory bandwidth and significant reductions in memory latency. To determine whether existing microarchitectures can tap the potential performance advantages of IRAM systems, we examined both execution time analyses of existing microprocessors and system simulation of hypothetical processors. Our results indicate that, for current benchmarks, existing architectures, whether simple, superscalar or out-of-order, are unable to exploit IRAM’s increased memory bandwidth and decreased memory latency to achieve significant performance benefits. 1

