Results 11 - 20
of
62
Using Hardware Performance Monitors to Isolate Memory Bottlenecks
- In ACM, editor, Supercomputing
, 2000
"... In this paper, we present and evaluate two techniques that use different styles of hardware support to provide data structure specific processor cache information. In one approach, hardware performance counter overflow interrupts are used to sample cache misses. In the other, cache misses within reg ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
In this paper, we present and evaluate two techniques that use different styles of hardware support to provide data structure specific processor cache information. In one approach, hardware performance counter overflow interrupts are used to sample cache misses. In the other, cache misses within regions of memory are counted to perform an n-way search for the areas in which the most misses are occurring. We present a simulation-based study and comparison of the two techniques. We find that both techniques can provide accurate information, and describe the relative advantages and disadvantages of each. 1 Introduction As processor speeds have rapidly increased, the gap between these speeds and the access time of main memory has widened. Because of this, it is becoming ever more important for applications to make effective use of memory caches. Information about an application 's interaction with the cache is therefore crucial to tuning its performance. To be most useful to a programmer...
Optimizing main-memory join on modern hardware
- IEEE Transactions on Knowledge and Data Eng
, 2002
"... AbstractÐIn the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but also by increasing parallelism inside the CPU. Current database systems have not ..."
Abstract
-
Cited by 27 (9 self)
- Add to MetaCart
AbstractÐIn the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling, and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized. Index TermsÐMain-memory databases, query processing, memory access optimization, decomposed storage model, join algorithms, implementation techniques. 1
Perfsuite: An Accessible, Open Source Performance Analysis Environment for Linux
- In Proc. of the Linux Cluster Conference, Chapel
, 2005
"... The motivation, design, implementation, and current status of a new set of software tools called PerfSuite that is targeted to performance analysis of user applications on Linux-based systems is described. These tools emphasize ease of use/deployment and portability/reuse in implementation details a ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
The motivation, design, implementation, and current status of a new set of software tools called PerfSuite that is targeted to performance analysis of user applications on Linux-based systems is described. These tools emphasize ease of use/deployment and portability/reuse in implementation details as well as data representation and format. After a year of public beta availability and production deployment on Linux clusters that rank among the largest-scale in the country, PerfSuite is gaining acceptance as a user-oriented and flexible software tool set that is as valuable on the desktop as it is on leading-edge terascale clusters.
Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications
- In Supercomputing '97
, 1997
"... : In this paper we compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
: In this paper we compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant differences in their memory subsystems. Our memory model includes the effect of overlap between CPU and memory operations and allows us to infer the individual contributions of all three improvements in the Origin's memory architecture and relate the effectiveness of each improvement to application characteristics.. 1 Introduction The biggest challenge in the design and use of high-performance computer systems involves managing the disparity between central processing unit (CPU) speed and memory subsystem speed. The need to address this issue is likely to become more acute in the future, because processor speed may double every eighteen months but DRAM memory access speed is expected to inc...
Using Interaction Costs for Microarchitectural Bottleneck Analysis
- ABSTRACT APPEARS IN 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO ’03)
, 2003
"... Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle wher ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle where many, overlapping resources are active). This paper introduces a new model for understanding event costs to facilitate processor design and optimization. First, we observe that everything in a machine (instructions, hardware structures, events) can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization. Finally, we propose performance-monitoring hardware for measuring interaction costs that is suitable for modern processors.
A performance counter architecture for computing accurate CPI components
- In ASPLOS
, 2006
"... A common way of representing processor performance is to use Cycles per Instruction (CPI) ‘stacks ’ which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given micr ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
A common way of representing processor performance is to use Cycles per Instruction (CPI) ‘stacks ’ which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions). This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.
A Methodology and an Evaluation of the SGI Origin2000
- IN PROC. OF THE INTL. CONF. ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS (SIGMETRICS
"... As hardware-coherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
As hardware-coherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates the SGI Origin2000---the machine that perhaps has the most aggressive communication architecture of the recent cache-coherent offerings---and, in doing so, articulates a sound methodology for evaluating real systems. We examine data access and synchronization microbenchmarks; speedups for different application classes, problem sizes and scaling models; detailed interactions and time breakdowns using performance tools; and the impact of special hardware support. We find that overall the Origin appears to deliver on the promise of cache-coherent shared address space multiprocessing, at least at the 32-processor scale we examine. The machine is quite easy to program for performance and has fewer...
Interaction cost and shotgun profiling
- ACM Transactions on Architecture and Code Optimization
, 2004
"... We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energ ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energy efficiency and finding the right loads to prefetch all require measuring the performance costs of bottlenecks. In the past, simple event counts were enough to find the important bottlenecks. Today, the parallelism of modern processors makes such analysis much more difficult, rendering traditional performance counters less useful. If two microarchitectural events (such as a fetch stall and a cache miss) occur in the same cycle, which event should we blame for the cycle? What cost should we assign to each event? In this paper, we introduce a new model for understanding event costs to facilitate processor design and optimization. First, we observe that all instructions, hardware structures, and events in a machine can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization.
Critical Path Profiling of Message Passing and Shared-Memory Programs
- IEEE Transactions on Parallel and Distributed Systems
, 1998
"... In this paper, we introduce a runtime, nontrace-based algorithm to compute the critical path profile of the execution of message passing and shared-memory parallel programs. Our algorithm permits starting or stopping the critical path computation during program execution and reporting intermediate ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
In this paper, we introduce a runtime, nontrace-based algorithm to compute the critical path profile of the execution of message passing and shared-memory parallel programs. Our algorithm permits starting or stopping the critical path computation during program execution and reporting intermediate values. We also present an online algorithm to compute a variant of critical path, called critical path zeroing, that measures the reduction in application execution time that improving a selected procedure will have. Finally, we present a brief case study to quantify the runtime overhead of our algorithm and to show that online critical path profiling can be used to find program bottlenecks. Index Terms---Parallel and distributed processing, measurement, tools, program tuning, on-line evaluation. ------------------------------ ##p## ------------------------------ 1INTRODUCTION N performance tuning parallel programs, simple sums of sequential metrics, such as CPU utilization, do not ...
Warp processors
- ACM Transactions on Design Automation of Electronic Systems (TODAES
, 2006
"... We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so usi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves those improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary’s critical regions, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically re-implementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on-chip with the main processor. We describe our results of developing a warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66 % across a set of embedded benchmark applications. We

