Results 1 - 10
of
29
Memory System Characterization of Commercial Workloads
- In Proceedings of the 25th annual international symposium on Computer architecture
, 1998
"... Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as ..."
Abstract
-
Cited by 203 (5 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. 1
Cross-Architecture Performance Predictions for Scientific Applications Using Parameterized Models
, 2004
"... This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and diferentiable profile. Our toolkit operates on applica ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and diferentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architectureneutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.
Improving effective bandwidth through compiler enhancement of global cache reuse
- In Proceedings of International Parallel and Distributed Processing Symposium
, 2001
"... While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent o ..."
Abstract
-
Cited by 62 (17 self)
- Add to MetaCart
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy
FLASH vs. (Simulated) FLASH: Closing the Simulation Loop
, 2000
"... Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be di#cult. More importantly, architects often assume witho ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be di#cult. More importantly, architects often assume without proof that although their simulator may make inaccurate absolute performance predictions, it will still accurately predict architectural trends. This paper studies the source and magnitude of error in a range of architectural simulators by comparing the simulated execution time of several applications and microbenchmarks to their execution time on the actual hardware being modeled. The existence of a hardware gold standard allows us to find, quantify, and fix simulator inaccuracies. We then use the simulators to predict architectural trends and analyze the sensitivity of the results to the simulator configuration. We find that most of our simulators predict trends accurately, as long as ...
Compiler-Controlled Memory
- In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler’s ability to improve memory behavior is its im-perfect knowledge about the ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler’s ability to improve memory behavior is its im-perfect knowledge about the run-time behavior of the program. The compiler cannot completely predict run-time access patterns. There is an exception to this rule. During the reg-ister allocation phase, the compiler often must insert substantial amount,s of spill code; that is, instructions that move values from registers to memory and back again. Because the compiler itself inserts these memory instructions, it has more knowledge about them than other memory operations in the program. Spill-code operations are disjoint from the memory manipulations required by the semantics of the program being compiled, and, indeed, the two can interfere in the cache. This paper proposes a hardware solution to the problem of increased spill costs-a small compiler-con-trolled memory (CCM) to hold spilled values. This small random-access memory can (and should) be placed in a distinct address space from the main memory hierar-chy. The compiler can target spill instructions to use the CCM, moving most compiler-inserted memory traf-fic out of the pathway to main memory and eliminating any impact that those spill instructions would have on the state of the main memory hierarchy. Such mem-ories already exist on some DSP microprocessors. Our techniques can be applied directly on those chips. This paper presents two compiler-based methods to exploit such a memory, along with experimental results showing that speedups from using CCM may be sizable. It shows that using the register allocation’s coloring paradigm to assign spilled values to memory can greatly reduce the amount of memory required by a program. Permtsslon to make dIgItal or hard copies of all or part of this work for personal or classroom “se IS granted wthout fee prowded that copnes are not made or distributed for profn or commercial advan-tage and that copws bear thts notice and the full c!tatmn on the Hurst page.
Timestamp snooping: An approach for extending SMPs
- In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems
, 2000
"... Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processor’s cache ..."
Abstract
-
Cited by 23 (7 self)
- Add to MetaCart
Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processor’s cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies. This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order. We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28 % faster than directories, at a cost of 13-43 % more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29 % faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency. 1
A Case for User-Level Dynamic Page Migration
, 2000
"... This paper presents user-level dynamic page migration, a runtime technique which transparently enables parallel pro-grams to tune their memory performance on distributed shared memory multiprocessors, with feedback obtained from dynamic monitoring of memory activity. Our technique exploits the itera ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
This paper presents user-level dynamic page migration, a runtime technique which transparently enables parallel pro-grams to tune their memory performance on distributed shared memory multiprocessors, with feedback obtained from dynamic monitoring of memory activity. Our technique exploits the iterative nature of parallel programs and information available to the program both at compile time and at runtime in order to improve the accuracy and the timeliness of page migrations, as well as amortize better the overhead, compared to page migration engines implemented in the operating system. We present an adaptive page migration algorithm based on a competitive and a predictive criterion. The competitive criterion is used to correct poor page placement decisions of the operating system, while the predictive criterion makes the algorithm respon-sive to scheduling events that necessitate immediate page migrations, such as preemptions and migrations of threads. We also present a new technique for preventing page ping-pong and a mechanism for monitoring the performance of page migration algorithms at runtime and tuning their sen-sitive parameters accordingly. Our experimental evidence on a SGI Origin2000 shows that unmodified OpenMP codes linked with our runtime system for dynamic page migration are effectively immune to the page placement strategy of the operating system and the associated problems with data locality. Furthermore, our runtime system achieves solid performance improvements compared to the IRIX 6.5.5 page migration engine, for single parallel OpenMP codes and multiprogrammed workloads.
The Trade-Off between Implicit and Explicit Data Distribution in Shared-Memory Programming Paradigms
- In Proc. of the 15th ACM International Conference on Supercomputing (ICS’2001
, 2001
"... ..."
Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs
- ACM SIGMETRICS intern.conf. on Measur. and Model. of computer systems, p117 – 128, 2002
, 2002
"... In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.
Comparing the memory system performance of the HP V-Class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications
- IN PROC. OF ACM ICS
, 1999
"... As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on e ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing state-of-the-art multiprocessors. In this study, we present a detailed comparison of the memory system of two recent commercial ventures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors to tolerate the effect of memory latency. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance.

