Results 1 - 10
of
157
Memory System Characterization of Commercial Workloads
- In Proceedings of the 25th annual international symposium on Computer architecture
, 1998
"... Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as ..."
Abstract
-
Cited by 203 (5 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. 1
Bug Isolation via Remote Program Sampling
- In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation
, 2003
"... We propose a low-overhead sampling infrastructure for gathering information from the executions experienced by a program 's user community. Several example applications illustrate ways to use sampled instrumentation to isolate bugs. Assertion-dense code can be transformed to share the cost of assert ..."
Abstract
-
Cited by 193 (15 self)
- Add to MetaCart
We propose a low-overhead sampling infrastructure for gathering information from the executions experienced by a program 's user community. Several example applications illustrate ways to use sampled instrumentation to isolate bugs. Assertion-dense code can be transformed to share the cost of assertions among many users. Lacking assertions, broad guesses can be made about predicates that predict program errors and a process of elimination used to whittle these down to the true bug. Finally, even for non-deterministic bugs such as memory corruption, statistical modeling based on logistic regression allows us to identify program behaviors that are strongly correlated with failure and are therefore likely places to look for the error.
An Infrastructure for Adaptive Dynamic Optimization
, 2003
"... Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework ..."
Abstract
-
Cited by 130 (5 self)
- Add to MetaCart
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight, API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc.. To demonstrate
An analysis of database workload performance on simultaneous multithreaded processors
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, databa ..."
Abstract
-
Cited by 119 (13 self)
- Add to MetaCart
Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable “critical ” working set. Second, we show that the additional data cache conflicts caused by simultaneousmultithreaded instruction scheduling can be nearly eliminated by the proper choice of software-directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better interthread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT’s latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. 1
ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
, 1997
"... Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and ..."
Abstract
-
Cited by 118 (2 self)
- Add to MetaCart
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also support paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software...
Energy-Efficient Soft Real-Time CPU Scheduling for Mobile Multimedia Systems
, 2003
"... present s GRACE-OS, an energy-e#cient soft real-t-e CPU scheduler for mobile devicestvi primarily run multD]A1D applicatD4D[ The major goal of GR CE-OS is t support applicatDD qualit y of service and save energy. To achievet his goal, GR CE-OS int egrat] dynamic volt]% scaling int osoft real-t1D sch ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
present s GRACE-OS, an energy-e#cient soft real-t-e CPU scheduler for mobile devicestvi primarily run multD]A1D applicatD4D[ The major goal of GR CE-OS is t support applicatDD qualit y of service and save energy. To achievet his goal, GR CE-OS int egrat] dynamic volt]% scaling int osoft real-t1D scheduling and decides howfast t execut applicat37: inaddit]] t when and how longt executt hem. GR CE-OS makes such scheduling decisions based ont heprobabilit y distD[A1%M3 of applicat%M cycle demands, andobt23] t he demanddistA3:772A via online profiling and est]:A1%]2 We have implementl GR CE-OS in tA Linux kernel and evaluat: it on an HPlapt7 wit a variable-speed CPU and mult37MA1 codecs. Our experiment alresult showt hat (1) tA demand dist2DA1%%M oft he stA7M7 codecs isst373 or changes smoot73 . ThisstsA74[ y impliest hat it is feasiblet o perform st chast: scheduling and volt]3 scaling wit low overhead; (2) GR CE-OS deliverssoft performance guarant ees by boundingtn deadline miss rat4 under applicat7:]7M ecific requirementu and (3) GR CE-OS reduces CPU idlet ime and spends more busyt ime in lower-power speeds. Our measurement indicati t hat comparedt o det::2A1744M scheduling and volt age scaling, GR CE-OS saves energy by 7% t 72% while delivering stMD%A17%D performance guarant ees.
End-System Optimizations for High-Speed TCP
- IEEE Communications Magazine
, 2000
"... Modern TCP implementations are capable of very high point-to-point bandwidths. Delivered performance on the fastest networks is often limited by the sending and receiving hosts, rather than by the network hardware or the TCP protocol implementation itself. In this case, systems can achieve higher ba ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
Modern TCP implementations are capable of very high point-to-point bandwidths. Delivered performance on the fastest networks is often limited by the sending and receiving hosts, rather than by the network hardware or the TCP protocol implementation itself. In this case, systems can achieve higher bandwidth by reducing host overheads through a variety of optimizations above and below the TCP protocol stack, given support from the network interface. This paper surveys the most important of these optimizations, and illustrates their effects quantitatively with empirical results from a an experimental network delivering up to two gigabits per second of point-to-point TCP bandwidth. 1 Introduction Good TCP/IP protocol implementations are capable of transferring data at a high percentage of available network link bandwidth, reflecting the success of many years of refinements to TCP/IP protocol handling software and policies. On the fastest networks, applicationto -application throughput is...
The benefits of event-driven energy accounting in power-sensitive systems
- In Proceedings of the 9th ACM SIGOPS European Workshop
, 2000
"... A prerequisite of energy-aware scheduling is precise knowledge of any activity inside the computer system. Embedded hardware monitors (e.g., processor performance counters) have proved to offer valuable information in the field of performance analysis. The same approach can be applied to investigate ..."
Abstract
-
Cited by 70 (1 self)
- Add to MetaCart
A prerequisite of energy-aware scheduling is precise knowledge of any activity inside the computer system. Embedded hardware monitors (e.g., processor performance counters) have proved to offer valuable information in the field of performance analysis. The same approach can be applied to investigate the energy usage patterns of individual threads. We use information about active hardware units (e.g., integer/floatingpoint unit, cache/memory interface) gathered by event counters to establish a thread-specific energy accounting. The evaluation shows that the correlation of events and energy values provides the necessary information for energy-aware scheduling policies. Our approach to OS-directed power management adds the energy usage pattern to the runtime context of a thread. Depending on the field of application we present two scenarios that
A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the re ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the retirement stage of a processor pipeline removed from the critical path. The features of the proposed design include rapid detection of program hot spots after changes in execution behavior, runtime-tunable selection criteria for hot spot detection, and negligible overhead during application execution. Experiments using several SPEC95 benchmarks, as well as several large WindowsNT applications, demonstrate the promise of the proposed design. 1 Introduction Optimizing compilers can gain significant performance benefits by performing code transformations based on a program's runtime profile. Traditionally, profiles are collected by running an instrumented version of the executable. However, bec...

