Results 11 - 20
of
81
Thread-Sensitive Scheduling for SMT Processors
, 2000
"... A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-g ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-grained resource sharing, SMT threads have the ability to interfere or conflict with each other, as well as to share these resources to mutual benefit. This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execution contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. Thread-sensitive scheduling uses thread-behavior feedback to choose the best set of threads to execute together, in order to maximize processor throughput. We introduce several thread-sensitive scheduling schemes and compare them to traditional oblivious schemes, such as round-robin. Our measurements show how these scheduling algorithms impact performance and the utilization of low-level hardware resources. We also demonstrate how thread-sensitive scheduling algorithms can be tuned to trade-off performance and fairness. For the workloads we measured, we show that an IPC-based thread-sensitive scheduling algorithm can achieve speedups over oblivious schemes of 7% to 15%, with minimal hardware costs. 1
Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly
- In Proc. of 12th ASPLOS
, 2006
"... In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among di ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45–65 % of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors. We present Computation Spreading (CSP), which employs hardware migration to distribute a thread’s dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes. When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27–58%, private L2 load misses by 0–19%, and branch mispredictions by 9–25%.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
- In HPCA-11
, 2005
"... Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRA ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRAM systems in SMT systems, and search for new thread-aware DRAM optimization techniques. Our major findings are: (1) in general, increasing the number of threads tends to increase the memory concurrency and thus the pressure on DRAM systems, but some exceptions do exist; (2) the application performance is sensitive to memory channel organizations, e.g. independent channels may outperform ganged organizations by up to 90%; (3) the DRAM latency reduction through improving row buffer hit rates becomes less effective due to the increased bank contentions; and (4) thread-aware DRAM access scheduling schemes may improve performance by up to 30 % on workload mixes of memory-intensive applications. In short, the use of SMT techniques has somewhat changed the context of DRAM optimizations but does not make them obsolete. 1
DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture
- In Proceedings of the IBM Center for Advanced Studies Conference
, 2005
"... With the proliferation of database workloads on servers, much recent research on server architecture has focused on database system benchmarks. The TPC benchmarks for the two most common server workloads, OLTP and DSS, have been used extensively in the database community to evaluate the database sys ..."
Abstract
-
Cited by 27 (14 self)
- Add to MetaCart
With the proliferation of database workloads on servers, much recent research on server architecture has focused on database system benchmarks. The TPC benchmarks for the two most common server workloads, OLTP and DSS, have been used extensively in the database community to evaluate the database system functionality and performance. Unfortunately, these benchmarks fall short of being effective in microarchitecture and memory system research due to several key shortcomings. First, setting up the experimental environment and tuning these benchmarks to match the workload behavior of interest involves extremely complex procedures. Second, the benchmarks themselves are complex and preclude accurate correlation of microarchitecture- and memory-level bottlenecks to dominant workload characteristics. Finally, industrial-grade configurations of such benchmarks are too large and preclude their use in detailed but slow microarchitectural simulation studies of future servers. In this paper, we first present an analysis of the dominant behavior in DSS and OLTP workloads, and highlight their key processor and memory performance characteristics. We then introduce a systematic scaling framework to scale down the TPC benchmarks. Finally, we propose the DBmbench, consisting of two substantially scaled-down benchmarks: µTPC-H and µTPC-C that accurately (> 95%) capture the processor and memory performance behavior of DSS and OLTP workloads. Copyright c ○ 2005 Minglong Shao. Permission to copy is hereby granted provided the original copyright notice is reproduced in copies made. 1
Architectural support for enhanced smt job scheduling
- In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04
, 2004
"... By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating syste ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each thread’s expected resource utilization but do not take into account variance in thread behavior. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler. This work is first motivated with experiments in a simulated environment, then validated on a Hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux Kernel.
Scaling and Characterizing Database Workloads: Bridging the Gap between Research and Practice
- in Proceedings of the 36th International Symposium on Microarchitecture
, 2003
"... On-Line Transaction Processing (OLTP) workloads are crucial benchmarks for the design and analysis of server processors. Typical cached configurations used by researchers to simulate OLTP workloads are orders of magnitude smaller than the fully scaled configurations used by OEM vendors to achieve wo ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
On-Line Transaction Processing (OLTP) workloads are crucial benchmarks for the design and analysis of server processors. Typical cached configurations used by researchers to simulate OLTP workloads are orders of magnitude smaller than the fully scaled configurations used by OEM vendors to achieve world-record transaction processing throughput. The objective of this study is to discover the underlying relationships that characterize OLTP performance over a wide range of configurations. To this end, we have derived the "iron law" of database performance. Using our iron law, we show that both the average instructions executed per transaction (IPX) and the average cycles per instruction (CPI) are critical to the transaction-throughput performance. We use an extensive, empirical examination of an Oracle based commercial LTP workload on an Intel Xeon multiprocessor system to characterize the scaling behavior of both the IPX and the CPI. We demonstrate that across a wide range of configurations the IPX and CPI behavior follows predictable trends, which can be accurately characterized by simple linear or piece-wise linear approximations. Based on our data, we propose a method for selecting a minimal, representative workload configuration from which behaviors of much larger LTP configurations can be accurately extrapolated.
Data Page Layouts for Relational Databases on Deep Memory Hierarchies
, 2002
"... Relational database systems have traditionally optimized for I/0 performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Relational database systems have traditionally optimized for I/0 performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages).
Predicting multiple metrics for queries: Better decisions enabled by machine learning
- In ICDE
, 2009
"... Abstract — One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics — their runtimes and resource usage — can solve two important problems. First, every database vendo ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Abstract — One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics — their runtimes and resource usage — can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries. When these long-running queries can be identified before they start, they can be rejected or scheduled when they will not cause extreme resource contention for the other queries in the system. Second, deciding whether a system can complete a given workload in a given time period (or a bigger system is necessary) depends on knowing the resource requirements of the queries in that workload. We have developed a system that uses machine learning to accurately predict the performance metrics of database queries whose execution times range from milliseconds to hours. For training and testing our system, we used both real customer queries and queries generated from an extended set of TPC-DS templates. The extensions mimic queries that caused customer problems. We used these queries to compare how accurately different techniques predict metrics such as elapsed time, records used, disk I/Os, and message bytes. The most promising technique was not only the most accurate, but also predicted these metrics simultaneously and using only information available prior to query execution. We validated the accuracy of this machine learning technique on a number of HP Neoview configurations. We were able to predict individual query elapsed time within 20 % of its actual time for 85 % of the test queries. Most importantly, we were able to correctly identify both the short and long-running (up to two hour) queries to inform workload management and capacity planning. I.
Cache-conscious frequent pattern mining on a modern processor
- In Proceedings of the International Conference on Very Large Data Bases (VLDB
, 2005
"... In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly under-utilize a modern processor. The primary p ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly under-utilize a modern processor. The primary performance bottlenecks are poor data locality and low instruction level parallelism (ILP). We propose a cache-conscious prefix tree to address this problem. The resulting tree improves spatial locality and also enhances the benefits from hardware cache line prefetching. Furthermore, the design of this data structure allows the use of a novel tiling strategy to improve temporal locality. The result is an overall speedup of up to 3.2 when compared with state-of-the-art implementations. We then show how these algorithms can be improved further by realizing a non-naive thread-based decomposition that targets simultaneously multi-threaded processors. A key aspect of this decomposition is to ensure cache re-use between threads that are co-scheduled at a fine granularity. This optimization affords an additional speedup of 50%, resulting in an overall speedup of up to 4.8. To
Detailed Characterization of a Quad Pentium Pro Server Running TPC-D
- In Proceedings of International Conference on Computer Design
, 1999
"... Abstract While database workloads consume a major fraction of the cycles in today's machines, there are only a few public-domain performance studies that characterize in detail how these workloads exercise the machines. This fact is due to the complexity of setting up and tuning database workloads, ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Abstract While database workloads consume a major fraction of the cycles in today's machines, there are only a few public-domain performance studies that characterize in detail how these workloads exercise the machines. This fact is due to the complexity of setting up and tuning database workloads, the high cost of the equipment required to evaluate them, and the frequent use of proprietary systems.

