Results 11 - 20
of
89
ASR: Adaptive selective replication for CMP caches
- In Proceedings of MICRO-39
, 2006
"... The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wire ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes Adaptive Selective Replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29 % versus shared caches, 19% versus private caches, and 12 % versus CMP-NuRapid [9] and Victim Replication [41]. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including Cooperative Caching [8]. 1.
An Analysis of Linux Scalability to Many Cores
"... This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard paralle ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet. 1
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
, 1999
"... This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.
Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches
- IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 2009
"... In future multi-cores, large amounts of delay and power will be spent accessing data in large L2/L3 caches. It has been recently shown that OS-based page coloring allows a non-uniform cache architecture (NUCA) to provide low latencies and not be hindered by complex data search mechanisms. In this wo ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
In future multi-cores, large amounts of delay and power will be spent accessing data in large L2/L3 caches. It has been recently shown that OS-based page coloring allows a non-uniform cache architecture (NUCA) to provide low latencies and not be hindered by complex data search mechanisms. In this work, we extend that concept with mechanisms that dynamically move data within caches. The key innovation is the use of a shadow address space to allow hardware control of data placement in the L2 cache while being largely transparent to the user application and off-chip world. These mechanisms allow the hardware and OS to dynamically manage cache capacity per thread as well as optimize placement of data shared by multiple threads. We show an average IPC improvement of 10-20% for multi-programmed workloads with capacity allocation policies and an average IPC improvement of 8% for multi-threaded workloads with policies for shared page placement.
Kernel-Level Scheduling for the Nano-Threads Programming Model
, 1998
"... Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as w ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as well as execution time become important performance metrics. In this paper we present efficient kernel scheduling policies and propose a new kernel-user interface aiming at supporting efficient parallel execution in diverse workload environments. Our approach relies on support for user level threads which are used to exploit parallelism within applications, and a two-level scheduling policy which coordinates the number of resources allocated by the kernel with the number of threads generated by each application. We compare our scheduling policies with the native gang scheduling policy of the IRIX 6.4 operating system on a Silicon Graphics Origin2000. Our experimental results show substantial ...
Towards A Simplified Database Workload For Computer Architecture Evaluations
- In Workload Characterization for Computer System Design, edited byh
, 2000
"... We propose and evaluate a simplified technique for studying the architectural behavior of database workloads. This "microbenchmark" technique poses simple queries of the database to generate the same dominant I/O patterns exhibited in more complex, fully-scaled workloads. The potential benefits from ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We propose and evaluate a simplified technique for studying the architectural behavior of database workloads. This "microbenchmark" technique poses simple queries of the database to generate the same dominant I/O patterns exhibited in more complex, fully-scaled workloads. The potential benefits from this microbenchmark approach include smaller hardware requirements, less extensive workload parameter tuning, and simpler database parameter tuning. We demonstrate that the microbenchmark workload exhibits processor and memory system behavior relatively similar to that of the more complex standardized benchmarks. We also enumerate several factors that impact the representativeness of these microbenchmark workloads. Keywords: Database, transaction processing, decision support, microbenchmark, and performance evaluation. 1. INTRODUCTION In the last five to ten years, several studies have explored the architectural characteristics of online transaction processing (OLTP) database workloads ...
The Effectiveness of SRAM Network Caches in Clustered DSMs
, 1998
"... The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off: hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory...
An Experimental Evaluation of Processor Pool-Based Scheduling for Shared-Memory NUMA multiprocessors
- In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science 1291
, 1997
"... Abstract. In this paper we describe the design, implementation and experimental evaluation of a technique for operating system schedulers called processor pool-based scheduling [51]. Our technique is designed to assign processes (or kernel threads) of parallel applications to processors in multiprog ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Abstract. In this paper we describe the design, implementation and experimental evaluation of a technique for operating system schedulers called processor pool-based scheduling [51]. Our technique is designed to assign processes (or kernel threads) of parallel applications to processors in multiprogrammed, shared-memory NUMA multiprocessors. The results of the experiments conducted in this research demonstrate that: 1) Pool-based scheduling is an effective method for localizing application execution and reducing mean response times. 2) Although application parallelism should be considered, the optimal pool size is a function of the the system architecture. 3) The strategies of placing new applications in a pool with the largest potential for inpool growth (i.e., the pool containing the fewest jobs) and of isolating applications from each other are desirable properties of algorithms for operating system schedulers executing on NUMA architectures. The ‘‘Worst-Fit’ ’ policy weexamine incorporates both of these properties. 1Introduction The number of bus-based shared-memory multiprocessors being manufactured
Using Hardware Counters to Automatically Improve Memory Performance
- ACM/IEEE Conference on Supercomputing
, 2004
"... In this paper, we introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use lightweight, inexpensive plug-in hardware counters to profile the memory access behavior of an application, and then migrate pages to memory l ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
In this paper, we introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use lightweight, inexpensive plug-in hardware counters to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. Using the Dyninst runtime instrumentation combined with hardware counters, we were able to add page migration capabilities to the system without having to modify the operating system kernel, or to re-compile application programs. This approach reduced the total number of non-local memory accesses of applications by up to 90%. Even on a system with small remote to local memory access latency rations, this resulted in up to 16 % improvement in execution time. 1.
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors
, 2000
"... This paper presents algorithms for improving the performance of parallel programs on multiprogrammed sharedmemory NUMA multiprocessors, via the use of user-level dynamic page migration. The idea that drives the algorithms is that a page migration engine can perform accurate and timely page migration ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
This paper presents algorithms for improving the performance of parallel programs on multiprogrammed sharedmemory NUMA multiprocessors, via the use of user-level dynamic page migration. The idea that drives the algorithms is that a page migration engine can perform accurate and timely page migrations in a multiprogrammed system if it can correlate page reference information with scheduling information obtained from the operating system. The necessary page migrations can be performed as a response to scheduling events that break the implicit association between threads and their memory affinity sets. We present two algorithms that use feedback from the kernel scheduler to aggressively migrate pages upon thread migrations. The first algorithm exploits the iterative nature of parallel programs, while the second targets generic codes without making assumptions on their structure. Performance evaluation on an SGI Origin2000 shows that our page migration algorithms provide substantial improv...

