Results 1 - 10
of
163
Cooperative caching for chip multiprocessors
- In Proceedings of the 33nd Annual International Symposium on Computer Architecture
, 2006
"... Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these ch ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference. CC reduces the average memory access latency by balancing between cache latency and capacity opti-mizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a “shared ” cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed
A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA
"... Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of generalpurpose applications compared to contempora ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of generalpurpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.
Understanding Failures in Petascale Computers
"... With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interrup ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. 1.
Data-Intensive Supercomputing: The case for DISC
, 2007
"... Google and its competitors have created a new class of large-scale computer systems to support Internet search. These “Data-Intensive Super Computing ” (DISC) systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Google and its competitors have created a new class of large-scale computer systems to support Internet search. These “Data-Intensive Super Computing ” (DISC) systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business efficiencies, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing. Keywords: parallel computing, data storage, web searchWhen a teenage boy wants to find information about his idol by using Google with the search query “Britney Spears, ” he unleashes the power of several hundred processors operating on a data set of over 200 terabytes. Why then can’t a scientist seeking a cure for cancer invoke large amounts of computation over a terabyte-sized database of DNA microarray data at the click of a button? Recent papers on parallel programming by researchers at Google [13] and Microsoft [19] present the results of using up to 1800 processors to perform computations accessing up to 10 terabytes of data. How can university researchers demonstrate the credibility of their work without having comparable computing facilities available? 1
A performance study of general purpose applications on graphics processors
- FIRST WORKSHOP ON GENERAL PURPOSE PROCESSING ON GRAPHICS PROCESSING UNITS
, 2007
"... Graphic processors (GPUs), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. To best utilize the GPU’s parallel computing resources, it is crucial to understand how GPU architectures and programming models can ..."
Abstract
-
Cited by 15 (11 self)
- Add to MetaCart
Graphic processors (GPUs), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. To best utilize the GPU’s parallel computing resources, it is crucial to understand how GPU architectures and programming models can be applied to different categories of traditionally CPU applications. In this paper we examine several common, computationally demanding applications—Traffic Simulation, Thermal Simulation, and K-Means—whose performance may benefit from graphics hardware’s parallel computing capabilities. We show that all of our applications can be accelerated using the GPU, demonstrating as high as 40 × speedup when compared with a CPU implementation. We also examine the performance characteristics of our applications, presenting advantages and inefficiencies of the programming model and desirable features to more easily and completely support a larger body of applications.
A Note on Auto-tuning GEMM for GPUs
, 2009
"... Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs t ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13, 11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA’s GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development [12]. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27 % in certain cases for both single and double precision GEMMs on the GTX 280). Keywords: Auto-tuning, matrix multiply, dense linear algebra, GPUs. 1
The impact of multicore on math software
- In PARA 2006, Ume˚a Sweden
, 2006
"... The idea that computational modeling and simulation represents a new branch of scientific methodology, alongside theory and experimentation, was introduced about two decades ago. It has since come to symbolize the enthusiasm and sense of importance that people in our community feel for the work they ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
The idea that computational modeling and simulation represents a new branch of scientific methodology, alongside theory and experimentation, was introduced about two decades ago. It has since come to symbolize the enthusiasm and sense of importance that people in our community feel for the work they are doing. But
Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems ∗
"... Multicore systems have increasingly gained importance in both shared-memory and distributed-memory environments. This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared- or distributed-memory). We use a task-based librar ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
Multicore systems have increasingly gained importance in both shared-memory and distributed-memory environments. This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared- or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the taskbased library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky factorization, LU factorization, and QR factorization. Our experiments on both shared-memory machines (16-core Intel Tigerton, 32-core IBM Power6) and distributed-memory machines (Cray XT4 using 1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time. 1.
Diverse replication for single-machine Byzantine-fault tolerance
- In Submission
, 2008
"... New single-machine environments are emerging from abundant computation available through multiple cores and secure virtualization. In this paper, we describe the research challenges and opportunities around diversified replication as a method to increase the Byzantine-fault tolerance (BFT) of single ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
New single-machine environments are emerging from abundant computation available through multiple cores and secure virtualization. In this paper, we describe the research challenges and opportunities around diversified replication as a method to increase the Byzantine-fault tolerance (BFT) of single-machine servers to software attacks or errors. We then discuss the design space of BFT protocols enabled by these new environments. 1

