Results 1 - 10
of
26
Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem
- Proceedings of the International Symposium on Cluster Computing and the Grid
, 2006
"... This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively m ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively multimethod-enabled. Nemesis has been integrated in MPICH2 as a CH3 channel and delivers better performance than other dedicated communication channels in MPICH2. Furthermore, the resulting MPICH2 architecture outperforms other MPI implementations in point-to-point benchmarks. 1
Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem
- Proceedings of the Euro PVM/MPI Conference
, 2006
"... Abstract. This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its sharedmemory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Abstract. This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its sharedmemory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks as well as application benchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications. 1
Data transfers between processes in an SMP system: performance study and application to MPI
- in: Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), IEEE Computer Society
, 2006
"... Abstract — This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: sharedm ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Abstract — This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: sharedmemory buffers, message queues, the Ptrace system call, kernel module-based copy, and a high-speed network. We evaluate each mechanism based on latency, bandwidth, its impact on application cache usage, and its suitability to support MPI twosided and one-sided messages. I. MOTIVATION AND SCOPE Designing a communication system tailored for a particular architecture requires understanding the achievable performance levels of the underlying hardware and software. Such understanding is key to a more efficient design and
Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?
"... Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded application ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36 % performance increase when the threads are placed on cores appropriately.
Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?
"... Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing. 1
Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing
, 2009
"... Abstract. We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching o ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract. We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are significant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a significant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to two-sided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get significant computational acceleration which otherwise may have been impossible.
Architectural considerations for efficient software execution on parallel microprocessors
- Parallel and Distributed Processing Symposium
, 2007
"... Chip Multiprocessors (CMPs) and Simultaneous Multithreading (SMT) processors provide high CPU performance but tend to put more pressure on the memory interface than their single-thread counterparts. The commonly known “memory wall ” problem is exacerbated by multiple threads sharing a single memory ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Chip Multiprocessors (CMPs) and Simultaneous Multithreading (SMT) processors provide high CPU performance but tend to put more pressure on the memory interface than their single-thread counterparts. The commonly known “memory wall ” problem is exacerbated by multiple threads sharing a single memory interface. In the future, this problem will only get worse as more cores are added to a single chip. Therefore, communication mechanisms between the execution contexts/cores, using either shared caches or fast interconnects between private caches, are essential to keep the CPUs busy without overly taxing the memory interface. Larger systems built with multiple CMPs add another dimension to this already challenging problem, as the communication mechanism is no longer uniform across the entire system. In order to parallelize data-intensive applications to achieve high performance on these systems, one must explore a number of execution behaviors in a complex architecture-dependent exercise that entails identifying key components of the communication subsystem and understanding their behavior under varying degrees of workload. As part of ongoing research into developing efficient program execution models for parallel microprocessors, we have developed a tool to evaluate the performance of the storage controllers at different levels of the memory hierarchy under varying workloads and measure the overhead of maintaining cache coherence in the system. The tool allows exploration of architectural features of real processors that affect the performance of several parallel execution approaches. In this paper, we demonstrate its use by evaluating two of our parallel programming models (one introduced in an earlier paper and the other introduced here) that employ architecture-specific optimizations. We compare our approaches with a conventional model for a number of applications on several modern parallel microprocessor systems. I.
Combining Locality Analysis with Online Proactive Job Co-Scheduling in Chip Multiprocessors
"... Abstract. The shared-cache contention on Chip Multiprocessors causes performance degradation to applications and hurts system fairness. Many previously proposed solutions schedule programs according to runtime sampled cache performance to reduce cache contention. The strong dependence on runtime sam ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. The shared-cache contention on Chip Multiprocessors causes performance degradation to applications and hurts system fairness. Many previously proposed solutions schedule programs according to runtime sampled cache performance to reduce cache contention. The strong dependence on runtime sampling inherently limits the scalability and effectiveness of those techniques. This work explores the combination of program locality analysis with job co-scheduling. The rationale is that program locality analysis typically offers a large-scope view of various facets of an application including data access patterns and cache requirement. That knowledge complements the local behaviors sampled by runtime systems. The combination offers the key to overcoming the limitations of prior co-scheduling techniques. Specifically, this work develops a lightweight locality model that enables efficient, proactive prediction of the performance of co-running processes, offering the potential for an integration in online scheduling systems. Compared to existing multicore scheduling systems, the technique reduces performance degradation by 34 % (7 % performance improvement) and unfairness by 47%. Its proactivity makes it resilient to the scalability issues that constraints the applicability of previous techniques. 1
A Profiler for a Heterogeneous Multi-Core Multi-FPGA System by
"... A thesis submitted in conformity with the requirements ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
A thesis submitted in conformity with the requirements
Developing Scalable Applications with Vampir,
"... to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.

