Results 1 - 10
of
39
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures
- In International Parallel and Distributed Processing Symposium
, 2002
"... This paper examines the explicit communication characteristics of several sophisticated scientific applications, which, by themselves, constitute a representative suite of publicly available benchmarks for large cluster architectures. By focusing on the Message Passing Interface (MPI) and by using ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
This paper examines the explicit communication characteristics of several sophisticated scientific applications, which, by themselves, constitute a representative suite of publicly available benchmarks for large cluster architectures. By focusing on the Message Passing Interface (MPI) and by using hardware counters on the microprocessor, we observe each application's inherent behavioral characteristics: point-to-point and collective communication, and floating-point operations. Furthermore, we explore the sensitivities of these characteristics to both problem size and number of processors. Our analysis reveals several striking similarities across our diverse set of applications including the use of collective operations, especially those collectives with very small data payloads. We also highlight a trend of novel applications parting with regimented, static communication patterns in favor of dynamically evolving patterns, as evidenced by our experiments on applications that use implicit linear solvers and adaptive mesh refinement. Overall, our study contributes a better understanding of the requirements of current and emerging paradigms of scientific computing in terms of their computation and communication demands.
POEMS: End-to-End Performance Design of Large Parallel Adaptive Computational Systems
- IEEE Transactions on Software Engineering
, 2001
"... The POEMS project is creating an environment for end-to-end performance modeling of complex parallel and distributed systems, spanning the domains of application software, runtime and operating system software, and hardware architecture. Towards this end, the POEMS framework supports composition o ..."
Abstract
-
Cited by 44 (10 self)
- Add to MetaCart
The POEMS project is creating an environment for end-to-end performance modeling of complex parallel and distributed systems, spanning the domains of application software, runtime and operating system software, and hardware architecture. Towards this end, the POEMS framework supports composition of component models from these different domains into an end-to-end system model. This composition can be specified using a generalized graph model of a parallel system, together with interface specifications that carry information about component behaviors and evaluation methods. The POEMS Specification Language compiler, under development, will generate an end-to-end system model automatically from such a specification. The components of the target system may be modeled using different modeling paradigms (analysis, simulation, or direct measurement) and may be modeled at various levels of detail. As a result, evaluation of a POEMS end-to-end system model may require a variety of eval...
Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures using Multidimensional Wavefront Applications
- The International Journal of High Performance Computing Applications
, 2000
"... The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavef ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. The authors validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100 TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI
Statistical Scalability Analysis of Communication Operations in Distributed Applications
"... Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism will undoubtedly expose scalability limitations in existing applications, where sca ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism will undoubtedly expose scalability limitations in existing applications, where scalability is the ability of a parallel algorithm on a parallel architecture to effectively utilize an increasing number of processors. Users will need precise and automated techniques for detecting the cause of limited scalability. This paper addresses this dilemma. First, we argue that users face numerous challenges in understanding application scalability: managing substantial amounts of experiment data, extracting useful trends from this data, and reconciling performance information with their application's design. Second, we propose a solution to automate this data analysis problem by applying fundamental statistical techniques to scalability experiment data. Finally, we evaluate our operational prototype on several applications, and show that statistical techniques offer an effective strategy for assessing application scalability. In particular, we find that non-parametric correlation of the number of tasks to the ratio of the time for communication operations to overall communication time provides a reliable measure for identifying communication operations that scale poorly. 1
Local grid scheduling techniques using performance prediction
- IEE Proc.-Comput. Digit. Tech
, 2003
"... The use of computational grids to provide an integrated computer platform, composed of differentiated and distributed systems, presents fundamental resource and workload management questions. Key services such as resource discovery, monitoring and scheduling are inherently more complicated in a grid ..."
Abstract
-
Cited by 32 (13 self)
- Add to MetaCart
The use of computational grids to provide an integrated computer platform, composed of differentiated and distributed systems, presents fundamental resource and workload management questions. Key services such as resource discovery, monitoring and scheduling are inherently more complicated in a grid environment where the resource pool is large, dynamic and architecturally diverse. In this research, we approach the problem of grid workload management through the development of a multi-tiered scheduling architecture (TITAN) that employs a performance prediction system (PACE) and task distribution brokers to meet user-defined deadlines and improve resource usage efficiency. This paper focuses on the lowest tier which is responsible for local scheduling. By coupling application performance data with scheduling heuristics, the architecture is able to balance the processes of minimising run-to-completion time and processor idle time, whilst adhering to service deadlines on a per-task basis. 1
Scalable analysis techniques for microprocessor performance counter metrics
- In Proc. of the Conference on Supercomputers (SC2002
, 2002
"... Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these counters use raw counts ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these counters use raw counts, ratios, and visualization techniques help users make decisions about their application performance. While these techniques are appropriate for analyzing data from one process, they do not scale easily to new levels demanded by contemporary computing systems. Very simply, this paper addresses these concerns by evaluating several multivariate statistical techniques on these datasets. We find that several techniques, such as statistical clustering, can automatically extract important features from the data. These derived results can, in turn, be fed directly back to an application developer, or used as input to a more comprehensive performance analysis environment, such as a visualization or an expert system. 1
An Empirical Performance Evaluation of Scalable Scientific Applications
- in Proceedings of the 2002 ACM/IEEE Conference on Supercomputing
, 2002
"... We investigate the scalability, architectural requirements, and performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, w ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
We investigate the scalability, architectural requirements, and performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPI's collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one application's intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the application's primary computation. 1
STAPL: An adaptive, generic parallel C++ library
- IN INT. WORKSHOP ON LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2001, [CDROM
, 2003
"... The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed me ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5 % of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.
Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications
- In Supercomputing '97
, 1997
"... : In this paper we compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
: In this paper we compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant differences in their memory subsystems. Our memory model includes the effect of overlap between CPU and memory operations and allows us to infer the individual contributions of all three improvements in the Origin's memory architecture and relate the effectiveness of each improvement to application characteristics.. 1 Introduction The biggest challenge in the design and use of high-performance computer systems involves managing the disparity between central processing unit (CPU) speed and memory subsystem speed. The need to address this issue is likely to become more acute in the future, because processor speed may double every eighteen months but DRAM memory access speed is expected to inc...
STORM: Lightning-Fast Resource Management
- In Supercomputing 2002
, 2002
"... Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem—essentially, all of the c ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem—essentially, all of the code that runs on a cluster other than the applications— increasingly impacts application efficiency. In this paper, we present STORM, a resourcemanagement framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling. 1

