Results 1 - 10
of
66
The GrADS project: Software support for high-level grid application development
- International Journal of High Performance Computing Applications
, 2001
"... Advances in networking technologies will soon make it possible to use the global information infrastructure in a qualitatively different way—as a computational resource as well as an information resource. This idea for an integrated computation and information resource called the Computational Power ..."
Abstract
-
Cited by 162 (24 self)
- Add to MetaCart
(Show Context)
Advances in networking technologies will soon make it possible to use the global information infrastructure in a qualitatively different way—as a computational resource as well as an information resource. This idea for an integrated computation and information resource called the Computational Power Grid has been described by the recent book entitled The Grid: Blueprint for a New Computing Infrastructure [18]. The Grid will connect the nation’s computers, databases, instruments, and people in a seamless web, supporting emerging computation-rich application concepts such as remote computing, distributed supercomputing, tele-immersion, smart instruments, and data mining. To realize this vision, significant scientific and technical obstacles must be overcome. Principal among these is usability. Because the Grid will be inherently more complex than existing computer systems, programs that execute on the Grid will reflect some of this complexity. Hence, making Grid resources useful and accessible to scientists and engineers will require new software tools that embody major advances in both the theory and practice of building Grid applications. The goal of the Grid Application Development Software (GrADS) Project is to simplify distributed heterogeneous computing in the same way that the World Wide Web simplified information sharing
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 76 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
HPCView: A tool for top-down analysis of node performance
- The Journal of Supercomputing
, 2002
"... ..."
(Show Context)
The Jrpm System for Dynamically Parallelizing Java Programs
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)roce ..."
Abstract
-
Cited by 65 (4 self)
- Add to MetaCart
We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)rocessors, and thread-level speculation (TLS) simplifies program parallelization by allowing us to parallelize optimistically without violating correct sequential program behavior. Using a Java virtual machine with dynamic compilation support coupled with a hardware profiler, speculative buffer requirements and inter-thread dependencies of prospective speculative thread loops (STLs) are analyzed in real-time to identi the best loops to parallelize. Once sufficient data has been collected to make a reasonable decision, selected loops are dynamically recompiled to run in parallel Experimental results demonstrate that Jrpm can exploit thread-level parallelism with minimal effort from the programmer. On four processors, we achieved speedups of 3 to 4 for floating point applications, to 3 on multimedia applications, and between 1.5 and .5 on integer applications. Performance was achieved by automatic selection of thread decompositions by the hardware profiler, intra-procedural optimizations on code compiled dynamically into speculative threads, and some minor programmer transformations for exposing parallelism that cannot be performed automatically.
Using Thread-Level Speculation to Simplify Manual Parallelization
, 2003
"... In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEC CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
, 1997
"... The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consisten ..."
Abstract
-
Cited by 45 (12 self)
- Add to MetaCart
The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consiste...
Scaling applications to massively parallel machines using projections performance analysis tool
- In Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis
, 2005
"... Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present case studies invo ..."
Abstract
-
Cited by 41 (23 self)
- Add to MetaCart
Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present case studies involving NAMD, a parallel classic molecular dynamics application for large biomolecular systems, and CPAIMD, Car-Parrinello ab initio molecular dynamics application, and efforts to scale them to large number of processors. Both applications are implemented in Charm++, and the performance analysis was carried out using Projections, the performance visualization/analysis tool associated with Charm++. We will showcase a series of optimizations facilitated by Projections. The resultant performance of NAMD led to a Gordon Bell award at SC2002 with unprecedented speedup on 3,000 processors with teraflops level peak performance. We also explore the techniques for applying the performance visualization/analysis tool on future generation extreme-scale parallel machines and discuss the scalability issues with Projections. 1
Scaling molecular dynamics to 3000 processors with projections: A performance analysis case study
- In Terascale Performance Analysis Workshop, International Conference on Computational Science(ICCS
, 2003
"... Abstract. Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present a case ..."
Abstract
-
Cited by 35 (13 self)
- Add to MetaCart
Abstract. Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present a case study involving NAMD, a parallel molecular dynamics application, and efforts to scale it to run on 3000 processors with Tera-FLOPS level performance. NAMD is implemented in Charm++, and the performance analysis was carried out using “projections”, the performance visualization/analysis tool associated with Charm++. We will showcase a series of optimizations facilitated by projections. The resultant performance of NAMD led to a Gordon Bell award at SC2002. 1
ZPL: A Machine Independent Programming Language for Parallel Computers
- IEEE Transactions on Software Engineering
, 2000
"... The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to "see ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
(Show Context)
The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to "see" this underlying machine. This paper describes ZPL and provides a comprehensive evaluation of the language with respect to its goals of performance, portability, and programming convenience. In particular, we describe ZPL's machine-independent performance model, describe the programming benefits of ZPL's region-based constructs, summarize the compilation benefits of the language's high-level semantics, and summarize empirical evidence that ZPL has achieved both high performance and portability on diverse machines such as the IBM SP-2, Cray T3E, and SGI Power Challenge. Index Terms: portable, efficient, parallel programming language. This research was supported by DARPA Grant F30602-97-1-0152, a grant of HPC time from the Arctic Region Supercomputing Center, NSF Grant CCR--9707056, and ONR grant N00014-99-1-0402. 1 1
Real-time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids
"... this paper is organized as follows. In 2, we 1 This work was supported in part by the Defense Advanced Research Projects Agency under DARPA contracts DABT6394 -C0049 (SIO Initiative), F30602-96-C-0161, and DABT63-96-C-0027 by the National Science Foundation under grants NSF CDA 94-01124, ASC 97-202 ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
this paper is organized as follows. In 2, we 1 This work was supported in part by the Defense Advanced Research Projects Agency under DARPA contracts DABT6394 -C0049 (SIO Initiative), F30602-96-C-0161, and DABT63-96-C-0027 by the National Science Foundation under grants NSF CDA 94-01124, ASC 97-20202, and EIA 99-77284, by the NSF PACI Computational Science Alliance Cooperative Agreement, and by the Department of Energy under contracts DOE B-341492, W-7405-ENG-48, and 1-B-333164