Results 1 - 10
of
16
Implementation of a Portable Nested Data-Parallel Language
- Journal of Parallel and Distributed Computing
, 1994
"... This paper gives an overview of the implementation of Nesl, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel alg ..."
Abstract
-
Cited by 154 (26 self)
- Add to MetaCart
This paper gives an overview of the implementation of Nesl, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current Nesl implementation is based on an intermediate language called Vcode and a library of vector routines called Cvl. It runs on the Connection Machine CM-2, the Cray Y-MP C90, and serial machines. We compare initial benchmark results of Nesl with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that Nesl's performance is competitive with that of machine-specific codes for regular dense da...
ICC++ -- A C++ Dialect for High Performance Parallel Computing
- In Proceedings of the 2nd International Symposium on Object Technologies for Advanced Software
, 1996
"... ICC++ is a new C++ concurrent dialect which allows sequential/parallel program versions to be maintained with single source, the construction of concurrent data abstractions, convenient expression of irregular and fine-grained concurrency, and supports high performance implementations. ICC++ prov ..."
Abstract
-
Cited by 55 (10 self)
- Add to MetaCart
ICC++ is a new C++ concurrent dialect which allows sequential/parallel program versions to be maintained with single source, the construction of concurrent data abstractions, convenient expression of irregular and fine-grained concurrency, and supports high performance implementations. ICC++ provides annotations for potential concurrency, facilitating both sharing source with sequential programs and grain size tuning for efficient execution. ICC++ has a notion of object consistency which can be extended structurally and procedurally to implement larger data abstractions. Finally, ICC++ integrates arrays into the object system and hence the concurrency model. In short, ICC++ addresses concurrency and its relation to abstractions -- whether they are implemented by single objects, several objects, or object collections. The design of the language, its rationale, and current status are all described. Keywords concurrent object-oriented programming, concurrent languages, parallel...
Optimal Evaluation of Array Expressions on Massively Parallel Machines
- ACM TRANS. PROG. LANG. SYST
, 1992
"... ..."
Techniques for the Translation of MATLAB Programs into Fortran 90
- ACM Transactions on Programming Languages and Systems
, 1999
"... This article describes the main techniques developed for FALCON's MATLAB-to-Fortran 90 compiler. FALCON is a programming environment for the development of high-performance scientific programs. It combines static and dynamic inference methods to translate MATLAB programs into Fortran 90. The static ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
This article describes the main techniques developed for FALCON's MATLAB-to-Fortran 90 compiler. FALCON is a programming environment for the development of high-performance scientific programs. It combines static and dynamic inference methods to translate MATLAB programs into Fortran 90. The static inference is supported with advanced value propagation techniques and symbolic algorithms for subscript analysis. Experiments show that FALCON's MATLAB translator can generate code that performs more than 1000 times faster than the interpreted version of MATLAB and substantially faster than commercially available MATLAB compilers on one processor of an SGI Power Challenge. Furthermore, in most cases we have tested, the compiler-generated code is as fast as corresponding hand-written programs
Flattening Trees
, 1998
"... Nested data-parallelism can be efficiently implemented by mapping it to flat parallelism using Blelloch & Sabot's flattening transformation. So far, the only dynamic data structure supported by flattening are vectors. We extend it with support for user-defined recursive types, which allow parallel t ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Nested data-parallelism can be efficiently implemented by mapping it to flat parallelism using Blelloch & Sabot's flattening transformation. So far, the only dynamic data structure supported by flattening are vectors. We extend it with support for user-defined recursive types, which allow parallel tree structures to be denfied. Thus, important parallel algorithms can be implemented more clearly and efficiently.
Automatic Synchronization Elimination in Synchronous FORALLs
- In Frontiers '95: The 5th Symposium on the Frontiers of Massively Parallel Computation
, 1995
"... This paper investigates a promising optimization technique that automatically eliminates redundant synchronization barriers in synchronous FORALLs. We present complete algorithms for the necessary program restructurings and subsequent code generation. Furthermore, we discuss the correctness, complex ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper investigates a promising optimization technique that automatically eliminates redundant synchronization barriers in synchronous FORALLs. We present complete algorithms for the necessary program restructurings and subsequent code generation. Furthermore, we discuss the correctness, complexity, and performance of our restructuring algorithm before we finally evaluate its practical usefulness by quantitative experimentation. The experimental evaluation results are very encouraging. An implementation of the optimization algorithms in our Modula-2* compiler eliminated more than 50% of the originally present synchronization barriers in a set of seven parallel benchmarks. This barrier reduction improved the execution times of the generated programs by over 40% on a MasPar MP-1 with 16384 processors and by over 100% on a sequential workstation. 1 Introduction Data-parallel programs operate on all elements of a data structure simultaneously and are expressed with explicit or implic...
On the Distributed Implementation of Aggregate Data Structures by Program Transformation
- In Proceedings of the 4th IPPS/SDP International Workshop on High-Level Parallel Programming Models and Supportive Environments,IPPS/SDP99
, 1999
"... . A critical component of many data-parallel programming languages are operations that manipulate aggregate data structures as a whole---this includes Fortran 90, Nesl, and languages based on BMF. These operations are commonly implemented by a library whose routines operate on a distributed represen ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
. A critical component of many data-parallel programming languages are operations that manipulate aggregate data structures as a whole---this includes Fortran 90, Nesl, and languages based on BMF. These operations are commonly implemented by a library whose routines operate on a distributed representation of the aggregate structure; the compiler merely generates the control code invoking the library routines and all machine-dependent code is encapsulated in the library. While this approach is convenient, we argue that by breaking the abstraction enforced by the library and by presenting some of internals in the form of a new intermediate language to the compiler back-end, we can optimize on all levels of the memory hierarchy and achieve more flexible data distribution. The new intermediate language allows us to present these optimisations elegantly as program transformations. We report on first results obtained by our approach in the implementation of nested data parallelis...
Piecewise Execution of Nested Data-Parallel Programs
- Languages and Compilers for Parallel Computing, volume 1033 of Lecture Notes in Computer Science
, 1995
"... The technique of flattening nested data parallelism combines all the independent operations in nested apply-to-all constructs and generates large amounts of potential parallelism for both regular and irregular expressions. However, the resulting data-parallel programs can have enormous memory req ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
The technique of flattening nested data parallelism combines all the independent operations in nested apply-to-all constructs and generates large amounts of potential parallelism for both regular and irregular expressions. However, the resulting data-parallel programs can have enormous memory requirements, limiting their utility. In this paper, we presentpiecewise execution, an automatic method of partially sefializing data-parallel programs so that they achieve maximum parallelism within storage limitations. By computing large intermediate sequences in pieces, our approach requires asymptotically less memory to perform the same amount of work. By using characteristics of the underlying parallel architecture to drive the computation size, we retain effective use of a parallel machine at each step. This dramatically expands the class of nested data-parallel programs that can be executed using the flattening technique. With the addition of piecewise I/O operations, these techniques can be applied to generate out-of-core execution on large datasets.
The Advantages of Multiple Parallelizations in Combinatorial Search
, 1994
"... Applications typically have several potential sources of parallelism, and in choosing a particular parallelization, the programmer must balance the benefits of each source of parallelism with the corresponding overhead. The trade-offs are often difficult to analyze, as they may depend on the hardwar ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Applications typically have several potential sources of parallelism, and in choosing a particular parallelization, the programmer must balance the benefits of each source of parallelism with the corresponding overhead. The trade-offs are often difficult to analyze, as they may depend on the hardware architecture, software environment, input data, and properties of the algorithm. An example of this dilemma occurs in a wide range of problems that involve processing trees, wherein processors can be assigned either to separate subtrees, or to parallelizing the work performed on individual tree nodes. We explore the complexity of the trade-offs involved in this decision by considering alternative parallelizations of combinatorial search, examining the factors that deter-mine the best-performing implementation for this important class of problems. Using subgraph isomorphism as a representative search problem, we show how the density of the solution space, the
Evaluating High Level Parallel Programming Support for Irregular Applications in ICC++
- in ICC++,” in Proceedings of the International Scientific Computing in Object-oriented Parallel Environments Conference
"... Object-oriented techniques have been proffered as aids for managing complexity, enhancing reuse, and improving readability of irregular parallel applications. However, as performance is the major reason for employing parallelism, programmability and high performance must be delivered together. Using ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Object-oriented techniques have been proffered as aids for managing complexity, enhancing reuse, and improving readability of irregular parallel applications. However, as performance is the major reason for employing parallelism, programmability and high performance must be delivered together. Using a suite of seven challenging irregular applications and the mature Illinois Concert system (a high-level concurrent object-oriented programming model backed by an aggressive implementation), we evaluate what programming efforts are required to achieve high performance. For all seven applications, we achieve performance comparable to the best reported for low-level programming means on large-scale parallel systems. In general, a high-level concurrent object-oriented programming model supported by aggressive implementation techniques can eliminate programmer management of many concerns -- procedure and computation granularity, namespace management, and low-level concurrency management. Our st...

