Results 1 - 10
of
33
Nonlinear Array Dependence Analysis
, 1991
"... Standard array data dependence techniques can only reason about linear constraints. There has also been work on analyzing some dependences involving polynomial constraints. Analyzing array data dependences in real-world programs requires handling many "unanalyzable" terms: subscript arrays, run-time ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Standard array data dependence techniques can only reason about linear constraints. There has also been work on analyzing some dependences involving polynomial constraints. Analyzing array data dependences in real-world programs requires handling many "unanalyzable" terms: subscript arrays, run-time tests, function calls. The standard approach to analyzing such programs has been to omit and ignore any constraints that cannot be reasoned about. This is unsound when reasoning about value-based dependences and whether privatization is legal. Also, this prevents us from determining the conditions that must be true to disprove the dependence. These conditions could be checked by a run-time test or verified by a programmer or aggressive, demand-driven interprocedural analysis. We describe a solution to these problems. Our solution makes our system sound and more accurate for analyzing value-based dependences and derives conditions that can be used to disprove dependences. We also give some p...
An Overview of Symbolic Analysis Techniques Needed for the Effective Parallelization of the Perfect Benchmarks
, 1994
"... We have identified symbolic analysis techniques that will improve the effectivenessofparallelizing Fortran compilers, with emphasis upon data dependence analysis. We have done this by comparing the automatically and manually parallelized versions of the Perfect R fl .Thetechniques include: symbolic ..."
Abstract
-
Cited by 34 (11 self)
- Add to MetaCart
We have identified symbolic analysis techniques that will improve the effectivenessofparallelizing Fortran compilers, with emphasis upon data dependence analysis. We have done this by comparing the automatically and manually parallelized versions of the Perfect R fl .Thetechniques include: symbolic data dependence tests for nonlinear expressions,constraint propagation, array summary information, and run time tests.
FALCON: A MATLAB Interactive Restructuring Compiler
- IN LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING
, 1995
"... The development of efficient numerical programs and library routines for high-performance parallel computers is a complex task requiring not only an understanding of the algorithms to be implemented, but also detailed knowledge of the target machine and the software environment. In this paper, w ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
The development of efficient numerical programs and library routines for high-performance parallel computers is a complex task requiring not only an understanding of the algorithms to be implemented, but also detailed knowledge of the target machine and the software environment. In this paper, we describe a programming environment that can utilize such knowledge for the development of high-performance numerical programs and libraries. This environment uses an existing highlevel array language (MATLAB) as source language and performs static, dynamic, and interactive analysis to generate Fortran 90 programs with directives for parallelism. It includes capabilities for interactive and automatic transformations at both the operation-level and the functional- or algorithm-level. Preliminary experiments, comparing interpreted MATLAB programs with their compiled versions, show that compiled programs can perform up to 48 times faster on a serial machine, and up to 140 times fas...
The Polaris Internal Representation
, 1994
"... The Polaris Program Manipulation System is a production quality tool for source-to-source transformations and complex analysis of Fortran code. In this paper we describe the motivations for and the design of Polaris' internal representation. The internal representation is composed of a basic abstrac ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
The Polaris Program Manipulation System is a production quality tool for source-to-source transformations and complex analysis of Fortran code. In this paper we describe the motivations for and the design of Polaris' internal representation. The internal representation is composed of a basic abstract syntax tree on top of which exist many layers of functionality. This functionality allows complex operations on the data structure as well as allowing it to emulate other internal representations. Further, the internal representation is designed to enforce the consistency of the state of the internal structure in terms of both the correctness of the data structure and the correctness of the Fortran code being manipulated. In addition, operations on the internal representation result in the automatic updating of affected data structures such as flow information. We describe how the system's philosophies developed from its predecessor, the Delta prototyping system, and how they were implemen...
Strings: A High-Performance Distributed Shared Memory for Symmetrical Multiprocessor Clusters
- in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing
, 1998
"... This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these ligh ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these lightweight processes. Thus, Strings is designed to exploit data parallelism at the application level and task parallelism at the DSM system level. We show how using multiple kernel threads can improve the performance even in the presence of false sharing, using matrix multiplication as a case-study. We also show the performance results with benchmark programs from the SPLASH-2 suite [17]. Though similar work has been demonstrated with SoftFLASH [18], our implementation is completely in user space and thus more portable. Some other researach has studied the effect of clustering in SMPs suing simulations [19]. We have shown results from runs on an actual network of SMPs
FALCON: An Environment for the Development of Scientific Libraries and Applications
, 1995
"... We summarize our work consisting of the development of FALCON, a programming environment based on MATLAB. This environment includes capabilities for the rapid prototyping of algorithms, and for the interactive and automatic transformations at both the operation-level and the functionor algorithmic-l ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
We summarize our work consisting of the development of FALCON, a programming environment based on MATLAB. This environment includes capabilities for the rapid prototyping of algorithms, and for the interactive and automatic transformations at both the operation-level and the functionor algorithmic-level in order to obtain good numerical and computational performance. FALCON supports the development and reuse of numerical programs and libraries, and combines the transformation and analysis techniques used in restructuring compilers with the algebraic techniques used by developers to express and manipulate their algorithms in an intuitively useful manner.
A Compiler-Directed Cache Coherence Scheme with Improved Intertask Locality
, 1994
"... In this paper 1 , we introduce a compiler-directed coherence scheme which can exploit most of the temporal and spatial locality across task boundaries. It requires only an extended tag field per cache word, one modified memory access instruction, and a counter called the epoch counter in each proc ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In this paper 1 , we introduce a compiler-directed coherence scheme which can exploit most of the temporal and spatial locality across task boundaries. It requires only an extended tag field per cache word, one modified memory access instruction, and a counter called the epoch counter in each processor. By using the epoch counter as a system-wide version number, the scheme simplifies the cache hardware of previous version control [5] or timestamp-based schemes [12], but still exploits most of the temporal and spatial locality across task boundaries. We present a compiler algorithm to generate the appropriate memory access instructions for the proposed scheme. The algorithm is based on a data flow analysis technique. It identifies potential stale references by examining memory reference patterns in a source program. 1 Introduction Reducing memory latency is critical to the performance of large-scale parallel systems. Due to the temporal and spatial locality of memory reference patter...
Hardware And Compiler Support For Cache Coherence In Large-Scale Shared-Memory Multiprocessors
, 1996
"... ompiler can detect potentially stale references and what kind of performance can be obtained using a real compiler. iii Also, most of the compiler-directed coherence schemes proposed to date have not addressed the real cost of the required hardware support. For example, many of the schemes require ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
ompiler can detect potentially stale references and what kind of performance can be obtained using a real compiler. iii Also, most of the compiler-directed coherence schemes proposed to date have not addressed the real cost of the required hardware support. For example, many of the schemes require expensive hardware support and assume a cache organization with singleword cache lines and a word-addressable architecture. Also, the issues of synchronization, such as lock variables and critical sections, have been addressed rarely. This dissertation addresses these hardware and compiler implementation issues and investigates the feasibility and performance of the compiler-directed cache coherence approach. We propose a new compiler-directed scheme that can be implemented on a largescale multiprocessor using off-the-shelf microprocessors. The scheme can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related is
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors
- in Proceedings of High Performance Computing and Networking (HPCN Europe '98
, 1998
"... This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting the code to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks. 1

