Results 1 -
9 of
9
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
Localizing Non-affine Array References
, 1999
"... Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize non-affine array references, such as the indirect memory references common in sparse-matrix computations. Our optimization combines elements of tilin ..."
Abstract
-
Cited by 44 (9 self)
- Add to MetaCart
Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize non-affine array references, such as the indirect memory references common in sparse-matrix computations. Our optimization combines elements of tiling, data-centric tiling, data remapping and inspector-executor parallelization. We describe our technique, bucket tiling, which includes the tasks of permutation generation, data remapping, and loop regeneration. We show that profitability cannot generally be determined at compile-time, but requires an extension to run-time. We demonstrate our technique on three codes: integer sort, conjugate gradient, and a kernel used in simulating a beating heart. We observe speedups of 1.91 on integer sort, 1.57 on conjugate gradient, and 2.69 on the heart kernel. 1. Introduction Researchers have long sought to increase data locality and exploit parallelism in loop nests [34, 32, 16, 5, 33, 18]. These wor...
Compiler Generated Multithreading to Alleviate Memory Latency
- Journal of Universal Computer Science, special issue on Multithreaded Processors and Chip-Multiprocessors
, 2000
"... Since the era of vector and pipelined computing, the computational speed is limited by the memory access time. Faster caches and more cache levels are used to bridge the growing gap between the memory and processor speeds. With the advent of multithreaded processors, it becomes feasible to concurren ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
Since the era of vector and pipelined computing, the computational speed is limited by the memory access time. Faster caches and more cache levels are used to bridge the growing gap between the memory and processor speeds. With the advent of multithreaded processors, it becomes feasible to concurrently fetch data and compute in two cooperating threads. A technique is presented to generate these threads at compile time, taking into account the characteristics of both the program and the underlying architecture. The results have been evaluated for an explicitly parallel processor. With a number of common programs the data-fetch thread allows to continue the computation without cache miss stalls.
Guiding Program Transformations with Modal Performance Models
, 2000
"... Successful program optimization requires analysis of profitability. From this analysis, a compiler or runtime system can decide where and how to apply an assortment of program transformations. This two-faced problem is called transformation guidance. We consider the desired goal of robust guidance o ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Successful program optimization requires analysis of profitability. From this analysis, a compiler or runtime system can decide where and how to apply an assortment of program transformations. This two-faced problem is called transformation guidance. We consider the desired goal of robust guidance of performance optimizations for hierarchical systems. A guidance system is robust if it unifies disparate sources of knowledge, and makes reasonable decisions hold up, despite a lack of definitive information. In particular, we seek to address concerns presented by aspects of syntax, architecture, and data set. Syntax may not be statically analyzable; for example, the data dependences due to A(B(i)) (an indirect memory reference) cannot be determined until runtime. Architecture poses a problem in the complexity of the relationship between its properties and performance. Data set shares both problems: on the one hand, we cannot analyze properties of unavailable data; and yet, once available, we cannot easily predict how its properties, combined with architectural properties, in uence execution time. This thesis solves aspects of this robust guidance problem. First, we present bucket tiling, a program transformation for locality which handles non-a ne array references (such as the indirect which reference mentioned above). Bucket tiling improves the performance of codes such as conjugate gradient and integer sort by 1.5 to 2.8 times. We have developed a tool which automatically applies bucket tiling to C or Fortran codes. To guide locality optimizations such as bucket tiling in a robust manner requires a new modeling strategy. We present the abstraction of modal models. A modal model recognizes, and leverages off the following observation: many aspects of a program's behavior can be assigned to a small, finite number of distinguishable categories. We develop a modal model for guiding locality transformations which uses three parameterized modes to represent three different access patterns. We show how to experimentally determine parameterized formulas for execution time of these modes on any given target platform. Further, we use these modes as the basis for a calculus of performance modeling for our guidance system. Given any program, represented as a tree of modes, we show how to determine an execution time formula for the program. For bucket tiling, we determine execution time formulas for the original and transformed programs, and use these to guide the decision on performing the transformation. We also contrast a modal modeling approach to a static-combinatoric approach. Such an approach models by counting some observable property of behavior, such as cache misses. This contrast highlights the principle advantage of modal modeling: robustness to syntax, architecture, and data set properties.
Asynchronous Resource Management
- In Proceedings of the International Parallel and Distributed Processing Symposium
, 2001
"... As the organization of high-performance computers becomes more complex, the task of managing resources on them becomes increasingly difficult. The software layers which include the operating system, the runtime system, and the compiler must now map applications to machine architectures that consist ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As the organization of high-performance computers becomes more complex, the task of managing resources on them becomes increasingly difficult. The software layers which include the operating system, the runtime system, and the compiler must now map applications to machine architectures that consist of multiple CPUs, several layers of cache, deep memory hierarchies, file systems, interconnects, network interface cards, along with logical resources defined in the software system. To fully utilize all the available resources, software systems may use multiple sequential processes or threads that act on the passive resources of the system. This paper introduces a resource-centric, eventdriven model, where resources are active objects. We describe an algorithm that implements this model and show that this can significantly improve the performance of a wide variety of applications. 1 Introduction In order to fully utilize the available resources of a computer system, software systems use s...
A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces
"... Abstract. Parallel loops account for the greatest percentage of program parallelism. The degree to which parallelism can be exploited and the amount of overhead involved during parallel execution of a nested loop directly depend on partitioning, i.e., the way the different iterations of a parallel l ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Parallel loops account for the greatest percentage of program parallelism. The degree to which parallelism can be exploited and the amount of overhead involved during parallel execution of a nested loop directly depend on partitioning, i.e., the way the different iterations of a parallel loop are distributed across different processors. Thus, partitioning of parallel loops is of key importance for high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of rectangular iteration spaces, the problem of partitioning of non-rectangular iteration spaces- e.g. triangular, trapezoidal iteration spaces- has not been given enough attention so far. In this paper, we present a geometric approach for partitioning N-dimensional non-rectangular iteration spaces for optimizing performance on parallel processor systems. Speedup measurements for kernels (loop nests) of linear algebra packages are presented. 1
βτοο: Object Oriented Language Compilation for Fine Grained Targets
, 1992
"... fiøoo (Bee-too) is an object oriented programming language integrating class abstraction with the block structured function semantics of conventional imperative languages. High concurrency and architecture independence is promoted by the programming model, the fiøoo compiler, and the fine-grained me ..."
Abstract
- Add to MetaCart
fiøoo (Bee-too) is an object oriented programming language integrating class abstraction with the block structured function semantics of conventional imperative languages. High concurrency and architecture independence is promoted by the programming model, the fiøoo compiler, and the fine-grained method-state graph generated intermediately for subsequent target-generation stages. After briefly introducing the aims of the project and programming model, this paper concentrates on the qualities of the programming model and role of the compiler in generating its intermediate-representation method-state graph. The graph is generated from expressions with imperative operations on closure-abstracted object interfaces, and describes actions and storage which approach the granularity of those found at the lowest hardware levels and hence are extremly fine-grained. Graph structure re-writing during target-generation caters for different architectures. A formal specification and concrete syntax o...
Cache Remapping to Improve the Performance of Tiled Algorithms
"... With the increasing processing power, the latency of the memory hierarchy becomes the stumbling block of many modern computer architectures. In order to speed-up the calculations, different forms of tiling are used to keep data at the fastest cache level. However, conflict misses cannot easily be av ..."
Abstract
- Add to MetaCart
With the increasing processing power, the latency of the memory hierarchy becomes the stumbling block of many modern computer architectures. In order to speed-up the calculations, different forms of tiling are used to keep data at the fastest cache level. However, conflict misses cannot easily be avoided using the current techniques. In this paper cache remapping is presented as a new way to eliminate conflict as well as capacity and cold misses in regular array computations. The method uses advanced cache hints which can be exploited at compile time. The results on a set of typical examples are very favorable and it is shown that cache remapping is amenable to an efficient compiler implementation.
A Novel Approach for Partitioning Iteration Spaces with Variable Densities
"... Efficient partitioning of parallel loops plays a critical role in high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of loops with rectangular iteration spaces, the problem of partitioning non-rectangular i ..."
Abstract
- Add to MetaCart
Efficient partitioning of parallel loops plays a critical role in high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of loops with rectangular iteration spaces, the problem of partitioning non-rectangular iteration spaces — e.g., triangular, trapezoidal iteration spaces — with variable densities has not been addressed so far to the best of our knowledge. In this paper, we present a mathematical model for partitioning N-dimensional non-rectangular iteration spaces with variable densities. We present a unimodular loop transformation and a geometric approach for partitioning an iteration space along an axis corresponding to the outermost loop across a given number of processors to achieve near-optimal performance, i.e., to achieve nearoptimal load balance across different processors. We present a case study to illustrate the effectiveness of our approach. Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques—parallel programming;

