Results 1 - 10
of
35
Finding effective optimization phase sequences
- In Proceedings of the 2003 ACM SIGPLAN conference on Language, Compiler, and Tool for Embedded Systems
, 2003
"... It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumptio ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumption. Given that many embedded application developers are willing to spend time tuning an application, we believe a viable approach is to allow the developer to steer the process of optimizing a function. In this paper, we describe support in VISTA, an interactive compilation system, for finding effective sequences of optimization phases. VISTA provides the user with dynamic and static performance information that can be used during an interactive compilation session to gauge the progress of improving the code. In addition, VISTA provides support for automatically using performance information to select the best optimization sequence among several attempted. One such feature is the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria. We hav e included a number of experimental results that evaluate the effectiveness of using a genetic algorithm in VISTA to find effective optimization phase sequences.
Thread Fork/Join Techniques for Multi-level Parallelism Exploitation in NUMA Multiprocessors
- in NUMA Multiprocessors. In 13th Int. Conference on Supercomputing ICS'99, Rhodes
, 1999
"... This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanis ..."
Abstract
-
Cited by 35 (18 self)
- Add to MetaCart
This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanism are designed, implemented, evaluated and compared with the ones used in the IRIX MP library, an efficient implementation which supports a single level of parallelism. Supporting multiple levels of parallelism is a current research goal, both in shared and distributed memory machines. Our proposals include a first work generation scheme (GWD, or global work descriptor) which supports multiple levels of parallelism, but not processor grouping. The second work generation scheme (LWD, or local work descriptor) has been designed to support multiple levels of parallelism and processor grouping. Processor grouping is needed to distribute processors among different parts of the computation and ma...
A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
Execution-Driven Tools for Parallel Simulation of Parallel Architectures and Applications (*)
, 1993
"... Execution-driven techniques instrument application codes to generate events for simulation or tracing. Previously reported systems using execution-driven techniques have provided critical path simulation, trace generation, or execution-driven simulation capabilities. Critical path simulation perform ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
Execution-driven techniques instrument application codes to generate events for simulation or tracing. Previously reported systems using execution-driven techniques have provided critical path simulation, trace generation, or execution-driven simulation capabilities. Critical path simulation performs implicit, optimistic parallelization of serial codes and is used to study parallelism and performance. This paper describes EPG-sim, a newly-developed set of execution-driven tools that performs parallel execution-driven simulation and trace generation for studying parallel systems. These tools integrate the above execution-driven techniques into a single framework through the use of intelligent source-level instrumentation. The tools can be used to model varying processor and system architectures, and can simulate serial, optimistically parallelized, or parallel codes being executed on modeled parallel systems. Critical path simulations, trace generation, and execution-driven simulations ...
Kernel-Level Scheduling for the Nano-Threads Programming Model
, 1998
"... Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as w ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
Multiprocessor systems are increasingly becoming the systems of choice for low and high-end servers, running such diverse tasks as number crunching, large-scale simulations, data base engines and world wide web server applications. With such diverse workloads, system utilization and throughput, as well as execution time become important performance metrics. In this paper we present efficient kernel scheduling policies and propose a new kernel-user interface aiming at supporting efficient parallel execution in diverse workload environments. Our approach relies on support for user level threads which are used to exploit parallelism within applications, and a two-level scheduling policy which coordinates the number of resources allocated by the kernel with the number of threads generated by each application. We compare our scheduling policies with the native gang scheduling policy of the IRIX 6.4 operating system on a Silicon Graphics Origin2000. Our experimental results show substantial ...
Microarchitecture Support for Dynamic Scheduling of Acyclic Task Graphs
- In 25th Annual International Symposium on Microarchitecture
, 1992
"... It can be shown that any program can be broken into its loop structure, plus acyclic dependence graphs representing the body of each loop or subroutine. The parallelism inherent in these acyclic graphs augments the loop-level parallelism available in the program. This paper presents two algorithms f ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
It can be shown that any program can be broken into its loop structure, plus acyclic dependence graphs representing the body of each loop or subroutine. The parallelism inherent in these acyclic graphs augments the loop-level parallelism available in the program. This paper presents two algorithms for dynamic scheduling of such acyclic task graphs containing both data and control dependences, and describes a microarchitecture which implements these algorithms efficiently. Keywords-- Functional parallelism, fine-grain parallelism, microarchitecture, dynamic scheduling, parallelizing compiler. ############################# 1 This work was funded in part by NSF grant CCR 89-57310 PYI, DOE grant DE-FG0285ER25001, and a Shell Doctoral Fellowship (Carl Beckmann). - 2 - 1. Introduction Traditional approaches to parallel processing have focused largely on loop-level parallelism. Another source of parallelism in programs is non-loop, or functional, parallelism [Girk91]. While the amount o...
Graphical visualization of compiler optimizations
- Journal of Programming Languages
, 1995
"... This paper describes xvpodb, avisualization tool developed to support the analysis of optimizations performed by the vpo optimizer. The tool is a graphical optimization viewer that can display the state of the program representation before and after sequences of changes, referred to as transformatio ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
This paper describes xvpodb, avisualization tool developed to support the analysis of optimizations performed by the vpo optimizer. The tool is a graphical optimization viewer that can display the state of the program representation before and after sequences of changes, referred to as transformations, that results in semantically equivalent (and usually improved) code. The information and insight such visualization provides can simplify the debugging of problems with the optimizer. Unique features of xvpodb include rev erse viewing (or undoing) of transformations and the ability to stop at breakpoints associated with the generated instructions. The viewer facilitates the retargeting of vpo to a new machine, supports experimentation with new optimizations, and has been used as a teaching aid in compiler classes.
A Visualization System for Parallelizing Programs
- In Proceedings of Supercomputing
, 1992
"... A software environment for visualization when parallelizing programs is described in this paper. The system supports multi-paradigm program visualization, automatic generation of optimizers from specifications, interactive and undo facilities for code transformations, and a multilevel browser. Textu ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
A software environment for visualization when parallelizing programs is described in this paper. The system supports multi-paradigm program visualization, automatic generation of optimizers from specifications, interactive and undo facilities for code transformations, and a multilevel browser. Textual and graphical forms of optimization specification languages are utilized in the specification of both traditional and parallelizing transformations. An undo facility is provided for the user to remove ineffective or inappropriate transformations. An extended form of the Program Dependence Graph is developed that not only enables the exploitation of parallelism in sequential programs by applying transformations but also facilitates mappings for code visualization. The conceptual framework that allows multi-paradigm program visualization is presented. A multi-level browser can be used to browse any block of statements in one program view and the corresponding code of another program view is...
Strings: A High-Performance Distributed Shared Memory for Symmetrical Multiprocessor Clusters
- in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing
, 1998
"... This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these ligh ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these lightweight processes. Thus, Strings is designed to exploit data parallelism at the application level and task parallelism at the DSM system level. We show how using multiple kernel threads can improve the performance even in the presence of false sharing, using matrix multiplication as a case-study. We also show the performance results with benchmark programs from the SPLASH-2 suite [17]. Though similar work has been demonstrated with SoftFLASH [18], our implementation is completely in user space and thus more portable. Some other researach has studied the effect of clustering in SMPs suing simulations [19]. We have shown results from runs on an actual network of SMPs
Unallocated Memory Space in COMA Multiprocessors
, 1995
"... Cache only memory architecture (COMA) for distributed shared memory multiprocessors attempts to provide high utilization of local memory by organizing the local memory as a large cache, called attraction memory (AM), without traditional main memory. To facilitate caching of replicated data, it is de ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Cache only memory architecture (COMA) for distributed shared memory multiprocessors attempts to provide high utilization of local memory by organizing the local memory as a large cache, called attraction memory (AM), without traditional main memory. To facilitate caching of replicated data, it is desirable to have some of the physical storage space in the AMs left unallocated, i.e. not utilized as a part of the physical address space. Without the unallocated space, excessive relocation and migration of memory blocks between the AMs can happen due to replacement, negating the very purpose of the AM. It is important in a COMA machine that the operating sy stem maintains a certain amount of unallocated memory space to provide good performance. In this paper, we identify an important relation between the amount of unallocated space and the set associativity of the AM, and discuss the trade-off between additional unallocated memory space and higher set associativity. 1. Introduction Distr...

