Results 1 - 10
of
12
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
Hierarchical Tiling for Improved Superscalar Performance
- IN INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1995
"... It takes more than a good algorithm to achieve high performance: inner-loop performance and data locality are also important. Tiling is a well-known method for parallelization and for improving data locality. However, tiling has the potential of being even more beneficial. At the finest granularity, ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
It takes more than a good algorithm to achieve high performance: inner-loop performance and data locality are also important. Tiling is a well-known method for parallelization and for improving data locality. However, tiling has the potential of being even more beneficial. At the finest granularity, it can be used to guide register allocation and instruction scheduling; at the coarsest level, it can help manage magnetic storage media. It also can be useful in overlapping data movement with computation, for instance by prefetching data from archival storage, disks and main memory into cache and registers, or by choreographing data movement between processors. Hierarchical tiling is a framework for applying both known tiling methods and new techniques to an expanded set of uses. It eases the burden on several compiler phases that are traditionally treated separately, such as scalar replacement, register allocation, generation of message passing calls, and storage mapping. By explicitly ...
Hierarchical Tiling: A Methodology for High Performance
, 1996
"... Good parallel algorithms are not enough; computer features such as the memory hierarchy and processor architecture need to be exploited to achieve high performance on parallel machines. Hierarchical tiling is a methodology for exploiting parallelism and locality at all levels of the memory/process ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Good parallel algorithms are not enough; computer features such as the memory hierarchy and processor architecture need to be exploited to achieve high performance on parallel machines. Hierarchical tiling is a methodology for exploiting parallelism and locality at all levels of the memory/processor hierarchy: functional units, registers, caches, multiple processors, and disks. Hierarchical tiling concentrates on the interaction between multiple levels of tilings. One novel idea of hierarchical tiling is the naming of the values on the surface of a tile. Names determine where values are stored in the memory/processor hierarchy. Storage for the surface of a tile is materialized at that level, while interior elements of the tile only require storage as temporaries at a lower level of hierarchical memory. A second distinctive feature is that hierarchical tiling provides explicit control of all data movement, both within and between the levels memory/processor hierarchy. This is a...
Towards a Scalable Parallel Object Database - The Bulk Synchronous Parallel Approach
, 1996
"... Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in non-numerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make general-purpose parallel computing co ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in non-numerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make general-purpose parallel computing cost-effective, the requirements for non-numerical (or symbolic) applications, and the previous attempts to develop parallel databases. The central theme of the Bulk Synchronous Parallel model is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Therefore, the primary objective of this report is to investigate the feasibility of developing a portable, scalable, parallel object database, based on the Bulk Synchronous Parallel model of computation. In particular, we devise a way of providing high-level abstra...
Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization
- PhD thesis,University of Wisconsin,Madison,1999
, 1999
"... Efficient locking synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of locking primitives is to enable exclusive access to shared data and critical sections of code. In this dissertation, I make the following six cont ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Efficient locking synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of locking primitives is to enable exclusive access to shared data and critical sections of code. In this dissertation, I make the following six contributions. (1) I propose a framework, the synchronization period, in which to reason about the inefficiencies of locking primitives. (2) I identify four previously proposed locking mechanisms (local spinning, queue-based locking, collocation, and synchronous prefetch) and uses them to classify existing locking primitives according to which of these mechanisms they incorporate. (3) With detailed simulations, I show the extent to which these four mechanisms can improve the performance of sharedmemory programs. I evaluate the space of these mechanisms using sixteen synchronization constructs, which are formed from six base types of locks (test&set, test&test&set, MCS, LH, M, and QOLB). I show t...
Software Issues In High-Performance Computing And A Framework For The Development Of HPC Applications
- COMPUTING, U. VISHKIN, ED.: ACM
, 1994
"... We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem spe ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem specification and target architecture as computational methods and architectures evolve. To attack these problems, we suggest an application development methodology in which high-level architecture-independent specifications are elaborated, through an iterative refinement process which introduces architectural detail, into a form which can be translated to efficient low-level architecture-specific programming notations. A tree-structured development process permits multiple architectures to be targeted with implementation strategies appropriate to each architecture, and also provides a systematic means to accommodate changes in specification and target architecture. We describe the Pr...
The Proteus System for the Development of Parallel Applications
, 1994
"... Target Language In our methodology we have identified a small set of specifications that comprise the abstract target language (ATL) of the refinement system. These are specifications of types such as arrays, lists, tuples, integers, characters, etc., that commonly appear in programming languages. ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Target Language In our methodology we have identified a small set of specifications that comprise the abstract target language (ATL) of the refinement system. These are specifications of types such as arrays, lists, tuples, integers, characters, etc., that commonly appear in programming languages. The refinement expresses a system as a definitional extension the ATL specs. Thus by associating a model---a concrete type in a specific programming language---with each ATL specification the complete system specification is compiled. 5.2.3 Proteus to DPL Translation The translation of Proteus to DPL consists of a series of major steps: 1. Expansion of iterator expressions into image and filter expressions. 2. Conversion to data-parallel form. 3. An interpretation of sequences into the nested sequence vocabulary of DPL. 4. Addition of storage management code. 5. Conversion into C. Source Mediating Target CORE-SEQ SEQ-AS-ARRAY ARRAY SEQ Component 1 System Component 2 CORE-SEQ SEQ-AS-ARRAY ...
A Communication Model for Small Messages with InfiniBand
- PARS Proceedings
, 2005
"... Designing new and optimal algorithms for a specific architecture requires accurate modelling of this architecture. This is especially needed to choose one out of different solutions for the same problem or to proof a lower bound to a problem. Assumed that the model is highly accurate, a given algori ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Designing new and optimal algorithms for a specific architecture requires accurate modelling of this architecture. This is especially needed to choose one out of different solutions for the same problem or to proof a lower bound to a problem. Assumed that the model is highly accurate, a given algorithm can be seen as optimal solution if it reaches the lower bound. Therefore the accuracy of a model is extremely important for algorithmic design. A detailed model can also help to understand the architectural details and their influence on the running time of different solutions and it can be used to derive better algorithms for a given problem. This work introduces some architectural specialities of the InfiniBand network and shows that most widely used models introduce inaccuracies for sending small messages with InfiniBand. Therefore a comparative model analysis is performed to find the most accurate model for InfiniBand. Basing on this analysis and a description of the architectural specialities of InfiniBand, a new, more accurate but also much complexer model called LoP is deduced from the LogP which can be used to assess the running time of different algorithms. The newly developed model can be used to find lower bounds for algorithmic problems and to enhance several algorithms. 1
Partitioning Regular Applications for Cache-Coherent Multiprocessors
, 1994
"... In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only ident ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order to obtain reasonable efficiency. Shared-address MPPs provide an easier paradigm for programmers than message passing systems since the communication is automatically handled by the hardware and/or operating system. However, it is just as important to optimize the communication in shared-address systems if high performance is to be achieved. Since communication is implied by the data layout and data reference pattern of the application, the data layout scheme and data access pattern must be controlled by the compiler in order to optimize communication. Machine specific parameters, such as cache siz...
A Performance Model for Unified Parallel C
, 2007
"... www.cs.mtu.edu This research is a performance centric investigation of the Unified Parallel C (UPC), a parallel programming language that belong to the Partitioned Global Address Space (PGAS) language family. The objective is to develop performance modeling methodology that targets UPC but can be ge ..."
Abstract
- Add to MetaCart
www.cs.mtu.edu This research is a performance centric investigation of the Unified Parallel C (UPC), a parallel programming language that belong to the Partitioned Global Address Space (PGAS) language family. The objective is to develop performance modeling methodology that targets UPC but can be generalized for other PGAS languages. The performance modeling methodology relies on platform characterization and program characterization, achieved through shared memory benchmarking and static code analysis, respectively. Models built using this methodology can predict the performance of simple UPC application kernels with relative errors below 15%. Beside performance prediction, this work provide a framework based on shared memory benchmarking and code analysis for platform evaluation and compiler/runtime optimization study. A few platforms are evaluated in terms of their fitness to UPC computing. Some optimization techniques, such as remote reference caching, is studied using this framework. A UPC implementation, MuPC, is developed along with the performance study. MuPC consists of a UPC-to-C translator built upon a modified version of the EDG C/C++ front end and a runtime system built upon MPI and POSIX threads. MuPC performance features include a runtime software cache for remote accesses and low latency access to shared memory with affinity to the issuing thread. In this research, MuPC serves as a platform that facilitates the development, testing, and validation of performance microbenchmarks and optimization techniques.

