Results 1 - 10
of
13
The Communication Requirements of Mutual Exclusion
- In Proceedings of the Seventh Annual Symposium on Parallel Algorithms and Architectures
, 1995
"... This paper examines the amount of communication that is required for performing mutual exclusion. It is assumed that n processors communicate via accesses to a shared memory that is physically distributed among the processors. We consider the possibility of creating a scalable mutual exclusion proto ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
This paper examines the amount of communication that is required for performing mutual exclusion. It is assumed that n processors communicate via accesses to a shared memory that is physically distributed among the processors. We consider the possibility of creating a scalable mutual exclusion protocol that requires only a constant amount of communication per access to a critical section. We present two main results. First, we show that there does not exist a scalable mutual exclusion protocol that uses only read and write operations. This result solves an open problem posed by Yang and Anderson. Second, we prove that the same result holds even if test-and-set, compare-and-swap, load-and-reserve and store-conditional operations are allowed in addition to read and write operations. Our results hold even if an amortized analysis of communication costs is used, an arbitrary amount of memory is available, and the processors have coherent caches. In contrast, a mutual exclusion protocol is ...
Local-Area MultiProcessor: the Scalable Coherent Interface
- DEFINING THE GLOBAL INFORMATION INFRASTRUCTURE: INFRASTRUCTURE, SYSTEMS, AND SERVICES
, 1994
"... There is rapidly increasing demand for very high performance shared access to distributed data, for multiprocessors, networked workstation clusters, distributed databases, industrial data acquisition and control systems, etc. The objective is to satisfy this demand at the lowest longterm cost. This ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
There is rapidly increasing demand for very high performance shared access to distributed data, for multiprocessors, networked workstation clusters, distributed databases, industrial data acquisition and control systems, etc. The objective is to satisfy this demand at the lowest longterm cost. This paper first considers the general properties that an appropriate system architecture should have. A new architectural model, Local-Area MultiProcessor, is introduced. These properties are then considered in more detail, and practical design decisions are made, illustrated by the evolution of the ISO/ANSI/IEEE standard Scalable Coherent Interface (SCI) as it addressed these issues. Finally, the current status of the various SCI follow-on and support projects is reported.
Packet Routing In Fixed-Connection Networks: A Survey
, 1998
"... We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
A Combinatorial Treatment of Balancing Networks
, 1999
"... Balancing networks, originally introduced by Aspnes et al. (Proc. of the 23rd Annual ACM Symposium on Theory of Computing, pp. 348-358, May 1991), represent a new class of distributed, low-contention data structures suitable for solving many fundamental multi-processor coordination problems that can ..."
Abstract
-
Cited by 23 (11 self)
- Add to MetaCart
Balancing networks, originally introduced by Aspnes et al. (Proc. of the 23rd Annual ACM Symposium on Theory of Computing, pp. 348-358, May 1991), represent a new class of distributed, low-contention data structures suitable for solving many fundamental multi-processor coordination problems that can be expressed as balancing problems. In this work, we present a mathematical study of the combinatorial structure of balancing networks, andavariety of its applications. Our study identies important combinatorial transfer parameters of balancing networks. In turn, necessary and sucient combinatorial conditions are established, expressed in terms of transfer parameters, which precisely characterize many important and well studied classes of balancing networks suchascounting networks and smoothing networks.We propose these combinatorial conditions to be \balancing analogs" of the well known Zero-One principle holding for sorting networks.
Efficient Barriers for Distributed Shared Memory Computers
- In Proceedings of 8th International Parallel Processing Symposium
, 1994
"... Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or att ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or attraction memory [2] architectures such as the KSR-1. In this paper, we present two new barrier algorithms that offer the best performance we have recorded on the KSR-1 distributed cache multiprocessor. We discuss the trade-offs and the performance of seven algorithms on two architectures. The new barrier algorithms adapt well to a hierarchical caching memory model and take advantage of parallel communication offered by most multiprocessor interconnection networks. Performance results are shown for a 256-processor KSR-1 and a 20-processor Sequent Symmetry. 1 Introduction Barriers are a synchronization tool for parallel computers, including shared and distributed (messagepassing) address-spac...
The System-on-a-Chip Lock Cache
, 2004
"... CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Organization and Roadmap . . . . . . . . . . . . . . . . . . . 5 II BACKGROUND AND PREVIOUS WORK . . . . . . . . . . . . 6 2.1 Locking Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Hardware Instructions for Locking . . . . . . . . . . . . . . . 8 2.1.2 Traditional Spin-Lock . . . . . . . . . . . . . . . . . . . . . . 1
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearest-neighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key
Impact of Load Imbalance on the Design of Software Barriers
- in Proceedings of the 1995 International Conference on Parallel Processing
, 1995
"... Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining tre ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the load imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier whereslow processorsmigrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degreeand using dynamic placement, software barriers that are scalable to large numbers of pr...
A System-on-a-Chip Lock Cache with Task Preemption Support
- Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’01
, 2001
"... Intertask/interprocess synchronization overheads may be significant in a multiprocessor-shared memory System-on-a-Chip implementation. These overheads are observed in terms of lock latency, lock delay and memory bandwidth consumption in the system. It has been shown that a hardware solution brings a ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Intertask/interprocess synchronization overheads may be significant in a multiprocessor-shared memory System-on-a-Chip implementation. These overheads are observed in terms of lock latency, lock delay and memory bandwidth consumption in the system. It has been shown that a hardware solution brings a much better performance improvement than the synchronization algorithms developed in software [3]. Our previous work presented a SoC Lock Cache (SoCLC) hardware mechanism which resolves the Critical Section (CS) interactions among multiple processors and improves the performance criteria in terms of lock latency, lock delay and bandwidth consumption in a shared memory multiprocessor SoC for short CSes [1]. This paper extends our previous work to support long CSes as well. This combined support involves modifications both in the RTOS kernel level facilities (such as support for preemptive versus non-preemptive synchronization, interrupt handling and RTOS initialization) and in the hardware mechanism. The worst-case simulation results of a database application model with client-server pair of tasks on a fourprocessor system showed that our mechanism achieved a 57% improvement in lock latency, 14% speed up in lock delay and a 35% overall speedup in total execution time.
The Concurrent Execution of Non-communicating Programs on SIMD Processors
- In The Fourth Symposium on the Frontiers of Massively Parallel Computation
, 1992
"... This paper explores the use of SIMD (or SIMD-like) hardware to support the efficient interpretation of concurrent, non-communicating programs. This approach places compiled programs into the local memory space of each distinct processing element (PE). Within each PE, a local program counter is initi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper explores the use of SIMD (or SIMD-like) hardware to support the efficient interpretation of concurrent, non-communicating programs. This approach places compiled programs into the local memory space of each distinct processing element (PE). Within each PE, a local program counter is initialized and the instructions are interpreted in parallel across all of the PEs by control signals emanating from the central control unit. Initial experiments have been conducted with two distinct software architectures (MINTABs and MIPS R2000) on the MasPar MP-1 and two distinct applications (program mutation analysis and Monte Carlo simulation). While these experiments have shown only marginal performance improvement, it appears that with several minor hardware modifications, SIMD-like hardware can be constructed that will cost-effectively support both SIMD and MIMD processing.

