Results 1 - 10
of
67
On the Design of Chant: A Talking Threads Package
- PROC.SUPERCOMPUTING 94,PP.350-359, WASHINGTON,D.C
, 1994
"... Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages traditionally support only shared memory synchronization and communication primitives, limiting their ..."
Abstract
-
Cited by 71 (9 self)
- Add to MetaCart
Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages traditionally support only shared memory synchronization and communication primitives, limiting their use in distributed memory environments. We introduce the design of a runtime interface, called Chant, that supports lightweight threads with the capability of communication using both point-to-point and remote service request primitives, built from standard message passing libraries. This is accomplished by extending the POSIX pthreads interface with global thread identifiers, global thread operations, and message passing primitives. This paper introduces the Chant interface and describes the runtime issues in providing an efficient, portable implementation of such an interface. In particular, we present performance results of the initial portion of our runtime system: point-to-point message passing among threads. We examine the issue of thread scheduling in the presence of polling for messages, and measure the overhead incurred when using this interface as opposed to using the underlying communication layer directly. Weshow that our design can accommodate various polling methods, depending on the level of support present in the underlying thread system, and imposes little overhead in point-to-point message passing over the existing communication layer.
Static analyses for eliminating unnecessary synchronizations from Java programs
- In Static Analysis Symposium (SAS
, 1999
"... Abstract. This paper presents and evaluates a set of analyses designed to reduce synchronization overhead in Java programs. Monitor-based synchronization in Java often causes significant overhead, accounting for 5-10 % of total execution time in our benchmark applications. To reduce this overhead, p ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Abstract. This paper presents and evaluates a set of analyses designed to reduce synchronization overhead in Java programs. Monitor-based synchronization in Java often causes significant overhead, accounting for 5-10 % of total execution time in our benchmark applications. To reduce this overhead, programmers often try to eliminate unnecessary lock operations by hand. Such manual optimizations are tedious, error-prone, and often result in poorly structured and less reusable programs. Our approach replaces manual optimizations with static analyses that automatically find and remove unnecessary synchronization from Java programs. These analyses optimize cases where a monitor is entered multiple times by a single thread, where one monitor is nested within another, and where a monitor is accessible by only one thread. A partial implementation of our analyses eliminates up to 70 % of synchronization overhead and improves running time by up to 5 % for several already hand-optimized benchmarks. Thus, our automated analyses have the potential to significantly improve the performance of Java applications while enabling programmers to design simpler and more reusable multithreaded code. 1.
Ariadne: Architecture of a Portable Threads system supporting Mobile Processes
- Software-Practice and Experience
, 1996
"... Threads exhibit a simply expressed and powerful form of concurrency, easily exploitable in applications that run on both uni- and multi-processors, shared- and distributed-memory systems. This paper presents the design and implementation of Ariadne: a layered, C-based software architecture for multi ..."
Abstract
-
Cited by 50 (15 self)
- Add to MetaCart
Threads exhibit a simply expressed and powerful form of concurrency, easily exploitable in applications that run on both uni- and multi-processors, shared- and distributed-memory systems. This paper presents the design and implementation of Ariadne: a layered, C-based software architecture for multi-threaded computing on a variety of platforms. Ariadne is a portable user-space threads system that runs on shared- and distributed-memory multiprocessors. It can be used for parallel and distributed applications. Thread-migration is supported at the application level in homogeneous environments (e.g., networks of SPARCs and Sequent Symmetrys, Intel hypercubes). Threads may migrate between processes to access remote data, preserving locality of reference for computations with a dynamic data space. Ariadne can be tuned to specific applications through a customization layer. Support is provided for scheduling via a built-in or application-specific scheduler, and interfacing with any communicat...
The Pantheon storage-system simulator
, 1995
"... This paper presents the facilities and design of the Pantheon storage system simulator. This simulator was originally intended to do performance modelling of parallel disk arrays. Over time it has been extended and generalized so that it now supports modelling of a wide range of I/O systems for both ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
This paper presents the facilities and design of the Pantheon storage system simulator. This simulator was originally intended to do performance modelling of parallel disk arrays. Over time it has been extended and generalized so that it now supports modelling of a wide range of I/O systems for both uniprocessors and parallel computers. The intent of this document is to provide an overview of the simulator, its architecture and its capabilities. The intended audience is potential users of the simulator and people who will be assessing its results. Copyright 1995 Hewlett-Packard Company. All rights reserved. 1
Filaments: Efficient Support for Fine-Grain Parallelism
, 1993
"... . It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
. It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ease of programming, ease of use as a target for code generation, and load-balancing potential. This paper describes a portable threads package, Filaments, that supports efficient execution of fine-grain parallel programs on shared-memory multiprocessors. Filaments supports three kinds of threads---run-to-completion, barrier (iterative), and fork/join--- which appear to be sufficient for scientific computations. Filaments employs a unique combination of techniques to achieve efficiency: stateless threads, very small thread descriptors, optimized barrier synchronization, scheduling that enhances data locality, and automatic pruning of fork/join threads. The gains in performance are such that ...
Modular Specification Of Interaction Policies In Distributed Computing
, 1996
"... Software executing on distributed systems consists of many asynchronous, autonomous components which interact in order to coordinate local activity. The need for such coordination, as well as requirements such as heterogeneity, scalability, security and availability, considerably increase the comple ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
Software executing on distributed systems consists of many asynchronous, autonomous components which interact in order to coordinate local activity. The need for such coordination, as well as requirements such as heterogeneity, scalability, security and availability, considerably increase the complexity of code in distributed applications. Moreover, changing requirements, as well as changes in hardware platforms, lead to software that is constantly evolving and complicates reuse. To support development and evolution of distributed applications requires techniques which allow coordination code to be specified, customized, and maintained independently of application components; goals which cannot be realized solely through object-oriented techniques. This thesis demonstrates that meta-level specification of interaction policies enables modular description of component interaction policies, as well as customization of policy implementations. We present the high-level language Dil for spec...
Monitors and Exceptions: How to implement Java efficiently
- IN ACM 1998 WORKSHOP ON JAVA FOR HIGH-PERFORMANCE NETWORK COMPUTING
, 1998
"... Efficient implementation of monitors and exceptions is crucial for the performance of Java. One implementation of threads showed a factor of 30 difference in run time on some benchmark programs. This article describes an efficient implementation of monitors for Java as used in the CACAO just-in-time ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
Efficient implementation of monitors and exceptions is crucial for the performance of Java. One implementation of threads showed a factor of 30 difference in run time on some benchmark programs. This article describes an efficient implementation of monitors for Java as used in the CACAO just-in-time compiler. With this implementation the thread overhead is less than 40% for typical application programs and can be completely eliminated for some applications. This article also gives the implementation details of the new exception handling scheme in CACAO. The new approach reduces the size of the generated native code by a half and allows null pointers to be checked by hardware. By using these techniques, the CACAO system has become the fastest JavaVM implementation for the Alpha processor.
SMART: a Simulator of Massive ARchitectures and Topologies
- In International Conference on Parallel and Distributed Systems Euro-PDS'97
, 1997
"... Many important results in the area of computer architecture have been achieved using simulators. In this paper we present SMART, a simulator of parallel architectures. SMART provides a flexible and efficient simulation environment that includes the most common interconnection networks and routing al ..."
Abstract
-
Cited by 22 (20 self)
- Add to MetaCart
Many important results in the area of computer architecture have been achieved using simulators. In this paper we present SMART, a simulator of parallel architectures. SMART provides a flexible and efficient simulation environment that includes the most common interconnection networks and routing algorithms and gives the user basic mechanisms to define the internal structure of the processing nodes. To show the characteristics of SMART, we analyze the relations between the degree of overlapping of the transpose FFT algorithm and the presence of a communication processor on a fat tree and on a bi-dimensional cube that have the same normalized communication bandwidth.
Data movement and control substrate for parallel scientific computing
- of Lecture Notes in Computer Science
, 1997
"... In this paper, we describe the design and implementation of a datamovement and control substrate (DMCS) for network-based, homogeneous communication within a single multiprocessor. DMCS is an implementation of an API for communication and computation that has been proposed by the PORTS consortium. O ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
In this paper, we describe the design and implementation of a datamovement and control substrate (DMCS) for network-based, homogeneous communication within a single multiprocessor. DMCS is an implementation of an API for communication and computation that has been proposed by the PORTS consortium. One of the goals of this consortium is to de ne an API that can support heterogeneous computing without undue performance penalties for homogeneous computing. Preliminary results in our implementation suggest that this is quite feasible. The DMCS implementation seeks to minimize the assumptions made about the homogeneous nature of its target architecture. Finally, we present some extensions to the API for PORTS that will improve the performance of sparse, adaptive and irregular type of numeric computations.
The Performance Implications of Locality Information Usage in Shared-Memory . . .
, 1996
"... This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors. A prototype implementation shows that a locality-conscious scheduler outperforms approaches ignoring locality information. 1 Introduction Cache ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors. A prototype implementation shows that a locality-conscious scheduler outperforms approaches ignoring locality information. 1 Introduction Cache-coherent multiprocessors with non uniform memory access (NUMA architectures) have become quite attractive as compute servers for parallel applications in the field of scientific computing. They combine scalability and the sharedmemory programming model, relieving the application designer of data distribution and coherency maintenance. But locality of reference, load balancing and scheduling are still of crucial importance. One goal of software development is a high degree of locality of reference from the system up to the application level. Even if application designers develop code with high lo

