Results 1 - 10
of
19
Stardust: an Environment for Parallel Programming on Networks of Heterogeneous Workstations
- Journal of Parallel and Distributed Computing
, 1996
"... This paper describes Stardust, an environment for parallel programming on networks of heterogeneous machines. Stardust runs on distributed memory multicomputers and networks of workstations. Applications using Stardust can communicate both through message-passing and distributed shared memory. Stard ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
This paper describes Stardust, an environment for parallel programming on networks of heterogeneous machines. Stardust runs on distributed memory multicomputers and networks of workstations. Applications using Stardust can communicate both through message-passing and distributed shared memory. Stardust includes a mechanism for application reconfiguration. This mechanism is used for balancing the load of the machines hosting the application, as well as for tolerating machine restarts (anticipated or not). At reconfiguration time, application processes can migrate between heterogeneous machines, and the number of application processes can vary (increase or decrease) depending on the available resources. Stardust is currently implemented on an heterogeneous system including an Intel Paragon running Mach/OSF1 and a set of Pentiums running Chorus/classiX. The paper details the design and implementation of Stardust, as well as its performance. Contact author Isabelle Puaut IRISA, Campus Uni...
Hardware Fault Containment in Scalable Shared-Memory Multiprocessors
- Proc. of the 24th ISCA
, 1997
"... Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size. The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive faulttolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine. Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.
Overview of distributed shared memory
- Trinity College Dublin
, 1998
"... So much has already been written about everything that you can't nd out anything about it. | James Thurber, Lanterns and Lances (1961) Loosely-coupled distributed systems haveevolved using message passing as the main paradigm for sharing information. Other paradigms used in loosely-coupled distribut ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
So much has already been written about everything that you can't nd out anything about it. | James Thurber, Lanterns and Lances (1961) Loosely-coupled distributed systems haveevolved using message passing as the main paradigm for sharing information. Other paradigms used in loosely-coupled distributed systems, such as rpc, are usually implemented on top of an underlying message-passing system. On the other hand, in tightly-coupled architectures, such asmulti-processor machines, the paradigm is usually based on shared memory with its attractively simple programming model. The shared-memory paradigm has recently been extended for use in more loosely-coupled architectures and is known as distributed shared memory (dsm [153, 178,58]) in this context. This chapter discusses some of the issues involved in the design and implementation of such adsm in loosely-coupled distributed systems and brie y discusses related work in other elds. In dsm systems, processes share data transparently across node boundaries � data faulting, location, and movement are handled by thedsm system. Among other things, this allows parallel programs designed to use the shared-memory abstraction to execute without modi cation on a
Smooth and efficient integration of high-availability in a parallel single level store system
- In Proceedings of the Seventh International Euro-Par Conference Manchester on Parallel Processing (Euro-Par ’01) (2001
"... apport de recherche ..."
A practical transparent data sharing service for the grid
- in Proceedings Fifth International Workshop on Distributed Shared Memory (DSM 2005
, 2005
"... We consider a transparent data sharing service for distributed applications in the Grid. Our service may alleviate the burden of the user and the programmer to manage the distribution and the migration of data by transparently locating, caching, and managing the consistency of the data. To fit in a ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We consider a transparent data sharing service for distributed applications in the Grid. Our service may alleviate the burden of the user and the programmer to manage the distribution and the migration of data by transparently locating, caching, and managing the consistency of the data. To fit in a large scale and dynamic environment such as the Grid, our data sharing service tolerates any number of reconfigurations (benign failures, arrivals and departures of sites) in the system’s life time, and up to a fixed number of simultaneous reconfigurations of the system. This service relies on application backward error recovery and replication to ensure the liveness of the application. Reconfiguration has a high impact on data location mechanisms. Our service leverages an underlying structured overlay network to solve this issue. We have experimentally evaluated our service and present in this paper an analysis of the results. 1.
dsl: An environment with automatic code distribution for industrial control systems
- Principles of Distributed Systems: 7th International Conference, volume 3114 of LNCS
, 2003
"... for industrial control systems ..."
Device Driver Programming in a Transactional DSM Operating System
- In Proceedings of the AsiaPacific Computer Systems Architecture Conference
, 2002
"... The Plurix project implements an object-oriented operating system (OS) for PC clusters. Network communication is implemented via the distributed shared memory (DSM) paradigm. Memory consistency is maintained by restartable transactions and an optimistic synchronization scheme, that have been used in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Plurix project implements an object-oriented operating system (OS) for PC clusters. Network communication is implemented via the distributed shared memory (DSM) paradigm. Memory consistency is maintained by restartable transactions and an optimistic synchronization scheme, that have been used in database technology in the past. Originally, DSM systems were built to support parallel algorithms, but using DSM as a foundation for a general purpose OS offers interesting perspectives in designing and using distributed applications. The OS, including kernel and all drivers, is written in Plurix Java. Our Java compiler directly translates Java source code into Intel machine instructions. Some minor language extensions support device-level programming. During the development of the system we identified conceptual problems which are caused by the restartability requirement of transactions. Clearly interrupts do not reoccur in case of an aborted transaction. Without proper precaution interrupts would get lost or devices could receive broken commands. In this paper we shortly review our DSM system and present the "smart buffer" concept to bridge the gap between restartable DSM transactions and non-restartable device operations and events. Finally, we validate our proposed solution by performance measurements and compare the kernel interface to traditional operating systems.
Bootstrapping and Startup of an object-oriented Operating System
- System”, European Conference on Object-Oriented Programming - Workshop on Object-Orientation and Operating Systems, Malaga
, 2002
"... The Plurix project implements an object-oriented Operating System (OS) for PC clusters. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The Plurix project implements an object-oriented Operating System (OS) for PC clusters.
A Cluster Operating System Based on Software COMA Memory Management
"... Clusters of SMPs are attractive for executing shared memory parallel applications but reconciling high performance and ease of programming remains an open issue. A possible approach is to provide an efficient Single System Image operating system giving the illusion of an SMP machine. In this paper, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Clusters of SMPs are attractive for executing shared memory parallel applications but reconciling high performance and ease of programming remains an open issue. A possible approach is to provide an efficient Single System Image operating system giving the illusion of an SMP machine. In this paper, we present such a system focusing on global management of the memory resource. We introduce the concept of container at the lowest operating system level to build a COMA-like memory management subsystem. Higher level operating system services such as virtual memory system and file cache can be easily implemented based on containers and transparently take benefit of the whole memory resource available in the cluster.
Linking And Loading In A Persistent Dsm Operating System
, 2000
"... Our native Java compiler directly generates runtime structures in a persistent Distributed Shared Memory (DSM). The compiler has been used to build a general purpose PC Operating System (OS) on top of a persistent DSM memory. The persistent DSM operating environment lends itself naturally to an inte ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Our native Java compiler directly generates runtime structures in a persistent Distributed Shared Memory (DSM). The compiler has been used to build a general purpose PC Operating System (OS) on top of a persistent DSM memory. The persistent DSM operating environment lends itself naturally to an integration of symbol tables, class descriptors and naming during Java program compilation and execution.

