Results 1 - 10
of
49
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
STiNG: A CC-NUMA Computer System for the Commercial Marketplace
, 1996
"... "STiNG" is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multiprocessor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI) based coherent interconnect. The Quads are bas ..."
Abstract
-
Cited by 142 (0 self)
- Add to MetaCart
"STiNG" is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multiprocessor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI) based coherent interconnect. The Quads are based on the Intel P6 processor and the external bus it defines. In addition to 4 P6 processors, each Quad may contain up to 4 GBytes of system memory, 2 Peripheral Component Interface (PCI) busses for I/O, and a Lynx board. The Lynx board provides the datapath to the SCI-based interconnect and ensures systemwide cache coherency. STiNG is one of the first commercial CCNUMA systems to be built. This paper describes the motivation for building STiNG as well as its architecture and implementation. In addition, performance analysis is provided for On-Line Transaction Processing (OLTP) and Decision Support System (DSS) workloads. Finally, the status of the current implementation is reviewed. 1. Int...
Effects of communication latency, overhead, and bandwidth in a cluster architecture
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on ..."
Abstract
-
Cited by 98 (5 self)
- Add to MetaCart
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. 1
Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design
- JOURNAL OF SUPERCOMPUTING
, 1993
"... We introduce the concept of Hierarchical Clustering as a way to structure shared memory multiprocessor operating systems for scalability. As the name implies, the concept is based on clustering and hierarchical system design. Hierarchical Clustering leads to a modular system, composed of easy-tode ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
We introduce the concept of Hierarchical Clustering as a way to structure shared memory multiprocessor operating systems for scalability. As the name implies, the concept is based on clustering and hierarchical system design. Hierarchical Clustering leads to a modular system, composed of easy-todesign and efficient building blocks. The resulting structure is scalable because it i) maximizes locality, which is key to good performance in NUMA systems, and ii) provides for concurrency that increases linearly with the number of processors. At the same time, there is tight coupling within a cluster, so the system performs well for local interactions which are expected to constitute the common case. A clustered system can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. We show how this structuring technique is applied to the design of a microkernel-based operating system called HURRICANE. This prototype system is the first complete and running implementation of its kind, and demonstrates the feasibility of a hierarchically clustered system. We present performance results based on the prototype, demonstrating the characteristics and behavior of a clustered system. In particular, we show how clustering trades off the efficiencies of tight coupling for the advantages of replication, increased locality, and decreased lock contention. We describe some of the lessons we learned from our implementation efforts and close with a discussion of our future work.
Hfs: A performance-oriented flexible file system based on building-block compositions
- ACM Transactions on Computer Systems
, 1997
"... The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file sy ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file’s structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application’s access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.
Implementing a Parallel C++ Runtime System for Scalable Parallel Systems
- In Proceedings of Supercomputing '93
, 1993
"... pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ ..."
Abstract
-
Cited by 48 (9 self)
- Add to MetaCart
pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computer systems. The first version of the compiler is a preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs on the Thinking Machines CM-5, the Intel Paragon, the BBN TC2000, the Kendall Square Research KSR-1, and the Sequent Symmetry. In this paper we describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection. To illustrate the behavior of the runtime system we include a description and performance results on four benchmark programs. 1 Introduction ...
Multicast snooping: a new coherence method using a multicast address network
- In Proceedings of the 26th Annual International Symposium on Computer architecture(ISCA
, 1999
"... This paper proposes a new coherence method called “multicast snooping ” that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast “mask. ” Transac ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
This paper proposes a new coherence method called “multicast snooping ” that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast “mask. ” Transactions are delivered with an ordered multicast network, such as an Isotach network, which eliminates the need for acknowledgment messages. Processors handle transactions as they would with a snooping protocol, while a simplified directory operates in parallel to check masks and gracefully handle incorrect ones (e.g., previous owner missing). Preliminary performance numbers with mostly SPLASH-2 benchmarks running on 32 processors show that we can limit multicasts to an average of 2-6 destinations (<< 32) and we can deliver 2-5 multicasts per network cycle (>> broadcast snooping’s 1 per cycle). While these results do not include timing, they do provide encouragement that multicast snooping can obtain data directly (like broadcast snooping) but apply to larger systems (like directories). 1
Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors
- In Proceedings of 25th International Symposium on Computer Architecture (ISCA
, 1998
"... Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local mem ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and datalocality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible. 1
Testing shared memories
- SIAM Journal on Computing
, 1997
"... Abstract. Sequential consistency is the most widely used correctness condition for multiprocessor memory systems. This paper studies the problem of testing shared-memory multiprocessors to determine if they are indeed providing a sequentially consistent memory. It presents the first formal study of ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
Abstract. Sequential consistency is the most widely used correctness condition for multiprocessor memory systems. This paper studies the problem of testing shared-memory multiprocessors to determine if they are indeed providing a sequentially consistent memory. It presents the first formal study of this problem, which has applications to testing new memory system designs and realizations, providing run-time fault tolerance, and detecting bugs in parallel programs. A series of results are presented for testing an execution of a shared memory under various scenarios, comparing sequential consistency with linearizability, another well-known correctness condition. Linearizability imposes additional restrictions on the shared memory, beyond that of sequential consistency; these restrictions are shown to be useful in testing such memories.
Efficient Low-Contention Parallel Algorithms
- the 1994 ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention prope ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models, and can be efficiently emulated with only logarithmic slowdown on hypercubetype non-combining networks. This paper describes fast, low-contention, work-optimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the high-contention steps typical of crcw pram algorithms. An illustrative expe...

