Results 1 - 10
of
10
Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects
- IN FIFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many conventional para ..."
Abstract
-
Cited by 31 (11 self)
- Add to MetaCart
This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical---like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severel...
Solving the Game of Awari Using Parallel Retrograde Analysis
- IEEE Computer
, 2003
"... We have solved the game of awari, an ancient African board game that is played worldwide now. The game is a draw when both players play optimally. To solve awari, we computed several databases that can be used jointly to select the best move from any position that can occur in a game. The largest da ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
We have solved the game of awari, an ancient African board game that is played worldwide now. The game is a draw when both players play optimally. To solve awari, we computed several databases that can be used jointly to select the best move from any position that can occur in a game. The largest database contains 204 billion entries (178 gigabyte), and is much larger than the largest (endgame) database for any game computed so far. In total, we determined the results for 889 billion positions. We solved the game on a large computer cluster, using a new parallel search algorithm that optimally uses the available resources (processors, memories, disks, and network).
Integrating Polling, Interrupts, and Thread Management
- In Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
, 1996
"... Many user-level communication systems receive network messages by polling the network adapter from user space. While polling avoids the overhead of interruptbased mechanisms, it is not suited for all parallel applications. This paper describes a general-purpose, multithreaded, communication system t ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Many user-level communication systems receive network messages by polling the network adapter from user space. While polling avoids the overhead of interruptbased mechanisms, it is not suited for all parallel applications. This paper describes a general-purpose, multithreaded, communication system that uses both polling and interrupts to receive messages. Users need not insert polls into their code; through a careful integration of the user-level communication software with a user-level thread scheduler, the system can automatically switch between polling and interrupts. We have evaluated the performance of this integrated system on Myrinet, using a synthetic benchmark and a number of applications that have very different communication requirements. We show that the integrated system achieves robust performance: in most cases, it performs as well as or better than systems that rely exclusively on interrupts or polling. 1 Introduction An emerging trend in the parallel programmingcommu...
A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2002
"... This paper discusses a new work-scheduling algorithm for parallel search of single-agent state spaces, called Transposition-Table-Driven Work Scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less proce ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
This paper discusses a new work-scheduling algorithm for parallel search of single-agent state spaces, called Transposition-Table-Driven Work Scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves close-to-linear speedups; for large problems the speedups are even superlinear due to better memory usage. On the same machine, the algorithm is 1.6 to 12.9 times faster than traditional work-stealing-based schemes.
Transposition Table Driven Work Scheduling in Distributed Search
- IN 16TH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI'99
, 1999
"... This paper introduces a new scheduling algorithm for parallel single-agent search, transposition table driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper introduces a new scheduling algorithm for parallel single-agent search, transposition table driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves nearly-optimal performance and scales well. The algorithm performs a factor of 2.0 to 13.7 times better than traditional work-stealing-based schemes.
Challenging Applications on Fast Networks
- Fourth International Symposium on High-Performance Computer Architecture (HPCA-4
, 1998
"... Parallel computing on clusters of workstations is attractive because of the low costs in comparison to MPPs, but the speed of the local area network limits the class of applications that can be run efficiently. Fortunately, faster network technology is becoming available for the next generation of w ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Parallel computing on clusters of workstations is attractive because of the low costs in comparison to MPPs, but the speed of the local area network limits the class of applications that can be run efficiently. Fortunately, faster network technology is becoming available for the next generation of workstation clusters. This paper studies the effect of running challenging applications that communicate heavily on three types of modern interconnects: 100 Mbit/s Fast Ethernet, 155 Mbit/s ATM, and 1.28 Gbit/s Myrinet. Experimental results show that even challenging communicationintensive applications can achieve acceptable performance on workstation clusters, but only if the communication software has been designed and tuned for high performance. 1. Introduction Parallel computing on clusters of workstations is attractive because of the low costs in comparison to commercial Massively Parallel Processors (MPPs). A disadvantage of workstation clusters is that MPPs contain high-performance pro...
Friendly and Efficient Message Handling
, 1996
"... Since communication software spends a significant amount of time on handling incoming messages, it is desirable that message handlers avoid expensive context-switches on frequently executed paths. High-performance Active Message systems demand that handlers run to completion without blocking. Unfort ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Since communication software spends a significant amount of time on handling incoming messages, it is desirable that message handlers avoid expensive context-switches on frequently executed paths. High-performance Active Message systems demand that handlers run to completion without blocking. Unfortunately, disallowing all blocking in handlers makes it hard to integrate them into large, preemption-based systems, because each potentially blocking action, including library calls, must be rewritten. We have implemented a portable, hybrid upcall mechanism that is easier to use than Active Messages yet avoids unnecessary thread switching. The key idea is that message handlers are only allowed to block on locks protecting shared data. Inside message handlers, blocking on synchronous communication and condition variables is not allowed. This restriction allows most messages to be processed without unnecessary thread switching on the critical path. When a message handler has to suspend its wor...
Awari Is Solved
- Journal of the ICGA
, 2002
"... to the random accesses of database entries during construction. The choice or design of the algorithm to create awari (endgame) databases is mainly determined by the amount of main memory available (Lincke, 2002), trading memory for additional computational effort and storing intermediate results on ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
to the random accesses of database entries during construction. The choice or design of the algorithm to create awari (endgame) databases is mainly determined by the amount of main memory available (Lincke, 2002), trading memory for additional computational effort and storing intermediate results on disk. Our system contains 144 Pentium III processors at 1.0 GHz, 72 GB of distributed main memory, a total disk space of 1.4 TB, and a Myrinet interconnect: a fast, switched network. One of the challenges was to handle the relative "small" amount of memory. The parallel retrograde search algorithm described by Bal and Allis (1995) is efficient, but would have required more than 350 GB of memory, much more than we had. Sequential memory-limited search algorithms for awari endgame databases exist (cf. Lincke and Marzetta, 2000), but solving awari entirely on a single machine would take decades, if not centuries, since these algorithms still require much more memory than a single computer pro
Performance study of parallel programs on a clustered Wide-Area Network
, 1997
"... Contents 1 Introduction 3 2 The environment of the experiment 5 2.1 The Orca language and implementation . . . . . . . . . . . . . . 5 2.2 The Amoeba processor pool . . . . . . . . . . . . . . . . . . . . . 6 2.3 The simulation of a clustered wide-area network . . . . . . . . . 7 2.4 Analyzing Orc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Contents 1 Introduction 3 2 The environment of the experiment 5 2.1 The Orca language and implementation . . . . . . . . . . . . . . 5 2.2 The Amoeba processor pool . . . . . . . . . . . . . . . . . . . . . 6 2.3 The simulation of a clustered wide-area network . . . . . . . . . 7 2.4 Analyzing Orca programs . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Compiling and running Orca programs . . . . . . . . . . . . . . . 8 3 Related work 10 4 All-pairs Shortest Paths 12 4.1 Performance of ASP . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 The traveling salesman problem 15 5.1 Performance of TSP . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.3 Summary . . . . . . . . . . . . . . . . . . . .

