Results 1 - 10
of
91
The Tau Parallel Performance System
- The International Journal of High Performance Computing Applications
, 2006
"... The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portabilit ..."
Abstract
-
Cited by 242 (21 self)
- Add to MetaCart
(Show Context)
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed. This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets
, 1999
"... Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding en ..."
Abstract
-
Cited by 84 (0 self)
- Add to MetaCart
Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this paper we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottom-up algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determine...
Ibis: A Flexible and Efficient Java-based Grid Programming Environment
- Concurrency & Computation: Practice & Experience
, 2005
"... In computational grids, performance-hungry applications need to simultaneously tap the computational power of multiple, dynamically available sites. The crux of designing grid programming environments stems exactly from the dynamic availability of compute cycles: grid programming environments (a) ne ..."
Abstract
-
Cited by 74 (19 self)
- Add to MetaCart
(Show Context)
In computational grids, performance-hungry applications need to simultaneously tap the computational power of multiple, dynamically available sites. The crux of designing grid programming environments stems exactly from the dynamic availability of compute cycles: grid programming environments (a) need to be portable to run on as many sites as possible, (b) they need to be flexible to cope with different network protocols and dynamically changing groups of compute nodes, while (c) they need to provide efficient (local) communication that enables high-performance computing in the first place. Existing programming environments are either portable (Java), or they are flexible (Jini, Java RMI), or they are highly efficient (MPI). No system combines all three properties that are necessary for grid computing. In this paper, we present Ibis, a new programming environment that combines Java’s “run everywhere ” portability both with flexible treatment of dynamically available networks and processor pools, and with highly efficient, object-based communication. Ibis can transfer Java objects very efficiently by combining streaming object serialization with a zero-copy protocol. Using RMI as a simple test case, we show that Ibis outperforms existing RMI implementations, achieving up to 9 times higher throughputs with trees of objects. 1
Efficient Wire Formats for High Performance Computing
- IN PROCEEDINGS OF SUPERCOMPUTING 2000
, 2000
"... ... in non-traditional circumstances where it must interoperate with other applications. For example, online visualization is being used to monitor the progress of applications, and real-world sensors are used as inputs to simulations. Whenever these situations arise, there is a question of what com ..."
Abstract
-
Cited by 59 (26 self)
- Add to MetaCart
... in non-traditional circumstances where it must interoperate with other applications. For example, online visualization is being used to monitor the progress of applications, and real-world sensors are used as inputs to simulations. Whenever these situations arise, there is a question of what communications infrastructure should be used to link the different components. Traditional HPC-style communications systems such as MPI offer relatively high performance, but are poorly suited for developing these less tightly-coupled cooperating applications. Object-based systems and meta-data formats like XML offer substantial plug-and-play flexibility, but with substantially lower performance. We observe that the flexibility and baseline performance of all these systems is strongly determined by their wire format', or how they represent data for transmission in a heterogeneous environment. We examine the performance implications of different wire formats and present an alternative with significant advantages in terms of both performance and flexibility.
Parallel Structures and Dynamic Load Balancing for Adaptive Finite Element Computation
- Applied Numerical Mathematics
, 1996
"... this paper, we have focused on describing and comparing several load balancing schemes. Comparisons by timing are difficult, since times vary between runs having the same parameters. The high-speed switch of the IBM SP2 computer is a shared resource that affects run times. More subtle effects can re ..."
Abstract
-
Cited by 41 (12 self)
- Add to MetaCart
(Show Context)
this paper, we have focused on describing and comparing several load balancing schemes. Comparisons by timing are difficult, since times vary between runs having the same parameters. The high-speed switch of the IBM SP2 computer is a shared resource that affects run times. More subtle effects can result from differences in the order in which messages used for migration are processed. Changes in the order in which those messages are received and integrated into the local MDB result in different traversal orders of the mesh entities. These differences cause small changes in load balancings and coarsenings. While such differences in meshes and partitionings do not affect the solution accuracy, they can cause sufficient changes in efficiency to make precise timings difficult. Qualitatively, PSIRB produced the best partitions (measured as a function of total analysis time). Octree-generated partitions were comparable but resulted in slightly longer solution times. In both cases, one or two iterations of partition boundary smoothing led to a quality improvement. ITB by itself resulted in poorer partition quality, but is useful when mesh changes are small between computational stages. Predictive enrichment provided su21 perior performance to our current enrichment process with transient problems where there are frequent enrichment and balancing steps. Enhancements to the existing load balancing procedures and the implementation of new ones are under investigation. Improvements in the slice-by-slice technique used by ITB for migration are necessary. Experiments with geometrical methods that use the spatial location of elements relative to the centroids of sending and receiving processors showed promise at reducing the number of processor interconnections. Vidwans et al. [39] pr...
Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects
- IN FIFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many convent ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical---like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severel...
Cut-Through Delivery in Trapeze: An Exercise in Low-Latency Messaging
- In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing (HPDC-6
, 1997
"... New network technology continues to improve both the latency and bandwidth of communication in computer clusters. The fastest high-speed networks approach or exceed the I/O bus bandwidths of "gigabitready " hosts. These advances introduce new considerations for the design of network interf ..."
Abstract
-
Cited by 33 (11 self)
- Add to MetaCart
New network technology continues to improve both the latency and bandwidth of communication in computer clusters. The fastest high-speed networks approach or exceed the I/O bus bandwidths of "gigabitready " hosts. These advances introduce new considerations for the design of network interfaces and messaging systems for low-latency communication. This paper investigates cut-through delivery, a technique for overlapping host I/O DMA transfers with network traversal. Cut-through delivery significantly reduces end-to-end latency of large messages, which are often critical for application performance. We have implemented cut-through delivery in Trapeze, a new messaging substrate for network memory and other distributed operating system services. Our current Trapeze prototype is capable of demand-fetching 8K virtual memory pages in 200¯s across a Myrinet cluster of DEC AlphaStations. 1. Introduction Advances in network technology continue to improve the latency and bandwidth of communication...
System-Level versus User-Defined Checkpointing
, 1998
"... Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the application programmer and implemented at the operating-system le ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the application programmer and implemented at the operating-system level. In the recent years, there has been some work on higher-level forms of checkpointing. In this second approach, the user is responsible for the checkpoint placement and is required to specify the checkpoint contents. In this paper, we compare the two approaches: systemlevel and user-defined checkpointing. We discuss the pros and cons of both approaches and we present an experimental study that was conducted on a commercial parallel machine. 1. Introduction Checkpointing allows long-running programs to save state at regular intervals so that they may be restarted after interruptions without unduly retarding their progress. It is a feasible technique for tolerating transient failures an...
Runtime Mechanisms for Efficient Dynamic Multithreading
, 1996
"... High performance on distributed memory machines for programming models with dynamic thread creation and multithreading requires efficient thread management and communication. Traditional multithreading runtimes, consisting of few general-purpose, bundled mechanisms that assume minimal compiler and h ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
High performance on distributed memory machines for programming models with dynamic thread creation and multithreading requires efficient thread management and communication. Traditional multithreading runtimes, consisting of few general-purpose, bundled mechanisms that assume minimal compiler and hardware support, are suitable for computations involving coarse-grained threads but provide low efficiency in the presence of small granularity threads and irregular communication behavior. We describe two mechanisms of the Illinois Concert runtime system which address this shortcoming. The first, hybrid stack-heap execution, exploits close coupling with the compiler to dynamically form coarse-grained execution units; threads are lazily created as required by runtime situations. The second, pull messaging, exploits hardware support to implement a distributed message queue with receiverinitiated data transfer, delivering robust performance across a wide range of dynamic communication charact...
Open Metadata Formats: Efficient XML-Based Communication for High Performance Computing
- In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
, 2001
"... High-performance computing faces considerable change as the Internet and the Grid mature. Applications that once were tightly-coupled and monolithic are now decentralized, with collaborating components spread across diverse computational elements. Such distributed systems most commonly communicate t ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
(Show Context)
High-performance computing faces considerable change as the Internet and the Grid mature. Applications that once were tightly-coupled and monolithic are now decentralized, with collaborating components spread across diverse computational elements. Such distributed systems most commonly communicate through the exchange of structured data. Definition and translation of metadata is incorporated in all systems that exchange structured data. We observe that the manipulation of this metadata can be decomposed into three separate steps: discovery, binding of program objects to the metadata, and marshaling of data to and from wire formats. We have designed a method of representing message formats in XML, using datatypes available in the XML Schema specification. We have implemented a tool, XMIT, that uses such metadata and exploits this decomposition in order to provide flexible run-time metadata definition facilities for an efficient binary communication mechanism. We also demonstrate that the use of XMIT makes possible such flexibility at little performance cost.