Results 1 - 10
of
16
Scalable Distributed Visualization Using Off-the-Shelf Components
, 1999
"... This paper describes a visualization architecture for scalable computer systems. The architecture is currently being prototyped for use in Beowulf-class clustered systems. A set of OpenGL frame buffers are driven in parallel by a set of CPUs. The visualization architecture merges the contents of the ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This paper describes a visualization architecture for scalable computer systems. The architecture is currently being prototyped for use in Beowulf-class clustered systems. A set of OpenGL frame buffers are driven in parallel by a set of CPUs. The visualization architecture merges the contents of these frame buffers by userprogrammable associative and commutative combining operations. The system hardware is built from off-the-shelf components including OpenGL accelerators, Field Programmable Gate Arrays (FPGAs) , and gigabit network interfaces and switches. A secondgeneration prototype supports 60 Hz operation at 1024 # 1024 pixel resolution with interactive latency up to 1000 nodes. CR Categories: B.7.1 [Integrated circuits]: Types and design styles---Gate arrays; C.2.5 [Computer-communication networks]: Local and wide-area networks---High-speed; D.1.3 [Programming techniques]: Concurrent programming---Parallel programming; I.3.1 [Computer graphics]: Hardware architecture---Parallel processing; I.3.2 [Computer graphics]: Graphics systems--- Distributed/network graphics Keywords: FPGA, OpenGL, visualization, cluster, Beowulf, gigabit, fat-tree 1
Sepia: Scalable 3D compositing using PCI Pamette
- In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines
, 1999
"... We have implemented an image combining architecture that allows distributed rendering of a partitioned data set at interactive rates. The architecture achieves real-time frame rates and low latency through pipelining and the use of a high bandwidth network technology to transfer the image data. It i ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
We have implemented an image combining architecture that allows distributed rendering of a partitioned data set at interactive rates. The architecture achieves real-time frame rates and low latency through pipelining and the use of a high bandwidth network technology to transfer the image data. It is flexible because it uses programmable FPGA devices to implement the combining logic. The implementation cost is kept low by using only commodity components for the network and graphics, and FPGA logic. The result is a cost-effective interactive visualization system that can be used with a variety of applications running on distributed computing systems such as cluster of workstations and personal computers. We first motivate the development of a distributed rendering system and we introduce some of the concepts related to the 3D-visualization domain. We then describe our implementation of this system using the PCI Pamette FPGA-based board. We emphasize the advantages of using a programmable board for the prototype development and also for a potential commercial version. 1
A Simple MPI Process Swapping Architecture for Iterative
- Applications, The International Journal of High Performance Computing Applications
, 2004
"... Parallel computing is now popular and mainstream, but performance and ease-of-use remain elusive to many endusers. There exists a need for performance improvements that can be easily retrofitted to existing parallel applications. In this paper we present MPI process swapping, a simple performance en ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Parallel computing is now popular and mainstream, but performance and ease-of-use remain elusive to many endusers. There exists a need for performance improvements that can be easily retrofitted to existing parallel applications. In this paper we present MPI process swapping, a simple performance enhancing add-on to the MPI programming paradigm. MPI process swapping improves performance by dynamically choosing the best available resources throughout application execution, using MPI process over-allocation and real-time performance measurement. Swapping provides fully automated performance monitoring and process management, and a rich set of primitives to control execution behavior manually or through an external tool. Swapping, as defined in this implementation, can be added to iterative MPI applications and requires as few as three lines of source code change. We verify our design for a particle dynamics application on desktop resources within a production commercial environment. 1.
Performance Effects of Scheduling Strategies for Master/Slave Distributed Applications
, 1998
"... The achievement of parallel application performance on non-dedicated workstation clusters requires careful attention to the scheduling of tasks and communication on the underlying platform. In the literature, application scheduling policies are usually chosen by matching the resource requirements of ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
The achievement of parallel application performance on non-dedicated workstation clusters requires careful attention to the scheduling of tasks and communication on the underlying platform. In the literature, application scheduling policies are usually chosen by matching the resource requirements of an application with the performance characteristics of the target platform. However, when clusters of workstations are shared with other users, platform performance is non-uniform and varies over time. As a result, the performance of distinct scheduling policies may also vary depending on dynamic system state and particular characteristics of the job being run. Our experimental work focuses on a master/slave parallel ray-tracing application executing on a set of workstation clusters at UCSD and the San Diego Supercomputer Center. The experiments show that two di erent scheduling strategies, one static and one dynamic, exhibit very di erent performance sensitivities to variabilities in resou...
On Dynamic Load Balancing on Graphics Processors
"... To get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
To get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. With the recent advent of scatter operations and atomic hardware primitives it is now possible to bring some of the more elaborate dynamic load balancing schemes from the conventional SMP systems domain to the graphics processor domain. We have compared four different dynamic load balancing methods to see which one is most suited to the highly parallel world of graphics processors. Three of these methods were lock-free and one was lock-based. We evaluated them on the task of creating an octree partitioning of a set of particles. The experiments showed that synchronization can be very expensive and that new methods that take more advantage of the graphics processors features and capabilities might be required. They also showed that lock-free methods achieves better performance than blocking and that they can be made to scale with increased numbers of processing units.
Dynamic Load Balancing for Parallel Interval-Newton Using Message Passing
, 2002
"... Branch-and-prune (BP) and branch-and-bound (BB) techniques are commonly used for intelligent search in finding all solutions, or the optimal solution, within a space of interest. The corresponding binary tree structure provides a natural parallelism allowing concurrent evaluation of subproblems usin ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Branch-and-prune (BP) and branch-and-bound (BB) techniques are commonly used for intelligent search in finding all solutions, or the optimal solution, within a space of interest. The corresponding binary tree structure provides a natural parallelism allowing concurrent evaluation of subproblems using parallel computing technology. Of special interest here are techniques derived from interval analysis, in particular an interval-Newton/generalized-bisection procedure. In this context, we discuss issues of load balancing and work scheduling that arise in the implementation of parallel interval-Newton on a cluster of workstations using message passing, and describe and analyze techniques for this purpose. Results using an asynchronous diffusive load balancing strategy show that a consistently high efficiency can be achieved in solving nonlinear equations, providing excellent scalability, especially with the use of a two-dimensional torus virtual network. The effectiveness of the approach used, especially in connection with a novel stack management scheme, is also demonstrated in the consistent superlinear speedups observed in performing global optimization.
Policies for swapping MPI processes
- In Proceedings of HPDC-12, the Symposium on High Performance and Distributed Computing
, 2003
"... Despite the enormous amount of research and development work in the area of parallel computing, it is a common observation that simultaneous performance and ease-of-use are elusive. We believe that ease-of-use is critical for many end users, and thus seek performance enhancing techniques that can be ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Despite the enormous amount of research and development work in the area of parallel computing, it is a common observation that simultaneous performance and ease-of-use are elusive. We believe that ease-of-use is critical for many end users, and thus seek performance enhancing techniques that can be easily retrofitted to existing parallel applications. In a previous paper we have presented MPI process swapping, a simple add-on to the MPI programming environment that can improve performance in shared computing environments. MPI process swapping requires as few as three lines of source code change to an existing application. In this paper we explore a question that we had left open in our previous work: based on which policies should processes be swapped for best performance? Our results show that, with adequate swapping policies, MPI process swapping can provide substantial performance benefits with very limited implementation effort. 1.
A Framework for Opportunistic Cluster Computing using JavaSpaces
, 2001
"... lications that can be broken into manageable components, such an opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to System State enables us to minimize intrusiveness to the machines within the cluste ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
lications that can be broken into manageable components, such an opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to System State enables us to minimize intrusiveness to the machines within the cluster. ################# ## I thank Manish Parashar for being a patient thesis advisor and for providing an environment to conduct excellent research. He has built a great research lab, TASSL, and I am grateful that he has allowed me to be a part of it. I thank my brother Deepak Batheja for encouragement and moral support during the entire span of my thesis. This thesis would never have been a reality without his continual support and guidance. He also taught me a great deal about doing and presenting research. I also thank everyone in the TASSL lab and ECE Department for making the past two years a wonderful experience. I really appreciate all the support and encou
Adaptive Cluster Computing using JavaSpaces
, 2001
"... In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces [1] to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse-gra ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces [1] to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse-grained parallelism and has three key features: (1) portability across heterogeneous platforms, (2) minimal configuration overheads for participating nodes, and (3) automated system state monitoring (using SNMP) to ensure nonintrusive behavior. Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to the current system state minimizes the intrusiveness of the framework. # ########## Adaptive cluster computing, Parallel/Distributed computing, JavaSpaces, Jini, SNMP. 1.
Parallel Rendering with an Actor Model
- Proceedings of Eurographics '97, Workshop on Programming Paradigms for Graphics
, 1997
"... This paper describes an application of autonomous concurrent objects (Actors) to parallel rendering. The resulting rendering system is shown to be both scalable and portable. A parallel rendering application based on Monte Carlo path tracing is constructed using programming abstractions defined by t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper describes an application of autonomous concurrent objects (Actors) to parallel rendering. The resulting rendering system is shown to be both scalable and portable. A parallel rendering application based on Monte Carlo path tracing is constructed using programming abstractions defined by the Actor model. This application is demonstrated to scale to hundreds of computers with efficiencies approaching 99%. The abstractions and the application are demonstrated to be portable across a range of parallel and distributed computer systems with various communications characteristics and topologies. A similar set of abstractions have been implemented in VLSI, suggesting that the entire rendering application could be realized by a special purpose systolic architecture. 1 Introduction Photorealistic rendering requires a physically accurate simulation of light transport in complex geometric domains. Such domains may contain a diversity of materials, each scattering light according to a d...

