Results 1 - 10
of
35
ABSTRACT Software Engineering for Multicore Systems – An Experience Report
, 2007
"... The emergence of inexpensive parallel computers powered by multicore chips combined with stagnating clock rates raises new challenges for software engineering. As future performance improvements will not come “for free ” from increased clock rates, performance critical applications will need to be p ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
The emergence of inexpensive parallel computers powered by multicore chips combined with stagnating clock rates raises new challenges for software engineering. As future performance improvements will not come “for free ” from increased clock rates, performance critical applications will need to be parallelized. However, little is known about the engineering principles for parallel general-purpose applications. This paper presents an experience report with four diverse case studies on multicore software development for generalpurpose applications. They were programmed in different languages and benchmarked on several multicore computers. Empirical findings include: • Multicore computers deliver: Real speedups are achievable, albeit with significant programming effort and speedups that are typically lower than the number of cores employed. • Massive refactoring of sequential programs is required, sometimes at several levels. Special tools for parallelization refactorings appear to be an important area of research. • Autotuning is indispensable, as manually tuning thread assignment, number of pipeline stages, size of data partitions and other parameters is difficult and error prone. • Architectures that encompass several parallel components are poorly understood. Tuneable architectural patterns with parallelism at several levels need to be discovered.
SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems
- in proceedings of Parallel and Distributed Processing Symposium
, 2004
"... This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon’s algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix mu ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon’s algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated. 1.
Computational Complexity of an Optical Model of Computation
, 2005
"... We investigate the computational complexity of an optically inspired model of computation. The model is called the continuous space machine and operates in discrete timesteps over a number of two-dimensional complex-valued images of constant size and arbitrary spatial resolution. We define a number ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
We investigate the computational complexity of an optically inspired model of computation. The model is called the continuous space machine and operates in discrete timesteps over a number of two-dimensional complex-valued images of constant size and arbitrary spatial resolution. We define a number of optically inspired complexity measures and data representations for the model. We show the growth of each complexity measure under each of the model's operations. We characterise the power of an important discrete restriction of the model. Parallel time on this variant of the model is shown to correspond, within a polynomial, to sequential space on Turing machines, thus verifying the parallel computation thesis. We also give a characterisation of the class NC. As a result the model has computational power equivalent to that of many well-known parallel models. These characterisations give a method to translate parallel algorithms to optical algorithms and facilitate the application of the complexity theory toolbox to optical computers. Finally we show that another variation on the model is very powerful
A Parallel Dynamic Programming Algorithm on a Multi-core
, 2007
"... Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive sta ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive stages, in other words, the data dependence is non-uniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the non-uniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multi-core architecture. We present a programming and execution model for multi-core architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multi-thread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The
A Compositional Framework for Developing Parallel Programs on Two Dimensional Arrays
, 2005
"... The METR technical reports are published as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electron ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The METR technical reports are published as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.
Parallel PSO Using MapReduce
- In Proc. of the Congress on Evolutionary Computation
, 2007
"... Abstract — In optimization problems involving large amounts of data, such as web content, commercial transaction information, or bioinformatics data, individual function evaluations may take minutes or even hours. Particle Swarm Optimization (PSO) must be parallelized for such functions. However, la ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — In optimization problems involving large amounts of data, such as web content, commercial transaction information, or bioinformatics data, individual function evaluations may take minutes or even hours. Particle Swarm Optimization (PSO) must be parallelized for such functions. However, large-scale parallel programs must communicate efficiently, balance work across all processors, and address problems such as failed nodes. We present MapReduce Particle Swarm Optimization (MRPSO), a PSO implementation based on the MapReduce parallel programming model. We describe MapReduce and show how PSO can be naturally expressed in this model, without explicitly addressing any of the details of parallelization. We present a benchmark function for evaluating MRPSO and note that MRPSO is not appropriate for optimizing easily evaluated functions. We demonstrate that MRPSO scales to 256 processors on moderately difficult problems and tolerates node failures. I.
Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse netflix data
- In KDD
, 2009
"... All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted co-clustering [5]. The dataflow libr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted co-clustering [5]. The dataflow library we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multicore hardware, while hiding traditional difficulties such as queuing, threading, memory management, and deadlocks. The dataflow CF implementation first compresses the large, sparse training dataset into co-clusters. Then it generates recommendations by combining the average ratings of the co-clusters with the biases of the users and movies. When configured to identify 20x20 co-clusters in the Netflix training dataset, the implementation predicted over 100 million ratings in 16.31 minutes and achieved an RMSE of 0.88846 without any fine-tuning or domain knowledge. This is an effective real-time prediction runtime of 9.7 µs per rating which is far superior to previously reported results. Moreover, the implemented co-clustering framework supports a wide variety of other large-scale data mining applications and forms the basis for predictive modeling on large, dyadic datasets [4, 7].
CONQUEST: A distributed tool for constructing summaries of high-dimensional discrete-attributed datasets
- in Proc. 4th SIAM Intl. Conf. Data Mining (SDM’04
, 2004
"... The problem of constructing bounded-error summaries of binary attributed data of very high dimensions is an important and difficult one. These summaries enable more expensive analysis techniques to be applied efficiently with little loss in accuracy. Recent work in this area has resulted in the use ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The problem of constructing bounded-error summaries of binary attributed data of very high dimensions is an important and difficult one. These summaries enable more expensive analysis techniques to be applied efficiently with little loss in accuracy. Recent work in this area has resulted in the use of discrete linear algebraic transforms to construct such summaries efficiently. This paper addresses the problem of constructing summaries of distributed datasets. Specifically, the problem can be stated as follows: given a set of n discrete attributed vectors distributed across p sites, construct a summary of k << n vectors such that each of the input vectors is within given bounded distance from some output vector. In addition to being algorithmically efficient (i.e., must do no more work than corresponding serial algorithm), the distributed formulation must have low parallelization overheads. We present here, CONQUEST, a tool that achieves excellent performance and scalability for summarizing distributed datasets. In contrast to traditional parallel techniques that distribute the kernel operations, CONQUEST uses a less aggressive parallel formulation that relies on the principle of sampling to reduce communication overhead while maintaining high accuracy. Specifically, each individual site computes its local patterns independently. Various sites cooperate within dynamically orchestrated workgroups to construct consensus patters from these local patterns. Individual sites then decide to participate in the consensus or leave the group. Experimental results on a set of Intel Xeon servers demonstrate that this strategy is capable of excellent performance in terms of compression time, ratio, and accuracy with respect to postprocessing tasks. The communication overhead associated with CONQUEST is also shown to be minimal, making it ideally suited to wide-area deployment.
Dynamic Load Balancing in Distributed Mining of Molecular Compounds
- IEEE Transactions on Parallel and Distributed Systems, Special Issue on High Performance Computational Biology
, 2006
"... Abstract — In molecular biology it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract — In molecular biology it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening dataset, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non-dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational Grids. Index Terms — Distributed computing, peer-to-peer computing, dynamic load balancing, subgraph mining, frequent patterns, biochemical databases, molecular compounds. I.

