Results 1  10
of
26
A DataClustering Algorithm On Distributed Memory Multiprocessors
 In LargeScale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analyticall ..."
Abstract

Cited by 96 (1 self)
 Add to MetaCart
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: kmeans, data mining, massive data sets, messagepassing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing ..."
Abstract

Cited by 76 (6 self)
 Add to MetaCart
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Models of Computation  Exploring the Power of Computing
"... Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and oper ..."
Abstract

Cited by 59 (5 self)
 Add to MetaCart
Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and operating systems were under development and therefore became both the subject and basis for a great deal of theoretical work. The power of computers of this period was limited by slow processors and small amounts of memory, and thus theories (models, algorithms, and analysis) were developed to explore the efficient use of computers as well as the inherent complexity of problems. The former subject is known today as algorithms and data structures, the latter computational complexity. The focus of theoretical computer scientists in the 1960s on languages is reflected in the first textbook on the subject, Formal Languages and Their Relation to Automata by John Hopcroft and Jeffrey Ullman. This influential book led to the creation of many languagecentered theoretical computer science courses; many introductory theory courses today continue to reflect the content of this book and the interests of theoreticians of the 1960s and early 1970s. Although
Ahnentafel indexing into Mortonordered arrays, or matrix locality for free
 In EuroPar 2000 – Parallel Processing
, 2000
"... Abstract. Definitions for the uniform representation of ddimensional matrices serially in Mortonorder (or Zorder) support both their use with cartesian indices, and their divideandconquer manipulation as quaternary trees. In the latter case, ddimensional arrays are accessed as 2 dary trees. T ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
Abstract. Definitions for the uniform representation of ddimensional matrices serially in Mortonorder (or Zorder) support both their use with cartesian indices, and their divideandconquer manipulation as quaternary trees. In the latter case, ddimensional arrays are accessed as 2 dary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, and the tree helps schedule multiprocessing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics elsewhere show that the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without any lowlevel tuning.
Parallel Sorting With Limited Bandwidth
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation an ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusiveread variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
Opportunity cost algorithms for reduction of i/o and interprocess communication overhead in a computing cluster
 IEEE Transactions on Parallel and Distributed Systems
, 2003
"... Abstract—Computing Clusters (CC) consisting of several connected machines, could provide a highperformance, multiuser, timesharing environment for executing parallel and sequential jobs. In order to achieve good performance in such an environment, it is necessary to assign processes to machines in ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Abstract—Computing Clusters (CC) consisting of several connected machines, could provide a highperformance, multiuser, timesharing environment for executing parallel and sequential jobs. In order to achieve good performance in such an environment, it is necessary to assign processes to machines in a manner that ensures efficient allocation of resources among the jobs. This paper presents opportunity cost algorithms for online assignment of jobs to machines in a CC. These algorithms are designed to improve the overall CPU utilization of the cluster and to reduces the I/O and the Interprocess Communication (IPC) overhead. Our approach is based on known theoretical results on competitive algorithms. The main contribution of the paper is how to adapt this theory into working algorithms that can assign jobs to machines in a manner that guarantees nearoptimal utilization of the CPU resource for jobs that perform I/O and IPC operations. The developed algorithms are easy to implement. We tested the algorithms by means of simulations and executions in a real system and show that they outperform existing methods for process allocation that are based on ad hoc heuristics. Index Terms—Load balancing, competitive algorithms, cluster computing, I/O overhead, IPC overhead. 1
Sampling and Analytical Techniques for Data Distribution of Parallel Sparse Computation
, 1997
"... We present a compiletime method to select compression and distribution schemes for sparse matrices which are computed using Fortran 90 array intrinsic operations. The selection process samples input sparse matrices to determine their sparsity structures. It is also guided by cost functions of vari ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
We present a compiletime method to select compression and distribution schemes for sparse matrices which are computed using Fortran 90 array intrinsic operations. The selection process samples input sparse matrices to determine their sparsity structures. It is also guided by cost functions of various sparse routines as measured from the target machine. The Fortran 90 array expression is then transformed into a sparse array expression that calls the selected compression and distribution routines. 1 Introduction It has long been a challenging research topic to devise general guidelines for selecting efficient compression and distribution schemes for parallel executions of sparse matrix computations. We feel that this problem is difficult at least for the following three reasons. First, the cost of a sparse matrix computation depends greatly on the structures (i.e., the distributions of nonzero elements) of its input matrices [2]. Such information, however, may not be available at c...
Onebit Counts between Unique and Sticky
 ACM SIGPLAN Notices
, 1998
"... Stoye's onebit reference tagging scheme can be extended to local counts of two or more via two strategies. The first, suited to pure register transactions, is a cache of referents to two shared references. The analog of Deutsch's and Bobrow's multiplereference table, this cache is s ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Stoye's onebit reference tagging scheme can be extended to local counts of two or more via two strategies. The first, suited to pure register transactions, is a cache of referents to two shared references. The analog of Deutsch's and Bobrow's multiplereference table, this cache is sufficient to manage small counts across successive assignment statements. Thus, accurate reference counts above one can be tracked for short intervals, like those bridging one function 's environment to its successor's. The second, motivated by runtime stacks that duplicate references, avoids counting any references from the stack. It requires a local pointerinversion protocol in the mutator, but one still local to the referent and the stack frame. Thus, an accurate reference count of one can be maintained regardless of references from the recursion stack. CCS categories and Subject Descriptors: D.4.2 [Storage Management]: Allocation/Deallocation strategies; E.2 [Data Storage Representations]: Linked re...
From Algorithm Parallelism to InstructionLevel Parallelism: An EncodeDecode Chain Using PrefixSum (Extended Abstract)
, 1997
"... A novel comprehensive and coherent approach for the purpose of increasing instructionlevel parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefixsum (PS) instruction, which will have efficient implementation in hardware, to the instructi ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
A novel comprehensive and coherent approach for the purpose of increasing instructionlevel parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefixsum (PS) instruction, which will have efficient implementation in hardware, to the instructionset architecture. This addition gives for the first time a concrete way for recruiting the whole knowledge base of parallel algorithms for that purpose. The potential increase in ILP is demonstrated by experimental results for a test application. The main technical contribution is in the form of a "completeness theorem". Perhaps surprisingly, the current abstract proves that in an envisioned system which employs parallel PS functional units, a proper use of a serial programming language suffices for the following. With a moderate effort, one can program a parallel algorithm (in a serial language), so that a parallelizing compiler (even without runtime methods!) will be able to extract th...
The Opie Compiler: from Rowmajor Source to Mortonordered Matrices
, 2004
"... The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of u ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of usable code to support future developmentofnew algorithms better suited to Mortonordered matrices.