Results 1  10
of
47
1 Parallel Spectral Clustering in Distributed Systems
"... Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluster ..."
Abstract

Cited by 61 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through
Multiparty Session C: Safe parallel programming with message optimisation
 In TOOLS Europe, volume 7304 of LNCS
, 2012
"... Abstract. This paper presents a new efficient programming toolchain for messagepassing parallel algorithms which can fully ensure, for any typable programs and for any execution path, deadlockfreedom, communication safety and global progress through a static checking. The methodology is embodied ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a new efficient programming toolchain for messagepassing parallel algorithms which can fully ensure, for any typable programs and for any execution path, deadlockfreedom, communication safety and global progress through a static checking. The methodology is embodied as a multiparty sessionbased programming environment for C and its runtime libraries, which we call Session C. Programming starts from specifying a global protocol for a target parallel algorithm, using a protocol description language. From this global protocol, the projection algorithm generates endpoint protocols, based on which each endpoint C program is designed and implemented with a small number of concise session primitives. The endpoint protocol can further be refined to a more optimised protocol through subtyping for asynchronous communication, preserving original safety guarantees. The underlying theory can ensure that the complexity of the toolchain stays in polynomial time against the size of programs. We apply this framework to representative parallel algorithms with complex communication topologies. The benchmark results show that Session C performs competitively against MPI. 1
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems”, Proceedings of the 3 rd International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar’04
 MS in mathematics and engineering from the Moscow Aviation Institute in 1980 and his PhD in engineering from HeavyMachinery Research Institute in
"... Abstract — In this paper the possibility of including automatic optimization techniques in the design of parallel dynamic programming algorithms in heterogeneous systems is analyzed. The main idea is to automatically approach the optimum values of a number of algorithmic parameters (number of proces ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper the possibility of including automatic optimization techniques in the design of parallel dynamic programming algorithms in heterogeneous systems is analyzed. The main idea is to automatically approach the optimum values of a number of algorithmic parameters (number of processes, number of processors, processes per processor), and thus obtain low execution times. Hence, users could be provided with routines which execute efficiently, and independently of the experience of the user in heterogeneous computing and dynamic programming, and which can adapt automatically to a new network of processors or a new network configuration. I.
Ranking and Semisupervised Classification on Large Scale Graphs Using MapReduce
"... Label Propagation, a standard algorithm for semisupervised classification, suffers from scalability issues involving memory and computation when used with largescale graphs from realworld datasets. In this paper we approach Label Propagation as solution to a system of linear equations which can be ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Label Propagation, a standard algorithm for semisupervised classification, suffers from scalability issues involving memory and computation when used with largescale graphs from realworld datasets. In this paper we approach Label Propagation as solution to a system of linear equations which can be implemented as a scalable parallel algorithm using the mapreduce framework. In addition to semisupervised classification, this approach to Label Propagation allows us to adapt the algorithm to make it usable for ranking on graphs and derive the theoretical connection between Label Propagation and PageRank. We provide empirical evidence to that effect using two natural language tasks – lexical relatedness and polarity induction. The version of the Label Propagation algorithm presented here scales linearly in the size of the data with a constant main memory requirement, in contrast to the quadratic cost of both in traditional approaches. 1
Scalable Node Level Computation Kernels for Parallel Exact Inference
"... In this paper, we investigate data parallelism in exact inference with respect to arbitrary junction trees. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with clique width and the number of states of random varia ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
In this paper, we investigate data parallelism in exact inference with respect to arbitrary junction trees. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with clique width and the number of states of random variables. We study potential table representation and scalable algorithms for node level primitives. Based on such node level primitives, we propose computation kernels for evidence collection and evidence distribution. A data parallel algorithm for exact inference is presented using the proposed computation kernels. We analyze the scalability of node level primitives, computation kernels and the exact inference) algorithm using the coarse grained multicomputer (CGM) model. According to the analysis, we achieve O NdCwC ∏ wC j=1 rC,j/P local computation time and O(N) global communication rounds using ∏wC j=1 rC,j, where N is the number of cliques in the junction tree; dC is the clique degree; rC,j is the P processors, 1 ≤ P ≤ maxC number of states of the jth random variable in C; wC is the clique width; and ws is the separator width. We implemented the proposed algorithm on stateoftheart clusters. Experimental results show that the proposed algorithm exhibits almost linear scalability over a wide range.
Anomaly Localization in LargeScale Clusters
"... Abstract — A critical problem facing by managing largescale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, healthrelated data are c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract — A critical problem facing by managing largescale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, healthrelated data are collected for troubleshooting. However, due to the massive quantities of information obtained from a large number of components, the root causes of anomalies are often buried like needles in a haystack. In this paper, we present a localization method to automatically find out the potential root causes (i.e. a subset of nodes) of the problem from the overwhelming amount of data collected systemwide. System managers can focus on examining these potential locations, thereby significantly reducing human efforts required for anomaly localization. Our method consists of three interrelated steps: (1) feature collection to assemble a feature space for the system; (2) feature extraction to obtain the most significant features for efficient data analysis by applying the principal component analysis (PCA) algorithm; and (3) outlier detection to quickly identify the nodes that are “far away” from the majority by using the cellbased detection algorithm. Preliminary studies are presented to demonstrate the potential of our method for localizing anomalies in a computing environment where the nodes perform comparable tasks. I.
Automatic Task Reorganization in MapReduce
"... Abstract—MapReduce is increasingly considered as a useful parallel programming model for largescale data processing. It exploits parallelism among execution of primitive map and reduce operations. Hadoop is an open source implementation of MapReduce that has been used in both academic research and ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—MapReduce is increasingly considered as a useful parallel programming model for largescale data processing. It exploits parallelism among execution of primitive map and reduce operations. Hadoop is an open source implementation of MapReduce that has been used in both academic research and industry production. However, its implementation strategy that one map task processes one data block limits the degree of concurrency and degrades performance because of inability to fully utilize available resources. In addition, its assumption that task execution time in each phase does not vary much does not always hold, which makes speculative execution useless. In this paper, we present mechanisms to dynamically split and consolidate tasks to cope with load balancing and break through the concurrency limit resulting from fixed task granularity. For singlejob system, two algorithms are proposed for circumstances where prior knowledge is known and unknown. For multijob case, we propose a modified shortestjobfirst strategy, which minimizes job turnaround time theoretically when combined with task splitting. We compared the effectiveness of our approach to the default task scheduling strategy using both synthesized and tracebased workloads. Simulation results show that our approach improves performance significantly.
LINVIEW: incremental view maintenance for complex analytical queries
 In SIGMOD
, 2014
"... Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called Linview, for capturing deltas of linear algebra pr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called Linview, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over reevaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremental view maintenance of linear algebra practical and usually substantially cheaper than reevaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our evaluation demonstrates the efficiency of Linview in generating parallel incremental programs that outperform reevaluation techniques by more than an order of magnitude. 1.
Basic parallel and distributed computing curriculum
 Second NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar’12) in conjunction with the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS
"... Abstract—With the advent of multicore processors and their fast expansion, it is quite clear that parallel computing is now a genuine requirement in Computer Science and Engineering (and related) curriculum. In addition to the pervasiveness of parallel computing devices, we should take into account ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—With the advent of multicore processors and their fast expansion, it is quite clear that parallel computing is now a genuine requirement in Computer Science and Engineering (and related) curriculum. In addition to the pervasiveness of parallel computing devices, we should take into account the fact that there are lot of existing softwares that are implemented in the sequential mode, and thus need to be adapted for a parallel execution. Therefore, it is required to the programmer to be able to design parallel programs and also to have some skills in moving from a given sequential code to the corresponding parallel code. In this paper, we present a basic educational scenario on how to give a consistent and efficient background in parallel computing to ordinary computer scientists and engineers. KeywordsHPC; multicore; scheduling; SIMD; accelerator; benchmark; dependence; graph; shared memory; distributed memory; thread; synchronization; I.
Asynchronous Iterative Solution for Dominant Eigenvectors with Applications in Performance Modelling and . . .
, 2009
"... Performance analysis calculations, for models of any complexity, require a distributed computation effort that can easily occupy a large compute cluster for many days. Producing a simple steadystate measure involves an enormous dominant eigenvector calculation, with even modest performance models h ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Performance analysis calculations, for models of any complexity, require a distributed computation effort that can easily occupy a large compute cluster for many days. Producing a simple steadystate measure involves an enormous dominant eigenvector calculation, with even modest performance models having upwards of 10 12 variables. Computations such as passagetime analysis are an order of magnitude more difficult, producing many hundreds of repeated linear system calculations. As models describe greater concurrency, so the state space of the model increases and with it the magnitude of any performance analysis problem that may be being attempted. The PageRank algorithm is used by Google to measure the relative importance of web pages. It does this by formulating and solving a similarly enormous dominant eigenvector problem, with one variable for every page on the web. As with performance problems, as the number of web pages grows, so the size of the underlying system calculation grows also. With the number of web pages currently estimated to exceed one trillion, the PageRank problem requires many thousands of computers running concurrently over many different clusters. Both problems share the same underlying mathematical type and also the same requirement