Results 1  10
of
23
Residual splash for optimally parallelizing belief propagation
 In In Artificial Intelligence and Statistics (AISTATS
, 2009
"... As computer architectures move towards multicore we must build a theoretical understanding of parallelism in machine learning. In this paper we focus on parallel inference in graphical models. We demonstrate that the natural, fully synchronous parallelization of belief propagation is highly ineffici ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
As computer architectures move towards multicore we must build a theoretical understanding of parallelism in machine learning. In this paper we focus on parallel inference in graphical models. We demonstrate that the natural, fully synchronous parallelization of belief propagation is highly inefficient. By bounding the achievable parallel performance in chain graphical models we develop a theoretical understanding of the parallel limitations of belief propagation. We then provide a new parallel belief propagation algorithm which achieves optimal performance. Using two challenging realworld tasks, we empirically evaluate the performance of our algorithm on large cyclic graphical models where we achieve near linear parallel scaling and out perform alternative algorithms. 1
Node Level Primitives for Parallel Exact Inference
, 2007
"... We present node level primitives for parallel exact inference on an arbitrary Bayesian network. We explore the probability representation on each node of Bayesian networks and each clique of junction trees. We study the operations with respect to these probability representations and categorize the ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
We present node level primitives for parallel exact inference on an arbitrary Bayesian network. We explore the probability representation on each node of Bayesian networks and each clique of junction trees. We study the operations with respect to these probability representations and categorize the operations into four node level primitives: table extension, table multiplication, table division, and table marginalization. Exact inference on Bayesian networks can be implemented based on these node level primitives. We develop parallel algorithms for the above and achieve parallel computational complexity of O(w 2 r (w+1) N/p), O(Nr w) space complexity and scalability up to O(r w), where N is the number of cliques in the junction tree, r is the number of states of a random variable, w is the maximal size of the cliques, and p is the number of processors. Experimental results illustrate the scalability of our parallel algorithms for each of these primitives.
Scalable Node Level Computation Kernels for Parallel Exact Inference
"... In this paper, we investigate data parallelism in exact inference with respect to arbitrary junction trees. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with clique width and the number of states of random varia ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
In this paper, we investigate data parallelism in exact inference with respect to arbitrary junction trees. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with clique width and the number of states of random variables. We study potential table representation and scalable algorithms for node level primitives. Based on such node level primitives, we propose computation kernels for evidence collection and evidence distribution. A data parallel algorithm for exact inference is presented using the proposed computation kernels. We analyze the scalability of node level primitives, computation kernels and the exact inference) algorithm using the coarse grained multicomputer (CGM) model. According to the analysis, we achieve O NdCwC ∏ wC j=1 rC,j/P local computation time and O(N) global communication rounds using ∏wC j=1 rC,j, where N is the number of cliques in the junction tree; dC is the clique degree; rC,j is the P processors, 1 ≤ P ≤ maxC number of states of the jth random variable in C; wC is the clique width; and ws is the separator width. We implemented the proposed algorithm on stateoftheart clusters. Experimental results show that the proposed algorithm exhibits almost linear scalability over a wide range.
Scalable parallel implementation of Bayesian network to junction tree conversion for exact inference
 Information Retrieval: Data Structures and Algorithms
, 2006
"... We present a scalable parallel implementation for converting a Bayesian network to a junction tree, which can then be used for a complete parallel implementation for exact inference. We explore parallelism during the process of moralization, triangulation, clique identification, junction tree constr ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
We present a scalable parallel implementation for converting a Bayesian network to a junction tree, which can then be used for a complete parallel implementation for exact inference. We explore parallelism during the process of moralization, triangulation, clique identification, junction tree construction and potential table calculation. For an arbitrary Bayesian network with n vertices using p processors, the worstcase running time is shown to be O(n 2 w/p+wr w n/p+n log p), where w is the clique width and r is the number of states of the random variables. Our algorithm is scalable over 1 ≤ p ≤ nw / log n. We have implemented our parallel algorithm using OpenMP and experimented with up to 128 processors. We consider three types of Bayesian networks: linear, balanced and random. While the state of the art PNL library implementation does not scale, we achieve speedups of 31, 29 and 24 for the above graphs respectively on the DataStar cluster at San Diego Supercomputing Center. 1
Parallel Evidence Propagation on Multicore Processors ⋆
"... Abstract. In this paper, we design and implement an efficient technique for parallel evidence propagation on stateoftheart multicore processor systems. Evidence propagation is a major step in exact inference, a key problem in exploring probabilistic graphical models. We propose a rerooting algori ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Abstract. In this paper, we design and implement an efficient technique for parallel evidence propagation on stateoftheart multicore processor systems. Evidence propagation is a major step in exact inference, a key problem in exploring probabilistic graphical models. We propose a rerooting algorithm to minimize the critical path in evidence propagation. The rerooted junction tree is used to construct a directed acyclic graph (DAG) where each node represents a computation task for evidence propagation. We develop a collaborative scheduler to dynamically allocate the tasks to the cores of the processors. In addition, we integrate a task partitioning module in the scheduler to partition large tasks so as to achieve load balance across the cores. We implemented the proposed method using Pthreads on both AMD and Intel quadcore processors. For a representative set of junction trees, our method achieved almost linear speedup. The execution time of our method was around twice as fast as the OpenMP based implementation on both the platforms.
Parallel Exact Inference on the Cell Broadband Engine Processor
, 2009
"... We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatic ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to stateoftheart processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations.
Parallelization of Inference in Bayesian Networks
, 1999
"... This report gives a survey of different approaches to parallelization of probabilistic inference in Bayesian networks. Results from preliminary experiments are presented, and the conclusion is that the largest performance improvements are obtained with low level parallelization of the potential oper ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This report gives a survey of different approaches to parallelization of probabilistic inference in Bayesian networks. Results from preliminary experiments are presented, and the conclusion is that the largest performance improvements are obtained with low level parallelization of the potential operations performed during inference.
A probabilistic model for information and sensor validation
 The Computer Journal
, 2006
"... This paper develops a new theory and model for information and sensor validation. The model represents relationships between variables using Bayesian networks and utilizes probabilistic propagation to estimate the expected values of variables. If the estimated value of a variable differs from the ac ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This paper develops a new theory and model for information and sensor validation. The model represents relationships between variables using Bayesian networks and utilizes probabilistic propagation to estimate the expected values of variables. If the estimated value of a variable differs from the actual value, an apparent fault is detected. The fault is only apparent since it may be that the estimated value is itself based on faulty data. The theory extends our understanding of when it is possible to isolate real faults from potential faults and supports the development of an algorithm that is capable of isolating real faults without deferring the problem to the use of expert provided domainspecific rules. To enable practical adoption for realtime processes, an any time version of the algorithm is developed, that, unlike most other algorithms, is capable of returning improving assessments of the validity of the sensors as it accumulates more evidence with time. The developed model is tested by applying it to the validation of temperature sensors during the startup phase of a gas turbine when conditions are not stable; a problem that is known to be challenging. The paper concludes with a discussion of the practical applicability and scalability of the model. 1.
Parallel Exact Inference on a CPUGPGPU Heterogenous System ∗
"... Exact inference is a key problem in exploring probabilistic graphical models, where the computational complexity varies dramatically as the parameters of the graphical models changes. To achieve scalability over hundreds of threads remains a fundamental challenge. In this paper, we design an efficie ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Exact inference is a key problem in exploring probabilistic graphical models, where the computational complexity varies dramatically as the parameters of the graphical models changes. To achieve scalability over hundreds of threads remains a fundamental challenge. In this paper, we design an efficient scheduler hosted by the CPU to allocate cliques in junction trees to the GPGPU at run time. The scheduler can merge multiple small cliques or split large cliques dynamically so as to maximize the utilization of the GPGPU resources. We propose a conflict free potential table organization and an optimal data layout for coalescing memory access. In addition, we develop a double buffering based asynchronous data transfer between the CPU and GPGPU to overlap the clique processing on the GPGPU with the data transfer and scheduling. Our implementation of the proposed method on GPGPU platforms achieved 30 × speedup compared with stateoftheart multicore processors, and it sustains 70 % of the theoretical upper bound of the GPGPU throughput. 1.