Results 1  10
of
10
Low Depth CacheOblivious Algorithms
, 2009
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cacheoblivious model. We describe several cacheoblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparsematrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cacheoblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multicore processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these mappings to a multilevel parallel treeofcaches model that reflects current and future trends in multicore cache hierarchies—these new mappings imply that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
Firewall modules and modular firewalls
 in Proc. 18th IEEE Int. Conf. Network Protocols (ICNP
, 2010
"... Abstract—A firewall is a packet filter placed at an entry point of a network in the Internet. Each packet that goes through this entry point is checked by the firewall to determine whether to accept or discard the packet. The firewall makes this determination based on a specified sequence of overlap ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—A firewall is a packet filter placed at an entry point of a network in the Internet. Each packet that goes through this entry point is checked by the firewall to determine whether to accept or discard the packet. The firewall makes this determination based on a specified sequence of overlapping rules. The firewall uses the firstmatch criterion to determine which rule in the sequence should be applied to which packet. Thus, to compute the set of packets to which a rule is applied, the firewall designer needs to consider all the rules that precede this rule in the sequence. This “rule dependency ” complicates the task of designing firewalls (especially those with thousands of rules), and makes firewalls hard to understand. In this paper, we present a metric, called the dependency metric, for measuring the complexity of firewalls. This metric, though accurate, does not seem to suggest ways to design firewalls whose dependency metrics are small. Thus, we present another metric, called the inversion metric, and develop methods for designing firewalls with small inversion metrics. We show that the dependency metric and the inversion metric are correlated for some classes of firewalls. So by aiming to design firewalls with small inversion metrics, the designer may end up with firewalls whose dependency metrics are small as well. We present a method for designing modular firewalls whose inversion metrics are very small. Each modular firewall consists of several components, called firewall modules. The inversion metric of each firewall module is very small in fact, 1 or 2. Thus, we conclude that modular firewalls are easy to design and easy to understand. I.
Efficient Scheduling for Parallel Memory Hierarchies (Regular Submission)
"... This paper presents a scheduling algorithm for efficiently implementing nestedparallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nestedparallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be wr ..."
Abstract
 Add to MetaCart
(Show Context)
This paper presents a scheduling algorithm for efficiently implementing nestedparallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nestedparallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be written cache obliviously (no choices are made based on machine parameters) and analyzed using a single level of cache with parameters Z (cache size) and L (cache line size), and a parameter α specifying the algorithm’s parallelism (for input size n, n α represents the number of processors that can be effectively used). For several fundamental algorithms we show that the cache cost in the parallel ideal cache model is optimal, matching the sequential bounds, with a parallelism α → 1. For example, for cacheoblivious sorting of n keys, the cache cost is Q ∗ (n; Z, L) = Θ((n/L)log Z+2 n). Our scheduler guarantees that the number of misses across all caches at each level i of the machine’s hierarchy is at most the cache cost Q ∗ (n; Zi/3, Li) as analyzed for an algorithm. Machine hierarchies are modeled as trees of caches using a symmetric variant of the parallel memory hierarchy (PMH) model. In this model, every cache at level i is of size Zi, has line size Li, transfer cost Ci (the cost of fetching a line of data from its parent cache at level i + 1), and child fanout fi. Each leaf node (level 0) is a processor, with parameters set so that its cost corresponds to the processor’s work (i.e., its instruction count). Finally, we show that if the algorithm parallelism exceeds the machine parallelism (as defined in the paper) the work is balanced including the cost of cache misses. In particular for an hlevel memory hierarchy, our scheduler guarantees a total runtime of T(n) = O ( ∑h−1 i=0 Ci ̂ Qα(n; Zi/3, Li)
A Bridging Model for . . .
, 2010
"... Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software ..."
Abstract
 Add to MetaCart
Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multicore architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multicore architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our MultiBSP model is a multilevel model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameteraware portable algorithms that run efficiently on
A Bridging Model for MultiCore Computing
"... Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software ..."
Abstract
 Add to MetaCart
(Show Context)
Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multicore architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multicore architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our MultiBSP model is a multilevel model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameteraware portable algorithms that run efficiently on
permission. Single Program, Multiple Data Programming for Hierarchical Computations
"... All rights reserved. ..."
(Show Context)
Modeling SharedMemory Multiprocessor Systems with AADL
"... Abstract. Multiprocessor chips are now commonly supplied by IC manufacturers. Realtime applications based on this kind of execution platforms are difficult to develop for many reasons: complexity, design space, unpredictability,... The allocation of the multiple hardware resources to the software ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Multiprocessor chips are now commonly supplied by IC manufacturers. Realtime applications based on this kind of execution platforms are difficult to develop for many reasons: complexity, design space, unpredictability,... The allocation of the multiple hardware resources to the software entities could have a great impact on the final result. Then, if the interest to represent the set of hardware computing resources inside an architectural model is obvious, we argue that other hardware elements like shared buses or memory hierarchies must be included in the models if analyze of the timing behavior is expected to be performed. This article gives guidelines to represent, with AADL, sharedmemory multiprocessing systems and their memory hierarchies while keeping a highlevel model of the hardware architecture.
Published In Low Depth CacheOblivious Algorithms
, 2009
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cacheoblivious model. We describe several cacheoblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparsematrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cacheoblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multicore processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these