Results 1 -
5 of
5
Cache-optimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have near-optimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4-level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
Low Depth Cache-Oblivious Algorithms
, 2009
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract
- Add to MetaCart
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cache-oblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multi-core processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these mappings to a multi-level parallel tree-of-caches model that reflects current and future trends in multi-core cache hierarchies—these new mappings imply that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the
Efficient Scheduling for Parallel Memory Hierarchies (Regular Submission)
"... This paper presents a scheduling algorithm for efficiently implementing nested-parallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nested-parallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be wr ..."
Abstract
- Add to MetaCart
This paper presents a scheduling algorithm for efficiently implementing nested-parallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nested-parallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be written cache obliviously (no choices are made based on machine parameters) and analyzed using a single level of cache with parameters Z (cache size) and L (cache line size), and a parameter α specifying the algorithm’s parallelism (for input size n, n α represents the number of processors that can be effectively used). For several fundamental algorithms we show that the cache cost in the parallel ideal cache model is optimal, matching the sequential bounds, with a parallelism α → 1. For example, for cache-oblivious sorting of n keys, the cache cost is Q ∗ (n; Z, L) = Θ((n/L)log Z+2 n). Our scheduler guarantees that the number of misses across all caches at each level i of the machine’s hierarchy is at most the cache cost Q ∗ (n; Zi/3, Li) as analyzed for an algorithm. Machine hierarchies are modeled as trees of caches using a symmetric variant of the parallel memory hierarchy (PMH) model. In this model, every cache at level i is of size Zi, has line size Li, transfer cost Ci (the cost of fetching a line of data from its parent cache at level i + 1), and child fanout fi. Each leaf node (level 0) is a processor, with parameters set so that its cost corresponds to the processor’s work (i.e., its instruction count). Finally, we show that if the algorithm parallelism exceeds the machine parallelism (as defined in the paper) the work is balanced including the cost of cache misses. In particular for an h-level memory hierarchy, our scheduler guarantees a total runtime of T(n) = O ( ∑h−1 i=0 Ci ̂ Qα(n; Zi/3, Li)
A Bridging Model for . . .
, 2010
"... Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software ..."
Abstract
- Add to MetaCart
Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multi-core architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our Multi-BSP model is a multi-level model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameter-aware portable algorithms that run efficiently on

