Results 1 -
7 of
7
Data Parallel Haskell: a status report
, 2007
"... We describe the design and current status of our effort to implement the programming model of nested data parallelism into the Glasgow Haskell Compiler. We extended the original programmingmodel and its implementation, both of which were first popularised by the NESL language, in terms of expressiv ..."
Abstract
-
Cited by 56 (14 self)
- Add to MetaCart
We describe the design and current status of our effort to implement the programming model of nested data parallelism into the Glasgow Haskell Compiler. We extended the original programmingmodel and its implementation, both of which were first popularised by the NESL language, in terms of expressiveness as well as efficiency. Our current aim is to provide a convenient programming environment for SMP parallelism, and especially multicore architectures. Preliminary benchmarks show that we are, at least for some programs, able to achieve good absolute performance and excellent speedups.
Functional array fusion
- In ICFP ’01: Proceedings of the sixth ACM SIGPLAN international conference on Functional programming
, 2001
"... This paper introduces a new approach to optimising array algorithms in functional languages. We are specifically aiming at an efficient implementation of irregular array algorithms that are hard to implement in conventional array languages such as Fortran. We optimise the storage layout of arrays co ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
This paper introduces a new approach to optimising array algorithms in functional languages. We are specifically aiming at an efficient implementation of irregular array algorithms that are hard to implement in conventional array languages such as Fortran. We optimise the storage layout of arrays containing complex data structures and reduce the running time of functions operating on these arrays by meansofequationalprogramtransformations. Inparticular, this paper discusses a novel form of combinator loop fusion, whichbyremovingintermediatestructuresoptimisestheuse of the memory hierarchy. We identify a combinator named loopP that provides a general scheme for iterating over an array and that in conjunction with an array constructor replicateP is sufficient to express a wide range of array algorithms. On this basis, we define equational transformation rules that combine traversals of loopP and replicateP as well as sequences of applications of loopP into a single loopP traversal. Our approach naturally generalises to a parallel implementation and includes facilities for optimising load balancing and communication. A prototype implementation based on the rewrite rule pragma of the Glasgow Haskell Compiler is significantly faster than standard Haskell arrays and approaches the speed of hand coded C for simple examples. 1.
Optimizing overall loop schedules using prefetching and partitioning
- IEEE Transactions on Parallel and Distributed Systems
, 2000
"... In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular par-titions. Then a two-part schedule, consisting of the ALU and memory parts, is produced ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular par-titions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two parts are executed simultaneously, and hence the remote memory laten-cies are overlapped. We study the optimal partition shape and size so that a well balanced overall schedule can be obtained. Experiments on DSP benchmarks show that the proposed methodology consistently pro-duces optimal or near optimal solutions.
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching
- Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology
, 2000
"... Over the last 20 years, the performance gap between CPU and memory has been steadily increasing. As a result, a variety of techniques has been devised to hide that performance gap, from intermediate fast memories (caches) to various prefetching and memory management techniques for manipulating the d ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Over the last 20 years, the performance gap between CPU and memory has been steadily increasing. As a result, a variety of techniques has been devised to hide that performance gap, from intermediate fast memories (caches) to various prefetching and memory management techniques for manipulating the data present in these caches. In this paper we propose a new memory management technique that takes advantage of access pattern information that is available at compile time by prefetching certain data elements before explicitly being requested by the CPU, as well as maintaining certain data in the local memory over a number of iterations. In order to better take advantage of the locality of reference present in loop structures, our technique also uses a new approach to memory by partitioning it and reducing execution to each partition, so that information is reused at much smaller time intervals than if execution followed the usual pattern. These combined approaches - using a new set of memory instructions as well as partitioning the memory - lead to improvements in total execution time of approximately 25% over existing methods.
Loop Scheduling Optimization with Data Prefetching based on Multi-dimensional Retiming
- In Proc. ICSA 11th Intl. Conference on Parallel and Distributed Computing Systems
, 1998
"... In this paper, we propose a novel loop scheduling technique based on multi-dimensional retiming in a balanced fashion, which considers the computation schedule and memory access schedule simultaneously. Experiments show that the proposed technique which combines the data prefetching and retiming is ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we propose a novel loop scheduling technique based on multi-dimensional retiming in a balanced fashion, which considers the computation schedule and memory access schedule simultaneously. Experiments show that the proposed technique which combines the data prefetching and retiming is successful in hiding memory latency and improving the overall performance. Comparing with the traditional list scheduling algorithm, the average improvement is 28%. 1 Introduction In the past few years, the processing time of a processor has increased more rapidly than the speed to access the memory. In a system with a local and remote memory, the program execution time become significantly depending on the remote-memory access latency. Data prefetching, i.e., retrieving data from the remote memory and store it in the local memory before using it, is an attractive approach to hide memory access latencies. A prefetching scheme reduces the memory miss penalty by overlapping the processor comp...
Optimal Loop Scheduling for Hiding Memory Latency Based on Two Level Partitioning and Prefetching.
- IEEE Transactions on Signal Processing
, 2000
"... The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. As a result, a variety of techniques have been devised to hide this latency. These techniques range from cache hierarchies to various prefetching and memory management techniques fo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. As a result, a variety of techniques have been devised to hide this latency. These techniques range from cache hierarchies to various prefetching and memory management techniques for manipulating the data present in the caches. In DSP applications, the existence of large numbers of uniform nested loops makes the issue of loop scheduling very important. In this paper, we propose a new memory management technique that can be applied to computer architectures with three levels of memory, the scheme generally adopted in contemporary computer architecture. This technique takes advantage of access pattern information that is available at compile time by prefetching certain data elements from the higher level memory before they are explicitly requested by the lower level memory or CPU. It also maintains certain data for a period of time to prevent unnecessary data swapping. In order to take better advantage of the locality of references present in these loop structures, our technique introduces a new approach to memory management by partitioning it and reducing execution to each partition, so that data locality is much improved compared with the usual pattern. These combined approaches { using a new set of memory instructions as well as partitioning the memory { lead to improvements in average execution times of approximately 35% over the one-level partition algorithm and more than 80% over list scheduling and hardware prefetching.
Software Methods to Increase Data Cache Performance
"... Cache performance is critical to the overall performance of modern CPUs. In most processors, cycle time is almost entirely determined by the cache hit time. This places practical limits on the complexity of cache ..."
Abstract
- Add to MetaCart
Cache performance is critical to the overall performance of modern CPUs. In most processors, cycle time is almost entirely determined by the cache hit time. This places practical limits on the complexity of cache

