Results 1 
5 of
5
Speeding up External Mergesort
 IEEE Transactions on Knowledge and Data Engineering
"... External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Int ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, inte...
MCSTL: The MultiCore Standard Template Library
"... Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The MultiCore Standard Template Library (MCSTL) simplifies parallelization by providing efficient parallel implementations of the algorithms in the C++ Standard Template Library. Thus, simple recompilation will provide partial parallelization of applications that make consistent use of the STL. We present performance measurements on several architectures. For example, our sorter achieves a speedup of 21 on an 8core 32thread SUN T1. 1
Towards Optimal Range Medians ⋆
"... Abstract. We consider the following problem: Given an unsorted array of n elements, and a sequence of intervals in the array, compute the median in each of the subarrays defined by the intervals. We describe a simple algorithm which needs O(n log k + k log n) time to answer k such median queries. Th ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract. We consider the following problem: Given an unsorted array of n elements, and a sequence of intervals in the array, compute the median in each of the subarrays defined by the intervals. We describe a simple algorithm which needs O(n log k + k log n) time to answer k such median queries. This improves previous algorithms by a logarithmic factor and matches a comparison lower bound for k = O(n). The space complexity of our simple algorithm is O(n log n) in the pointermachine model, and O(n) in the RAM model. In the latter model, a more involved O(n) space data structure can be constructed in O(n log n) time where the time per query is reduced to O(log n / log log n). We also give efficient dynamic variants of both data structures, achieving O(log 2 n) query time using O(n log n) space in the comparison model and O((log n / log log n) 2) query time using O(n log n / log log n) space in the RAM model, and show that in the cellprobe model, any data structure which supports updates in O(log O(1) n) time must have Ω(log n / log log n) query time.
Efficient Chip Multi Processor Programming  Programming a MultiCore Processor
, 2011
"... In this work a realistic machine model, the CMPmodel, is investigated. This model captures the cache hierarchies on mainstream CMPs as well as the ways that these caches interacts. A parallel programming library for benchmarking is presented. The presented library introduces policy based scheduling ..."
Abstract
 Add to MetaCart
In this work a realistic machine model, the CMPmodel, is investigated. This model captures the cache hierarchies on mainstream CMPs as well as the ways that these caches interacts. A parallel programming library for benchmarking is presented. The presented library introduces policy based scheduling allowing fair evaluation of scheduling algorithms. The paralleldepthfirst scheduler theoretically perform better than the widely used workstealing scheduler, in the CMPcache model. Both schedulers are implemented in the benchmarking library. Efficient parallelizations of the wellknown sequential sorting algorithms quicksort and multiway mergesort are analysed based on the CMPcache model. The parallel quicksort algorithm is based on a parallelization of the inplace sequential partitioning algorithm. The parallel multiway mergesort is based on a fway partitioning algorithm. The presented library is used to evaluate the two analysed parallel sorting algorithms using both the implemented schedulers. We find that efficient parallel
Synchronization, Coherence, and Consistency for High Performance SharedMemory Multiprocessing
, 1992
"... Although improved device technology has increased the performance of computer systems, fundamental hardware limitations and the need to build faster systems using existing technology have led many computer system designers to consider parallel designs with multiple computing elements. Unfortunately, ..."
Abstract
 Add to MetaCart
Although improved device technology has increased the performance of computer systems, fundamental hardware limitations and the need to build faster systems using existing technology have led many computer system designers to consider parallel designs with multiple computing elements. Unfortunately, the design of efficient and scalable multiprocessors has proven to be an elusive goal. This dissertation describes a hierarchical busbased multiprocessor architecture, an adaptive cache coherence protocol, and efficient and simple synchronization support that together meet this challenge. We have also developed an executiondriven tool for the simulation of sharedmemory multiprocessors, which we use to evaluate the proposed architectural enhancements. Our simulator offers substantial advantages in terms of reduced time and space overheads when compared to instructiondriven or tracedriven simulation techniques, without significant loss of accuracy. The simulator generates correctly inter...