Results 1 - 10
of
22
Copy Or Discard Execution Model For Speculative Parallelization On Multicores
"... The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Dis ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
(Show Context)
The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Discard (CorD) execution model in which the state of speculative parallel threads is maintained separately from the nonspeculative computation state. If speculation is successful, the results of the speculative computation are committed by copying them into the non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Optimizations are proposed to reduce the cost of data copying between nonspeculative and speculative state. A lightweight mechanism that maintains version numbers for non-speculative data values enables misspeculation detection. We also present an algorithm for profile-based speculative parallelization that is effective in extracting parallelism from sequential programs. Our experiments show that the combination of CorD and our speculative parallelization algorithm achieves speedups ranging from 3.7 to 7.8 on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors.
Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
"... Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, su ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application. To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.
Multi-Execution: Multicore Caching for Data-Similar Executions
"... While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one prof ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as “multi-execution.” Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5× on average (ranging from 0.93 × to 6.92×), while posing an overhead of only 4.28 % on cache area and 5.21 % on power when it is used.
A Helper Thread Based EDP Reduction Scheme for Adapting Application Execution in CMPs
"... In parallel to the changes in both the architecture domain – the move toward chip multiprocessors (CMPs) – and the application domain – the move toward increasingly data-intensive workloads – issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The C ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
In parallel to the changes in both the architecture domain – the move toward chip multiprocessors (CMPs) – and the application domain – the move toward increasingly data-intensive workloads – issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The CPU availability can change dynamically due to several reasons such as thermal overload, increase in transient errors, or operating system scheduling. An important question in this context is how to adapt, in a CMP, the execution of a given application to CPU availability change at runtime. Our paper studies this problem, targeting the energy-delay product (EDP) as the main metric to optimize. We first discuss that, in adapting the application execution to the varying CPU availability, one needs to consider the number of CPUs to use, the number of application threads to accommodate and the voltage/frequency levels to employ (if the CMP has this capability). We then propose to use helper threads to adapt the application execution to CPU availability change in general with the goal of minimizing the EDP. The helper thread runs parallel to the application execution threads and tries to determine the ideal number of CPUs, threads and voltage/frequency levels to employ at any given point in execution. We illustrate this idea using two applications (Fast Fourier Transform and MultiGrid) under different execution scenarios. The results collected through our experiments are very promising and indicate that significant EDP reductions are possible using helper threads. For example, we achieved up to 66.3 % and 83.3 % savings in EDP when adjusting all the parameters properly in applications FFT and MG, respectively. 1
Speculative parallelization of sequential loops on multicores
- International Journal of Parallel Programming
"... The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. T ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. The execution model we employ is based upon state separation, an approach for separately maintaining the speculative state of parallel threads and non-speculative state of the computation. If speculation is successful, the results produced by parallel threads in speculative state are committed by copying them into the computation’s non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Techniques are proposed to reduce the cost of data copying between non-speculative and speculative state and efficiently carrying misspeculation detection. We apply the above approach to speculative parallelization of loops in several sequential programs which results in significant speedups on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors. The advent of multicores presents a promising opportunity for speeding up sequential programs via profilebased
Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures
- in High Performance Computing and Simulation (HPCS), 2011 International Conference on. IEEE, 2011
"... ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investi ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investigate four application independent thread pinning strategies and five application sensitive ones based on cache sharing. We made extensive performance evaluation on three different multi-core machines reflecting three usual utilisation: workstation machine, server machine and high performance machine. In overall, we show that fixing thread affinities (whatever the tested strategy) is a better choice for improving program performance on HPC ccNUMA machines compared to OS-based thread placement. This means that the current Linux OS scheduling strategy is not necessarily the best choice in terms of performance on ccNUMA machines, even if it is a good choice in terms of cores usage ratio and work balancing. On smaller Core2 and Nehalem machines, we show that the benefit of thread pinning is not satisfactory in terms of speedups versus OSbased scheduling, but the performance stability is much better.
A constraint programming approach to instruction assignment
- The 15th Annual Workshop on the Interaction between Compilers and Computer Architecture (INTERACT’15
, 2011
"... A fundamental problem in compiler optimization, which has increased in importance due to the spread of multi-core architectures, is to find parallelism in sequential programs. Current processors can only be fully taken advantage of if workload is distributed over the available processors. In this pa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
A fundamental problem in compiler optimization, which has increased in importance due to the spread of multi-core architectures, is to find parallelism in sequential programs. Current processors can only be fully taken advantage of if workload is distributed over the available processors. In this paper we look at distributing instructions in a block of code over multi-cluster processors, the instruction assignment problem. The optimal assignment of instructions in blocks of code on multiple processors is known to be NP-complete. In this paper we present a constraint programming approach for scheduling instructions on multi-cluster systems that feature fast inter-processor communication. We employ a problem decomposition technique to solve the problem in a hierarchical manner where an instance of the master problem solves multiple sub-problems to derive a solution. We found that our approach was able to achieve an improvement of 6%-20%, on average, over the state-of-the-art techniques on superblocks from SPEC 2000 benchmarks. 1.
Conflict-Avoidance in Multicore Caching for Data-Similar Executions
"... Power density constraints have affected the scaling of clock speed in processors, but following Moore’s law we have entered the multicore domain and we are about to step in the era of manycores. Harnessing the full potential of large number of cores is a challenging problem as shared on-chip resourc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Power density constraints have affected the scaling of clock speed in processors, but following Moore’s law we have entered the multicore domain and we are about to step in the era of manycores. Harnessing the full potential of large number of cores is a challenging problem as shared on-chip resources such as memory subsystem, interconnect networks become the bottlenecks. One easy and popular way of utilizing parallelism in large scale systems is by running multiple instances of the same application as we observe in many domains such as verification, security etc. and we term it as “multiexecution. ” This model of computation will probably become more popular as the number of cores in a processor grows. We identify that leveraging the similarity in data across the instances of an application by dynamically merging identical data in a cache can reduce the off-chip traffic and thereby, lead to faster execution. However, dissimilarities in content increase the competition for cache lines as well. In this paper we explore the design space of hybrid mergeable cache architecture that places dissimilar data blocks in a conventional cache and thereby, enables us to exploit data similarity more efficiently by reducing the conflicts. We experiment with benchmarks from various multi-execution domain and show that our hybrid mergeable cache design leads to an average of 8.9 % additional speedup over Mergeable cache while running 8 copies of an application, with an overhead of less than 2.26 % in area. I.
An Improvement Over Threads Communications on Multi-Core Processors 1
, 2012
"... Abstract: Multicore is an integrated circuit chip that uses two or more computational engines (cores) places in a single processor. This new approach is used to split the computational work of a threaded application and spread it over multiple execution cores, so that the computer system can benefi ..."
Abstract
- Add to MetaCart
Abstract: Multicore is an integrated circuit chip that uses two or more computational engines (cores) places in a single processor. This new approach is used to split the computational work of a threaded application and spread it over multiple execution cores, so that the computer system can benefits from a better performance and better responsiveness of the system. A thread is a unit of execution inside a process that is created and maintained to execute a set of actions/ instructions. Threads can be implemented differently from an operating system to another, but the operating system is in most cases responsible to schedule the execution of different threads. Multi-threading improving efficiency of processor performance with a cost-effective memory system. In this paper, we explore one approach to improve communications for multithreaded. Pre-send is a software Controlled data forwarding technique that sends data to destination's cache before it is needed, eliminating cache misses in the destination's cache as well as reducing the coherence traffic on the bus. we show how we could improve the overall system performance by addition of these architecture optimizations to multi-core processors.