Results 1 -
9 of
9
Copy Or Discard Execution Model For Speculative Parallelization On Multicores
"... The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Dis ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
(Show Context)
The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Discard (CorD) execution model in which the state of speculative parallel threads is maintained separately from the nonspeculative computation state. If speculation is successful, the results of the speculative computation are committed by copying them into the non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Optimizations are proposed to reduce the cost of data copying between nonspeculative and speculative state. A lightweight mechanism that maintains version numbers for non-speculative data values enables misspeculation detection. We also present an algorithm for profile-based speculative parallelization that is effective in extracting parallelism from sequential programs. Our experiments show that the combination of CorD and our speculative parallelization algorithm achieves speedups ranging from 3.7 to 7.8 on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors.
SDC-Based Modulo Scheduling for Pipeline Synthesis
- Int’l Conf. on Computer-Aided Design
, 2013
"... Modulo scheduling is a popular technique to enable pipelined execution of successive loop iterations for performance im-provement. While a variety of modulo scheduling algorithms exist for software pipelining, they are not amenable to many complex design constraints and optimization goals that arise ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Modulo scheduling is a popular technique to enable pipelined execution of successive loop iterations for performance im-provement. While a variety of modulo scheduling algorithms exist for software pipelining, they are not amenable to many complex design constraints and optimization goals that arise in the hardware synthesis context. In this paper we describe a modulo scheduling framework based on the formulation of system of difference constraints (SDC). Our framework can systematically model a rich set of performance constraints that are specific to the hardware design. The scheduler also exploits the unique mathemati-cal properties of SDC to carry out efficient global optimiza-tion and fast incremental update on the constraint system to minimize the resource usage of the synthesized pipeline. Ex-periments demonstrate that our proposed technique provides efficient solutions for a set of real-life applications and com-pares favorably against a widely used lifetime-sensitive mod-ulo scheduling algorithm. 1.
Speculative parallelization of sequential loops on multicores
- International Journal of Parallel Programming
"... The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. T ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. The execution model we employ is based upon state separation, an approach for separately maintaining the speculative state of parallel threads and non-speculative state of the computation. If speculation is successful, the results produced by parallel threads in speculative state are committed by copying them into the computation’s non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Techniques are proposed to reduce the cost of data copying between non-speculative and speculative state and efficiently carrying misspeculation detection. We apply the above approach to speculative parallelization of loops in several sequential programs which results in significant speedups on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors. The advent of multicores presents a promising opportunity for speeding up sequential programs via profilebased
A General Constraint-centric Scheduling Framework for Spatial Architectures
"... Specialized execution using spatial architectures provides energy efficient computation, but requires effective algorithms for spatially scheduling the computation. Generally, this has been solved with architecture-specific heuristics, an approach which suffers from poor compiler/architect productiv ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Specialized execution using spatial architectures provides energy efficient computation, but requires effective algorithms for spatially scheduling the computation. Generally, this has been solved with architecture-specific heuristics, an approach which suffers from poor compiler/architect productivity, lack of insight on optimality, and inhibits migration of techniques between architectures. Our goal is to develop a scheduling framework usable for all spatial architectures. To this end, we expresses spatial scheduling as a constraint satisfaction problem using Integer Linear Program-ming (ILP). We observe that architecture primitives and scheduler responsibilities can be related through five abstractions: placement of computation, routing of data, managing event timing, managing resource utilization, and forming the optimization objectives. We encode these responsibilities as 20 general ILP constraints, which are used to create schedulers for the disparate TRIPS, DySER, and PLUG architectures. Our results show that a general declar-ative approach using ILP is implementable, practical, and typically matches or outperforms specialized schedulers.
IMPRESS: Improving Multicore Performance and Reliability via Efficient Support for Software Monitoring
, 2009
"... My sincere thanks to my advisor Prof. Rajiv Gupta, with whom I have worked for the past 5 years. I have learnt a lot about compilers, architecture and research from him during this time. I have also learnt a lot from the ways in which he interacts with his students. His humility, patience and concer ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
My sincere thanks to my advisor Prof. Rajiv Gupta, with whom I have worked for the past 5 years. I have learnt a lot about compilers, architecture and research from him during this time. I have also learnt a lot from the ways in which he interacts with his students. His humility, patience and concern for his students are characteristics that have really struck me. I have thoroughly enjoyed the time spent with him solving research problems, and I want to thank him for all the help, advice and encouragement towards this dissertation. I also thank my other committee members Prof. Walid Najjar and Prof. Frank Vahid. I would also like to thank all the members of my research group including Dr.
An Energy-Efficient Patchable Accelerator For Post-Silicon Engineering Changes
"... With the shorter time-to-market and the rising cost in SoC development, the demand for post-silicon programmability has been increasing. Recently, programmable accelerators have attracted more attention as an enabling solution for post-silicon engineering change. However, programmable accelerators s ..."
Abstract
- Add to MetaCart
(Show Context)
With the shorter time-to-market and the rising cost in SoC development, the demand for post-silicon programmability has been increasing. Recently, programmable accelerators have attracted more attention as an enabling solution for post-silicon engineering change. However, programmable accelerators suffers from 5∼10X less energy efficiency than fixed-function accelerators mainly due to their extensive use of memories. This paper proposes a highly energy-efficient accelerator which enables post-silicon engineering change by a control patching mechanism. Then, we propose a patch compilation method from a given pair of an original design and a modified design. Experimental results demonstrate that the proposed accelerators offer high energy efficiency competitive to fixed-function accelerators and can achieve about 5X higher efficiency than the existing programmable accelerators.
Optimal Control Problem and Power-Efficient Medical Image Processing Using Puma
"... ABSTRACT: As a starting point of this paper we present a problem from mammographic image processing. We show how it can be formulated as an optimal control problem for PDEs and illustrate that it leads to penalty t erms which are nonstandard in the theory of optimal control of PDEs. To solve this co ..."
Abstract
- Add to MetaCart
ABSTRACT: As a starting point of this paper we present a problem from mammographic image processing. We show how it can be formulated as an optimal control problem for PDEs and illustrate that it leads to penalty t erms which are nonstandard in the theory of optimal control of PDEs. To solve this control problem we use a generalization of the conditional gradient method which is especially suitable for non-convex problems. We apply this method to our control problem and illustrate that this method also covers the recently proposed method of surrogate functional from the theory of inverse problems. Graphics processing units (GPUs) are becoming an increasingly popularplatform to run applications that require a high computation throughput.They are limited, however, by memory bandwidth and power and, assuch, cannot always achieve their full potential. This paper presents thePUMA architecture- a domain-specific accelerator designed specificallyfor medical imaging applications, but with sufficient generality to makeit programmable. The goal is to closely match the performance achievedby GPUs in this domain but at a fraction of the power consumption. Theresults are quite promising- PUMA achieves upto 2X the performance of a modern GPU architecture and has up to a 54X improved efficiency on a floating-point and memory-intensive MRI reconstruction algorithm.
AUTOMATIC DESIGN OF EFFICIENT APPLICATION-CENTRIC ARCHITECTURES
, 2008
"... would not have been possible without the guidance and support of many people. First and foremost, I would like to thank my advisor, Scott Mahlke. His insight, expertise, enthusiasm, and encouragement played a large part in my success in graduate school. Without his guidance, this dissertation would ..."
Abstract
- Add to MetaCart
(Show Context)
would not have been possible without the guidance and support of many people. First and foremost, I would like to thank my advisor, Scott Mahlke. His insight, expertise, enthusiasm, and encouragement played a large part in my success in graduate school. Without his guidance, this dissertation would not exist. In addition, Scott was one of the first to encourage me to undertake graduate studies in the first place after I worked with him as an undergraduate. I would also like to thank my thesis committee, Professors Trevor Mudge, Todd Austin, and Stéphane Lafortune. They donated their time, giving valuable comments and suggestions to help improve the research and refine the thesis. In addition, I would like to thank Bill Mangione-Smith, who first exposed me to compilers and to graduate school when I worked with him at UCLA. The research presented in this dissertation was not the work of one person; I was fortunate to have the assistance of a number of other students in the Compilers Creating Custom Processors research group. Manjunath Kudlur provided significant help with the ILP and SMT scheduling formulations presented in this dissertation.