Results 1 -
9 of
9
BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution
"... LT (latency tolerant) execution is an attractive candidate technique for future out-of-order cores. LT defers the forward slices of LLC (last-level cache) misses to a slice buffer and re-executes them when the misses return. An LT core increases ILP without physically scaling the issue queue and reg ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
LT (latency tolerant) execution is an attractive candidate technique for future out-of-order cores. LT defers the forward slices of LLC (last-level cache) misses to a slice buffer and re-executes them when the misses return. An LT core increases ILP without physically scaling the issue queue and register file and increases MLP without additional software threads that can reduce cache performance. Unfortunately, proposed LT designs are not energy efficient. They require too many additional structures and they defer and re-execute too many instructions to justify their performance gains. In this paper, we address these inefficiencies. We introduce a microarchitecture called BOLT (Better Out-of-Order Latency-Tolerance) that implements LT as an alternative use of SMT (Simultaneous Multi-Threading). We also present a new slice buffer organization and traversal scheme that increases performance and reduces overhead by pruning instances of useless and redundant LT. Collectively, these modifications turn outof-order LT into a technique that improves performance in an energy-efficient way. 1.
RetCon: Transactional Repair without Replay
- In Proceedings of the 37th Annual International Symposium on Computer Architecture
, 2010
"... Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, i.e. the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes scalability bottlenecks of workloads that use optimistic concurrency. We find th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, i.e. the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes scalability bottlenecks of workloads that use optimistic concurrency. We find that one common bottleneck is updates to auxiliary program data in otherwise non-conflicting operations, e.g. reference count updates and hashtable occupancy field increments. To eliminate the performance impact of conflicts on such auxiliary data, this work proposes RETCON, a hardware mechanism that tracks the relationship between input and output values symbolically and uses this symbolic information to transparently repair the output state of a transaction at commit. RETCON is inspired by instruction replay-based mechanisms but exploits simplifying properties of the nature of computations on auxiliary data to perform repair without replay. Our experiments show that RETCON provides significant speedups for workloads that exhibit conflicts on auxiliary data, including transforming a transactionalized version of the Python interpreter from a workload that exhibits no scaling to one that exhibits near-linear scaling on 32 cores.
Decoupled Store Completion/Silent Deterministic Replay: Enabling Scalable Data Memory for CPR/CFP Processors
"... CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mech ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations. The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don’t work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint. We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area.
MECHANISMS FOR UNBOUNDED, CONFLICT-ROBUST HARDWARE TRANSACTIONAL MEMORY
"... COPYRIGHT 2010 Colin BlundellThis dissertation is dedicated to my wife, Angelina. Without you, this would not have been possible. iii Acknowledgements This dissertation would not have been possible without the love and support of my family. My deepest thanks go to my wife, Angelina. The opportunity ..."
Abstract
- Add to MetaCart
COPYRIGHT 2010 Colin BlundellThis dissertation is dedicated to my wife, Angelina. Without you, this would not have been possible. iii Acknowledgements This dissertation would not have been possible without the love and support of my family. My deepest thanks go to my wife, Angelina. The opportunity to meet her has been the greatest reward of my decision to go to graduate school. She is both the source of my success and the reason that this success has meaning. I also thank Jacob for the joy that he has brought to my life and Merlin for his constant good humor, support, and loyalty. The support of my mother, father, and brother has been instrumental in me reaching this point. They have shared in the joy of my successes and have helped me weather the setbacks. The foundation of my later success was laid in my parents ’ teaching when I was young. My brother David has been there with me through the ups and downs of our entire lives; he is the best friend a brother could ever hope for.
Scalable Cores in Chip Multiprocessors
"... Chip design is at an inflection point. It is now clear that chip multiprocessors (CMPs) will dominate product offerings for the forseeable future. Such designs integrate many processing cores onto a single chip. However, debate remains about the relative merits of explicit software threading necesar ..."
Abstract
- Add to MetaCart
Chip design is at an inflection point. It is now clear that chip multiprocessors (CMPs) will dominate product offerings for the forseeable future. Such designs integrate many processing cores onto a single chip. However, debate remains about the relative merits of explicit software threading necesary to use these designs. At the same time, the pursuit of improved performance for single threads must continue, as legacy applications and hard-to-parallelize codes will remain important. These concerns lead computer architects to a quandary with each new design. Too much focus on per-core performance will fail to encourage software (and software developers) to migrate programs toward explicit concurrency; too little focus on cores will hurt performance of vital existing applications. Furthermore, because future chips will be constrained by power, it may not be possible to deploy both aggressive cores and many hardware threads in the same chip. To address the need for chips delivering both high single-thread performance and many hardware threads, this thesis evaluates ScalableCoresinChipMultiprocessors:CMPsequippedwith cores that deliver high-performance (at high per-core power) when the situation merits, but can also operate at lower-power modes, to enable concurrent execution of many threads. Toward this vision, I make several contributions. First, I discuss a method for representing inter-instruction
ENERGY EFFICIENT LOAD LATENCY TOLERANCE: SINGLE-THREAD PERFORMANCE FOR THE MULTI-CORE ERA
, 2010
"... COPYRIGHT ..."
Appears in the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011 Idempotent Processor Architecture
"... Improving architectural energy efficiency is important to address diminishing energy efficiency gains from technology scaling. At the same time, limiting hardware complexity is also important. This paper presents a new processor architecture, the idempotent processor architecture, that advances both ..."
Abstract
- Add to MetaCart
Improving architectural energy efficiency is important to address diminishing energy efficiency gains from technology scaling. At the same time, limiting hardware complexity is also important. This paper presents a new processor architecture, the idempotent processor architecture, that advances both of these directions by presenting a new execution paradigm that allows speculative execution without the need for hardware checkpoints to recover from misspeculation, instead using only re-execution to recover. Idempotent processors execute programs as a sequence of compilerconstructed idempotent (re-executable) regions. The nature of these regions allows precise state to be reproduced by re-execution, obviating the need for hardware recovery support. We build upon the insight that programs naturally decompose into a series of idempotent regions and that these regions can be large. The paradigm of executing idempotent regions, which we call idempotent processing, can be used to support various types of speculation, including branch prediction, dependence prediction, or execution in the presence of hardware faults or exceptions. In this paper, we demonstrate how idempotent processing simplifies the design of in-order processors. Conventional in-order processors suffer from significant complexities to achieve high performance while supporting the execution of variable latency instructions and enforcing precise exceptions. Idempotent processing eliminates much of these complexities and the resulting inefficiencies by allowing instructions to retire out of order with support for re-execution when necessary to recover precise state. Across a diverse set of benchmark suites, our quantitative results show that we obtain a geometric mean performance increase of 4.4 % (up to 25 % and beyond) while maintaining an overall reduction in power and hardware complexity.
Graduate Group Chairperson’s Name Graduate Group ChairpersonAcknowledgments
, 2011
"... This dissertation would not have been possible without the guidance, help, and support from my committee, family and friends. I owe my deepest gratitude to my advisor, Dr. Amir Roth, for his excellent guidance, extreme patience, and great caring. Dr. Amir Roth led me through my Ph.D. life, providing ..."
Abstract
- Add to MetaCart
This dissertation would not have been possible without the guidance, help, and support from my committee, family and friends. I owe my deepest gratitude to my advisor, Dr. Amir Roth, for his excellent guidance, extreme patience, and great caring. Dr. Amir Roth led me through my Ph.D. life, providing me with a great vision of computer architecture. I would like to thank Dr. Milo Martin, who not only inspired me with research ideas, but also helped me in considering career in a long run. I would like to thanks Dr.Tali Moreshet, Dr. Andr DeHon, and Dr. Oleg Sokolsky for the great suggestions on my dissertation. I would also like to thank my great Penn CIS architecture and compiler group companions, Dr. Vlad Petric, Dr. Anne Bracy, Dr. Marc Corliss, Dr. Andrew Hilton, Dr. Colin Blundell, Arun Raghavan, Santosh Nagarakatte, and Vivek Rane, for the great days we spent together. I would like to specially thank my grandfather, Mingfang Sha, who was a great man with strong determination. It is him who motivated me to start my graduate study at the University of Pennsylvania. Although I was not able to accompany him going through the last moment of his life, I know that his spirit is always with me, keeping me moving forward. My great thanks go to my grandmother, Sujuan Tang, for encouraging me every time when I was down and holding the family together after grandfather left us. I want to thank my parents, Zuojia Sha and Liqiong Tan, who selflessly have devoted all their love to me since the first day when I was born. I would like to thank my uncle, Dr. Yuejia Sha, my aunt, Qi Yin, and my cousin, Hao Sha, for their non-stop support and help. I would like to thank a good friend of mine, Gang Song, for accompanying me through ii the good and bad days, sharing my happiness, and encouraging me to pursuing my dream
PRACTICAL LOW-OVERHEAD ENFORCEMENT OF MEMORY SAFETY FOR C PROGRAMS
"... COPYRIGHT 2012 Santosh Ganapati NagarakatteThis dissertation is dedicated to my parents. Without them, this would not have been possible. iii Acknowledgments This dissertation is a direct result of constant support and encouragement from my parents who had more confidence in my abilities than I had, ..."
Abstract
- Add to MetaCart
COPYRIGHT 2012 Santosh Ganapati NagarakatteThis dissertation is dedicated to my parents. Without them, this would not have been possible. iii Acknowledgments This dissertation is a direct result of constant support and encouragement from my parents who had more confidence in my abilities than I had, at times, in my ability. Apart from my parents, there are numerous people who have been instrumental in the growth of my research career and my development as an individual. My adviser, Milo Martin has had a transformative influence on me as a researcher. I am fortunate to have worked with him for the last five years. Milo provided me the initial insights, the motivation to work on the problem, and eventually has nourished my ideas. He was generous with his time and wisdom. He provided me an excellent platform where I could excel. Apart from the research under him, he gave me freedom to collaborate with other researchers independently. I have also learned a great deal about teaching, presentations, and mentoring that will be amazingly useful in my future

