Results 1 - 10
of
10
Efficient processor support for DRFx, a memory model with exceptions
- In Proceedings of the sixteenth international conference on Architectural
, 2011
"... Abstract A longstanding challenge of shared-memory concurrency is to provide a memory model that allows for efficient implementation while providing strong and simple guarantees to programmers. The C++0x and Java memory models admit a wide variety of compiler and hardware optimizations and provide ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
Abstract A longstanding challenge of shared-memory concurrency is to provide a memory model that allows for efficient implementation while providing strong and simple guarantees to programmers. The C++0x and Java memory models admit a wide variety of compiler and hardware optimizations and provide sequentially consistent (SC) semantics for data-race-free programs. However, they either do not provide any semantics (C++0x) or provide a hard-tounderstand semantics (Java) for racy programs, compromising the safety and debuggability of such programs. In earlier work we proposed the DRFx memory model, which addresses this problem by dynamically detecting potential violations of SC due to the interaction of compiler or hardware optimizations with data races and halting execution upon detection. In this paper, we present a detailed micro-architecture design for supporting the DRFx memory model, formalize the design and prove its correctness, and evaluate the design using a hardware simulator. We describe a set of DRFx-compliant complexity-effective optimizations which allow us to attain performance close to that of TSO (Total Store Model) and DRF0 while providing strong guarantees for all programs.
Software-hardware cooperative DRAM bank partitioning for chip multiprocessors
- In Proc. the 2010 IFIP Int’l Conf. Network and Parallel Computing (NPC
, 2010
"... Abstract. DRAM row buffer conflicts can increase the memory access latency significantly for single-threaded applications. In a chip multiprocessor system, multiple applications competing for DRAM will suffer additional row buffer conflicts due to interthread interference. This paper presents a new ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract. DRAM row buffer conflicts can increase the memory access latency significantly for single-threaded applications. In a chip multiprocessor system, multiple applications competing for DRAM will suffer additional row buffer conflicts due to interthread interference. This paper presents a new hardware and software cooperative DRAM bank partitioning method that combines page coloring and XOR cache mapping to evaluate the benefit potential of reducing interthread interference. Using SPECfp2000 as our benchmarks, our simulation results show that our scheme can boost the performance of the most benchmark combinations tested, with the speedups of up to 13%, 14 % and 8.06 % observed for two cores (with 16 banks), two cores (with 32 banks) and four cores (with
SynFull: Synthetic traffic models capturing a full range of cache coherent behaviour
- in ISCA
, 2014
"... Modern and future many-core systems represent complex ar-chitectures. The communication fabrics of these large systems heavily influence their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased com-plexit ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Modern and future many-core systems represent complex ar-chitectures. The communication fabrics of these large systems heavily influence their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased com-plexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that capture workload trends with faster simulation times are highly ben-eficial at early stages of architectural exploration. We pro-pose SynFull, a synthetic traffic generation methodology that captures both application and cache coherence behaviour to rapidly evaluate NoCs. SynFull allows designers to quickly indulge in detailed performance simulations without the cost of long-running full-system simulation. By capturing a full range of application and coherence behaviour, architects can avoid the over or underdesign of the network as may occur when using traditional synthetic traffic patterns such as uni-form random. SynFull has errors as low as 0.3 % and provides 50 × speedup on average over full-system simulation. 1.
When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency
"... While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive contention for shared resources, wasting energy, which is of primary concern for embedded devices. Furthermore, thermal and power constraints prevent us from operating all the processor cores at the highest possible frequency, favoring fewer threads. The best number of threads to run depends on the application, user input and hardware resources available. It can also change at runtime making it infeasible for the programmer to determine this number. To address this problem, we propose LIMO, a runtime system that dynamically manages the number of running threads of an application for maximizing peformance and energy-efficiency. LIMO monitors threads ’ progress along with the usage of shared hardware resources to determine the best number of threads to run and the voltage and frequency level. With dynamic adaptation, LIMO provides an average of 21 % performance improvement and a 2x improvement in energy-efficiency on a 32-core system over the default configuration of 32 threads for a set of concurrent applications from the PARSEC suite, the Apache web server, and the Sphinx speech recognition system.
Data Criticality in Network-On-Chip Design
"... ABSTRACT Many network-on-chip (NoC) designs focus on maximizing performance, delivering data to each core no later than needed by the application. Yet to achieve greater energy efficiency, we argue that it is just as important that data is delivered no earlier than needed. To address this, we explo ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Many network-on-chip (NoC) designs focus on maximizing performance, delivering data to each core no later than needed by the application. Yet to achieve greater energy efficiency, we argue that it is just as important that data is delivered no earlier than needed. To address this, we explore data criticality in CMPs. Caches fetch data in bulk (blocks of multiple words). Depending on the application's memory access patterns, some words are needed right away (critical ) while other data are fetched too soon (non-critical ). On a wide range of applications, we perform a limit study of the impact of data criticality in NoC design. Criticalityoblivious designs can waste up to 37.5% energy, compared to an idealized NoC that fetches each word both no later and no earlier than needed. Furthermore, 62.3% of energy is wasted fetching data that is not used by the application. We present NoCNoC, a practical, criticality-aware NoC design that achieves up to 60.5% energy savings with no loss in performance. Our work moves towards an ideally-efficient NoC, delivering data both no later and no earlier than needed.
Load Value Approximation
"... Abstract—Approximate computing explores opportunities that emerge when applications can tolerate error or inex-actness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. We can trade-off some loss in output value integrity ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Approximate computing explores opportunities that emerge when applications can tolerate error or inex-actness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. We can trade-off some loss in output value integrity for improved processor performance and energy-efficiency. As memory accesses consume substantial latency and energy, we explore load value approximation, a microarchitec-tural technique to learn value patterns and generate approx-imations for the data. The processor uses these approximate data values to continue executing without incurring the high cost of accessing memory, removing load instructions from the critical path. Load value approximation can also inhibit approximated loads from accessing memory, resulting in energy savings. On a range of PARSEC workloads, we observe up to 28.6 % speedup (8.5 % on average) and 44.1 % energy savings (12.6 % on average), while maintaining low output error. By exploiting the approximate nature of applications, we draw closer to the ideal latency and energy of accessing memory. I.
ACKNOWLEDGMENTS
, 2011
"... My heartfelt gratitude and thanks are due to my advisor Dr. Renato J. Figueiredo for supporting, encouraging and guiding me in my academic journey culminating in the PhD degree. His patience and guidance, especially during the initial years, gave me the confidence to persevere. Learning from him abo ..."
Abstract
- Add to MetaCart
(Show Context)
My heartfelt gratitude and thanks are due to my advisor Dr. Renato J. Figueiredo for supporting, encouraging and guiding me in my academic journey culminating in the PhD degree. His patience and guidance, especially during the initial years, gave me the confidence to persevere. Learning from him about computer architecture and systems, virtualization, the art of research, techniques for good writing and strategies for creating good presentations has been a wonderful experience. I am privileged to have him as my advisor and mentor. I thank Dr. P. Oscar Boykin for teaching me techniques of analytical modeling and for the invigorating discussions on applying engineering principles to solve real-world problems. I am grateful to Dr. Jose Fortes for giving me an opportunity to be a part of the ACIS Lab at the University of Florida and for sharing his insight and perspective on research and the PhD process. I also thank Dr. Tao Li and Dr. Prabhat Mishra for serving on my committee and for their insightful questions and suggestions which have enhanced this dissertation. A good portion of my computer architecture knowledge and simulation skills were learned and honed during my internships at Intel Corporation. I thank Ramesh Illikkal, Greg Regnier, Donald Newell and Dr. Ravi Iyer for giving me these opportunities and Nilesh Jain, Jaideep Moses, Dr. Omesh Tickoo and Paul M.Stillwell Jr for helping me complete these internships successfully. I also thank the members of the SoC Platform and Architecture group at Intel Labs for their ideas and perspectives on my research. I am especially thankful to Dr. Omesh Tickoo for being a wonderful mentor during and after my internship. I would also like to thank my past and present colleagues at ACIS Labs and at
POSTPRINT VERSION, COPYRIGHT OWNED BY IEEE On the Performance of Tagged Translation Lookaside Buffers: A Simulation-Driven Analysis
"... Abstract—Recent virtualization-driven CPU architectural extensions involve tagging the hardware-managed Translation Lookaside Buffer (TLB) entries to avoid TLB flushes during context switches, thereby sharing the TLB among multiple address spaces. While tagged TLBs are expected to improve the perfor ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Recent virtualization-driven CPU architectural extensions involve tagging the hardware-managed Translation Lookaside Buffer (TLB) entries to avoid TLB flushes during context switches, thereby sharing the TLB among multiple address spaces. While tagged TLBs are expected to improve the performance of virtualized workloads, a systematic evaluation of this improvement, its dependence on TLB and workload related factors and the performance implications of the con-tention arising from TLB sharing are yet to be investigated. This paper undertakes these investigations using a simulation-driven approach. We develop a simulation model for the tagged TLB and integrate it into a full-system simulation framework. Using this model, we show that the performance impact of using tagged TLBs ranges from 1 % to 25 % and is highly dependent on the size of the TLB, the TLB miss penalty and the nature of the workload and the type of tag used. The performance of consolidated workloads is also simulated and the observations from these simulations are used to highlight the performance variation due to resource contention in the shared TLB. Isolating the TLB behavior of one application in a consolidated workload from these variations due to the TLB contention by means of a static TLB usage control scheme is also explored. Furthermore, we show that the performance improvement due to tagged TLBs can be further increased by 1.4 × for selected high-priority applications, by restricting the TLB usage of other low-priority workloads, in a consolidated workload scenario. I.
Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs
"... Abstract—To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation and hardware implementation cost; these design aspects are not independent. The k ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation and hardware implementation cost; these design aspects are not independent. The key contribution of this work lies in the design of a novel selection strategy, Destination-Based Selection Strategy (DBSS) which targets interference that can arise in many-core systems running consolidation workloads. In the process of this design, we holistically consider all aspects to ensure an efficient design. Existing routing algorithms largely overlook issues associated with workload consolidation. Locally adaptive algorithms do not consider enough status information to avoid network congestion. Globally adaptive routing algorithms attack this issue by utilizing network status beyond neighboring nodes. However, they may suffer from interference, coupling the behavior of otherwise independent applications. To address these issues, DBSS leverages both local and non-local network status to provide more effective adaptivity. More importantly, by integrating the destination into the selection procedure, DBSS mitigates interference and offers dynamic isolation among applications. Results show that DBSS offers better performance than the best baseline selection strategy and improves the energy-delay product for medium and high injection rates; it is well suited for workload consolidation.
Accelerating Network-on-Chip Simulation via Sampling
"... Abstract—Architectural complexity continues to grow as we consider the large design space of multiple cores, cache architectures, networks-on-chip and memory controllers for emerging architectures. Simulators are growing in complexity to reflect each of these system components. However, many full-sy ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Architectural complexity continues to grow as we consider the large design space of multiple cores, cache architectures, networks-on-chip and memory controllers for emerging architectures. Simulators are growing in complexity to reflect each of these system components. However, many full-system simulators fail to take advantage of the underlying hardware resources such as multiple cores; as a result, simulation times have grown significantly in recent years. Long turnaround times limit the range and depth of design space exploration that is tractable. Communication has emerged as a first class design consideration and has led to significant research into networks-on-chip (NoC). The NoC is yet another component of the architecture that must be faithfully modeled in simulation. Given its importance, we focus on accelerating NoC simulation through the use of sampling techniques; sampling can provide both accurate results and fast evaluation. We propose NoCLabs and NoCPoint, two sampling methodologies utilizing statistical sampling theory and traffic phase behavior, respectively. Experimental results show that our proposed NoCLabs and NoCPoint estimate NoC performance with an average error of 5 % while achieving one order of magnitude speedup on average. I.