Results 1 - 10
of
14
Design and evaluation of hybrid fault-detection systems
- In Proceedings of the 32th Annual International Symposium on Computer Architecture
, 2005
"... As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and softwareonly fault-detection mechanisms to identify and mitigate the deleterious effec ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and softwareonly fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two faultdetection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the faultdetection design space. 1
Rapid development of a flexible validated processor model
- In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation
, 2005
"... For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable ..."
Abstract
-
Cited by 22 (11 self)
- Add to MetaCart
For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims. In this paper, we present our experience using this methodology to construct a validated flexible model of Intel’s Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction (CPI) to within 7.9 % on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal “offsetting errors ” which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as front-end stall errors. In 2 1 2 person-weeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; front-end stall errors were reduced from 5.6 % to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility. 1
Software-controlled fault tolerance
- ACM Transactions on Architecture and Code Optimization
, 2005
"... Traditional fault tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Traditional fault tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/ software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.
FPGA-based Fast, Cycle-Accurate, Full-System Simulators
- In Proceedings of the second Workshop on Architecture Research using FPGA Platforms, held in conjunction with HPCA-12
, 2006
"... Abstract — An ideal computer simulator is (i) fast, (ii) accurate to cycle level resolution, (iii) complete, modeling the entire system and running unmodified applications and operating systems, (iv) transparent providing visibility into all aspects of the system with minimum impact to simulation pe ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract — An ideal computer simulator is (i) fast, (ii) accurate to cycle level resolution, (iii) complete, modeling the entire system and running unmodified applications and operating systems, (iv) transparent providing visibility into all aspects of the system with minimum impact to simulation performance, (iv) inexpensive and (v) easy to create, extend and modify. Conventional wisdom says that no simulator can simultaneously have all these properties[1] and none currently does. Instead, simulators are specialized, emphasizing some desired properties over others. For example, architectural simulators traditionally trade speed for accuracy while full-system simulators traditionally trade accuracy for speed. This paper describes an approach to simulation that potentially has all of the characteristics of an ideal simulator listed above. It achieves its capabilities by partitioning a simulator into a software component and a hardware component implemented in FPGAs. The resulting simulators are capable of 1M to 100M cycles per second, full cycle-accuracy, the ability to run unmodified applications and operating systems and full visibility at a reasonable price. Such a simulator could potentially result in simulator convergence, where different groups can use the same simulation infrastructure, resulting in more coherent architectures, implementations and software. I.
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
"... The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweig ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles. We evaluate the proposed tiled CMP architecture on the Splash-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed directory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16 % on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors.
Microarchitecture Modeling for Design-Space Exploration Design-Space Exploration
, 2004
"... To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodolog ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodologies lead to long simulator development times. This bottleneck limits exploration to a small set of designs, potentially diminishing quality of the final design.
The Liberty Simulation Environment, Version 1.0
- PERFORMANCE EVALUATION REVIEW: SPECIAL ISSUE ON TOOLS FOR ARCHITECTURE RESEARCH
, 2004
"... High-level hardware modeling via simulation is an essential step in hardware systems design and research. Despite the importance of simulation, current model creation methods are error prone and are unnecessarily time consuming. To address these problems, we have publicly released the Liberty Simula ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
High-level hardware modeling via simulation is an essential step in hardware systems design and research. Despite the importance of simulation, current model creation methods are error prone and are unnecessarily time consuming. To address these problems, we have publicly released the Liberty Simulation Environment (LSE), Version 1.0, consisting of a simulator builder and automatic visualizer based on a shared hardware description language. LSE's design was motivated by a careful analysis of the strengths and weaknesses of existing systems. This has resulted in a system in which models are easier to understand, faster to develop, and have performance on par with other systems. LSE is capable of modeling any synchronous hardware system. To date, LSE has been used to simulate and convey ideas about a diverse set of complex systems including a chip multiprocessor out-of-order IA64 machine and a multiprocessor system with detailed device models.
Separate compilation of synchronous modules
- In International Conference on Embedded Software and Systems (ICESS
, 2005
"... Abstract. Synchronous models are useful for designing real-time embedded systems because they provide timing control and deterministic concurrency. However, the semantics of such models usually require an entire system to be compiled at once to analyze the dependencies among modules. The alternative ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. Synchronous models are useful for designing real-time embedded systems because they provide timing control and deterministic concurrency. However, the semantics of such models usually require an entire system to be compiled at once to analyze the dependencies among modules. The alternative is to write modules that can respond when the values of some of their inputs are unknown, a tedious and error-prone process. We present a compilation technique that allows a programmer to describe synchronous modules without having to consider undefined inputs. Our algorithm transforms such a description into code that does as much as it can with undefined inputs, allowing modules to be compiled separately and assembled later. We implemented our technique in a compiler for the Esterel language and present results that show the technique does not impose a substantial overhead. 1
Pipelined Multithreading Transformations and Support Mechanisms
- Princeton University
, 2007
"... Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallel ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is essential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied to parallelize general-purpose applications which are characterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies.
To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only tolerate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the performance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance.
Next, the dissertation shows that even though PMT applications tolerate inter-core latencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communication mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art.
Performance Scalability of Decoupled Software Pipelining
"... Any successful solution to using multicore processors to scale general-purpose program performance will have to contend with rising intercore communication costs while exposing coarsegrained parallelism. Recently proposed pipelined multithreading (PMT) techniques have been demonstrated to have gener ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Any successful solution to using multicore processors to scale general-purpose program performance will have to contend with rising intercore communication costs while exposing coarsegrained parallelism. Recently proposed pipelined multithreading (PMT) techniques have been demonstrated to have general-purpose applicability and are also able to effectively tolerate intercore latencies through pipelined interthread communication. These desirable properties make PMT techniques strong candidates for program parallelization on current and future multicore processors and understanding their performance characteristics is critical to their deployment. To that end, this paper evaluates the performance scalability of a general-purpose PMT technique called decoupled software pipelining (DSWP) and presents a thorough analysis of the communication bottlenecks that must be overcome for optimal DSWP scalability.

