Results 1 - 10
of
38
AccMon: Automatically Detecting Memory-related Bugs via Program Counter-based Invariants
- In 37th International Symposium on Microarchitecture (MICRO
, 2004
"... This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically acce ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically accessed by only a few instructions. Therefore, by capturing the invariant of the set of PCs that normally access a given variable, we can detect accesses by outlier instructions, which are often caused by memory corruption, buffer overflow, stack smashing or other memory-related bugs. Since this method is statistics-based, it can detect bugs that do not violate any programming rules and that, therefore, are likely to be missed by many existing tools. The second contribution is a novel architectural extension called the Check Look-aside Buffer (CLB). The CLB uses a Bloom filter to reduce monitoring overheads in the recentlyproposed iWatcher architectural framework for software debugging. The CLB significantly reduces the overhead of PC-based invariant debugging.
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence
, 2005
"... It has been shown that many requests miss in all remote nodes in shared memory multiprocessors. We are motivated by the observation that this behavior extends to much coarser grain areas of memory. We define a region to be a continuous, aligned memory area whose size is a power of two and observe th ..."
Abstract
-
Cited by 39 (3 self)
- Add to MetaCart
It has been shown that many requests miss in all remote nodes in shared memory multiprocessors. We are motivated by the observation that this behavior extends to much coarser grain areas of memory. We define a region to be a continuous, aligned memory area whose size is a power of two and observe that many requests find that no other node caches a block in the same region even for regions as large as 16K bytes. We propose RegionScout, a family of simple filter mechanisms that dynamically detect most non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes. RegionScout filters are implemented as a layered extension over existing snoop-based coherence systems. They require no changes to existing coherence protocols or caches and impose no constraints on what can be cached simultaneously. Their operation is completely transparent to software and the operating system. RegionScout filters require little additional storage and a single additional global signal. These characteristics are made possible by utilizing imprecise information about the regions cached in each node. Since they rely on dynamically collected information RegionScout filters can adapt to changing sharing patterns. We present two applications of RegionScout: In the first RegionScout is used to avoid broadcasts for non-shared regions thus reducing bandwidth. In the second RegionScout is used to avoid snoop induced tag lookups thus reducing energy.
Interplay of Energy and Performance for Disk Arrays Running Transaction Processing Workloads
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2003
"... The growth of business enterprises and the emergence of the Internet as a medium for data processing has led to a proliferation of applications that are server-centric. The power dissipation of such servers has a major consequence not only on the costs and environmental concerns of power generation ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
The growth of business enterprises and the emergence of the Internet as a medium for data processing has led to a proliferation of applications that are server-centric. The power dissipation of such servers has a major consequence not only on the costs and environmental concerns of power generation and delivery, but also on their reliability and on the design of cooling and packaging mechanisms for these systems. This paper examines the energy and performance ramifications in the design of disk arrays which consume a major portion of the power in transaction processing environments. Using traces of TPC-C and TPC-H running on commercial servers, we conduct in-depth simulations of energy and performance behavior of disk arrays with different RAID configurations. Our results demonstrate that conventional disk power optimizations that have been previously proposed and evaluated for single disk systems (laptops/workstations) are not very effective in server environments, even if we can design disks than have extremely fast spinup/spindown latencies and predict the idle periods accurately. On the other hand, tuning RAID parameters (RAID type, number of disks, stripe size etc.) has more impact on the power and performance behavior of these systems, sometimes having opposite effects on these two criteria.
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
- In Proc. of the 32nd Annual International Symposium on Computer Architecture
, 2005
"... To maintain coherence in conventional shared-memory multiprocessor systems, processors first check other processors’ caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcastbased systems. Our res ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
To maintain coherence in conventional shared-memory multiprocessor systems, processors first check other processors’ caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcastbased systems. Our results for a set of commercial, scientific and multiprogrammed workloads show that on average 67 % (and up to 94%) of broadcasts are unnecessary. Coarse-Grain Coherence Tracking is a new technique that supplements a conventional coherence mechanism and optimizes the performance of coherence enforcement. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that information to avoid unnecessary broadcasts. Coarse-Grain Coherence Tracking is shown to eliminate 55-97% of the unnecessary broadcasts, and improve performance by 8.8 % on average (and up to 21.7%). 1.
Flexible Snooping: Adaptive forwarding and filtering of snoops in embedded-ring multiprocessors
- In Proceedings of the 33rd International Symposium on Computer Architecture
, 2006
"... A simple and low-cost approach to supporting snoopy cache coherence is to logically embed a unidirectional ring in the network of a multiprocessor, and use it to transfer snoop messages. Other messages can use any link in the network. While this scheme works for any network topology, a naive impleme ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
A simple and low-cost approach to supporting snoopy cache coherence is to logically embed a unidirectional ring in the network of a multiprocessor, and use it to transfer snoop messages. Other messages can use any link in the network. While this scheme works for any network topology, a naive implementation may result in long response times or in many snoop messages and snoop operations. To address this problem, this paper proposes Flexible Snooping algorithms, a family of adaptive forwarding and filtering snooping algorithms. In these algorithms, a node receiving a snoop request may either forward it to another node and then perform the snoop, or snoop and then forward it, or simply forward it without snooping. The resulting design space offers trade-offs in number of snoop operations and messages, response time, and energy consumption. Our analysis using SPLASH-2, SPECjbb, and SPECweb workloads finds several snooping algorithms that are more costeffective than current ones. Specifically, our choice for a highperformance snooping algorithm is faster than the currently fastest algorithm while consuming 9-17 % less energy; our choice for an energy-efficient algorithm is only 3-6 % slower than the previous one while consuming 36-42 % less energy. 1.
TLB and snoop energy-reduction using virtual caches
- In Proceedings of International Symposium on Low Power Electronics and Design
, 2002
"... In our quest to bring down the power consumption in low-power chip-multiprocessors, we have found that TLB and snoop accesses account for about 40 % of the energy wasted by all L1data-cache accesses. We have investigated the prospects of using virtual caches to bring down the number of TLB accesses. ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
In our quest to bring down the power consumption in low-power chip-multiprocessors, we have found that TLB and snoop accesses account for about 40 % of the energy wasted by all L1data-cache accesses. We have investigated the prospects of using virtual caches to bring down the number of TLB accesses. A key observation is that while the energy wasted in the TLBs are cut, the energy associated with snoop accesses becomes higher. We then contribute with two techniques to reduce the number of snoop accesses and their energy cost. Virtual caches together with the proposed techniques are shown to reduce the energy wasted in the L1caches and the TLBs by about 30%.
The Thrifty Barrier: Energy-aware synchronization in shared-memory multiprocessors
- In International Symposium on High-Performance Computer Architecture
, 2004
"... Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the co-operative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energyeffi ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the co-operative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energyefficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance. 1
HARD: Hardware-Assisted Lockset-based Race Detection
"... Abstract The emergence of multicore architectures will lead to anincrease in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Softwaresolutions for detecting data races generally incur large overheads. Hardware support for race detection can sig-nific ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Abstract The emergence of multicore architectures will lead to anincrease in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Softwaresolutions for detecting data races generally incur large overheads. Hardware support for race detection can sig-nificantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during themonitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues suchas storing the lockset information and performing complex set operations, so far it has been implemented only in soft-ware with 10-30 times performance hit. This paper proposes the first hardware implementation(called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal over-head. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bit-wise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomlyinjected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20 % more than happens-before,with only 0.1-2.6 % of execution overhead. We also show our hardware design is cost-effective by comparing with theideal lockset implementation, which would require a large amount of hardware resources.
Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors
- Workshop on Duplicating, Deconstructing, and Debunking
, 2002
"... Chip multiprocessors (CMPs) have become an interesting micro-architectural style for high-end systems as well as low-power systems. While power-performance tradeoffs differ in these systems, a high power consumption can lead to devastating power densities in the former and a reduced operating time i ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Chip multiprocessors (CMPs) have become an interesting micro-architectural style for high-end systems as well as low-power systems. While power-performance tradeoffs differ in these systems, a high power consumption can lead to devastating power densities in the former and a reduced operating time in the latter owing to limited battery capacity. In this paper, we focus on the energy wasted in the snoopy cache protocols that keep the L1 caches in CMPs consistent. Previous studies have focussed on the energy wasted by snoop accesses in the private caches in SMP systems and found that it can be a big fraction of the total energy. We apply two techniques- serial snooping and Jetty- that were developed for SMP servers and see if they can lead to energy savings in a CMP. We find that the techniques
Power-performance considerations of parallel computing on chip multiprocessors
- ACM Transactions on Architecture and Code Optimization
, 2005
"... This paper looks at the power-performance implications of running parallel applications on chip multiprocessors (CMPs). First, we develop an analytical model that, for the first time, puts together parallel efficiency, granularity of parallelism, and voltage/frequency scaling, to establish a formal ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper looks at the power-performance implications of running parallel applications on chip multiprocessors (CMPs). First, we develop an analytical model that, for the first time, puts together parallel efficiency, granularity of parallelism, and voltage/frequency scaling, to establish a formal connection with the power consumption and performance of a parallel code running on a CMP. Then, we conduct detailed simulations of parallel applications running on a detailed powerperformance CMP model to confirm the analytical results and provide further insights. Both analytical and experimental models show that parallel computing can bring significant power savings and still meet a given performance target, by choosing granularity and voltage/frequency levels judiciously. The particular choice, however, is dependent on the application’s parallel efficiency curve and the process technology utilized, which our model captures. Likewise, analytical model and experiments show the effect of a limited power budget on the application’s scalability curve. In particular, we show that a limited power budget can cause a rapid performance degradation beyond a number of cores, even in the case of applications with excellent scalability properties. On the other hand, our experiments show that, when a limited power budget is in place, power-thrifty memory-bound applications may actually enjoy better scalability than more compute-intensive codes, even if the latter would exhibit higher scalability in a power-unconstrained scenario.

