Results 1 - 10
of
27
Larrabee: a many-core x86 architecture for visual computing
- In SIGGRAPH ’08: ACM SIGGRAPH 2008 papers
, 2008
"... Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector proces ..."
Abstract
-
Cited by 104 (6 self)
- Add to MetaCart
Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2 nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this
Heat-and-run: leveraging smt and cmp to manage power density through the operating system
- In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power density is characterized by localized chip hot spo ..."
Abstract
-
Cited by 83 (0 self)
- Add to MetaCart
Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power density is characterized by localized chip hot spots that can reach critical temperatures and cause failure. Previous architectural approaches to power density have used global clock gating, fetch toggling, dynamic frequency scaling, or resource duplication to either prevent heating or relieve overheated resources in a superscalar processor. Previous approaches also evaluate design technologies where power density is not a major problem and most applications do not overheat the processor. Future processors, however, are likely to be chip multiprocessors (CMPs) with simultaneously-multithreaded (SMT) cores. SMT CMPs pose unique challenges and opportunities for power density. SMT and CMP increase throughput and thus on-chip heat, but also provide natural granularities for managing power-density. This paper is the first work to leverage SMT and CMP to address power density. We propose heat-and-run SMT thread assignment to increase processor-resource utilization before cooling becomes necessary by co-scheduling threads that use complementary resources. We propose heat-and-run CMP thread migration to migrate threads away from overheated cores and assign them to free SMT contexts on alternate cores, leveraging availability of SMT contexts on alternate CMP cores to maintain throughput while allowing overheated cores to cool. We show that our proposal has an average of 9 % and up to 34 % higher throughput than a previous superscalar technique running the same number of threads.
Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors
- In CASES ’00: Proceedings of the 2000 international conference on Compilers, architecture, and
, 2000
"... Abstract Power consumption has been a major concern in designing microprocessors for portable systems such as notebook computers, hand-held computing and personal telecommunication devices. As these devices increase in popularity and are used in a wider range of applications, a low power design beco ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
Abstract Power consumption has been a major concern in designing microprocessors for portable systems such as notebook computers, hand-held computing and personal telecommunication devices. As these devices increase in popularity and are used in a wider range of applications, a low power design becomes more critical. In this paper, we propose a new microarchitectural data cache design called region-based caching that can reduce power consumption. Power savings is achieved by re-organizing the the first level cache to more efficiently exploit memory reference characteristics produced by programming language semantics. These characteristics enable the cache to be partitioned by memory region (stack, global, heap), reducing power consumption, while retaining comparable performance to a conventional cache design. Applications from the MediaBench benchmark suite indicate that a design with two additional small region-based caches results in 66 % reduction in average in energy-delay product. 1.
System-Level Power-Aware Design Techniques in Real-Time Systems
- Proceedings of the IEEE
, 2003
"... Power and energy consumption has recently become an important issue and consequently, power-aware techniques are being devised at all levels of system design; from the circuit and device level, to the architectural, compiler, operating system and networking layers. In this survey we concentrate on p ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Power and energy consumption has recently become an important issue and consequently, power-aware techniques are being devised at all levels of system design; from the circuit and device level, to the architectural, compiler, operating system and networking layers. In this survey we concentrate on power-aware design techniques for real-time systems. While the main focus is on hard real-time, soft real-time systems are considered as well. We start with the motivation for focusing on these systems and provide a brief discussion on power and energy objectives. We then follow with a survey of current research on a layer by layer basis. We conclude with illustrative examples and open research challenges. This work provides an overview of poweraware techniques for the real-time system engineer as well as an up-to-date reference list for the researcher.
Power reduction techniques for microprocessor systems
- ACM Computing Surveys
, 2005
"... Power consumption is a major factor that limits the performance of computers. We survey the “state of the art ” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architecture ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Power consumption is a major factor that limits the performance of computers. We survey the “state of the art ” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architectures to system software, and system
Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors
- In Computer Architecture Letters, Vol
, 2005
"... Abstract—This paper evaluates asymmetric cluster chip multiprocessor (ACCMP) architectures as a mechanism to achieve the highest performance for a given power budget. ACCMPs execute serial phases of multithreaded programs on large high-performance cores whereas parallel phases are executed on a mix ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract—This paper evaluates asymmetric cluster chip multiprocessor (ACCMP) architectures as a mechanism to achieve the highest performance for a given power budget. ACCMPs execute serial phases of multithreaded programs on large high-performance cores whereas parallel phases are executed on a mix of large and many small simple cores. Theoretical analysis reveals a performance upper bound for symmetric multiprocessors, which is surpassed by asymmetric configurations at certain power ranges. Our emulations show that asymmetric multiprocessors can reduce power consumption by more than two thirds with similar performance compared to symmetric multiprocessors.
Performance Implications of Single Thread Migration on a Chip Multi-Core
- SIGARCH Computer Architecture News
, 2005
"... High performance multi-core processors are becoming an industry reality. Although multi-cores are suited for multithreaded and multi-programmed workloads, many applications are still mono-thread and multi-core performance with a single thread workload is an important issue. Furthermore, recent studi ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
High performance multi-core processors are becoming an industry reality. Although multi-cores are suited for multithreaded and multi-programmed workloads, many applications are still mono-thread and multi-core performance with a single thread workload is an important issue. Furthermore, recent studies suggest that performance, power and temperature considerations of future multi-cores may necessitate activity-migration between cores. Motivated by the above, this paper investigates the performance implications of single thread migration on a multi-core. Specifically, the study considers the influence on the performance of a single thread of the following migration and multi-core parameters: frequency of migration, core warm-up modes, subset of resources that are warmed-up, number of cores, and cache hierarchy organization. The results of this study can provide insight to architects on how to design performance-efficient power and thermal strategies for a multi-core chip. The experimental results, for the benchmarks and microarchitectures used in this study, show that the performance loss due to activity migration on a multi-core with private L1s and a shared L2 can be minimized if: (a) a migrating thread continues its execution on a core that was previously visited by the thread, and (b) cores remember their predictor state since their previous activation (all other core resources can be cold). The data also show that the transfer of the register state between two cores can be slow, latency of several 100s of cycles, without limiting performance. 1
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs
- In MICRO-43: Proceedings of the 43th Annual IEEE/ACM International Symposium on Microarchitecture
, 2010
"... To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cor ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support U-cores. Unlike prior models that trade performance, power, and area using well-known relationships between simple and complex processors, our model must consider the less-obvious relationships between conventional processors and a diverse set of U-cores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today’s state-of-the-art multicores, GPUs, FPGAs, and ASICs. Our results reinforce some current-day understandings of the potential and limitations of U-cores and also provides new insights on their relative merits. 1.
Heat Stroke: Power-Density-Based Denial of Service in SMT
- In Proceedings of the International Symposium on HighPerformance Computer Architecture (HPCA
, 2005
"... In the past, there have been several denial-of-service (DOS) attacks which exhaust some shared resource (e.g., physical memory, process table, file descriptors, TCP connections) of the targeted machine. Though these attacks have been addressed, it is important to continue to identify and address new ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In the past, there have been several denial-of-service (DOS) attacks which exhaust some shared resource (e.g., physical memory, process table, file descriptors, TCP connections) of the targeted machine. Though these attacks have been addressed, it is important to continue to identify and address new attacks because DOS is one of most prominent methods used to cause significant financial loss. A recent paper shows how to prevent attacks that exploit the sharing of pipeline resources (e.g., shared trace cache) in SMT to degrade the performance of normal threads. In this paper, we show that power density can be exploited in SMT to launch a novel DOS attack, called heat stroke. Heat stroke repeatedly accesses a shared resource to create a hot spot at the resource. Current solutions to hot spots inevitably involve slowing down the pipeline to let the hot spot cool down. Consequently, heat stroke slows down the entire SMT pipeline and severely degrades normal threads. We present a solution to heat stroke by identifying the thread that causes the hot spot and selectively slowing down the malicious thread while minimally affecting normal threads. 1
Power Reduction in Superscalar Datapaths Through Dynamic Bit–Slice
- Activation”, International Workshop on Innovative Architecture (IWIA
, 2001
"... We show by simulating the execution of SPEC 95 benchmarks on a true hardware–level, cycle by cycle simulator for a superscalar CPU that about half of the bytes of operands flowing on the datapath, particularly the leading bytes, are all zeros. Furthermore, a significant number of the bits within the ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We show by simulating the execution of SPEC 95 benchmarks on a true hardware–level, cycle by cycle simulator for a superscalar CPU that about half of the bytes of operands flowing on the datapath, particularly the leading bytes, are all zeros. Furthermore, a significant number of the bits within the non–zero part of the data flowing on the various paths within the processor do not change from their prior value. We show how these two facts, attesting to the lack of a high level of entropy in the data streams, can be exploited to reduce power dissipation within all explicit and implicit storage components of a typical superscalar datapath such as register files, dispatch buffers, reorder buffers, as well as interconnections such as buses and direct links. Our simulation results and SPICE measurements from representative VLSI layouts show power savings of about 25 % on the average over all SPEC 95 benchmarks. 1.

