Results 1 - 10
of
51
Hotspot: A compact thermal modeling method for CMOS VLSI systems
- IEEE Transactions on
, 2006
"... Abstract—This paper presents HotSpot—a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconne ..."
Abstract
-
Cited by 39 (9 self)
- Add to MetaCart
Abstract—This paper presents HotSpot—a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient. Index Terms—Compact thermal model, early design stages, interconnect self-heating, temperature, VLSI. I.
A case for thermal-aware floorplanning at the microarchitectural level
- Journal of ILP
, 2005
"... In current day microprocessors, exponentially increasing power densities, leakage, cooling costs, and reliability concerns have resulted in temperature becoming a first class design constraint like performance and power. Hence, virtually every high performance microprocessor uses a combination of an ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
In current day microprocessors, exponentially increasing power densities, leakage, cooling costs, and reliability concerns have resulted in temperature becoming a first class design constraint like performance and power. Hence, virtually every high performance microprocessor uses a combination of an elaborate thermal package and some form of Dynamic Thermal Management (DTM) scheme that adaptively controls its temperature. While DTM schemes exploit the important variable of power density to control temperature, this paper attempts to show that there is a significant peak temperature reduction potential in managing lateral heat spreading through floorplanning. It argues that this potential warrants consideration of the temperature-performance trade-off early in the design stage at the microarchitectural level using floorplanning. As a demonstration, it uses previously proposed wire delay model and floorplanning algorithm based on simulated annealing to present a profile-driven, thermal-aware floorplanning scheme that significantly reduces peak processor temperature with minimal performance impact that is quite competitive with DTM. 1.
Thermal-Driven Multilevel Routing for 3-D ICs
- in Proceedings of the Asia South Pacific Design Automation Conference
, 2005
"... 3-D IC has a great potential for improving circuit performance and degree of integration. It is also an attractive platform for system-on-chip or system-in-package solutions. A critical issue in 3-D circuit design is heat dissipation. In this paper we propose an efficient 3-D multilevel routing appr ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
3-D IC has a great potential for improving circuit performance and degree of integration. It is also an attractive platform for system-on-chip or system-in-package solutions. A critical issue in 3-D circuit design is heat dissipation. In this paper we propose an efficient 3-D multilevel routing approach that includes a novel through-the-silicon via (TS-via) planning algorithm. The proposed approach features an adaptive lumped resistive thermal model and a two-step multilevel TSvia planning scheme. Experimental results show that with multilevel TS-via planning, the thermal-driven approach can reduce the maximum temperature to the required temperature with reasonable wirelength increase. Compared to a post processing approach for dummy TS-via insertion, to achieve the same required temperature, our approach uses 80 % fewer TS-vias. To our knowledge, this proposed approach is the first thermal-driven 3-D routing algorithm. I.
PicoServer: Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor
- in ASPLOS-XII: Proceedings of the 12th international conference on Architectural
, 2006
"... In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to mul ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple DRAM dies sufficient for a primary memory. The 3D technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied. The PicoServer architecture specifically targets Tier 1 server applications, which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. We find for a similar logic die area, a 12 CPU system with 3D stacking and no L2 cache outperforms an 8 CPU system with a large on-chip L2 cache by about 14 % while consuming 55 % less power. In addition, we show that a PicoServer performs comparably to a Pentium 4-like class machine while consuming only about 1/10 of the power, even when conservative assumptions are made about the power consumption of the PicoServer.
Performance, energy, and thermal considerations for SMT and CMP architectures
- Proc. of the Eleventh Int’l Symp. on High-Performance Computer Architecture
, 2005
"... Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their relative energy-efficiency and thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to explore this design space for a POWER4/POWER ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their relative energy-efficiency and thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to explore this design space for a POWER4/POWER5-like core. For an equalarea comparison with this style of core, we find CMP to be superior in terms of performance and energy-efficiency for CPU-bound benchmarks, but SMT to be superior for memory-bound benchmarks due to a larger L2 cache. Although both exhibit similar peak operating temperatures and thermal management overheads, the mechanism by which SMT and CMP heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures, CMP heating is mainly caused by the global impact of increased energy output. Because of this difference in heat up machanism, we found that the best thermal management technique is also different for SMT and CMP. Indeed, non-DVS localized thermal-management can outperform DVS for SMT. Finally, we show that CMP and SMT will scale differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core’s higher temperature and the exponential temperature-dependence of subthreshold leakage. 1.
Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach
"... Abstract—Efficient sharing of system resources is critical to obtaining high utilization and enforcing system-level performance objectives on chip multiprocessors (CMPs). Although several proposals that address the management of a single microarchitectural resource have been published in the literat ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract—Efficient sharing of system resources is critical to obtaining high utilization and enforcing system-level performance objectives on chip multiprocessors (CMPs). Although several proposals that address the management of a single microarchitectural resource have been published in the literature, coordinated management of multiple interacting resources on CMPs remains an open problem. We propose a framework that manages multiple shared CMP resources in a coordinated fashion to enforce higher-level performance objectives. We formulate global resource allocation as a machine learning problem. At runtime, our resource management scheme monitors the execution of each application, and learns a predictive model of system performance as a function of allocation decisions. By learning each application’s performance response to different resource distributions, our approach makes it possible to anticipate the system-level performance impact of allocation decisions at runtime with little runtime overhead. As a result, it becomes possible to make reliable comparisons among different points in a vast and dynamically changing allocation space, allowing us to adapt our allocation decisions as applications undergo phase changes. Our evaluation concludes that a coordinated approach to managing multiple interacting resources is key to delivering high performance in multiprogrammed workloads, but this is possible only if accompanied by efficient search mechanisms. We also show that it is possible to build a single mechanism that consistently delivers high performance under various important performance metrics. I.
The need for a full-chip and package thermal model for thermally optimized ic designs
- in Proc. Int. Symp. on Low Power Electronics and Design (ISLPED
, 2005
"... Modeling and analyzing detailed die temperature with a full-chip thermal model at early design stages is important to discover and avoid potential thermal hazards. However, omitting important aspects of package details in a thermal model can result in significant temperature estimation errors. In th ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Modeling and analyzing detailed die temperature with a full-chip thermal model at early design stages is important to discover and avoid potential thermal hazards. However, omitting important aspects of package details in a thermal model can result in significant temperature estimation errors. In this paper, we discuss the applications of an existing compact thermal model that models both die and package temperature details. As an example, a thermally selfconsistent leakage power calculation of a POWER4-like microprocessor design is presented. We then demonstrate the importance of including detailed package information in the thermal model by several examples considering the impact of thermal interface material (TIM), which glues the die to the heat spreader. The fact that detailed package information is needed to build an accurate compact thermal model implies a design flow, in which the chip- and package-level compact thermal model acts as a convenient medium for more productive collaborations among circuit designers, computer architects and package designers, leading to early and efficient evaluations of different design tradeoffs for an optimal design from a thermal point of view. Categories and Subject Descriptors:
System Level Leakage Reduction Considering the Interdependence of Temperature and Leakage
, 2004
"... The high leakage devices in nanometer technologies as well as the low activity rates in system-on-a-chip (SOC) contribute to the growing significance of leakage power at the system level. We first present system-level leakage-power modeling and characteristics and discuss ways to reduce leakage for ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The high leakage devices in nanometer technologies as well as the low activity rates in system-on-a-chip (SOC) contribute to the growing significance of leakage power at the system level. We first present system-level leakage-power modeling and characteristics and discuss ways to reduce leakage for caches. Considering the interdependence between leakage power and temperature, we then discuss thermal runaway and dynamic power and thermal management (DPTM) to reduce power and prevent thermal violations. We show that a thermal-independent leakage model may hide actual failures of DPTM. Finally, we present voltage scaling considering DPTM for different packaging options. We show that the optimal Vdd for the best throughput may be smaller than the largest Vdd allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly.
Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs
- ICS ’09: Proceedings of the 23rd international conference on Supercomputing (2009
"... Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addi ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98 % of the optimal speedup. 1.
Fast thermal simulation for architecture level dynamic thermal management
- in Proc. Int. Conf. on Computer Aided Design (ICCAD
, 2005
"... As power density increases exponentially, runtime regulation of operating temperature by dynamic thermal managements becomes necessary. This paper proposes a novel approach to the thermal analysis at chip architecture level for efficient dynamic thermal management. Our new approach is based on the o ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
As power density increases exponentially, runtime regulation of operating temperature by dynamic thermal managements becomes necessary. This paper proposes a novel approach to the thermal analysis at chip architecture level for efficient dynamic thermal management. Our new approach is based on the observation that the power consumption of architecture level modules in microprocessors running typical workloads presents strong nature of periodicity. Such a feature can be exploited by fast spectrum analysis in frequency domain for computing steady state response. To obtain the transient temperature changes due to initial condition and constant power inputs, numerically stable moment matching approach is carried out. The total transient responses is the addition of the two simulation results. The resulting fast thermal analysis algorithm leads to at least 10x-100x speedup over traditional integration-based transient analysis with small accuracy loss. 1.

