Results 1 - 10
of
11
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
- In ASPLOS
, 2013
"... Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU ar ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies. In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33 % average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
A case for heterogeneous on-chip interconnects for cmps
- In ISCA
, 2011
"... Network-on-chip (NoC) has become a critical shared resource in the emerging Chip Multiprocessor (CMP) era. Most prior NoC designs have used the same type of router across the entire net-work. While this homogeneous network design eases the burden on a network designer, partitioning the resources equ ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Network-on-chip (NoC) has become a critical shared resource in the emerging Chip Multiprocessor (CMP) era. Most prior NoC designs have used the same type of router across the entire net-work. While this homogeneous network design eases the burden on a network designer, partitioning the resources equally among all routers across the network does not lead to optimal resource us-age, and hence, affects the performance-power envelope. In this work, we propose to apportion the resources in an NoC to leverage the non-uniformity in network resource demand. Our proposal in-cludes partitioning the network resources, specifically buffers and links, in an optimal manner. This approach results in redistributing resources such that routers that require more resources are allocated more buffers and wider links compared to routers demanding fewer resources. This results in a novel heterogeneous network, called HeteroNoC, which is composed of two types of routers – small power efficient routers, and big high performance routers. We eval-uate a number of heterogeneous network configurations, composed of big and small routers, and show that giving more resources to routers along the diagonals in a mesh network provides maximum benefits in terms of performance and power. We also show the po-tential benefits of the HeteroNoC design by co-evaluating it with memory-controllers and configuring it with an asymmetric CMP consisting of heterogeneous cores.
NOC-Out: Microarchitecting a Scale-Out Processor
"... Scale-out server workloads benefit from many-core processor organizations that enable high throughput thanks to abundant request-level parallelism. A key characteristic of these workloads is the large instruction footprint that exceeds the capacity of private caches. While a shared last-level cache ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Scale-out server workloads benefit from many-core processor organizations that enable high throughput thanks to abundant request-level parallelism. A key characteristic of these workloads is the large instruction footprint that exceeds the capacity of private caches. While a shared last-level cache (LLC) can capture the instruction working set, it necessitates a low-latency interconnect fabric to minimize the core stall time on instruction fetches serviced by the LLC. Many-core processors with a mesh interconnect sacrifice performance on scale-out workloads due to NOC-induced delays. Lowdiameter topologies can overcome the performance limitations of meshes through rich inter-node connectivity, but at a high area expense. To address the drawbacks of existing designs, this work introduces NOC-Out – a many-core processor organization that affords low LLC access delays at a small area cost. NOC-Out is tuned to accommodate the bilateral core-to-cache access pattern, characterized by minimal coherence activity and lack of inter-core communication, that is dominant in scaleout workloads. Optimizing for the bilateral access pattern, NOC-Out segregates cores and LLC banks into distinct network regions and reduces costly network connectivity by eliminating the majority of inter-core links. NOC-Out further simplifies the interconnect through the use of low-complexity treebased topologies. A detailed evaluation targeting a 64-core CMP and a set of scale-out workloads reveals that NOC-Out improves system performance by 17 % and reduces network area by 28 % over a tiled mesh-based design. Compared to a design with a richly-connected flattened butterfly topology, NOC-Out reduces network area by 9x while matching the performance. 1.
HNOCS: Modular Open-Source Simulator for Heterogeneous NoCs
"... Abstract — We present HNOCS (Heterogeneous Network-on-Chip Simulator), an open-source NoC simulator based on OMNeT++. To the best of our knowledge, HNOCS is the first simulator to support modeling of heterogeneous NoCs with variable link capacities and number of VCs per unidirectional port. The HNOC ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We present HNOCS (Heterogeneous Network-on-Chip Simulator), an open-source NoC simulator based on OMNeT++. To the best of our knowledge, HNOCS is the first simulator to support modeling of heterogeneous NoCs with variable link capacities and number of VCs per unidirectional port. The HNOCS simulation platform provides an open-source, modular, scalable, extendible and fully parameterizable framework for modeling NoCs. It includes three types of NoC routers: synchronous, synchronous virtual output queue (VoQ) and asynchronous. HNOCS provides a rich set of statistical measurements at the flit and packet levels: end-to-end latencies, throughput, VC acquisition latencies, transfer latencies, etc. We describe the architecture, structure, available models and the features that make HNOCS suitable for advanced NoC exploration. We also evaluate several case studies which cannot be evaluated with any other exiting NoC simulator. Keywords-NoC simulator; Heterogeneous NoC I.
NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?
"... enables the integration of multiple memory stacks with a processor chip, thereby greatly increasing in-package memory capacity while largely avoiding the thermal challenges of 3D stacking DRAM on the processor. Systems employing inter-posers for memory integration use the interposer to provide point ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
enables the integration of multiple memory stacks with a processor chip, thereby greatly increasing in-package memory capacity while largely avoiding the thermal challenges of 3D stacking DRAM on the processor. Systems employing inter-posers for memory integration use the interposer to provide point-to-point interconnects between chips. However, these interconnects only utilize a fraction of the interposer’s overall routing capacity, and in this work we explore how to take advantage of this otherwise unused resource. We describe a general approach for extending the architec-ture of a network-on-chip (NoC) to better exploit the additional routing resources of the silicon interposer. We propose an asymmetric organization that distributes the NoC across both a multi-core chip and the interposer, where each sub-network is different from the other in terms of the traffic types, topolo-gies, the use or non-use of concentration, direct vs. indirect network organizations, and other network attributes. Through experimental evaluation, we show that exploiting the otherwise unutilized routing resources of the interposer can lead to significantly better performance. I.
ON-CHIP NETWORK-ENABLED MANY-CORE ARCHITECTURES FOR COMPUTATIONAL BIOLOGY APPLICATIONS
, 2013
"... ..."
Optimizing Heterogeneous NoC Design
"... We develop a novel design methodology that optimizes capacity of each link in a NoC and the numbers of virtual channels (VCs) at each router port for a given set of flows and latency constraints. In order to lower computation costs associated with a simulated annealing search in the design space, we ..."
Abstract
- Add to MetaCart
(Show Context)
We develop a novel design methodology that optimizes capacity of each link in a NoC and the numbers of virtual channels (VCs) at each router port for a given set of flows and latency constraints. In order to lower computation costs associated with a simulated annealing search in the design space, we utilize an approximate analysis of the NoC performance thus replacing the need for a NoC simulation. Therefore, computation time and resources are dramatically reduced. The area saving achieved by our heterogeneous NoC design is demonstrated by several use-cases. The heterogeneous NoC design process is applied to SoCs running multimedia benchmarks, and to Chip-Multi-Processor (CMP) running PARSEC benchmark programs. Categories and Subject Descriptors
PhaseNoC: Versatile Network Traffic Isolation through TDM-Scheduled Virtual Channels
"... Abstract—As multi-/many-core architectures evolve, the de-mands on the Network-on-Chip (NoC) are amplified. In addi-tion to high performance and physical scalability, the NoC is increasingly required to also provide specialized functionality, such as network virtualization, flow isolation, and Quali ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—As multi-/many-core architectures evolve, the de-mands on the Network-on-Chip (NoC) are amplified. In addi-tion to high performance and physical scalability, the NoC is increasingly required to also provide specialized functionality, such as network virtualization, flow isolation, and Quality-of-Service (QoS). Although traditional architectures supporting Virtual Channels (VC) offer the resources for flow partitioning and isolation, an adversarial workload can still interfere and degrade the performance of other workloads that are active in a different set of VCs. In this paper, we present PhaseNoC, a truly non-interfering VC-based architecture that adopts Time-Division Multiplexing (TDM) at the VC level. Distinct flows, or application domains, mapped to disjoint sets of VCs are isolated, both inside the router’s pipeline and at the network level. Any latency over-head is minimized by appropriate scheduling of flows in separate phases of operation, irrespective of the chosen topology. When strict isolation is not required, the proposed architecture can employ opportunistic bandwidth stealing. This novel mechanism works synergistically with the baseline PhaseNoC techniques to improve the overall latency/throughput characteristics of the NoC, while still preserving performance isolation. Experimental results corroborate that – with lower cost than state-of-the-art NoC architectures, and with minimum latency overhead – PhaseNoC removes any flow interference and allows for efficient network traffic isolation.
Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture
"... Future chip multiprocessors (CMP) will only grow in core count and diversity in terms of frequency, power consumption, and re-source distribution. Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is the next stage in this trend. This heterogeneou ..."
Abstract
- Add to MetaCart
(Show Context)
Future chip multiprocessors (CMP) will only grow in core count and diversity in terms of frequency, power consumption, and re-source distribution. Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is the next stage in this trend. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on re-source distribution, fairness, and overall performance. The heterogeneity of this architecture inevitably exerts different pressures on the interconnection due to the differing characteris-tics and requirements of applications running on CPU and GPU cores. CPU applications are sensitive to latency, while GPGPU ap-plications require massive bandwidth. This is due to the difference in the thread-level parallelism of the two architectures. GPUs use more threads to hide the effect of memory latency but require mas-sive bandwidth to supply those threads. On the other hand, CPU cores typically running only one or two threads concurrently are very sensitive to latency. This study surveys the impact and behavior of the interconnec-tion network when CPU and GPGPU applications run simultane-ously. This will shed light on other architectural interconnection studies on CPU-GPU heterogeneous architectures. 1.
Adaptive Virtual Channel Partitioning for Network-on-Chip in Heterogeneous Architectures
"... Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics of CPU and GPU cores. We consider how to efficiently share on- ..."
Abstract
- Add to MetaCart
Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics of CPU and GPU cores. We consider how to efficiently share on-chip resources between cores within the heterogeneous system, in particular the on-chip network. Heterogeneous architec-tures use an on-chip interconnection network to access shared resources such as last-level cache tiles and memory controllers, and this type of on-chip network will have a significant impact on performance. In this article, we propose a feedback-directed virtual channel partitioning (VCP) mechanism for on-chip routers to effectively share network bandwidth between CPU andGPU cores in a heterogeneous architecture. VCP dedicates a few virtual channels to CPU and GPU applications with separate injection queues. The proposed mechanism balances on-chip network bandwidth for applications running on CPU and GPU cores by adaptively choosing the best partitioning configuration. As a result, our mechanism improves system throughput by 15 % over the baseline across 39 heterogeneous workloads.