Results 1 -
6 of
6
GARNET: a detailed onchip network model inside a full-system simulator
- in Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2009
"... Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing dimini ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing diminishing returns and the advent of chip multiprocessors (CMPs) in mainstream systems, the on-chip network that connects different processing cores has become a critical part of the design. Transistor miniaturization has led to high global wire delay, and interconnect power comparable to transistor power. CMP design proposals can no longer ignore the interaction between the memory hierarchy and the interconnection network that connects various elements. This necessitates a detailed and accurate
A Case for Globally Shared-Medium On-Chip Interconnect ∗
"... As microprocessor chips integrate a growing number of cores, the issue of interconnection becomes more important for overall system performance and efficiency. Compared to traditional distributed shared-memory architecture, chip-multiprocessors offer a different set of design constraints and opportu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As microprocessor chips integrate a growing number of cores, the issue of interconnection becomes more important for overall system performance and efficiency. Compared to traditional distributed shared-memory architecture, chip-multiprocessors offer a different set of design constraints and opportunities. As a result, a conventional packet-relay multiprocessor interconnect architecture is a valid, but not necessarily optimal, design point. For example, the advantage of off-the-shelf interconnect and the in-field scalability of the interconnect are less important in a chip-multiprocessor. On the other hand, even with worsening wire delays, packet switching represents a non-trivial component of overall latency. In this paper, we show that with straightforward optimizations, the traffic between different cores can be kept relatively low. This in turn allows simple shared-medium interconnects to be built using communication circuits driving transmission lines. This architecture offers extremely low latencies and can support a large number of cores without the need for packet switching, eliminating costly routers.
TLSync: Support for Multiple Fast Barriers Using On-Chip Transmission Lines
- In Proc. Int’l Symp. on
, 2011
"... As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrie ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (baseband) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.
Current Trend in CMP
, 2009
"... • Future Network-on-Chip (NoC) needs and development trends • Traditional baseband-interconnect constraints • Multiband RF-Interconnect (RF-I) advantages – Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb) – On-chip demonstrations – Off-chip demonstrations ..."
Abstract
- Add to MetaCart
• Future Network-on-Chip (NoC) needs and development trends • Traditional baseband-interconnect constraints • Multiband RF-Interconnect (RF-I) advantages – Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb) – On-chip demonstrations – Off-chip demonstrations
A Design Space Exploration of Transmission-Line Links for On-Chip Interconnect
"... Abstract—With increasing core count, chip multiprocessors (CMP) require a high-performance interconnect fabric that is energy-efficient. Well-engineered transmission linebased communication systems offer an attractive solution, especially for CMPs with a moderate number of cores. While transmission ..."
Abstract
- Add to MetaCart
Abstract—With increasing core count, chip multiprocessors (CMP) require a high-performance interconnect fabric that is energy-efficient. Well-engineered transmission linebased communication systems offer an attractive solution, especially for CMPs with a moderate number of cores. While transmission lines have been used in a wide variety of purposes, there lack comprehensive studies to guide architects to navigate the circuit and physical design space to make proper architecture-level analyses and tradeoffs. This paper makes a first-step effort in exploring part of the design space. Using detailed simulation-based analysis, we show that a shared-medium fabric based on transmission line can offer better performance and a much better energy profile than a conventional mesh interconnect.
Reducing Power and Area by Interconnecting Memory Controllers to Memory Ranks with RF Coplanar Waveguides on the Same Package
"... The physical channel is the element that consumes the largest amount of power in a traditional memory controller (MC). Wired-RF can potentially decrease the amount of power dissipated by replacing the physical memory channel by an RF-channel, just as optical memory systems do by replacing the physic ..."
Abstract
- Add to MetaCart
The physical channel is the element that consumes the largest amount of power in a traditional memory controller (MC). Wired-RF can potentially decrease the amount of power dissipated by replacing the physical memory channel by an RF-channel, just as optical memory systems do by replacing the physical memory channel by an optical-channel. Considering that RF transmission can potentially consume less power than a traditional bus for on-chip distances, we propose to replace the traditional digital MC physical channel by coupling RF transmitters (TX), receivers (RX), an RF quilt-packaging coplanar waveguide (CPW), and a quilt-to- to interconnect MCs and memory ranks on the same package in a multicore. We evaluate the proposed solution in terms of power and area employing ITRS [1] and RF predictions[17]. Preliminary estimation shows that the proposed RF interface is able to save up to 57.3 % in terms of area and up to 78.2 % in terms of power consumption for next processor generations. Furthermore, considering a fixed area budget of one MC as a reference, the proposed interface can improve bandwidth up to 2.2x for an 8-core multiprocessor with 3 MCs and, assuming a fixed power budget of one MC, the proposed interface can improve bandwidth of up to 2.4x. I.

