Results 1 - 10
of
21
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2007
"... A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performa ..."
Abstract
-
Cited by 47 (15 self)
- Add to MetaCart
A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire mod- els (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies.
An adaptive cache coherence protocol optimized for producer-consumer sharing
- In 13th Int’l Symp. on High Performance Computer Architecture (HPCA-13
, 2007
"... Shared memory multiprocessors play an increasingly important role in enterprise and scientific computing facilities. Remote misses limit the performance of shared memory applications, and their significance is growing as network latency increases relative to processor speeds. This paper proposes two ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Shared memory multiprocessors play an increasingly important role in enterprise and scientific computing facilities. Remote misses limit the performance of shared memory applications, and their significance is growing as network latency increases relative to processor speeds. This paper proposes two mechanisms that improve shared memory performance by eliminating remote misses and/or reducing the amount of communication required to maintain coherence. We focus on improving the performance of applications that exhibit producer-consumer sharing. We first present a simple hardware mechanism for detecting producerconsumer sharing. We then describe a directory delegation mechanism whereby the “home node ” of a cache line can be delegated to a producer node, thereby converting 3-hop coherence operations into 2-hop operations. We then extend the delegation mechanism to support speculative updates for data accessed in a producer-consumer pattern, which can convert 2-hop misses into local misses, thereby eliminating the remote memory latency. Both mechanisms can be implemented without changes to the processor. We evaluate our directory delegation and speculative update mechanisms on seven benchmark programs that exhibit producer-consumer sharing using a cycle-accurate executiondriven simulator of a future 16-node SGI multiprocessor. We find that the mechanisms proposed in this paper reduce the average remote miss rate by 40%, reduce network traffic by 15%, and improve performance by 21%. Finally, we use Murphi to verify that each mechanism is error-free and does not violate sequential consistency. 1
Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence
"... Scalable cache coherence solutions are imperative to drive the many-core revolution forward. To fully realize the massive computation power of these many-core architectures, the communication substrate must be carefully examined and streamlined. There is tension between the need for an ordered inter ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Scalable cache coherence solutions are imperative to drive the many-core revolution forward. To fully realize the massive computation power of these many-core architectures, the communication substrate must be carefully examined and streamlined. There is tension between the need for an ordered interconnect to simplify coherence and the need for an unordered interconnect to provide scalable communication. In this work, we propose a coherence protocol, Virtual Tree Coherence (VTC), that relies on a virtually ordered interconnect. Our virtual ordering can be overlaid on any unordered interconnect to provide scalable, high-bandwidth communication. Specifically, VTC keeps track of sharers of a coarse-grained region, and multicasts requests to them through a virtual tree, employing properties of the virtual tree to enforce ordering amongst coherence requests. We compare VTC against a commonly used directory-based protocol and a greedy-order protocol extended onto an unordered interconnect. VTC outperforms both of these by averages of 25 % and 11 % in execution time respectively across a suite of scientific and commercial applications on 16 cores. For a 64-core system running server consolidation workloads, VTC outperforms directory and greedy protocols with average runtime improvements of 31 % and 12%. 1.
Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures
- In Proceedings of SPAA-19
, 2007
"... As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures. Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9 % for our workloads.
In-Network Coherence Filtering: Snoopy Coherence without Broadcasts
"... With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fas ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fast sharing of data among various cores and drive the many-core revolution forward. Snoopy coherence protocols, if realizable, have the desirable property of having low storage overhead and not adding indirection delay to cache-to-cache accesses. There are various proposals, like Token Coherence (TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols and make them realizable on unordered networks. However, snoopy protocols still have the broadcast overhead because each coherence request goes to all cores in the system. This has substantial network bandwidth and power implications. In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is used to filter away redundant snoop requests that are traveling towards unshared cores. Filtering these useless messages saves network bandwidth and power and makes snoopy protocols on many-core systems truly scalable. Our in-network coherence filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network traffic by 25.4 % on 16-processor chip multiprocessor (CMP) systems running parallel applications. For 64-processor CMP systems, our filtering technique on an average achieves 46.5% reduction in total number of snoops that ends up reducing the total network traffic by 27.3%, on an average.
Outstanding Research Problems in NoC Design: Circuit-, Microarchitecture-, and System-Level Perspectives
"... Abstract—Networks-on-Chip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we fi ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—Networks-on-Chip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis and solution evaluation. Motivation, problem formulation, proposed approaches and open issues are discussed for each problem enumerated in the paper from circuit, micro-architecture and systemlevel perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective. Index terms — On-chip communication architecture, networks-onchip, multiprocessor system-on-chip, CMP. I.
Subspace snooping: Filtering snoops with operating system support
- in Proceedings of the The Nineteenth International Conference on Parallel Architectures and Compilation Techniques
, 2010
"... Although snoop-based coherence protocols provide fast cacheto-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which s ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Although snoop-based coherence protocols provide fast cacheto-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44 % of snoops on average.
A Consistency Architecture for Hierarchical Shared Caches
"... Hierarchical Cache Consistency (HCC) is a scalable cache-consistency architecture for chip multiprocessors in which caches are shared hierarchically. HCC’s cache-consistency protocol is embedded in the message-routing network that interconnects the caches, providing a distributed and scalable altern ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Hierarchical Cache Consistency (HCC) is a scalable cache-consistency architecture for chip multiprocessors in which caches are shared hierarchically. HCC’s cache-consistency protocol is embedded in the message-routing network that interconnects the caches, providing a distributed and scalable alternative to bus-based and directory-based consistency mechanisms. The HCC consistency protocol is “progressive ” in that every message makes monotonic progress without timeouts, retries, negative acknowledgments, or retreating in any way. The latency is at most proportional to the diameter of the network. For HCC with a binary fat-tree network, the protocol requires at most 13 bits of additional state per cache line, no matter how large the system. We prove that the HCC protocol is deadlock free and provides sequential consistency.
In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects
- In Proceedings of the 15th Symposium on High-Performance Computer Architecture. IEEE, Los Alamitos, CA
, 2009
"... Realizing scalable cache coherence in the many-core era comes with a whole new set of constraints and opportunities. It is widely believed that multi-hop, unordered on-chip networks would be needed in many-core chip multiprocessors (CMPs) to provide scalable on-chip communication. However, providing ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Realizing scalable cache coherence in the many-core era comes with a whole new set of constraints and opportunities. It is widely believed that multi-hop, unordered on-chip networks would be needed in many-core chip multiprocessors (CMPs) to provide scalable on-chip communication. However, providing ordering among coherence transactions on unordered interconnects is a challenge. Traditional approaches for tackling coherence either have to use ordered interconnects (snoopbased protocols) which lead to scalability problems, or rely on an ordering point (directory-based protocols) which adds indirection latency. In this paper, we propose In-Network Snoop Ordering (INSO), in which coherence requests from a snoop-based protocol are inserted into the interconnect fabric and the network orders the requests in a distributed manner, creating a global ordering among requests. Essentially, when coherence requests enter the network, they grab snoop-orders at the injection router before being broadcasted. A snoop-order specifies the global ordering of the particular request with respect to other requests. Before requests reach their destinations, they get ordered along the way, at intermediate routers and destination network interfaces. Our logical ordering scheme can be mapped onto any unordered interconnect. This enables a cache coherence protocol which exploits the low-latency nature of unordered interconnects without adding indirection to coherence transactions. Our full-system evaluations compare INSO against a directory protocol and a broadcast based Token Coherence protocol. INSO outperforms these protocols by up to 30 % and 8.5%, respectively, on a wide range of scientific and emerging applications. 1
Improving support for locality and fine-grain sharing in chip multiprocesors
- In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques
, 2008
"... Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer’s primary cache when accessed. Chipmultiprocessorspresentanopportunityt ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer’s primary cache when accessed. Chipmultiprocessorspresentanopportunitytooptimizefor fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order tosupport fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory,

