DMCA
Piranha: A scalable architecture based on single-chip multiprocessing (2000)
Cached
Download Links
- [gridsec.usc.edu]
- [www.imit.kth.se]
- [www.research.digital.com]
- [www.ri.cmu.edu]
- [www.barroso.org]
- [research.compaq.com]
- [www.cs.cmu.edu]
- [csserver.evansville.edu]
- [csserver.evansville.edu]
- [research.microsoft.com]
- [carlstrom.com]
- [barroso.org]
- [csserver.evansville.edu]
- [csserver.evansville.edu]
- DBLP
Other Repositories/Bibliography
Venue: | SIGARCH Comput. Archit. News |
Citations: | 233 - 7 self |
Citations
490 | The SGI Origin: A ccNUMA Highly Scalable Server
- Laudon
- 1997
(Show Context)
Citation Context ...-hop write transactions involving a remote owner more efficiently. Second, we inherently eliminate livelock and starvation problems that arise due to the presence of NAKs. In contrast, the SGI Origin =-=[25]-=- uses a number of complicated mechanisms such as keeping retry counts and reverting to a strict request-reply protocol, while most other protocols with NAKs ignore this important problem (e.g, DASH [2... |
432 | A Single-Chip Multiprocessor
- Hammond, Nayfeh, et al.
- 1997
(Show Context)
Citation Context ...us paragraph) along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads [13]. An alternative approach, often referred to as chip multiprocessing (CMP) =-=[15]-=-, involves integrating multiple (possibly simpler) processor cores onto a single chip. This approach has been adopted by the next-generation IBM Power4 design which integrates two superscalar cores al... |
368 |
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor
- Lenoski, Laudon, et al.
- 1990
(Show Context)
Citation Context ...d request, the protocol can complete all directory state changes immediately. This property eliminates the need for extra confirmation messages sent back to the home (e.g., “ownership change” in DASH =-=[26]-=-), and also eliminates the associated protocol engine occupancy. Therefore, our protocol handles 3-hop write transactions involving a remote owner more efficiently. Second, we inherently eliminate liv... |
252 | The potential for using thread-level data speculation to facilitate automatic parallellization
- STEFFAN, MOWRY
- 1998
(Show Context)
Citation Context ...ing. Furthermore, Piranha focuses on commercial workloads, which have an abundance of explicit thread-level parallelism. Therefore, support for threadlevel speculation as proposed by Hydra and others =-=[22,41]-=- is not necessary for achieving high performance on such workloads. Another CMP design in progress is the IBM Power4 [9]. Each Power4 chip has two 1-GHz, five-issue, out-of-order superscalar processor... |
250 | An Evaluation of Directory Schemes for Cache Coherence
- Agarwal, Simoni, et al.
- 1988
(Show Context)
Citation Context ... the low latency, high bandwidth path provided by the integration of memory controllers on the chip. We use two different directory representations depending on the number of sharers: limited pointer =-=[1]-=- and coarse vector [14]. Two bits of the directory are used for state, with 42 bits available for encoding sharers. The directory is not used to maintain information about sharers at the home node. Fu... |
250 | Memory system characterization of commercial workloads
- Barroso, Gharachorloo, et al.
- 1998
(Show Context)
Citation Context ...performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as on-line transaction processing (OLTP) relative to technical workloads =-=[4,7, 8,21,28,34,36]-=-. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication m... |
232 |
Data speculation support for a chip multiprocessor. ASPLOS’98
- Hammond, Willey, et al.
- 1998
(Show Context)
Citation Context ...have advocated and evaluated the use of chip multiprocessing (CMP) in the context of workloads such as SPEC [15,29,33], and the Hydra project is exploring CMP with a focus on thread-level speculation =-=[16,17]-=-. The current implementation integrates four 250MHz processors each with 8KB instruction and data caches and a shared 128KB second-level cache onto a small chip. There are a number of differences betw... |
179 |
et al. The Stanford FLASH multiprocessor
- Kuskin, O
- 1994
(Show Context)
Citation Context ...lieve our design provides a nice balance between flexibility (e.g., for late binding of protocol) and performance. While the design is less flexible than using a general-purpose processor as in FLASH =-=[24]-=-, the specialized (more powerful) instructions lead to much lower protocol engine latency and occupancy. 2.5.2 Directory Storage The Piranha design supports directory data with virtually no memory spa... |
176 | Embra: Fast and flexible machine simulation
- Witchel, Rosenblum
- 1996
(Show Context)
Citation Context ...lation detail, enabling the user to choose the most appropriate trade-off between simulation detail and slowdown. The fastest simulator uses an onthe-fly binary translation technique similar to Embra =-=[48]-=- to position the workload into a steady state. For the medium-speed (in simulation time) processor module, SimOS-Alpha models a single-issue pipelined processor. Finally, the slowest-speed processor m... |
170 | Using the SimOS machine simulator to study complex computer systems
- Rosenblum, Bugnion, et al.
- 1997
(Show Context)
Citation Context ...database, and the queries are parallelized to generate four server processes per processor. 3.2 Simulation Environment For our simulations, we use the SimOS-Alpha environment (the Alpha port of SimOS =-=[37]-=-), which was used in a previous study of commercial applications and has been validated against Alpha multiprocessor hardware [4]. SimOS-Alpha is a full system simulation environment that simulates th... |
152 | Simultaneous Multithreading: A Platform for Next-generation Processors
- Eggers, Emer, et al.
- 1997
(Show Context)
Citation Context ... to more cores. Furthermore, the small size of the L1 along with the lack of an on-chip L2 cache makes this design non-optimal for commercial workloads such as OLTP. Simultaneous multithreading (SMT) =-=[11]-=- (and other forms of multithreading) is an alternative to CMP for exploiting the threadlevel parallelism in commercial workloads. In fact, Lo et al. [27] have shown that SMT can provide a substantial ... |
149 | The impact of architectural trends on operating system performance
- ROSENBLUM, BUGNION, et al.
- 1995
(Show Context)
Citation Context ...performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as on-line transaction processing (OLTP) relative to technical workloads =-=[4,7, 8,21,28,34,36]-=-. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication m... |
146 |
Missing the Memory Wall: The Case for Processor/ Memory Integration
- Saulsbury, Pong, et al.
- 1996
(Show Context)
Citation Context ...torage The Piranha design supports directory data with virtually no memory space overhead by computing ECC at a coarser granularity and utilizing the unused bits for storing the directory information =-=[31,38]-=-. ECC is computed across 256-bit boundaries (typical is 64-bit), leaving us with 44 bits for directory storage per 64-byte line. Compared to having a dedicated external storage and datapath for direct... |
142 | Contrasting characteristics and cache performance of technical and multi-user commercial workloads - Maynard, Donnelly, et al. - 1994 |
132 | An analysis of database workload performance on simultaneous multithreaded processors
- Lo, Barroso, et al.
- 1998
(Show Context)
Citation Context ...en used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantialperformance boost for database workloads =-=[27]-=-. In fact, the Alpha 21464 (successor to Alpha 21364) is planning to combine aggressive chip-level integration (see previous paragraph) along with an eight-instruction-wide out-of-order processor with... |
117 | Reducing memory and traffic requirements for scalable directory-based cache coherence schemes
- Gupta, Weber, et al.
- 1990
(Show Context)
Citation Context ... bandwidth path provided by the integration of memory controllers on the chip. We use two different directory representations depending on the number of sharers: limited pointer [1] and coarse vector =-=[14]-=-. Two bits of the directory are used for state, with 42 bits available for encoding sharers. The directory is not used to maintain information about sharers at the home node. Furthermore, directory in... |
103 | Tradeoffs in Two-Level On-Chip Caching
- Jouppi, Wilton
- 1993
(Show Context)
Citation Context ...h duplicate data. Therefore, Piranha opts for not maintaining the inclusion property. Although non-inclusive on-chip cache hierarchies have been previously studied in the context of a single-CPU chip =-=[20]-=-, the use of this technique in the context of a CMP leads to interesting issues related to coherence and allocation/replacement policies. To simplify intra-chip coherence and avoid the use of snooping... |
97 | Studies of windows NT performance using dynamic execution traces
- Perl, Sites
- 1996
(Show Context)
Citation Context ...performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as on-line transaction processing (OLTP) relative to technical workloads =-=[4,7, 8,21,28,34,36]-=-. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication m... |
74 |
The future of systems research
- Hennessy
- 1999
(Show Context)
Citation Context ...nt performance gains on target applications such as the SPEC benchmark [40], continuing along this path is becoming less viable due to substantial increases in development team sizes and design times =-=[18]-=-. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC. Meanwhile, commercial workloads such as databases and Web applications have surp... |
67 |
The S3.mp Scalable Shared Memory Multiprocessor
- Nowatzyk, Aybay, et al.
- 1995
(Show Context)
Citation Context ...torage The Piranha design supports directory data with virtually no memory space overhead by computing ECC at a coarser granularity and utilizing the unused bits for storing the directory information =-=[31,38]-=-. ECC is computed across 256-bit boundaries (typical is 64-bit), leaving us with 44 bits for directory storage per 64-byte line. Compared to having a dedicated external storage and datapath for direct... |
62 | Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor
- Krishnan, Torrellas
- 1998
(Show Context)
Citation Context ...ing. Furthermore, Piranha focuses on commercial workloads, which have an abundance of explicit thread-level parallelism. Therefore, support for threadlevel speculation as proposed by Hydra and others =-=[22,41]-=- is not necessary for achieving high performance on such workloads. Another CMP design in progress is the IBM Power4 [9]. Each Power4 chip has two 1-GHz, five-issue, out-of-order superscalar processor... |
61 | The memory performance of DSS commercial workloads in shared-memory multiprocessors
- Trancoso, Larriba-Pey, et al.
- 1997
(Show Context)
Citation Context ...U6 iL1+dL1 L2 2 State Tag RAC 2 L2 2 Data CPU 2 Clock There have been a large number of recent studies of database applications (both OLTP and DSS) due to the increasing importance of these workloads =-=[4,7,8,12,21,27,28,34,35,36,42,46]-=-. To the best of our knowledge, this is the first paper that provides a detailed evaluation of database workloads in the context of chip multiprocessing. Ranganathan et al. [35] study user-level trace... |
53 |
Simultaneous multithreading: Multiplying alpha’s performance
- Emer
- 1999
(Show Context)
Citation Context ...pha 21364) is planning to combine aggressive chip-level integration (see previous paragraph) along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads =-=[13]-=-. An alternative approach, often referred to as chip multiprocessing (CMP) [15], involves integrating multiple (possibly simpler) processor cores onto a single chip. This approach has been adopted by ... |
50 |
Evaluation of multithreaded uniprocessors for commercial application environments
- Eickemeyer, Johnson, et al.
- 1996
(Show Context)
Citation Context ...U6 iL1+dL1 L2 2 State Tag RAC 2 L2 2 Data CPU 2 Clock There have been a large number of recent studies of database applications (both OLTP and DSS) due to the increasing importance of these workloads =-=[4,7,8,12,21,27,28,34,35,36,42,46]-=-. To the best of our knowledge, this is the first paper that provides a detailed evaluation of database workloads in the context of chip multiprocessing. Ranganathan et al. [35] study user-level trace... |
46 |
Characterization of Alpha AXP performance using TP and SPEC Workloads
- CVETANOVIC, BHANDARKAR
- 1994
(Show Context)
Citation Context ...performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as on-line transaction processing (OLTP) relative to technical workloads =-=[4,7, 8,21,28,34,36]-=-. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication m... |
44 |
Performance of an OLTP Application on Symmetry Multiprocessor System
- Thakkar, Sweiger
- 1990
(Show Context)
Citation Context ...U6 iL1+dL1 L2 2 State Tag RAC 2 L2 2 Data CPU 2 Clock There have been a large number of recent studies of database applications (both OLTP and DSS) due to the increasing importance of these workloads =-=[4,7,8,12,21,27,28,34,35,36,42,46]-=-. To the best of our knowledge, this is the first paper that provides a detailed evaluation of database workloads in the context of chip multiprocessing. Ranganathan et al. [35] study user-level trace... |
43 | Evaluation of Design Alternatives for a Multiprocessor Microprocessor
- Nayfeh, Hammond, et al.
- 1996
(Show Context)
Citation Context ... that are specifically focused on commercial markets [5,23]. 7 Several papers from Stanford have advocated and evaluated the use of chip multiprocessing (CMP) in the context of workloads such as SPEC =-=[15,29,33]-=-, and the Hydra project is exploring CMP with a focus on thread-level speculation [16,17]. The current implementation integrates four 250MHz processors each with 8KB instruction and data caches and a ... |
36 |
S-connect: From Network of Workstations to Supercomputer Performance
- Nowatzyk, Browne, et al.
- 1995
(Show Context)
Citation Context ... process, where the primary caches are loaded from a small external EPROM over a bit-serial connection. 2.6.1 The Router (RT) The RT is similar to the S-Connect design developed for the S3.mp project =-=[30]-=-. Like the S-Connect, the RT uses a topology-independent, adaptive, virtual cut-through router core based on a common buffer pool that is shared across multiple priorities and virtual channels. Since ... |
35 |
Alpha 21364: A Scalable Single-Chip SMP,” Microprocessor Forum
- Bannon
- 1998
(Show Context)
Citation Context ...g a scaled 1GHz 21264 core (i.e., shrink of the current Alpha processor core to 0.18um technology), two levels of caches, memory controller, coherence hardware, and network router all on a single die =-=[2]-=-. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy which can substantially improve the performance of commercial workloads [3]. Furthermore, the reuse of... |
25 | AlphaServer 4100 performance characterization
- Cvetanovic, Donaldson
- 1996
(Show Context)
Citation Context ...performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as on-line transaction processing (OLTP) relative to technical workloads =-=[4,7, 8,21,28,34,36]-=-. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication m... |
21 | Impact of Chip-Level Integration on Performance of OLTP Workloads
- Barroso, Gharachorloo, et al.
- 2000
(Show Context)
Citation Context ...router all on a single die [2]. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy which can substantially improve the performance of commercial workloads =-=[3]-=-. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without... |
15 |
System Optimization for OLTP Workloads
- Kunkel, Armstrong, et al.
- 1999
(Show Context)
Citation Context ...t segment for high-performance servers) at the possible expense of other types of workloads. There are several other contemporary processor designs that are specifically focused on commercial markets =-=[5,23]-=-. 7 Several papers from Stanford have advocated and evaluated the use of chip multiprocessing (CMP) in the context of workloads such as SPEC [15,29,33], and the Hydra project is exploring CMP with a f... |
8 |
MAJC-5200: A VLIW Convergent MPSOC
- Tremblay
- 1999
(Show Context)
Citation Context ...ating eight much simpler processor cores on a single chip, and provides on-chip functionality for a scalable design. Finally, Sun Microsystems has also announced a new CMP design called the MAJC 5200 =-=[47]-=-, which is the first implementation of the MAJC architecture targeted at multimedia and Java applications. The 5200 contains two 500MHz VLIW processors, each capable of issuing four instructions per c... |
7 |
Performance of database workloads on shared memory systems with out-of-order processors
- Ranganathan, Gharachorloo, et al.
- 1998
(Show Context)
Citation Context ...iple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism =-=[35]-=-. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for ... |
5 |
Exploiting parallelism in cache coherency protocol engines, Euro-Par’95
- Nowatzyk, Aybay, et al.
- 1995
(Show Context)
Citation Context ..., with the home and remote engines being virtually identical except for the microcode that they execute. Our approach uses the same design philosophy as the protocol engines used in the S3.mp project =-=[32]-=-. Figure 4 shows a high-level block diagram of one protocol engine consisting of three independent (and decoupled) stages: the input controller, the microcodecontrolled execution unit, and the output ... |
3 |
The Stanford Hydra CMP. Presented at Hot Chips 11
- Hammond, Hubbert, et al.
- 1999
(Show Context)
Citation Context ...have advocated and evaluated the use of chip multiprocessing (CMP) in the context of workloads such as SPEC [15,29,33], and the Hydra project is exploring CMP with a focus on thread-level speculation =-=[16,17]-=-. The current implementation integrates four 250MHz processors each with 8KB instruction and data caches and a shared 128KB second-level cache onto a small chip. There are a number of differences betw... |
1 |
Memory System Characterization of Commercial Workloads
- chitecture
(Show Context)
Citation Context ...ibly simpler) processor cores onto a single chip. This approach has been adopted by the next-generation IBM Power4 design which integrates two superscalar cores along with a shared second-level cache =-=[9]-=-. While the SMT approach is superior in single-thread performance (important for workloads without explicit thread-level parallelism), it is best suited for very wide-issue processors which are more c... |