Results 1 - 10
of
11
A complexity-effective approach to alu bandwidth enhancement for instruction-level temporal redundancy
- In Proceedings of the International Symposium on Computer Architecture (ISCA
, 2004
"... Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insuffici ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
(Show Context)
Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23 % of the overall IPC loss. Keywords: Complexity-effective design, Instruction Reuse, Temporal Redundancy.
Instruction replication for clustered microarchitectures
- Proceedings of the 36 th International Symposium on Microarchitecture
, 2003
"... This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-clus ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as much as 70 % for selected programs. 1.
A Dependency Chain Clustered Microarchitecture
- In International Parallel and Distributed Processing Symposium
, 2005
"... In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. T ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. This is made possible by assuming support for executing compilerconstructed traces. One trace is executed at a time by executing its coarse-grained dependency chains (DCs) in different in-order clusters. Since the DCs of a trace are mutually data independent of each other they can be executed in different clusters without any direct communication between them. To execute DCs in narrower clusters without compromising ILP, a compiler algorithm that splits large DCs by duplicating instructions is proposed.
Removing Communications in Clustered Microarchitectures Through Instruction Replication
"... The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is don ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is done carefully because it might degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-ofthe-art modulo-scheduling algorithm. Though this algorithm has been proved to be very effective at reducing communications, results show that the number of communications can be further decreased by around one-third through replication, which results in a significant speedup. IPC is increased by 25 % on average for a four-cluster microarchitecture and by as much as 70 % for selected programs. We also show that replicating appropriate sets of instructions is more effective than doubling the intercluster connection network bandwidth.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustere
"... Abstract-- This paper presents AGAMOS, a technique to modulo schedule loops on clustered micro-architectures. The proposed scheme uses a multi-level graph partitioning strategy to distribute the workload among clusters and reduces the number of inter-cluster communications at the same time. Partitio ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-- This paper presents AGAMOS, a technique to modulo schedule loops on clustered micro-architectures. The proposed scheme uses a multi-level graph partitioning strategy to distribute the workload among clusters and reduces the number of inter-cluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of inter-cluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme out-performs a state-of-theart scheduler for all programs and for different cluster configurations. For some configurations, the speed-up obtained when using this new scheme is greater than 40%, and for selected programs, performance can be more than doubled. Index Terms-- Clustered microarchitectures, ILP, instruction replication, modulo scheduling, statically scheduled processors. I.
Replication-Based Partial Dynamic Scheduling on Heterogeneous Network Processors
"... Abstract. It is a great challenge to map network processing tasks to processing resources of advanced network processors, which are heterogeneous and multi-threading multiprocessor System-on-Chip. This paper proposes a novel scheduling algorithm, called Replication-based Partial Dynamic Scheduling ( ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. It is a great challenge to map network processing tasks to processing resources of advanced network processors, which are heterogeneous and multi-threading multiprocessor System-on-Chip. This paper proposes a novel scheduling algorithm, called Replication-based Partial Dynamic Scheduling (RPDS). It aims to improve the NP performance by combining the strategies of partial dynamic mapping and task replication with a 2-phase scheduling. RPDS differs from existing solutions in several aspects, e.g., the processing elements are heterogeneous, fully-connected, and multi-threading, the application is decomposed into directed acyclic graph tasks with continuous data-packets, and scheduling is conducted at both of initialization and run-time. Experimental results showed our algorithm could increase the largest average throughput by about 30% than those without dynamic phase replication.
OF THE REQUIREMENTS FOR THE DEGREE OF
, 2004
"... Over the past decade superscalar microprocessors have achieved enormous improvements in computing power by exploiting higher levels of parallelism in many different ways. Highperformance superscalar processors have experienced remarkable increases in processor width, pipeline depth and speculative e ..."
Abstract
- Add to MetaCart
Over the past decade superscalar microprocessors have achieved enormous improvements in computing power by exploiting higher levels of parallelism in many different ways. Highperformance superscalar processors have experienced remarkable increases in processor width, pipeline depth and speculative execution. All of these trends have come at an extremely high increase in hardware complexity and chip resource consumption. Until recently, their main limitation has been the availability of such resources in the chip, but with current technology shrinks and increases in transistor budgets, other limiting factors have become preeminent, such as power consumption, temperature and wire delays. These new problems greatly compromise the scalability of conventional superscalar designs. Many previous works have demonstrated the effectiveness of partitioning the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. Some of these components are the register file, the issue queue and the bypass network. Their partitioning is the basis for the so called clustered architectures. A clustered processor core, made up of several low complex blocks or clusters, can efficiently
ACKNOWLEDGEMENTS
"... This paper is made available online in accordance with publisher policies. Please scroll down to view the document itself. Please refer to the repository record for this item and our policy information available from the repository home page for further information. To see the final version of this ..."
Abstract
- Add to MetaCart
(Show Context)
This paper is made available online in accordance with publisher policies. Please scroll down to view the document itself. Please refer to the repository record for this item and our policy information available from the repository home page for further information. To see the final version of this paper please visit the publisher’s website. Access to the published version may require a subscription.
IEEE COPYRIGHT AND CONSENT FORM
"... To ensure uniformity of treatment among all contributors, other forms may not be substituted for this form, nor may any wording of the form be changed. This form is intended for original material submitted to the IEEE and must accompany any such material in order to be published by the IEEE. Please ..."
Abstract
- Add to MetaCart
(Show Context)
To ensure uniformity of treatment among all contributors, other forms may not be substituted for this form, nor may any wording of the form be changed. This form is intended for original material submitted to the IEEE and must accompany any such material in order to be published by the IEEE. Please read the form carefully and keep a copy for your files.
By Entitled
, 2013
"... To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research ” and the use of copyrighted material. Approved by Maj ..."
Abstract
- Add to MetaCart
(Show Context)
To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research ” and the use of copyrighted material. Approved by Major Professor(s): ____________________________________