Results 1 - 10
of
10
Region-based hierarchical operation partitioning for multicluster processors
- In Proc. of the SIGPLAN ’03 Conference on Programming Language Design and Implementation
, 2003
"... Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, ..."
Abstract
-
Cited by 42 (12 self)
- Add to MetaCart
Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20 % improvement in DSP kernels and 5 % improvement in SPECint2000 for a four-cluster architecture. Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—code generation, retargetable compilers; C.1.1 [Processor Architectures]:
Instruction replication for clustered microarchitectures
- Proceedings of the 36 th International Symposium on Microarchitecture
, 2003
"... This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-clus ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as much as 70 % for selected programs. 1.
Heterogeneous clustered VLIW microarchitectures
- In CGO ’07: Proceedings of the International Symposium on Code Generation and Optimization
, 2007
"... Increasing performance, while at the same time reducing power consumption, is a major design tradeoff in current microprocessors. In this paper, we investigate the potential of using a heterogeneous clustered VLIW microarchitecture. In the proposed microarchitecture, each cluster, the interconnectio ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Increasing performance, while at the same time reducing power consumption, is a major design tradeoff in current microprocessors. In this paper, we investigate the potential of using a heterogeneous clustered VLIW microarchitecture. In the proposed microarchitecture, each cluster, the interconnection network and the supporting memory hierarchy can run at different frequencies and voltages. Some of the clusters can then be configured to be performanceoriented and run at high frequency, while the other clusters can be configured to be low-power-oriented and run at lower frequencies, thus reducing overall consumption. For this heterogeneous design to be effective, we need to select the most suitable frequencies and voltages for each component. We propose a scheme to choose these parameters based on a model that estimates the energy consumption and the execution time of floating-point codes at compile time. Finally, we present a modulo scheduling technique based on graph partitioning that exploits the opportunities presented on heterogeneous clustered microarchitectures. Results show that the Energy-Delay 2 product (ED2) can be significantly reduced by 15 % on average for a microarchitecture with 4-clusters and by as much as 35 % for selected programs. 1.
Removing Communications in Clustered Microarchitectures Through Instruction Replication
"... The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is don ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is done carefully because it might degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-ofthe-art modulo-scheduling algorithm. Though this algorithm has been proved to be very effective at reducing communications, results show that the number of communications can be further decreased by around one-third through replication, which results in a significant speedup. IPC is increased by 25 % on average for a four-cluster microarchitecture and by as much as 70 % for selected programs. We also show that replicating appropriate sets of instructions is more effective than doubling the intercluster connection network bandwidth.
Automatic Data Partitioning for the Agere Payload Plus Network Processor
- In CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
, 2004
"... With the ever-increasing pervasiveness of the Internet and its stringent performance requirements, network system designers have begun utilizing specialized chips to increase the performance of network functions. To increase performance, many more advanced functions, such as tra#c shaping and polici ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
With the ever-increasing pervasiveness of the Internet and its stringent performance requirements, network system designers have begun utilizing specialized chips to increase the performance of network functions. To increase performance, many more advanced functions, such as tra#c shaping and policing, are being implemented at the network interface layer to reduce delays that occur when these functions are handled by a general-purpose CPU. While some designs use ASICs to handle network functions, many system designers have moved toward using programmable network processors due to their increased flexibility and lower design cost.
HARDWARE/SOFTWARE TECHNIQUES FOR MEMORY POWER OPTIMIZATIONS IN EMBEDDED PROCESSORS
, 2007
"... To my family ii ACKNOWLEDGEMENTS I would like to take this opportunity to express my sincere gratitude towards every individual who has contributed to this dissertation, and helped me both academically and personally during my graduate student years at the University of Michigan, Ann Arbor. Firstly, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
To my family ii ACKNOWLEDGEMENTS I would like to take this opportunity to express my sincere gratitude towards every individual who has contributed to this dissertation, and helped me both academically and personally during my graduate student years at the University of Michigan, Ann Arbor. Firstly, I would like to thank my adviser, Prof. Scott Mahlke, for his priceless guidance, patience, and support throughout my graduate studies. I had the privilege of working with him since my first summer internship at the Hewlett Packard Labs, Palo Alto in 2000. It was he who motivated me into the field of backend compilation and encouraged me to think, solve, and clearly articulate my research problems. Next, I would like to thank the members of my doctoral committee Prof. Brown, Prof. Chandra, Prof. Mudge, and Prof. Sylvester. Their invaluable comments and insights helped to improve the quality of my thesis. I would especially like to thank Prof. Brown, with whom I had the honor of collaborating on my research. His vision of a system-
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustere
"... Abstract-- This paper presents AGAMOS, a technique to modulo schedule loops on clustered micro-architectures. The proposed scheme uses a multi-level graph partitioning strategy to distribute the workload among clusters and reduces the number of inter-cluster communications at the same time. Partitio ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-- This paper presents AGAMOS, a technique to modulo schedule loops on clustered micro-architectures. The proposed scheme uses a multi-level graph partitioning strategy to distribute the workload among clusters and reduces the number of inter-cluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of inter-cluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme out-performs a state-of-theart scheduler for all programs and for different cluster configurations. For some configurations, the speed-up obtained when using this new scheme is greater than 40%, and for selected programs, performance can be more than doubled. Index Terms-- Clustered microarchitectures, ILP, instruction replication, modulo scheduling, statically scheduled processors. I.
Abstract
"... Modulo scheduling is an effective code generation technique that exploits the parallelism in program loops by overlapping iterations. One drawback of this optimization is that register requirements increase significantly because values across different loop iterations can be live concurrently. One p ..."
Abstract
- Add to MetaCart
(Show Context)
Modulo scheduling is an effective code generation technique that exploits the parallelism in program loops by overlapping iterations. One drawback of this optimization is that register requirements increase significantly because values across different loop iterations can be live concurrently. One possible solution to reduce register pressure is to insert spill code to release registers. Spill code stores values to memory between the producer and consumer instructions. Spilling heuristics can be divided into two classes: 1) a posteriori approaches (spill code is inserted after scheduling the loop) or 2) on-the-fly approaches (spill code is inserted during loop scheduling). Recent studies have reported obtaining better results for spilling on-the-fly. In this work, we study both approaches and propose two new techniques, one for each approach. Our new algorithms try to address the drawbacks observed in previous proposals. We show that the new algorithms outperform previous techniques and, at the same time, reduce compilation time. We also show that, much to our surprise, a posteriori spilling can be in fact slitghtly more effective than on-the-fly spilling.
COOPERATIVE DATA AND COMPUTATION PARTITIONING FOR DECENTRALIZED ARCHITECTURES
, 2007
"... To my parents. ii ACKNOWLEDGEMENTS The work presented in this thesis could not have been completed without a sig-nificant amount of help and support from many people during my graduate career. First, I would like to thank my adviser, Professor Scott Mahlke, who has helped guide my research direction ..."
Abstract
- Add to MetaCart
To my parents. ii ACKNOWLEDGEMENTS The work presented in this thesis could not have been completed without a sig-nificant amount of help and support from many people during my graduate career. First, I would like to thank my adviser, Professor Scott Mahlke, who has helped guide my research directions and develop my abilities to study and investigate new and exciting problems. Scott has been a great mentor and created a great research group which has made my work here enjoyable. I am very lucky to have worked with Scott as a student, teaching assistant and research assistant these past years. I would also like to thank my dissertation committee, Professor Igor Markov, Professor Steve Reinhardt and Professor Jim Freudenberg, for their time and effort to help improve and refine my thesis. Professor Markov provided me with more feedback and suggestions than I could have asked for. Professor Reinhardt also helped me flesh out the details of my work an push me to investigate new directions. I actually have
Register Pressure Guided Unroll-and-Jam
"... Unroll-and-jam is an effective loop optimization that not only improves cache locality and instruction level parallelism (ILP) but also benefits other loop optimizations such as scalar replacement. However, unroll-and-jam increases register pressure, potentially resulting in performance degradation ..."
Abstract
- Add to MetaCart
(Show Context)
Unroll-and-jam is an effective loop optimization that not only improves cache locality and instruction level parallelism (ILP) but also benefits other loop optimizations such as scalar replacement. However, unroll-and-jam increases register pressure, potentially resulting in performance degradation when the increase in register pressure causes register spilling. In this paper, we present a low cost method to predict the register pressure of a loop before applying unroll-and-jam on high-level source code with the consideration of the collaborative effects of scalar replacement, general scalar optimizations, software pipelining and register allocation. We also describe a performance model that utilizes prediction results to determine automatically the unroll vector, from a given unroll space, that achieves the best run-time performance. Our experiments show that the heuristic prediction algorithm predicts the floating point register pressure within 3 registers and the integer register pressure within 4 registers. With this algorithm, for the Polyhedron benchmark, our register pressure guided unroll-and-jam improves the overall performance about 2 % over the model in the industry-leading optimizing Open64 backend for both the x86 and x86-64 architectures. 1.