Results 1 - 10
of
16
Instruction scheduling for a tiled dataflow architecture
- In ACM International Conference on Architectural Support for Programming Languages and Operating Systems
, 2006
"... This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm mu ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule. Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17 % of the best schedule for each.
Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning
- In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques
, 2002
"... This paper presents a new modulo scheduling algorithm for clustered microarchitectures. The main feature of the proposed scheme is that the assignment of instructions to clusters is done by means of graph partitioning algorithms that are guided by a pseudo-scheduler. This pseudo-scheduler is a simpl ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
This paper presents a new modulo scheduling algorithm for clustered microarchitectures. The main feature of the proposed scheme is that the assignment of instructions to clusters is done by means of graph partitioning algorithms that are guided by a pseudo-scheduler. This pseudo-scheduler is a simplified version of the full instruction scheduler and estimates key constraints that would be encountered in the final schedule.
Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures
- In Proc. of the International Workshop on Advanced Compiler Technology for High Performance and Embedded Systems (IWACT
, 2001
"... High-performance microprocessors are currently designed with the purpose of exploiting the inherent instruction level parallelism (ILP) available in applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requi ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
High-performance microprocessors are currently designed with the purpose of exploiting the inherent instruction level parallelism (ILP) available in applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. In this paper we overview some hardware and software techniques proposed in the literature to alleviate the high register demands of aggressive scheduling heuristics on VLIW cores. From the software point of view, instruction scheduling can stretch lifetimes and reduce the register pressure. If more registers than those available in the architecture are required, some actions (such as the injection of spill code) have to be applied to reduce this pressure, at the expense of some performance degradation. From the hardware point of view, this degradation could be avoided if a high--capacity register file were included without causing a negative impact on the design of the processor (cycle time, area and power dissipation) . Future scalable VLIW cores will require the use of clustering to decentralize the design and to meet the technology constraints. New aggressive instruction scheduling techniques will be required to minimize the negative effect of this resource clustering and delays to move data around. Keywords--- Modulo scheduling, Register requirements, Spill code, Register file organization, clustered organization. I.
Hierarchical clustered register file organization for VLIW processors
- International Parallel and Distributed Processing Symposium, 2003. Proceedings. 22-26 April 2003 Page(s):10
, 2003
"... Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All inter– cluster communications are done through the second level register file. This paper also proposes MIRS HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements. 1.
Optimizing Loop Performance for Clustered VLIW Architectures
- In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
, 2002
"... Modern embedded systems often require high degrees of instruction-level parallelism (ILP) within strict constraints on power consumption and chip cost. Unfortunately, a high-performance embedded processor with high ILP generally puts large demands on register resources, making it difficult to mainta ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Modern embedded systems often require high degrees of instruction-level parallelism (ILP) within strict constraints on power consumption and chip cost. Unfortunately, a high-performance embedded processor with high ILP generally puts large demands on register resources, making it difficult to maintain a single, multi-ported register bank. To address this problem, some architectures, e.g. the Texas Instruments TMS320C6x, partition the register bank into multiple banks that are each directly connected only to a subset of functional units. These functional unit/register bank groups are called clusters.
Exploiting vector parallelism in software pipelined loops
- In Proc. of the 38th Annual International Symposium on Microarchitecture
, 2005
"... An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data paralle ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data parallelism. In contrast, efficient use of a processor’s scalar resources is typically achieved through ILP techniques such as software pipelining. In order to attain the best performance, it is necessary to utilize both sets of resources. This paper presents a novel approach for exploiting vector parallelism in a software pipelined loop. At its core is a method for judiciously partitioning operations between vector and scalar resources. The proposed algorithm (i) lowers the burden on the scalar resources by offloading computation to the vector functional units, and (ii) partially (or fully) inhibits the optimizations when full vectorization will decrease performance. This results in better resource usage and allows for software pipelining with shorter initiation intervals. Although our techniques complement statically scheduled machines most naturally, we believe they are applicable to any architecture that tightly integrates support for ILP and data parallelism. An important aspect of the proposed methodology is its ability to manage explicit communication of operands between vector and scalar instructions. Our methodology also allows for a natural handling of misaligned vector memory operations. For architectures that provide hardware support for misaligned references, software pipelining effectively hides the latency of these potentially expensive instructions. When explicit alignment is required in software, our algorithm accounts for these extra costs and vectorizes only when it is profitable. Finally, our heuristic can take advantage of alignment information where it is available. We evaluate our methodology using several DSP and SPEC FP benchmarks. Compared to software pipelining, our approach is able to achieve an average speedup of 1.30 × and 1.18 × for the two benchmark sets, respectively.
Exploring Energy-Performance Trade-offs for Heterogeneous Interconnect Clustered VLIW Processors
- In Proc. of Intl. Conf. on High Performance Computing
, 2005
"... Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Technological advancements permit design of a variety of clustered architectures by varying the degree of clustering and the type of interconnects. In this paper, we focus on exploring energy performance trade-offs in going from a unified VLIW architecture to different types of clustered VLIW architectures. We propose a new instruction scheduling algorithm that exploits scheduling slacks of instructions and communication slacks of data values together to achieve better energy-performance trade-offs for clustered architectures. Our instruction scheduling algorithm for clustered architectures with heterogeneous interconnect achieves 35 % and 40 % reduction in communication energy, whereas the overall energy-delay product improves by 4.5 % and 6.5 % respectively for 2 cluster and 4 cluster machines with marginal 1.6 % and 1.1 % increase in execution time. Our test bed uses the Trimaran compiler infrastructure. 1 1.
Register constrained modulo scheduling
- IEEE Trans. Parallel Distrib. Syst
"... Abstract—Software pipelining is an instruction scheduling technique that exploits the instruction level parallelism (ILP) available in loops by overlapping operations from various successive loop iterations. The main drawback of aggressive software pipelining techniques is their high register requir ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Software pipelining is an instruction scheduling technique that exploits the instruction level parallelism (ILP) available in loops by overlapping operations from various successive loop iterations. The main drawback of aggressive software pipelining techniques is their high register requirements. If the requirements exceed the number of registers available in the target architecture, some steps need to be applied to reduce the register pressure (incurring some performance degradation): reduce iteration overlapping or spilling some lifetimes to memory. In the first part of this paper, we propose a set of heuristics to improve the spilling process and to better decide between adding spill code or directly decreasing the execution rate of iterations. The experimental evaluation, over a large number of representative loops and for a processor configuration, reports an increase in performance by a factor of 1.29 and a reduction of memory traffic by a factor of 1.36. In the second part of this paper, we analyze the use of backtracking and propose a novel approach for simultaneous instruction scheduling and register spilling in modulo scheduling: MIRS (Modulo Scheduling with Integrated Register Spilling). The experimental evaluation reports an increase in performance by a factor of 1.46 and a reduction of the memory traffic by a factor of 1.66 (or an additional 1.13 and 1.22 with regard to the proposal in the first part of the paper). These improvements are achieved at the expense of a reasonable increase in the compilation time. Index Terms—Instruction level parallelism, instruction scheduling, modulo scheduling, register allocation, spill code. 1
Removing Communications in Clustered Microarchitectures Through Instruction Replication
"... The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is don ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is done carefully because it might degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-ofthe-art modulo-scheduling algorithm. Though this algorithm has been proved to be very effective at reducing communications, results show that the number of communications can be further decreased by around one-third through replication, which results in a significant speedup. IPC is increased by 25 % on average for a four-cluster microarchitecture and by as much as 70 % for selected programs. We also show that replicating appropriate sets of instructions is more effective than doubling the intercluster connection network bandwidth.

