Results 1 - 10
of
22
Integrated Temporal and Spatial Scheduling for Extended Operand Clustered VLIW Processors
- In Proc. of Conf. on computing frontiers
, 2004
"... Centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected clusters each with a separate register file and a few ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
(Show Context)
Centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected clusters each with a separate register file and a few functional units. Among the many inter-cluster communication models proposed, the extended operand model extends some of operand fields of instruction with a cluster specifier and allows an instruction to read some of the operands from other clusters without any extra cost. Scheduling for clustered processors involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). A scheduler is responsible for resolving the conflicting requirements of aggressively exploiting the parallelism offered by hardware and limiting the communication among clusters to available slots. This paper proposes an integrated spatial and temporal scheduling algorithm for extended operand clustered VLIW processors and evaluates its effectiveness in improving the run time performance of the code without code size penalty. 1.
Instruction Level Parallelism through Microthreading -- A Scalable Approach to Chip Multiprocessors
- THE COMPUTER JOURNAL
, 2006
"... Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to s ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to support instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each instruction can read and write its operands simultaneously. Neither structure scales well with issue width leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of instructions must proceed speculatively. An alternative, which avoids this complexity in instruction issue and eliminates speculative execution, is the microthreaded model. This model fragments sequential code at compile time and executes the fragments out of order while maintaining in-order execution within the fragments. The only constraints on the execution of fragments are the dependencies between them, which are managed in a distributed and scalable manner using synchronizing registers. The fragments of code are called microthreads and they capture ILP and loop concurrency. Fragments can be interleaved on a single processor to give tolerance to latency in operands or distributed to many processors to achieve speedup. The implementation of this model is fully scalable. It supports distributed instruction issue and
Instruction-level parallelism through Microthreading -- a scalable Approach to chip multiprocessors
- THE COMPUTER JOURNAL
, 2006
"... Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to s ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to support instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each instruction can read and write its operands simultaneously. Neither structure scales well with issue width leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of instructions must proceed speculatively. An alternative, which avoids this complexity
A graph matching based integrated scheduling framework for clustered VLIW processors
, 2004
"... Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule) and various clustered VLIW configura-tions, connectivity types, and inter-cluster communication models present different performance trade-offs to a sched-uler. The sc ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule) and various clustered VLIW configura-tions, connectivity types, and inter-cluster communication models present different performance trade-offs to a sched-uler. The scheduler is responsible for resolving the conflict-ing requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance without stretching the over-all schedule. This paper proposes a generic graph matching based framework that resolves the phase-ordering and fixed-ordering problems associated with scheduling on a clus-tered VLIW processor by simultaneously considering var-ious scheduling alternatives of instructions. We observe ap-proximately 16 % and 28 % improvement in the performance over an earlier integrated scheme and a phase-decoupled scheme respectively without extra code size penalty. 1
Exploring Energy-Performance Trade-offs for Heterogeneous Interconnect Clustered VLIW Processors
- IN PROC. OF INTL. CONF. ON HIGH PERFORMANCE COMPUTING
, 2005
"... Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Technological advancements permit design of a variety of clustered architectures by varying the degree of clustering and the type of interconnects. In this paper, we focus on exploring energy performance trade-offs in going from a unified VLIW architecture to different types of clustered VLIW architectures. We propose a new instruction scheduling algorithm that exploits scheduling slacks of instructions and communication slacks of data values together to achieve better energy-performance trade-offs for clustered architectures. Our instruction scheduling algorithm for clustered architectures with heterogeneous interconnect achieves 35 % and 40 % reduction in communication energy, whereas the overall energy-delay product improves by 4.5 % and 6.5 % respectively for 2 cluster and 4 cluster machines with marginal 1.6 % and 1.1 % increase in execution time. Our test bed uses the Trimaran compiler infrastructure.
Clustered VLIW Architectures: a Quantitative Approach
"... The cover design by Henny Herps, Floris van den Haar, and Andrei Terechko. The front cover shows a layout of a tiny synchronous digital IC with 24 standard cell gates and 26 nets in the Philips CMOS 65 nm standard Vth technology. The reader is challenged to guess the ASCII string with an extra propo ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The cover design by Henny Herps, Floris van den Haar, and Andrei Terechko. The front cover shows a layout of a tiny synchronous digital IC with 24 standard cell gates and 26 nets in the Philips CMOS 65 nm standard Vth technology. The reader is challenged to guess the ASCII string with an extra proposition of this PhD thesis that this IC generates on the 8 output pins. Note that the clock and reset are not shown to simplify the figure; the colors for polysilicon, diffusion, etc. are non-standard; metal vias are not visible. The reverse engineering associated with this challenge is a “walk in the park ” compared to what Soviet Union engineers did in the 1980s to clone DEC and Intel processors with thousands of gates. If no solution is found before 06 May 2007, I will publish hints to the solution on
Evaluation of bus based interconnect mechanisms in clustered VLIW architectures
- In Proceedings of the Conference on Design, Automation and Test in Europe (DATE-2005
, 2005
"... With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability problems due to a centralized register file which is far slower than the functional units (FU). Clustered VLIW architectures, with a subset of FUs connected to any RF are the solution to this scalability problem. Recent studies with a wide variety of inter-cluster interconnection mechanisms have presented substantial gains in performance (number of cycles) over the most studied RFto-RF type interconnections. However, these studies have compared only one or two design points in the RF-to-RF interconnects design space. In this paper, we extend the previous reported work. We consider both multi-cycle and pipelined buses. To obtain realistic bus latencies, we synthesized the various architectures and found out post layout clock periods. The results demonstrate that while there is very little variation in interconnect area, all the bus based architectures are heavily performance constrained. Also, neither multi-cycle or pipelined buses nor increasing the number of buses itself is able to achieve performance comparable to point-to-point type interconnects. 1
Instruction-set architecture exploration strategies for deeply clustered VLIW ASIPs
- in ECyPS 2013 - EUROMICRO/IEEE Workshop on Embedded and Cyber-Physical Systems
, 2013
"... Abstract-Instruction-set architecture exploration for clustered VLIW processors is a very complex problem. Most of the existing exploration methods are hand-crafted and time consuming. This paper presents and compares several methods for automating this exploration. We propose and discuss a two-pha ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract-Instruction-set architecture exploration for clustered VLIW processors is a very complex problem. Most of the existing exploration methods are hand-crafted and time consuming. This paper presents and compares several methods for automating this exploration. We propose and discuss a two-phase method which can quickly explore many different architectures and experimentally demonstrate that this method is capable of automatically achieving a 50% improvement on the energy-delay product cost of an automatically generated architecture for an ECG detection application and a 1% energy-delay product cost improvement compared to a hand-crafted design.
Qibin Sun Shih-Fu Chang, Semi-Fragile Authentication of JPEG-2000 Images with Control, Columbia University ADVENT Technical Report, 2002-101. Qibin Shih-Fu Chang, Maeno Kurato Masayuki Suto, semi-fragile image authentication framework combining infrastruc
- Information Technology---JPEG2000 Image Coding System, ISO/IEC International Standard 15444-1, Recommendation T.800, 2000. Rabbani R. Joshi, overview JPEG2000 image compression standard, Signal Processing: Image Communication, Vol.17, No.1, 2001. Taubman,
, 2002
"... Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor net-works. In particular, in a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes, are key for high performance. We also show that these interconnects can be built with