Results 11 - 20
of
116
Processor Design for Portable Systems
- Journal of VLSI Signal Processing
, 1996
"... : Processors used in portable systems must provide highly energy-efficient operation, due to the importance of battery weight and size, without compromising high performance when the user requires it. The user-dependent modes of operation of a processor in portable systems are described and separate ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
: Processors used in portable systems must provide highly energy-efficient operation, due to the importance of battery weight and size, without compromising high performance when the user requires it. The user-dependent modes of operation of a processor in portable systems are described and separate metrics for energy efficiency for each of them are found to be required. A variety of well known low-power techniques are re-evaluated against these metrics and in some cases are not found to be appropriate leading to a set of energy-efficient design principles. Also, the importance of idle energy reduction and the joint optimization of hardware and software will be examined for achieving the ultimate in lowenergy, high-performance design. 1. Introduction The recent explosive growth in portable electronics requires energy conscious design, without sacrificing performance. Simply increasing the battery capacity is not sufficient because the battery has become a significant fraction of the t...
Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine
- In Proceedings of the Eighth ACM Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instructio ..."
Abstract
-
Cited by 71 (15 self)
- Add to MetaCart
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instructionlevel parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.
Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures
- In International Symposium on Microarchitecture
, 1998
"... Recently, there has been a trend towards clustered microarchitectures to reduce the cycle time for wideissue microprocessors. In such processors, the register file and functional units are partitioned and grouped into clusters. Instruction scheduling for a clustered machine requires assignment and s ..."
Abstract
-
Cited by 71 (0 self)
- Add to MetaCart
Recently, there has been a trend towards clustered microarchitectures to reduce the cycle time for wideissue microprocessors. In such processors, the register file and functional units are partitioned and grouped into clusters. Instruction scheduling for a clustered machine requires assignment and scheduling of operations to the clusters. In this paper, a new scheduling algorithm named unified-assign-and-schedule (UAS) is proposed for clustered, statically-scheduled architectures. UAS merges the cluster assignment and instruction scheduling phases in a natural and straightforward fashion. We compared the performance of UAS with various heuristics to the well-known Bottom-up Greedy (BUG) algorithm and to an optimal cluster scheduling algorithm, measuring the schedule lengths produced by all of the schedulers. Our results show that UAS gives better performance than the BUG algorithm and is quite close to optimal. 1 Introduction Many high performance processors are designed with wide is...
Memory-System Design Considerations For Dynamically-Scheduled Microprocessors
, 1997
"... Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs that balance hardware complexity and compiler technology against performance targets. This dissertation presents a first thorough look at some of the issues introduced by this hardware complexity. The focus of the investigation of these issues is the register file and the other components of the data memory system. These components are: the lockup-free data cache, the stream buffers, and the interface to the lower levels of the memory system. The investigation is based on software models. These models incorporate the features of a dynamically-scheduled processor that affect the design of the data-memory components. The models represent a balance between accuracy and generality, and ar...
Dynamic Prediction of Critical Path Instructions
- IN PROCEEDINGS OF THE SEVENTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 2001
"... Modern processors come close to executing as fast as true dependences allow. The particular dependences that constrain execution speed constitute the critical path of execution. To optimize the performance of the processor, we either have to reduce the critical path or execute it more efficiently. I ..."
Abstract
-
Cited by 61 (5 self)
- Add to MetaCart
Modern processors come close to executing as fast as true dependences allow. The particular dependences that constrain execution speed constitute the critical path of execution. To optimize the performance of the processor, we either have to reduce the critical path or execute it more efficiently. In both cases, it can be done more effectively if we know the actual instructions that constitute that path. This paper
tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation
- IN PROCEEDINGS OF THE ACM SIGPLAN '97 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1997
"... tcc is a compiler that provides efficient and high-level access to dynamic code generation. It implements the `C ("Tick-C") programming language, an extension of ANSI C that supports dynamic code generation [15]. `C gives power and flexibility in specifying dynamically generated code: whereas most o ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
tcc is a compiler that provides efficient and high-level access to dynamic code generation. It implements the `C ("Tick-C") programming language, an extension of ANSI C that supports dynamic code generation [15]. `C gives power and flexibility in specifying dynamically generated code: whereas most other systems use annotations to denote run-time invariants, `C allows the programmer to specify and compose arbitrary expressions and statements at run time. This degree of control is needed to efficiently implement some of the most important applications of dynamic code generation, such as "just in time" compilers [17] and efficient simulators [10, 48, 46]. The paper focuses on the techniques that allow tcc to provide `C's flexibility and expressiveness without sacrificing run-time code generation efficiency. These techniques include fast register allocation, efficient creation and composition of dynamic code specifications, and link-time analysis to reduce the size of dynamic code generato...
Dynamically managing the communication-parallelism trade-off in future clustered processors
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2003
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, there ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instructionlevel parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application’s needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11 % performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communicationbound processors of the future.
Maps: A Compiler-Managed Memory System for Raw Machines
- IN PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck, " with machin ..."
Abstract
-
Cited by 41 (9 self)
- Add to MetaCart
This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck, " with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to...
CARS: A new code generation framework for clustered ILP processors
- In HPCA
, 2001
"... Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approache ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approaches need additional re-scheduling phases because they often do not impose finite resource constraints in all phases of code generation. These phase-ordered solutions have several drawbacks, resulting in the generation of poor performance code. Moreover, the iterative/back-tracking algorithms used in some of these schemes have large running times. In this paper we present CARS, a code generation framework for Clustered ILP processors, which combines the cluster assignment, register allocation, and instruction scheduling phases into a single code generation phase, thereby eliminating the problems associated with phase-ordered solutions. The CARS algorithm explicitly takes into account all the resource constraints at each cluster scheduling step to reduce spilling and to avoid iterative re-scheduling steps. We also present a new on-the-fly register allocation scheme developed for CARS. We describe an implementation of the proposed code generation framework and the results of a performance evaluation study using the SPEC95/2000 and MediaBench benchmarks.

