Results 1 - 10
of
104
Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing
- IEEE Transactions on Computers
, 1987
"... Abstract-hrge grain data flow (LGDF) programming is natural and convenient for describing digital signal processing (DSP) systems, but its runtime overhead is costly in real time or cost-sensitive applications. In some situations, designers are not willing to squander computing resources for the sak ..."
Abstract
-
Cited by 393 (34 self)
- Add to MetaCart
Abstract-hrge grain data flow (LGDF) programming is natural and convenient for describing digital signal processing (DSP) systems, but its runtime overhead is costly in real time or cost-sensitive applications. In some situations, designers are not willing to squander computing resources for the sake of program-mer convenience. This is particularly true when the target machine is a programmable DSP chip. However, the runtime overhead inherent in most LGDF implementations is not required for most signal processing systems because such systems are mostly synchronous (in the DSP sense). Synchronous data flow (SDF) differs from traditional data flow in that the amount of data produced and consumed by a data flow node is specified a priori for each input and output. This is equivalent to specifying the relative sample rates in signal processing system. This means that the scheduling of SDF nodes need not be done at runtime, but can be done at compile time (statically), so the runtime overhead evaporates. The sample rates can all be different, which is not true of most current data-driven digital signal processing programming methodologies. Synchronous data flow is closely related to computation graphs, a special case of Petri nets. This self-contained paper develops the theory necessary to statically schedule SDF programs on single or multiple proces-sors. A class of static (compile time) scheduling algorithms is proven valid, and specific algorithms are given for scheduling SDF systems onto single or multiple processors. Index Terms-Block diagram, computation graphs, data flow digital signal processing, hard real-time systems, multiprocessing,
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
IMPACT: An architectural framework for multiple-instruction-issue processors
- in Proceedings of the 18th International Symposium on Computer Architecture
, 1991
"... The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate e cient code for concurrent hardware. In the IM-PACT project, we havedeveloped IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization ca ..."
Abstract
-
Cited by 203 (41 self)
- Add to MetaCart
The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate e cient code for concurrent hardware. In the IM-PACT project, we havedeveloped IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors achieve solid speedup over high-performance single-instruction-issue processors. We ran experiments to characterize the following architectural design issues: code scheduling model, instruction issue rate, memory load latency, and function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compiler to generate e cient code for multiple-instructionissue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IMPACT-I C compiler. 1
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
Code scheduling and register allocation in large basic blocks
- in Proceedings of the 1988 International Conference on Supercomputing
, 1988
"... We discuss the issues about the interdependency between code scheduling and register allocation. We present two methods as solutions: (1) an integrated code scheduling technique; and (2) a DAGdriven register allocator. The integrated code scheduling method combines two scheduling techniques- one to ..."
Abstract
-
Cited by 103 (0 self)
- Add to MetaCart
We discuss the issues about the interdependency between code scheduling and register allocation. We present two methods as solutions: (1) an integrated code scheduling technique; and (2) a DAGdriven register allocator. The integrated code scheduling method combines two scheduling techniques- one to reduce pipeline delays and the other to minimize register usage- into a single phase. By keeping track of the number of available registers, the scheduler can choose the appropriate scheduling technique to schedule a better code sequence. The DAG-driven register allocator uses a dependency graph to assist in assigning registers; it introduces much less extra dependency than does an ordinary register allocator. For large basic blocks, both approaches were shown to generate more efficient code sequences than conventional techniques in the simulations. 1.
An Accurate Worst Case Timing Analysis for RISC Processors
- IN IEEE REAL-TIME SYSTEMS SYMPOSIUM
, 1995
"... An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pi ..."
Abstract
-
Cited by 94 (3 self)
- Add to MetaCart
An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, we propose extensions to the original timing schema where the timing information associated with each program construct is a simple time-bound. In our approach, associated with each program construct is what we call a WCTA (Worst Case Timing Abstraction), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning...
Branch Classification: a New Mechanism for Improving Branch Predictor Performance
- In Proceedings of the 27th International Symposium on Microarchitecture
, 1994
"... There is wide agreement that one of the most important impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution seems to be one solution of choice to the branch problem, but speculat ..."
Abstract
-
Cited by 84 (11 self)
- Add to MetaCart
There is wide agreement that one of the most important impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution seems to be one solution of choice to the branch problem, but speculative work is discarded if a branch is mispredicted. Therefore, we need a very accurate branch predictor; 95% accuracy is not good enough. This paper proposes branch classification to help improve the accuracy of branch predictors. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor predicts those branches for which it is best suited. This paper suggests one classification scheme, analyzes several branch predictors, and proposes a hybrid branch predictor that achieves higher prediction accuracy tha...
Theory of latency-insensitive design
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2001
"... Abstract—The theory of latency-insensitive design is presented as the foundation of a new correct-by-construction methodology to design complex systems by assembling intellectual property components. Latency-insensitive designs are synchronous distributed systems and are realized by composing functi ..."
Abstract
-
Cited by 75 (10 self)
- Add to MetaCart
Abstract—The theory of latency-insensitive design is presented as the foundation of a new correct-by-construction methodology to design complex systems by assembling intellectual property components. Latency-insensitive designs are synchronous distributed systems and are realized by composing functional modules that exchange data on communication channels according to an appropriate protocol. The protocol works on the assumption that the modules are stallable, a weak condition to ask them to obey. The goal of the protocol is to guarantee that latency-insensitive designs composed of functionally correct modules behave correctly independently of the channel latencies. This allows us to increase the robustness of a design implementation because any delay variations of a channel can be “recovered ” by changing the channel latency while the overall system functionality remains unaffected. As a consequence, an important application of the proposed theory is represented by the latency-insensitive methodology to design large digital integrated circuits by using deep submicrometer technologies. Index Terms—Deep submicrometer design, formal methods, latency-insensitive protocols, system design. I.
Pipelined Processors And Worst Case Execution Times
- Real-Time Systems
, 1993
"... The calculation of worst case execution time (WCET) is a fundamental requirement of almost all scheduling approaches for hard real-time systems. Due to their unpredicatability, hardware enhancements such as cache and pipelining are often ignored in attempts to find WCET of programs. This results in ..."
Abstract
-
Cited by 51 (7 self)
- Add to MetaCart
The calculation of worst case execution time (WCET) is a fundamental requirement of almost all scheduling approaches for hard real-time systems. Due to their unpredicatability, hardware enhancements such as cache and pipelining are often ignored in attempts to find WCET of programs. This results in estimations that are excessively pessimistic. In this paper a simple instruction pipeline is modeled so that more accurate estimations are obtained. The model presented can be used with any schedulability analysis that allows sections of non-preemptable code to be included. Our results indicate that WCET over-estimates at basic block level can be reduced from over 20% to less than 2%, and that the over-estimates for typical structured real-time programs can be reduced by 17%-40%. 1. Introduction In real-time systems, predictable temporal behaviour is an essential requirement. To be able to predict, and therefore guarantee, the timing behaviour of a real-time system two issues need to be ad...
An Accurate Worst Case Timing Analysis Technique for RISC Processors
, 1994
"... An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipel ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, this paper proposes extensions of the original timing schema [26] where the timing information associated with each program construct is a simple time-bound. We associate with each program construct what we call a WCTA (Worst Case Timing Abstraction), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning operations on...

