Results 1 -
9 of
9
Memory Bank Disambiguation using Modulo Unrolling for Raw Machines
- IN PROCEEDINGS OF THE ACM/IEEE FIFTH INT'L CONFERENCE ON HIGH PERFORMANCE COMPUTING(HIPC
, 1998
"... The Raw approach of replicated processor tiles interconnected with a fast static mesh network provides a simple, scalable design that maximizes the resources available in next generation processor technology. In Raw ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The Raw approach of replicated processor tiles interconnected with a fast static mesh network provides a simple, scalable design that maximizes the resources available in next generation processor technology. In Raw
An Aggressive Approach to Loop Unrolling
- Proc. Compiler Construction '96
, 1995
"... A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction execu ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the index variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instructionlevel parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by ten to thirty percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The r...
Runtime Predictability of Loops
, 2001
"... To obtain the benefits of aggressive, wide-issue, architectures, a large window of valid instructions must be available. While researchers have been successful in obtaining high accuracies with a range of dynamic branch predictors, there still remains the need for more aggressive instruction deliver ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
To obtain the benefits of aggressive, wide-issue, architectures, a large window of valid instructions must be available. While researchers have been successful in obtaining high accuracies with a range of dynamic branch predictors, there still remains the need for more aggressive instruction delivery.
Path-based Hardware Loop Prediction
"... For microprocessors that attempt to exploit instruction level parallelism, it is necessary to have a large window of candidate instructions from which to issue from. With loop prediction, we can predict the number of times that a loop will iterate, as well as the paths that will be followed inside t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
For microprocessors that attempt to exploit instruction level parallelism, it is necessary to have a large window of candidate instructions from which to issue from. With loop prediction, we can predict the number of times that a loop will iterate, as well as the paths that will be followed inside the loop body. With this type of prediction, several basic blocks can be prefetched and stored in a dedicated loop bu#er, reducing the number of instruction cache and memory requests, while providing a large window of instructions for speculative execution.
Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2002. [MAB + 94
, 2002
"... This paper describes a technique for utilizing predication to support software pipelining on EPIC architectures in the presence of dynamic memory aliasing. The essential idea is that the compiler generates an optimistic software-pipelined schedule that assumes there is no memory aliasing. The ope ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper describes a technique for utilizing predication to support software pipelining on EPIC architectures in the presence of dynamic memory aliasing. The essential idea is that the compiler generates an optimistic software-pipelined schedule that assumes there is no memory aliasing. The operations in the pipeline kernel are predicated, however, so that if memory aliasing is detected by a run-time check, the predicate registers are set to disable the iterations that are so tightly overlapped as to violate the memory dependences. We refer to these disabled kernel operations as software bubbles.
V.: A Measurement Study of the Linux TCP/IP Stack Performance and Scalability on SMP systems
- In: Proceedings of the 1st International Conference on COMmunication Systems softWAre and middlewaRE (COMSWARE
, 2006
"... Abstract — The performance of the protocol stack implementation of an operating system can greatly impact the performance of networked applications that run on it. In this paper, we present a thorough measurement study and comparison of the network stack performance of the two popular Linux kernels: ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — The performance of the protocol stack implementation of an operating system can greatly impact the performance of networked applications that run on it. In this paper, we present a thorough measurement study and comparison of the network stack performance of the two popular Linux kernels: 2.4 and 2.6, with a special focus on their performance on SMP architectures. Our findings reveal that interrupt processing costs, device driver overheads, checksumming and buffer copying are dominant overheads of protocol processing. We find that although raw CPU costs are not very different between the two kernels, Linux 2.6 shows vastly improved scalability, attributed to better scheduling and kernel locking mechanisms. We also uncover an anomalous behaviour in which Linux 2.6 performance degrades when packet processing for a single connection is distributed over multiple processors. This, however, verifies the superiority of the “processor per connection ” model for parallel processing. I.
Code restructuring for improving execution efficiency, code size and power consumption for embedded DSPs
- 12th International Workshop on Languages and Compilers for Parallel Computing
, 1999
"... Many embedded systems such as personal digital assistants (PDAs), cellular phones, etc. involve heavy use of digital signal processing and are thus based on Digital Signal Processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data-paths that typically the result of applicatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many embedded systems such as personal digital assistants (PDAs), cellular phones, etc. involve heavy use of digital signal processing and are thus based on Digital Signal Processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data-paths that typically the result of application specific needs (such as chaining multiplyaccumulate operations, etc). Efficient code generation for such embedded DSP processors is a challenging problem because of the additional constraints such as tight memory and low power consumption demands, resulting in the need for compact code. In this work, we address the problem of generating compact and efficient code for embedded DSP processors. Most of the DSP instruction set architectures (ISAs) feature intra instruction parallelism (IIP) enabling individual operations to be executed in parallel with a complex instruction. A reduction in generated code size and improved performance can be achieved by exploiting this parallelism present in s...
Compact and Efficient Code Generation through Program Restructuring on Limited Memory Embedded System DSPs
, 2000
"... Many embedded systems such as digital cameras, digital radios, high resolution printers, cellular phones, etc. involve heavy use of signal processing and are thus based on Digital Signal Processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data-paths that typically result ..."
Abstract
- Add to MetaCart
Many embedded systems such as digital cameras, digital radios, high resolution printers, cellular phones, etc. involve heavy use of signal processing and are thus based on Digital Signal Processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data-paths that typically result due to application specic needs (such as chaining multiplyaccumulate operations, etc). Ecient code generation for such embedded DSP processors is a challenging problem. The stringent requirements such as tight memory constraints and fast response time result in the need for compact and ecient code. In this work, we address the problem of generating compact and ecient code for embedded DSP processors. Most of the DSP instruction set architectures (ISAs) feature intra instruction parallelism (IIP) enabling individual operations to be executed in parallel by generating a complex instruction. A reduction in generated code-size and improved performance can be achieved by exploiting this...
Limits and Graph Structure of Available Instruction-Level Parallelism
- Lecture Notes in Computer Science
, 2001
"... We reexamine the limits of parallelism available in programs, using runtime reconstruction of program data-flow graphs. While limits of parallelism have been examined in the context of superscalar and VLIW machines, we also wish to study the causes of observed parallelism by examining the structu ..."
Abstract
- Add to MetaCart
We reexamine the limits of parallelism available in programs, using runtime reconstruction of program data-flow graphs. While limits of parallelism have been examined in the context of superscalar and VLIW machines, we also wish to study the causes of observed parallelism by examining the structure of the reconstructed data-flow graph. One aspect of structure analysis that we focus on is the isolation of instructions involved only in address calculations. We examine how address calculations present in RISC instruction streams generated by optimizing compilers affect the shape of the data-flow graph and often significantly reduce available parallelism.

